Srikanth Bhakthan
AI on AI Podcast
The OpenHands Software Agent SDK
0:00
-28:44

The OpenHands Software Agent SDK

A Composable and Extensible Foundation for Production Agents

arxiv: https://arxiv.org/pdf/2511.03690

Listen in 22 languages, when available:


Executive summary

- What it is: A modular, production-grade SDK for software development agents that runs locally by default and scales to sandboxed, remote deployments without code changes.

- Why it matters: It addresses reliability, security, and deployment friction that hinder agent adoption, with event-sourced state, immutable configuration, and a clean separation between core, tools, workspaces, and server.

- Key gains: Optional sandboxing, deterministic replay, model-agnostic multi‑LLM routing, MCP-native tools, REST/WebSocket server, built‑in VS Code/VNC/Browser workspaces, and security analysis with confirmation controls.

- Proof points: 64k+ GitHub stars (OpenHands ecosystem), strong results on SWE‑Bench Verified and GAIA, and low-cost CI that validates agent behavior across models in minutes.

Main themes

- From monolith to modular SDK: A full architectural redesign (V0 → V1) to separate concerns and improve reliability, extensibility, and deployment portability.

- Local-first, deploy-anywhere: Run the same agent locally or in containers/remote runtimes by swapping a workspace implementation.

- Deterministic, recoverable execution: Event-sourced state with immutable agent/config components and a single source of truth for conversation state.

- Vendor-agnostic intelligence: Use 100+ model providers via LiteLLM, including non-function-calling models, with routing, extended thinking, and reasoning support.

- End-to-end production path: Integrated server with APIs and interactive workspaces (VS Code Web, VNC desktop, Chromium) for human-in-the-loop control.

Key findings and capabilities

- Architecture and modularity

- Four packages: openhands.sdk (core), openhands.tools (tool implementations), openhands.workspace (local/remote sandboxes), openhands.agent_server (REST/WebSocket server).

- Two-layer composability: assemble deployment components; safely extend capability components (Agent, Tool, LLM, MCP) without modifying core.

- Strict separation of concerns: SDK is a shared library across CLI, GUI, and integrations; decoupled from benchmarks and apps.

- State and reliability

- Event-sourced state model: Immutable event log with deterministic replay, selective persistence, and recovery to last processed event.

- Stateless components: Agents, tools, and LLMs are immutable and validated at construction. ConversationState is the single mutable source of truth.

- Context window management: Condensers summarize history when needed; logs retain full fidelity. Default condenser can reduce API cost up to 2× without performance loss (as reported).

- Tooling and MCP integration

- Action–Execution–Observation pattern: Type-safe input validation (Pydantic), executor isolation, and structured observations for LLM consumption.

- MCP-native: Converts MCP tool schemas to SDK Action/Observation; MCP tools behave like native tools.

- Tool registry: Decouples specs from executors for cross-process/network execution and lazy, environment-specific instantiation.

- LLM abstraction and routing

- Supports 100+ providers (LiteLLM). Works with Chat Completions and OpenAI Responses APIs (including advanced reasoning/extended thinking fields).

- Non-native tool use: Enables function-calling for models that lack it via prompt/schema conversion and output parsing.

- Multi-LLM routing: RouterLLM selects models dynamically per message (e.g., route multimodal prompts to multimodal models).

- Deployment and workspaces

- Conversation factory: Same Conversation API returns LocalConversation (in-process) or RemoteConversation (agent server) depending on workspace.

- Minimal code change to containerize: Replace a local path with DockerWorkspace to run in an isolated container.

- Agent server: FastAPI-based REST/WebSocket service that reconstructs agent configs, executes locally in container, and streams events in real time.

- Official Docker images: Include API server, VS Code Web, VNC desktop, and Chromium browser; per-agent isolation and multi-tenancy ready.

- Security and governance

- SecurityAnalyzer + ConfirmationPolicy: Risk-rate actions (low/medium/high/unknown) and enforce approvals; WAITING_FOR_CONFIRMATION state with approve/reject.

- SecretRegistry: Per-session isolation, late binding, masking/redaction, optional encryption, and live rotation; safe propagation to tools (e.g., env vars in shell).

- Built-in security analyzer: LLM-based risk assessment and configurable threshold policy provided by default.

- Performance and QA

- Benchmarks (as reported in text):

- SWE-Bench Verified: Claude Sonnet 4.5 72.8%; Claude Sonnet 4 68.0%; GPT‑5 (reasoning=high) 68.8%; Qwen3 Coder 480B A35B 65.2%.

- GAIA (val): Claude Sonnet 4.5 67.9%; Claude Sonnet 4 57.6%; GPT‑5 (reasoning=high) 62.4%; Qwen3 Coder 480B A35B 41.2%.

- CI strategy:

- Programmatic tests: Mock LLM to validate logic and APIs quickly.

- LLM-based tests: Integration and example tests across real models; $0.5–$3 per full run, <5 minutes.

- On-demand benchmarks: $100–$1000, hours per run.

Unique differentiators vs provider SDKs (per text)

- Native sandboxed execution and isolation with first-party server and Docker images.

- Lifecycle control: pause/resume, sub-agent delegation, history restore, and deterministic replay.

- Model-agnostic multi-LLM routing across 100+ providers.

- Built-in security analyzer and confirmation policies.

- Integrated REST/WebSocket services and interactive workspace interfaces (VS Code Web, VNC, browser).

- QA instrumentation: unit tests, LLM-based integration tests, and benchmark harness.

Representative quotes from the text

- “Sandboxing should be opt-in, not universal.”

- “Stateless by default, one source of truth for state.”

- “Strict separation of concerns.”

- “Everything should be composable and safe to extend.”

- “The SDK defines an event-sourced state model with deterministic replay, an immutable configuration for agents, and a typed tool system with MCP integration.”

- “The same agent to run locally for prototyping or remotely in secure, containerized environments with minimal code changes.”

- “Compared with existing SDKs from OpenAI, Claude and Google, OpenHands uniquely integrates native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in security analysis.”

- “These elements allow the OpenHands Software Agent SDK to provide a practical foundation for prototyping, unlocking new classes of custom applications, and reliably deploying agents at scale.”

- “Two runs with identical parameters could still diverge subtly [in V0]; V1 treats all agents and components as immutable… The only mutable entity is the conversation state.”

- “This containerized design simplifies deployment and enables SaaS-style multi-tenancy while preserving workspace isolation.”

Business applications and value

- Software engineering automation

- Autonomous coding tasks, bug fixing (SWE-Bench class tasks), refactoring, codebase migration, and test generation.

- CI/CD agents that run gated actions in isolated environments with audit trails and human approval.

- Developer experience platforms

- Embedded agents in IDEs or web workspaces (VS Code Web/VNC/Browser) for long-running tasks and async assistance.

- Secure internal copilots with secret management and risk-based approvals.

- DevOps/SRE and IT automation

- Runbooks and incident response with controlled tool execution, pause/resume, and deterministic replay.

- Infra and environment orchestration via tools and MCP integrations.

- Governance, risk, and compliance

- Action-level risk ratings and approvals; session logs and reproducibility for audits.

- Secret rotation and masking to prevent leakage in logs and model context.

- SaaS and platform integrations

- Multi-tenant agent hosting with per-session isolation; offer agents as services behind APIs and WebSocket streams.

- Vendor-agnostic model routing to control cost/performance and diversify provider risk.

Adoption guide (quick start and extension points)

- Quick start

- Define LLM (via LiteLLM), get a default agent, create Conversation pointing at a local project path, send a message, run.

- To containerize, replace local path with DockerWorkspace; no other code changes required.

- Extend safely

- Tools: implement Action, ToolExecutor, Observation; register in tool registry; or import MCP tools seamlessly.

- LLM routing: subclass RouterLLM to select models dynamically.

- Policies: implement custom SecurityAnalyzer and ConfirmationPolicy.

- Context: add Skills (programmatic or markdown) and prompt prefixes/suffixes without changing agent logic.

- Delegation: spawn sub-agents via the delegation tool for parallel or hierarchical workflows.

Security and governance considerations

- Use opt-in sandboxing (DockerWorkspace) for untrusted or high-risk tasks; enforce per-conversation resource limits via container runtime.

- Require confirmations for medium/high/unknown risk actions; adjust thresholds by environment (dev vs prod).

- Manage secrets centrally via SecretRegistry integrations; enable rotation without restarting agents.

- Persist and encrypt state as required; use event logs for audits and deterministic replays.

Operational notes and prerequisites

- Core stack: Python, Pydantic models, FastAPI server, LiteLLM, Docker (for sandboxing).

- Performance/cost control: Condenser to manage context tokens; model routing to allocate expensive models only where needed; daily LLM-based CI costs are small.

- Scalability: One container per agent session for strong isolation; deploy behind a load balancer; leverage Kubernetes via container images if needed.

Risks and mitigations

- Model variability: Use deterministic replay and CI across multiple models to catch regressions; leverage routing to backstop failures.

- Security of actions: Enforce confirmation policies; run in containers; mask and encrypt secrets.

- State growth and cost: Use condensers; persist incrementally; prune artifacts as needed.

- Tool brittleness: Favor typed Action/Observation schemas; validate at boundaries; treat MCP tools as first-class but monitor server availability.

Key links and licensing

- SDK repository: https://github.com/OpenHands/software-agent-sdk

- Benchmarks: https://github.com/OpenHands/benchmarks

- License: MIT

Appendix: minimal workflow (conceptual)

- Local prototyping

- Instantiate LLM and default agent.

- Create Conversation(agent=..., workspace=”/path/to/project”).

- conversation.send_message(”Task…”); conversation.run().

- Remote/containerized

- Replace workspace path with DockerWorkspace(...).

- Keep agent, tools, and messaging code unchanged.

- Observe streamed events via WebSocket for UI/monitoring.

What’s new versus OpenHands V0

- Optional isolation instead of universal sandboxing; unified code path avoids duplicated local MCP/tool logic.

- Immutable config with a single mutable ConversationState; deterministic replay and reliable recovery.

- Modular SDK separated from apps and benchmarks; smaller, faster core; independent releases.

- Typed, extensible tool system with MCP parity; distributed execution via registry; safer and more composable orchestration.

Decision checklist

- Do you need vendor-agnostic models, MCP tools, and a path from local to isolated deployment?

- Will you require human-in-the-loop approvals, secret management, and auditable logs?

- Do you need to route across models for cost/performance or multimodality?

- Are repeatability and deterministic recovery mandatory for your workflows?

- Do you plan to expose agents via APIs and interactive workspaces?


Created with AI

Ready for more?