Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
225 changes: 225 additions & 0 deletions rfcs/001-openenv-spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
# RFC: OpenEnv Framework Spec for agent execution environments

**Status**: In Review
**Created**: 10/14/2025
**Authors**: @Darktex, @pankit-eng, @jspisak, @zkwentz
**RFC ID:** 001

## Summary

An e2e framework for creating, deploying and using isolated execution environments for agentic RL training, built using Gymnasium style APIs.It provides a clean client-server architecture where environments run as FastAPI servers in Docker containers, and clients interact with them via type-safe HTTP APIs.

## Motivation

### Problem Statement

Building execution environments for AI agents, code execution, or computational tasks typically involves:
- Complex setup and dependency management
- Security concerns with code execution
- Difficulty in scaling and deploying environments
- Lack of standardized interfaces between environments and clients of environments

### Goals

1. **Simplicity**: Simple APIs to interact with the environment from RL training code
2. **Type Safety**: Strongly-typed actions, observations, and state
3. **Isolation**: Each environment runs in its own Docker container
4. **Observability**: Leverage side-car container pattern to observe actions, observation tuples for an RL training eposide.


## Design

### Architecture Overview

```
┌─────────────────────────────────────────────────────────┐
│ RL code(Client Application) │
│ RL code(Client Application) │
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double line

│ ┌────────────────┐ ┌──────────────────┐ │
│ │ Environment │ │ Environment │ │
│ │ Client │ │ Client │ │
│ │ (HTTPEnvClient)│ │ (HTTPEnvClient) │ │
│ └────────┬───────┘ └────────┬─────────┘ │
└───────────┼───────────────────────────────┼─────────────┘
│ HTTP (reset, step, state) │ HTTP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should expose state as the model is that you keep that private and only return what you are allowed to see under the observation. If you are playing chess or having a 1:1 conversation, you are allowed to see everything so it doesn't matter. But it does matter in many real-life applications, which involved imperfect information (e.g. poker, you don't see other people's hands. But also a driving sim, where some cars will move out of your view because they are occluded by buildings or other cars)

│ │
┌───────────▼───────────────────────────────▼─────────────┐
│ Docker Containers (Isolated) │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ FastAPI Server │ │ FastAPI Server │ │
│ │ Environment │ │ Environment │ │
│ │ Logic │ │ Logic │ │
│ └──────────────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────┘
```

### Core Abstractions(Already available on the master)

#### 1. Environment (Server-Side)

```python
class Environment(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add a way of discovering actions (perhaps the topic of another RFC) it will have to backpropagate here

"""Base class for all environments."""

@abstractmethod
def reset(self) -> Observation:
"""Initialize new episode."""

@abstractmethod
def step(self, action: Action) -> Observation:
"""Execute action and return observation."""

@property
@abstractmethod
def state(self) -> State:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing in python is really private, so idk how to enforce this

"""Get current episode state."""
```

**Design Rationale**:
- Familiar interface for RL/environment practitioners
- Clear separation between action execution (step) and state management
- Abstract base class enforces contract across all environments

#### 2. HTTPEnvClient (Client-Side)

```python
class HTTPEnvClient(Generic[ActT, ObsT]):
"""Base class for HTTP environment clients."""

def reset(self) -> StepResult[ObsT]:
"""Reset environment."""

def step(self, action: ActT) -> StepResult[ObsT]:
"""Execute action."""

def state(self) -> State:
"""Get current state."""

def close(self) -> None:
"""Cleanup resources by signaling to the provider."""
```

**Design Rationale**:

The HTTPEnvClient serves as the primary interface for users to interact with environments, designed with several key principles:

- This base class handles all HTTP communication(resp, req) with the environment
- This base class handles all HTTP communication(resp, req) with the environment
- Generic types (`Generic[ActT, ObsT]`) provide compile-time type safety
- Each environment's concrete client class implements parsing step, observation, and state responses from the server into corresponding data models for the respective response.
- Each environment's concrete client class implements parsing step, observation, and state responses from the server into corresponding data models for the respective response.
- Example: `CodingEnv(HTTPEnvClient[CodeAction, CodeObservation])`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a better naming convention here. I know that CodingEnvClient is a bit heavy, but I find the current convention of naming the server CodingEnvironment and the client CodingEnv to be deceptive/confusing

- `state()` method provides visibility into episode metadata
- Explicit `close()` ensures proper resource cleanup

#### 3. Container Providers

```python
class ContainerProvider(ABC):
"""Abstract base for container orchestration."""

@abstractmethod
def start_container(self, image: str, ...) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we call .reset() a lot, does it make sense to have a .reset() here too to like restart from a warmed-up image?

"""Start container and return base URL."""

@abstractmethod
def stop_container(self) -> None:
"""Stop and remove container."""

@abstractmethod
def wait_for_ready(self, base_url: str, timeout_s: float) -> None:
"""Wait for container to be ready."""
```

**Design Rationale**:
- Pluggable architecture supports multiple platforms (local Docker, K8s, other orchestration providers)
- Provider abstraction decouples client from deployment details and management with easy integration with existing orchestration solutions
- Provider abstraction decouples client from deployment details and management with easy integration with existing orchestration solutions
- Consistent interface across all providers
- Higher level RL frameworks can implement their own container providers to integrate with their existing orchestration solutions.
- Higher level RL frameworks can implement their own container providers to integrate with their existing orchestration solutions.

### Key Design Decisions

In this RFC, we want to align on four decisions that will shape the overall design of the framework.

#### Decision 1: Baseline API Set

**Chosen Approach**: Define three core APIs as the baseline interface for this framework: `step`, `reset`, and `state`.

**Rationale**:
- **`reset()`**: Initializes a new episode and returns initial observation, providing a clean starting point for agent interactions
- **`step(action)`**: Executes an action and returns an observation, forming the core interaction loop
- **`state()`**: Provides visibility into the current episode state and metadata

These three APIs establish the minimum viable interface for environment interaction and are sufficient for basic RL training workflows. They align with established patterns from Gymnasium and similar frameworks, making them immediately familiar to practitioners.

**Scope**: This RFC focuses exclusively on these baseline APIs. Additional APIs (e.g., `render()`, `seed()`, `close()`, `tools()` and environment-specific utilities) will be explored in follow-up RFCs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call this in the very first line, so that the reader is not gonna be like "BUT WHAT ABOUT TOOLS"


#### Decision 2: Environment-Computed Rewards

**Chosen Approach**: Rewards are computed inside the environment and returned as part of the observation.

**Rationale**:
- **Encapsulation**: Reward logic stays with the environment where domain knowledge resides
- **Consistency**: Ensures reward computation is deterministic and reproducible across different client implementations
- **Flexibility**: Environments can use internal state and context not visible to clients for reward computation
- **Standard Pattern**: Aligns with Gymnasium/Gym conventions where rewards are returned from `step()`

The `Observation` base class includes a `reward` field that environments populate:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional reward field.


```python
@dataclass(kw_only=True)
class Observation:
"""Base class for all environment observations."""
done: bool = False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename to episode_over given the distinction between termination and truncation, episodes with a defined ending (answer given, etc) and those where we end computation for an external reason. https://farama.org/Gymnasium-Terminated-Truncated-Step-API
While this differs from the old gym naming conversion, it is more clear naming for new users to understand in my opinion.

reward: Union[bool, int, float, None] = None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use a numpy array as your reward, as if you have a single one that fine but allows more than one reward with the API having to be changed significantly (see #107 and multi-reward RL - https://github.com/Farama-Foundation/MO-Gymnasium).
Further, this allows easier vectorization of environments

metadata: Dict[str, Any] = field(default_factory=dict)
```

This design enables environments to compute rewards based on:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Note that the environment is just the place where these are returned, not necessarily where they are computed. For example, we recommend that you RPC to a GPU machine hosting your reward model"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This brings the next question: what standard should said RPCs follow so that this code is shareable?)

- Action outcomes (e.g., exit codes, success/failure)
- Internal state transitions
- Multi-step trajectories
- Domain-specific metrics

Clients receive fully-formed observations with rewards already computed, simplifying the client-side RL loop.

#### Decision 3: HTTP-Based Communication

**Chosen Approach**: Use HTTP/REST for client-server communication

**Rationale**:
- HTTP based RPC is universal and well-understood than other alternatives like grpc or thrift
- Easy to debug with standard tools (curl, Postman)
- Supports language-agnostic clients
- FastAPI provides excellent developer experience

#### Decision 4: Docker-Based runtime isolation and packaging

**Chosen Approach**: Each environment runs in its own Docker container

**Rationale**:
- Strong isolation boundaries compared to process-based isolation
- Reproducible environments with packaged dependencies
- Easy dependency management via Dockerfile
- Industry-standard tooling


### Example Environments

**Purpose**: Test infrastructure, demonstrate patterns, verify deployments

#### Coding Environment

Executes Python code in a sandboxed environment:

```python
from envs.coding_env import CodeAction, CodingEnv

client = CodingEnv.from_docker_image("coding-env:latest")
result = client.step(CodeAction(code="print('Hello, World!')"))
print(result.observation.stdout) # "Hello, World!\n"
print(result.observation.exit_code) # 0
client.close()
```