-
Notifications
You must be signed in to change notification settings - Fork 91
[RFC 001] - Baseline API and Interface Specifications #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
updated the name to OpenEnv
|
will we link this to the top level readme to ensure folks see it? |
adding a pytorch logo :)
Adding an experimental warning to the readme.
Creating a PR to update naming on the Readme
Co-authored-by: Copilot <[email protected]>
adding the CoC..
| ``` | ||
| ┌─────────────────────────────────────────────────────────┐ | ||
| │ RL code(Client Application) │ | ||
| │ RL code(Client Application) │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double line
| │ │ (HTTPEnvClient)│ │ (HTTPEnvClient) │ │ | ||
| │ └────────┬───────┘ └────────┬─────────┘ │ | ||
| └───────────┼───────────────────────────────┼─────────────┘ | ||
| │ HTTP (reset, step, state) │ HTTP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should expose state as the model is that you keep that private and only return what you are allowed to see under the observation. If you are playing chess or having a 1:1 conversation, you are allowed to see everything so it doesn't matter. But it does matter in many real-life applications, which involved imperfect information (e.g. poker, you don't see other people's hands. But also a driving sim, where some cars will move out of your view because they are occluded by buildings or other cars)
|
|
||
| @property | ||
| @abstractmethod | ||
| def state(self) -> State: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing in python is really private, so idk how to enforce this
| #### 1. Environment (Server-Side) | ||
|
|
||
| ```python | ||
| class Environment(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add a way of discovering actions (perhaps the topic of another RFC) it will have to backpropagate here
| - Generic types (`Generic[ActT, ObsT]`) provide compile-time type safety | ||
| - Each environment's concrete client class implements parsing step, observation, and state responses from the server into corresponding data models for the respective response. | ||
| - Each environment's concrete client class implements parsing step, observation, and state responses from the server into corresponding data models for the respective response. | ||
| - Example: `CodingEnv(HTTPEnvClient[CodeAction, CodeObservation])` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a better naming convention here. I know that CodingEnvClient is a bit heavy, but I find the current convention of naming the server CodingEnvironment and the client CodingEnv to be deceptive/confusing
| """Abstract base for container orchestration.""" | ||
|
|
||
| @abstractmethod | ||
| def start_container(self, image: str, ...) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we call .reset() a lot, does it make sense to have a .reset() here too to like restart from a warmed-up image?
| - **Flexibility**: Environments can use internal state and context not visible to clients for reward computation | ||
| - **Standard Pattern**: Aligns with Gymnasium/Gym conventions where rewards are returned from `step()` | ||
|
|
||
| The `Observation` base class includes a `reward` field that environments populate: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional reward field.
|
|
||
| These three APIs establish the minimum viable interface for environment interaction and are sufficient for basic RL training workflows. They align with established patterns from Gymnasium and similar frameworks, making them immediately familiar to practitioners. | ||
|
|
||
| **Scope**: This RFC focuses exclusively on these baseline APIs. Additional APIs (e.g., `render()`, `seed()`, `close()`, `tools()` and environment-specific utilities) will be explored in follow-up RFCs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should call this in the very first line, so that the reader is not gonna be like "BUT WHAT ABOUT TOOLS"
| metadata: Dict[str, Any] = field(default_factory=dict) | ||
| ``` | ||
|
|
||
| This design enables environments to compute rewards based on: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Note that the environment is just the place where these are returned, not necessarily where they are computed. For example, we recommend that you RPC to a GPU machine hosting your reward model"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This brings the next question: what standard should said RPCs follow so that this code is shareable?)
|
Merging this to move quicker. Will refactor. |
pseudo-rnd-thoughts
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realise I'm a bit late to the conversation but I'll give my two cent having maintained Gymnasium and thought about if a Gymnasium v2 would appear, what I would change.
- Make it vectorized and multi-agent by default. Single agent, single environment are just special cases and can remove the need for another library to add these features later, i.e., PettingZoo for multi-agent RL as an equivalent to Gymnasium. Why vectorized? Further reduces duplication later if you want a server that runs multiple environments at the same time.
Personally, I would plan to bring these into the API, from the beginning.
These changes can be very easy to implement through num_envs = 1 and num_agents = 1 attributes (I would make them properties for easier customization by users).
Also, changing the rewards to be an array, this gives the freedom to the developer on the shape, i.e., (1,) and (num-envs, num-agents, num-reward-types) are all valid under a single API.
| @dataclass(kw_only=True) | ||
| class Observation: | ||
| """Base class for all environment observations.""" | ||
| done: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename to episode_over given the distinction between termination and truncation, episodes with a defined ending (answer given, etc) and those where we end computation for an external reason. https://farama.org/Gymnasium-Terminated-Truncated-Step-API
While this differs from the old gym naming conversion, it is more clear naming for new users to understand in my opinion.
| class Observation: | ||
| """Base class for all environment observations.""" | ||
| done: bool = False | ||
| reward: Union[bool, int, float, None] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use a numpy array as your reward, as if you have a single one that fine but allows more than one reward with the API having to be changed significantly (see #107 and multi-reward RL - https://github.com/Farama-Foundation/MO-Gymnasium).
Further, this allows easier vectorization of environments
This PR is to discuss OpenEnv 0.1 RFC with focus on
What has been proposed here is already available on the master branch to try out and gather feedback from the current experience.
NOTE: Extensions to supporting observability, mcp tools will follow up this baseline API spec RFC in order to ensure