Add Docker sandbox backend and GenericSandboxEnv environment#1490
Add Docker sandbox backend and GenericSandboxEnv environment#1490hamishivi merged 16 commits intomainallenai/open-instruct:mainfrom add-docker-sandbox-envallenai/open-instruct:add-docker-sandbox-envCopy head branch name to clipboard
Conversation
b145ea6 to
1506151
Compare
Summary of ChangesHello @hamishivi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the reinforcement learning capabilities by integrating robust code execution environments. It introduces a Docker-based sandbox backend, allowing for secure and controlled execution of code and commands during RL training. Two new environments, Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
1506151 to
41cc028
Compare
There was a problem hiding this comment.
Code Review
The pull request introduces a robust Docker-based sandbox infrastructure for RL training, including two new environments (SandboxEnv and SandboxLMEnv). The implementation is well-structured, leveraging an abstract base class for backends and providing stateful bash execution. I have identified a few areas for improvement, primarily around path safety in shell commands, hardcoded resource limits, and minor path manipulation logic.
I am having trouble creating individual review comments. Click here to see my feedback.
open_instruct/environments/backends.py (128)
The path variable is injected directly into a shell command without quoting. This will fail if the path contains spaces or special characters. Using shlex.quote ensures the path is safely handled by the shell.
self._container.exec_run(["bash", "-c", f"echo '{encoded_content}' | base64 -d > {shlex.quote(path)}"])open_instruct/environments/backends.py (69)
The memory limits for the Docker container are currently hardcoded to 4GB. It would be better to make these configurable via the constructor to allow for different task requirements.
def __init__(self, image: str = "ubuntu:24.04", timeout: int = 1800, mem_limit: str = "4g"):
"""
Args:
image: Docker image to use (default: ubuntu:24.04)
timeout: Per-command timeout in seconds (default: 1800 / 30 min)
mem_limit: Memory limit for the container (default: 4g)
"""
self._image = image
self._timeout = timeout
self._mem_limit = mem_limit
self._container = None
self._client = None
open_instruct/environments/backends.py (88)
Using the configurable memory limit instead of a hardcoded value.
self._container = self._client.containers.run(
self._image, command="sleep infinity", detach=True, remove=True, mem_limit=self._mem_limit, memswap_limit=self._mem_limit
)open_instruct/environments/backends.py (134)
When executing cat, it's safer to use -- to signal the end of command options. This prevents cat from interpreting a filename starting with a hyphen as an option.
exit_code, output = self._container.exec_run(["cat", "--", path])open_instruct/environments/sandbox_lm.py (271)
Manual path manipulation using string splitting can be brittle. Since the sandbox is Linux-based, using posixpath.dirname is a more robust and standard way to extract the directory name.
import posixpath
parent = posixpath.dirname(path)
41cc028 to
882d68b
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a Docker-based sandbox for code execution in RL environments, adding a SandboxBackend abstraction, a DockerBackend implementation, and a GenericSandboxEnv that provides execute_bash and file editing tools. The changes are well-structured and the implementation is solid. I've provided a few suggestions to improve the robustness of the DockerBackend implementation, particularly around file I/O, and a minor correction to a system prompt file.
9ed0a7d to
552afa4
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
The pull request introduces a Docker sandbox backend and two new environments, SandboxEnv and SandboxLMEnv, for RL training with code execution. It also updates the CHANGELOG.md, open_instruct/environments/__init__.py, open_instruct/environments/tools/tools.py, and pyproject.toml files to integrate these new features and dependencies. A new debug script sandbox_lm_1gpu.sh and its corresponding system prompt file sandbox_lm_system_prompt.txt are added. The changes enhance the project's capability for code-aware RL agents by providing a sandboxed execution environment.
77c6a8d to
de5f289
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a Docker sandbox backend and two new environments, SandboxEnv and SandboxLMEnv, for RL training with code execution. The changes include new files for backend implementation (backends.py) and the sandbox environments (generic_sandbox.py), updates to __init__.py and tools.py to register these new components, and a new debug script (sandbox_lm_1gpu.sh) along with its system prompt. The pyproject.toml and uv.lock files have also been updated to include the docker dependency. The code generally looks good, with clear abstractions for the sandbox backend and well-defined tool interfaces. I've identified a few areas for improvement regarding error handling, logging, and consistency in the CHANGELOG.md entry.
b86560d to
23371fb
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a robust Docker-based sandbox infrastructure for RL training, featuring a stateful bash environment and a file editor tool. The implementation of state persistence in Docker containers via environment variable and CWD tracking is clever. However, there are a few issues regarding command defaults, potential hangs in file reading, and image-specific dependencies that should be addressed to ensure reliability across different Docker images.
23371fb to
67caf7a
Compare
1e537d8 to
ebf4cb1
Compare
Adds sandbox infrastructure for RL training with code execution: - backends.py: SandboxBackend ABC + DockerBackend (command timeout, output truncation, 4GB memory limit, put_archive file transfer) - generic_sandbox.py: GenericSandboxEnv with execute_bash + str_replace_editor (stateful bash, file editing, retry on reset) - Register as 'generic_sandbox' in TOOL_REGISTRY for auto-discovery - Add docker>=7.0.0 dependency - Add 1-GPU debug script and system prompt Co-authored-by: Cursor <cursoragent@cursor.com>
ebf4cb1 to
6072c71
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Without env_name, GenericSandboxEnv tools never get dispatched because the rollout loop never acquires the env or registers its inner tool names. This finds the first stateful env (not a simple Tool) and sets it as the default so samples without an explicit env_config column get dispatched correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When using --tools generic_sandbox, the pool is keyed by "generic_sandbox" but the model calls "execute_bash" and "str_replace_editor". These inner tool names were missing from tool_call_names, causing validate_dataset_tools to reject datasets listing them and EnvStatistics to miss their metrics. Fix: collect tool definitions from CLI pools first, build known_tool_names set, and include inner function names in tools_config.tool_call_names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mounts /var/run/docker.sock from the host so GenericSandboxEnv can create Docker containers inside Beaker jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
For stateful environments like GenericSandboxEnv, the pool key
(e.g., "generic_sandbox") was included alongside the actual tool
names ("execute_bash", "str_replace_editor") in tool_call_names.
This caused the env name to appear in metrics and validation.
Now only include tool definition names (what the model actually
calls), not pool/env keys.
Co-authored-by: Cursor <cursoragent@cursor.com>
With 16 unique prompts x 4 samples = 64 rollouts, pool_size=8 was a bottleneck causing long acquire waits. 16 gives one sandbox per unique prompt (64GB containers on Jupiter nodes). Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
run_code (write temp file + execute) is the same for all backends. Moved it to SandboxBackend as a concrete method so backends only need to implement run_command. Removed the DockerBackend override. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
natolambert
left a comment
There was a problem hiding this comment.
Seems fine -- one question is can we make docker an optional dependency? Not sure how main path you see this to be, especially given all the try-except on importing docker. If its required, we dont need try excepts?
There was a problem hiding this comment.
Where'd we get this? If we took it from somewhere can we credit?
There was a problem hiding this comment.
oh yeah let me add a comment to the script. Is modified from https://github.com/llm-in-sandbox/llm-in-sandbox
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
|
For now, I'll keep docker as a main dep, and remove the import guards. If it becomes onerous in the future we can change it fairly easily. |
Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Adds Docker sandbox infrastructure for RL training with code execution, adapted from PR #1453:
backends.py—SandboxBackendABC +DockerBackend:mem_limit)put_archive/get_archivefor robust file I/O (handles large files, binary data)remove=Truecontainers with graceful stop, kill-on-failure fallbackrun_codevalidates language (currently Python only)read_filesupports binary mode, raisesFileNotFoundError/IsADirectoryErrorgeneric_sandbox.py—GenericSandboxEnv:execute_bash— stateful bash (env vars and cwd persist between calls via wrapper script)str_replace_editor— file viewer/editor (view/create/str_replace/insert) with correct line numbering in view rangescoerce_argsfor type-safe tool arguments from model outputFileNotFoundErrorin editor to return model-friendly errorsGenericSandboxEnvConfigwithbackend,image,mem_limit,penalty,write_prompt_file,timeoutIntegration:
generic_sandboxinTOOL_REGISTRYfor auto-discovery from datasetsopen_instruct.environmentspackagedocker>=7.0.0dependencyDebug:
scripts/train/debug/envs/sandbox_lm_1gpu.sh— 1-GPU training scriptscripts/train/debug/envs/sandbox_lm_system_prompt.txt— system prompt for sandbox tasksTest plan
pytest tests/test_environments.py)TOOL_REGISTRYincludesgeneric_sandboxmake style && make quality)sandbox_lm_1gpu.shon a GPU machine with DockerWorking 8-gpu beaker script: https://beaker.org/orgs/ai2/workspaces/open-instruct-dev/work/01KJ4EKS3XC153FKTTJEN132PM?taskId=01KJ4EKS43ZGDVZMX0MXJ7F7X7&jobId=01KJ4EKS7WWBS76ZGC7C447GNB