Add Docker sandbox backend and GenericSandboxEnv environment by hamishivi · Pull Request #1490 · allenai/open-instruct

hamishivi · Feb 22, 2026

Summary

Adds Docker sandbox infrastructure for RL training with code execution, adapted from PR #1453:

backends.py — SandboxBackend ABC + DockerBackend:

Configurable image, command timeout, and memory limit (mem_limit)
put_archive / get_archive for robust file I/O (handles large files, binary data)
Command output truncated to 1MB to prevent OOM
remove=True containers with graceful stop, kill-on-failure fallback
run_code validates language (currently Python only)
read_file supports binary mode, raises FileNotFoundError / IsADirectoryError

generic_sandbox.py — GenericSandboxEnv:

execute_bash — stateful bash (env vars and cwd persist between calls via wrapper script)
str_replace_editor — file viewer/editor (view/create/str_replace/insert) with correct line numbering in view ranges
coerce_args for type-safe tool arguments from model output
Configurable error penalty (default -0.05)
Reset with 3-attempt retry, fresh backend per reset
Git init with identity config (only if git available in image)
Catches FileNotFoundError in editor to return model-friendly errors
Immutable tool definitions tuple
GenericSandboxEnvConfig with backend, image, mem_limit, penalty, write_prompt_file, timeout

Integration:

Registered as generic_sandbox in TOOL_REGISTRY for auto-discovery from datasets
Exported from open_instruct.environments package
Added docker>=7.0.0 dependency

Debug:

scripts/train/debug/envs/sandbox_lm_1gpu.sh — 1-GPU training script
scripts/train/debug/envs/sandbox_lm_system_prompt.txt — system prompt for sandbox tasks

Test plan

Existing environment tests pass (pytest tests/test_environments.py)
All imports resolve correctly
TOOL_REGISTRY includes generic_sandbox
Linter passes (make style && make quality)
Test sandbox_lm_1gpu.sh on a GPU machine with Docker

Working 8-gpu beaker script: https://beaker.org/orgs/ai2/workspaces/open-instruct-dev/work/01KJ4EKS3XC153FKTTJEN132PM?taskId=01KJ4EKS43ZGDVZMX0MXJ7F7X7&jobId=01KJ4EKS7WWBS76ZGC7C447GNB

gemini-code-assist · Feb 22, 2026

Summary of Changes

Hello @hamishivi, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the reinforcement learning capabilities by integrating robust code execution environments. It introduces a Docker-based sandbox backend, allowing for secure and controlled execution of code and commands during RL training. Two new environments, SandboxEnv and SandboxLMEnv, provide varying levels of interaction, from basic script execution to a full-featured, stateful coding environment, enabling models to interact with a simulated system for complex problem-solving. This foundational work expands the types of tasks that can be tackled within the RL framework, particularly those requiring dynamic code interaction.

Highlights

New Sandbox Infrastructure: Introduced a SandboxBackend abstract base class and its DockerBackend implementation, enabling code execution within Docker containers with features like command timeouts, output truncation, and memory limits.
New RL Environments: Added SandboxEnv for basic code execution tasks with execute, run_code, and submit tools, and SandboxLMEnv which mirrors the llm-in-sandbox interface, offering stateful execute_bash and a str_replace_editor for advanced file manipulation.
Tool Integration and Configuration: Both new sandbox environments are registered in TOOL_REGISTRY for auto-discovery from datasets, and their configurations (SandboxEnvConfig, SandboxLMEnvConfig) are integrated into the environment system.
Dependency and Debugging Support: The docker Python SDK (version >= 7.0.0) has been added as a dependency, and a 1-GPU debug script (sandbox_lm_1gpu.sh) with a specific system prompt has been included for testing the SandboxLMEnv.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

CHANGELOG.md
- Added Docker sandbox backend and SandboxEnv/SandboxLMEnv environments for code execution during RL training. Includes DockerBackend with command timeout and output truncation, SandboxEnv (basic execute/run_code/submit), SandboxLMEnv (execute_bash + str_replace_editor mirroring llm-in-sandbox), and a 1-GPU debug script.

Activity

The pull request was created by hamishivi, introducing new sandbox environments and a Docker backend for RL training.
The author provided a detailed summary of the changes, including the new backend, environments, and a debug script.
A test plan was outlined, covering existing environment tests, import resolution, TOOL_REGISTRY inclusion, linter checks, and a pending test for the sandbox_lm_1gpu.sh script on a GPU machine with Docker.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces a robust Docker-based sandbox infrastructure for RL training, including two new environments (SandboxEnv and SandboxLMEnv). The implementation is well-structured, leveraging an abstract base class for backends and providing stateful bash execution. I have identified a few areas for improvement, primarily around path safety in shell commands, hardcoded resource limits, and minor path manipulation logic.

I am having trouble creating individual review comments. Click here to see my feedback.

open_instruct/environments/backends.py (128)

The path variable is injected directly into a shell command without quoting. This will fail if the path contains spaces or special characters. Using shlex.quote ensures the path is safely handled by the shell.

        self._container.exec_run(["bash", "-c", f"echo '{encoded_content}' | base64 -d > {shlex.quote(path)}"])

open_instruct/environments/backends.py (69)

The memory limits for the Docker container are currently hardcoded to 4GB. It would be better to make these configurable via the constructor to allow for different task requirements.

    def __init__(self, image: str = "ubuntu:24.04", timeout: int = 1800, mem_limit: str = "4g"):
        """
        Args:
            image: Docker image to use (default: ubuntu:24.04)
            timeout: Per-command timeout in seconds (default: 1800 / 30 min)
            mem_limit: Memory limit for the container (default: 4g)
        """
        self._image = image
        self._timeout = timeout
        self._mem_limit = mem_limit
        self._container = None
        self._client = None

open_instruct/environments/backends.py (88)

Using the configurable memory limit instead of a hardcoded value.

        self._container = self._client.containers.run(
            self._image, command="sleep infinity", detach=True, remove=True, mem_limit=self._mem_limit, memswap_limit=self._mem_limit
        )

open_instruct/environments/backends.py (134)

When executing cat, it's safer to use -- to signal the end of command options. This prevents cat from interpreting a filename starting with a hyphen as an option.

        exit_code, output = self._container.exec_run(["cat", "--", path])

open_instruct/environments/sandbox_lm.py (271)

Manual path manipulation using string splitting can be brittle. Since the sandbox is Linux-based, using posixpath.dirname is a more robust and standard way to extract the directory name.

        import posixpath
        parent = posixpath.dirname(path)

hamishivi · Feb 22, 2026

/gemini review

gemini-code-assist

Code Review

This pull request introduces a Docker-based sandbox for code execution in RL environments, adding a SandboxBackend abstraction, a DockerBackend implementation, and a GenericSandboxEnv that provides execute_bash and file editing tools. The changes are well-structured and the implementation is solid. I've provided a few suggestions to improve the robustness of the DockerBackend implementation, particularly around file I/O, and a minor correction to a system prompt file.

hamishivi · Feb 22, 2026

/gemini review

gemini-code-assist

Code Review

The pull request introduces a Docker sandbox backend and two new environments, SandboxEnv and SandboxLMEnv, for RL training with code execution. It also updates the CHANGELOG.md, open_instruct/environments/__init__.py, open_instruct/environments/tools/tools.py, and pyproject.toml files to integrate these new features and dependencies. A new debug script sandbox_lm_1gpu.sh and its corresponding system prompt file sandbox_lm_system_prompt.txt are added. The changes enhance the project's capability for code-aware RL agents by providing a sandboxed execution environment.

hamishivi · Feb 22, 2026

/gemini review

gemini-code-assist

Code Review

This pull request introduces a Docker sandbox backend and two new environments, SandboxEnv and SandboxLMEnv, for RL training with code execution. The changes include new files for backend implementation (backends.py) and the sandbox environments (generic_sandbox.py), updates to __init__.py and tools.py to register these new components, and a new debug script (sandbox_lm_1gpu.sh) along with its system prompt. The pyproject.toml and uv.lock files have also been updated to include the docker dependency. The code generally looks good, with clear abstractions for the sandbox backend and well-defined tool interfaces. I've identified a few areas for improvement regarding error handling, logging, and consistency in the CHANGELOG.md entry.

hamishivi · Feb 22, 2026

/gemini review

gemini-code-assist

Code Review

This pull request introduces a robust Docker-based sandbox infrastructure for RL training, featuring a stateful bash environment and a file editor tool. The implementation of state persistence in Docker containers via environment variable and CWD tracking is clever. However, there are a few issues regarding command defaults, potential hangs in file reading, and image-specific dependencies that should be addressed to ensure reliability across different Docker images.

Adds sandbox infrastructure for RL training with code execution: - backends.py: SandboxBackend ABC + DockerBackend (command timeout, output truncation, 4GB memory limit, put_archive file transfer) - generic_sandbox.py: GenericSandboxEnv with execute_bash + str_replace_editor (stateful bash, file editing, retry on reset) - Register as 'generic_sandbox' in TOOL_REGISTRY for auto-discovery - Add docker>=7.0.0 dependency - Add 1-GPU debug script and system prompt Co-authored-by: Cursor <cursoragent@cursor.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Without env_name, GenericSandboxEnv tools never get dispatched because the rollout loop never acquires the env or registers its inner tool names. This finds the first stateful env (not a simple Tool) and sets it as the default so samples without an explicit env_config column get dispatched correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When using --tools generic_sandbox, the pool is keyed by "generic_sandbox" but the model calls "execute_bash" and "str_replace_editor". These inner tool names were missing from tool_call_names, causing validate_dataset_tools to reject datasets listing them and EnvStatistics to miss their metrics. Fix: collect tool definitions from CLI pools first, build known_tool_names set, and include inner function names in tools_config.tool_call_names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mounts /var/run/docker.sock from the host so GenericSandboxEnv can create Docker containers inside Beaker jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

For stateful environments like GenericSandboxEnv, the pool key (e.g., "generic_sandbox") was included alongside the actual tool names ("execute_bash", "str_replace_editor") in tool_call_names. This caused the env name to appear in metrics and validation. Now only include tool definition names (what the model actually calls), not pool/env keys. Co-authored-by: Cursor <cursoragent@cursor.com>

With 16 unique prompts x 4 samples = 64 rollouts, pool_size=8 was a bottleneck causing long acquire waits. 16 gives one sandbox per unique prompt (64GB containers on Jupiter nodes). Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

run_code (write temp file + execute) is the same for all backends. Moved it to SandboxBackend as a concrete method so backends only need to implement run_command. Removed the DockerBackend override. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

natolambert

Seems fine -- one question is can we make docker an optional dependency? Not sure how main path you see this to be, especially given all the try-except on importing docker. If its required, we dont need try excepts?

natolambert · Feb 24, 2026

Where'd we get this? If we took it from somewhere can we credit?

oh yeah let me add a comment to the script. Is modified from https://github.com/llm-in-sandbox/llm-in-sandbox

Co-authored-by: Cursor <cursoragent@cursor.com>

hamishivi · Feb 24, 2026

For now, I'll keep docker as a main dep, and remove the import guards. If it becomes onerous in the future we can change it fairly easily.

Co-authored-by: Cursor <cursoragent@cursor.com>

hamishivi force-pushed the add-docker-sandbox-env branch from b145ea6 to 1506151 Compare February 22, 2026 20:21

hamishivi force-pushed the add-docker-sandbox-env branch from 1506151 to 41cc028 Compare February 22, 2026 20:25