Tags · allenai/open-instruct

v0.2.0

Add model merging scripts for Beaker (#1459)

* Add model merging scripts for Beaker

Two approaches for merging HuggingFace models on Beaker:
- mergekit_merge.sh: Uses mergekit for standard architectures
- direct_merge.sh: Direct safetensors averaging for all architectures,
  including hybrid models that mergekit doesn't support

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add changelog entry for merge scripts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Validate weights length and zero-sum in direct_merge.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Print merge config in Beaker logs, add test run links to README

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix direct_merge.py: update docstring path, remove unused import, use +=, allow tokenizer overwrite

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix direct_merge.sh: rename /tmp/linear_merge.py to /tmp/direct_merge.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Address PR review: refactor merge_models, add tests, document dual approach

- Move scripts/merge/direct_merge.py to open_instruct/merge_models.py
- Apply all code review fixes: Path types, logger, math.isclose,
  model_weights rename, helper functions, module-level parser
- Add 14 unit tests (synthetic + pythia-14m integration)
- Update shell scripts to reference new module path
- Fix naming inconsistencies (linear_merge -> direct_merge)
- Document why both mergekit and direct merge are needed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Move launch_merges.sh example into README

Address PR feedback: launch_merges.sh is an example script, not
something that should be checked in. Moved its content into a
"Batch launching" section in the README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Mar 1, 2026
aa29e21
zip
tar.gz
Notes

v0.1.0

Add OLMo-core based DPO training module (#1391)

* Add OLMo-core based DPO training module

- Add dpo.py: New DPO training module using OLMo-core's TrainModule with HSDP support
- Add build_reference_logprobs_cache_olmo: Generic reference logprobs caching for OLMo-core
- Add compute_loss_olmo: Wrapper for DPO loss computation with ExperimentConfig
- Add concatenated_forward_olmo and separate_forward_olmo: OLMo-core forward functions
- Update mason.py: Add dpo.py to OPEN_INSTRUCT_COMMANDS
- Update debug scripts to use torchrun with OLMo-core models

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Cleaned up PR.

* Add OLMo-core train modules for DPO training

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix SpeedMonitorCallback parameter name

Change device_peak_flops_per_second to device_peak_flops to match
the OLMo-core API.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix CheckpointerCallback save_interval validation

Set default checkpointing_steps to 500 when not specified, since
the OLMo-core API requires save_interval >= 1.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Move checkpointing_steps default value to config class

Move the default value for checkpointing_steps (500) from dpo.py to the
CheckpointConfig dataclass in dpo_utils.py. This centralizes the default
and removes the conditional logic in the callback setup.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Remove duplicate checkpointing_steps field from ExperimentConfig

The checkpointing_steps field was defined in both CheckpointConfig (the
parent class) and ExperimentConfig. The duplicate field in ExperimentConfig
had default=None, which overrode the parent class's default of 500, causing
a TypeError when int() was called on None in dpo.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add Saturn cluster to medium_dpo.sh script

Add Saturn as an alternative cluster to help with multi-node scheduling
reliability.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* updated changelog

* Remove explicit torchrun multi-node args from DPO scripts

OLMo-core's prepare_training_environment() handles multi-node setup
internally using Beaker's environment variables. The explicit --nnodes,
--standalone, and --rdzv_backend=c10d arguments interfere with this and
cause RendezvousTimeoutError on multi-node runs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fixed linter errors

* Refactor DPO OLMo-core: add parallelism support, fix HSDP order

- Move OLMO_MODEL_CONFIG_MAP and get_transformer_config to olmo_core_utils.py
- Add tensor_parallel_degree, context_parallel_degree, pipeline_parallel_degree
- Replace _apply_hsdp with _apply_parallelism supporting TP/CP/PP
- Fix critical bug: apply HSDP before computing reference logprobs cache
- Add LoRA error check (not supported with OLMo-core)
- Remove unreachable make_disable_adapter_context function
- Reorganize DPO scripts to scripts/train/debug/dpo/
- Add local.sh for testing without Beaker

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix race condition in reference logprobs cache directory creation

Only the main process should create the cache directory and test write
permissions. Other ranks now wait at a barrier until this is complete.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix multi-node DPO post-training barrier failures

Two barrier issues caused "Connection closed by peer" gloo errors during
post-training cleanup:

1. Unconditional barrier at start of _handle_post_training called even
   when distributed training wasn't active

2. Asymmetric barrier inside beaker save conditional - only main_process
   reached this code due to is_main_process check, causing non-main
   processes to hang at the barrier while main does file I/O

Fix: Gate the initial barrier on is_distributed() and remove the
asymmetric inner barrier entirely since only main_process enters
that code block anyway.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Remove redundant compute_loss_olmo wrapper function

ExperimentConfig inherits from DPOConfig, so compute_loss() accepts
ExperimentConfig directly. The wrapper was unnecessarily creating a new
DPOConfig object when one wasn't needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* run urgent tests

* Fix case-insensitive beaker secret lookup

Beaker stores secret names case-insensitively, but Python's `in` operator
is case-sensitive. This caused lookups for `finbarrt_WANDB_API_KEY` to fail
when the secret was stored as `FINBARRT_WANDB_API_KEY`.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Updated mason.py

* Add uv run prefix to local DPO script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Save DPO models in HuggingFace format for evals

DPO training was saving models in olmo-core format, but eval jobs
and push_folder_to_hub expect HuggingFace format. Use olmo-core's
save_hf_model() to convert the trained model to HF format in
output_dir/hf_model/ before launching evals or pushing to hub.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix WEKA_CLUSTERS import in submit_eval_jobs.py

WEKA_CLUSTERS is defined in launch_utils, not utils. Import launch_utils
and use launch_utils.WEKA_CLUSTERS instead of utils.WEKA_CLUSTERS.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update GRPO single GPU script to use DPO-trained model

Use the DPO-trained OLMo model from allenai/open_instruct_dev with
revision dpo_olmo_core_debug_test instead of Qwen/Qwen3-1.7B.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add --add_bos flag for OLMo model in GRPO script

OLMo models require the --add_bos flag to be set.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Copy original HF config when saving DPO model

The save_hf_model() function creates an incorrect config.json with
wrong values for num_hidden_layers, eos_token_id, etc. Copy the
original model's config.json to preserve the correct values.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Use Weka path directly for DPO model in GRPO test

The HuggingFace model config was still incorrect, so use the
Weka path directly where the model was saved.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add logging for config.json save in DPO

Helps debug issues with model config not being saved correctly.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update GRPO script to use new DPO model path

Use the latest DPO model that was saved with correct config.json.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix DPO HF model saving to use correct layer count

The save_hf_model function from olmo-core was creating extra layers
in the output. Instead, use convert_state_to_hf with the original
HuggingFace config and save using transformers' native save_pretrained.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix OLMo-2-0425-1B config mapping to use correct layer count

The olmo2_1B config has 18 layers but the actual HuggingFace model
has 16 layers. Use olmo2_1B_v2 which has the correct 16 layers.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix HF model loading to use from_config instead of from_pretrained

Cannot pass state_dict together with a model name. Use from_config
to create the model, then load_state_dict to load the weights.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Revert to using save_hf_model for DPO model saving

The convert_state_to_hf approach doesn't work with DTensors from
distributed training. Use save_hf_model which handles DTensors
properly. The config mapping has been fixed so save_hf_model should
now produce correct layer counts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update GRPO script to use DPO model with correct 16 layers

Use the model saved from the DPO run with fixed config mapping.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Copy original HF config after save_hf_model

The save_hf_model function creates an incomplete config.json that
is missing fields like max_position_embeddings. Copy the original
model's config to ensure vLLM can load the model.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update GRPO script to use DPO model with complete config

Use the model from the DPO run with copied original config
that includes max_position_embeddings.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add OLMo3-7B DPO script using OLMo-core trainer

New script that uses dpo.py (OLMo-core + FSDP) instead of
dpo_tune_cache.py (Accelerate + DeepSpeed) for DPO training.
Configured for 2 nodes with 8k sequence length.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add documentation for adding OLMo-core models

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add --no_auto_dataset_cache to DPO script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix multi-node torchrun configuration for DPO

Add missing torchrun multi-node parameters:
- --nnodes to specify total number of nodes
- --node_rank for each node's rank
- --master_addr for coordinator address
- --master_port for coordinator port

These use Beaker environment variables that get substituted at runtime.
Without these, each node ran independently without distributed communication.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix nnodes to use hardcoded value instead of BEAKER_NUM_REPLICAS

BEAKER_NUM_REPLICAS is not a valid Beaker environment variable.
Use hardcoded value of 2 to match --num_nodes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add torchrun multi-node parameters to debug DPO multi_node.sh

Same fix as 7b_instruct_dpo_olmo_core.sh - add nnodes, node_rank,
master_addr, and master_port for proper multi-node coordination.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add OLMO_SHARED_FS=1 env var for multi-node DPO scripts

OLMo-core's checkpointing code requires this env var to be set when
using a shared filesystem (like Weka) to avoid unnecessary distributed
coordination for filesystem operations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add comment about cache cleanup for corrupted dataset cache

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Remove cache cleanup comment

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Support separate model config and weights for OLMo-core DPO

Allow users to specify a config_name separately from model_name_or_path,
enabling local model paths to work with OLMo-core DPO training.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix save_hf_model for FSDP-wrapped models in DPO

Add export_to_hf() function that builds an unwrapped model from config
and loads the FSDP state dict before saving. This avoids the type check
failure in olmo-core's get_hf_config() for FSDP-wrapped models.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix DTensor to Tensor conversion in export_to_hf

Convert DTensors from FSDP state dict to regular CPU tensors before
loading into the unwrapped model.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix FSDP state_dict collective operation for multi-node export

All ranks must participate in model.state_dict() as it's a collective
operation for FSDP models. Only rank 0 now saves to disk.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add detailed logging to export_to_hf for debugging

Log entry/exit for all ranks and each step in the export process.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix DTensor full_tensor() collective operation in export

The full_tensor() call on DTensors is a collective operation that requires
all ranks to participate. Move the conversion outside the is_main_process
check so all ranks call full_tensor().

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Clean up debug logging in export_to_hf

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix missing indices in DPO reference logprobs caching

Add drop_last parameter to HFDataLoader. When drop_last=False, pad the
remainder with repeated indices to fill a complete batch, ensuring all
dataset indices are processed. Use drop_last=False for the cache-building
dataloader to prevent -inf values in the reference logprobs cache.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add MFU/memory/token metrics to cache building + 3x cache batch size

Forward-only cache pass doesn't store activations, so we can use 3x
the training batch size. Also display avg_tok/ex, MFU%, and mem_GB
in the tqdm progress bar during cache building.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add --cache_logprobs_only flag for DPO cache forward-pass benchmarking

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update DPO cache benchmark to match production OLMo3-7B config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Now, we avoid the torch warning

* 6x cache batch size + mem% in DPO cache tqdm

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Reduce cache batch multiplier to 4x (6x OOMed)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Try unsharded cache build, fall back to FSDP on OOM

The DPO reference logprobs cache is forward-only (no backward pass), so
the full unsharded model may fit in GPU memory and avoids allgather
communication overhead. If it OOMs, we catch the error, clear the CUDA
cache, apply FSDP, and retry.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix data loader tests that used single_example_collator with batch_size > 1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix attn_backend auto-detection: check flash_attn_3 availability

The auto-detection was selecting flash_3 for H100 GPUs without checking
if the package is actually installed, causing RuntimeError on startup.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* added export to HF function

* Added script to convert olmo core to HF format.

* Add example usage to olmo-core to HF conversion script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix code review issues in convert_olmo_core_to_hf.py

- Use logger instead of print for output
- Remove unused model.load_state_dict() call

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Jan 26, 2026
8befd55
zip
tar.gz
Notes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0

v0.1.0

Search code, repositories, users, issues, pull requests...

Tags: allenai/open-instruct

v0.2.0

v0.1.0