Testing optimization methods for training GPT models using Hydra configuration and SLURM scheduling.
./setup_env.sh
wandb login # optional, for experiment tracking
./tests/test_hydra.sh #run a small example on gpt-tiny with shakespeare data# Basic run with Hydra
python run_hydra.py model=gpt-tiny optimizer=adamw data=shakespeare training=shakespeare
# Override specific parameters
python run_hydra.py \
model=gpt-medium \
optimizer=adamw \
optimizer.optimizer_params.lr=0.001 \
training.training_params.batch_size=32Outputs: outputs/<model>/<dataset>/<optimizer>/<run_name>/
For running grid searches across multiple GPUs:
./slurm_scripts/submit.sh \
scripts/run_slim_pajama10B_adam_medium.sh \
param_configs/adamw.json \
experiment_name \
8 # number of GPUsExample with partition:
PARTITION=gpu ./slurm_scripts/submit.sh scripts/run_*.sh params.json exp_name 4For distributed training across multiple nodes:
./slurm_scripts/submit_nodes_ddp.sh \
scripts/run_slim_pajama10B_adam_large.sh \
param_configs/adamw.json \
experiment_name \
4 # number of nodesWith custom GPUs/partition/constraint:
./slurm_scripts/submit_nodes_ddp.sh \
scripts/run_*.sh \
params.json \
exp_name \
8 \
--num_gpus=8 \
--partition=gpu \
--constraint=a100Options:
--num_gpus=N- GPUs per node (default: 4)--partition=P- SLURM partition (default: gpuxl)--constraint=C- GPU constraint (e.g., h100, a100)
Hydra lets us assemble experiment configurations from small, composable pieces and override them at the command line without editing YAML files. Each config group in hydra_conf/ corresponds to a different aspect of the training stack (model architecture, optimizer, dataset, logging, etc.). Hydra merges the selected configs into a single runtime configuration, snapshots it in outputs/<run>/.hydra/, and makes every field accessible to Python code via the OmegaConf API.
For a deeper dive into how Hydra works, common patterns, and the reasoning behind our layout, see the dedicated guide in docs/hydra.md.
hydra_conf/
├── model/ # gpt-small, gpt-medium, gpt-large
├── optimizer/ # adamw, dap-ns-nadam, etc.
├── training/ # slim_pajama10B, fineweb1B, etc.
├── data/ # Dataset configurations
└── logging/ # default, wandb
Each directory is a config group. The default choice for a group is defined in hydra_conf/config.yaml, but CLI overrides let you swap any component on the fly. You can also enable additional configs without replacing the defaults by prefixing with +, which is useful for stacking logging or callback configs.
python run_hydra.py \
model=gpt-large \ # Use model/gpt-large.yaml
optimizer=adamw \ # Use optimizer/adamw.yaml
training=slim_pajama10B \ # Use training/slim_pajama10B.yaml
data=slim_pajama10B # Use data/slim_pajama10B.yamlHydra parses the dotted overrides (optimizer.optimizer_params.lr=...) and updates only that field, leaving the rest of the config untouched. You can chain as many overrides as needed, and they are type-checked against the schema in the YAML files.
Hydra's multirun mode (-m) generates cartesian products of parameter values—handy for local sweeps before scaling out to SLURM:
python run_hydra.py -m \
optimizer.optimizer_params.lr=0.0003,0.001,0.003 \
training.training_params.batch_size=16,32Runs are numbered under multirun/<timestamp>/ and each one captures the exact config it used. The SLURM sweep scripts reuse the same idea under the hood by materializing all override combinations.
hydra_conf/model/gpt-small.yaml
config:
n_embd: 768
n_layer: 12
n_head: 12
flash_attention: True
name: gpt-smallhydra_conf/optimizer/adamw.yaml
optimizer_params:
name: adamw
lr: 0.001
weight_decay: 0
lr_schedule: constant-linearparam_configs/my_sweep.json
{
"lr": [0.0001, 0.0003, 0.001],
"wd": [0.0, 0.01, 0.1]
}scripts/my_experiment.sh
#!/bin/bash
lr=${1:-0.0001}
wd=${2:-0.1}
python run_hydra.py \
model=gpt-medium \
optimizer=adamw \
optimizer.optimizer_params.lr=$lr \
optimizer.optimizer_params.weight_decay=$wd./slurm_scripts/submit.sh \
scripts/my_experiment.sh \
param_configs/my_sweep.json \
my_sweep_name \
9 # Run all 9 combinations (3×3) in parallelThis generates all parameter combinations and runs them in parallel. Each run is logged separately.
GPT-opt/
├── scripts/ # Training wrapper scripts
├── slurm_scripts/ # SLURM submission scripts
├── param_configs/ # Parameter sweep JSON files
├── hydra_conf/ # Modular Hydra configs
├── gptopt/ # Python package (models, optimizers, training)
├── logs/ # SLURM job metadata & task outputs
├── slurm_logs/ # SLURM system logs (.out, .err)
├── outputs/ # Training results & checkpoints
└── disbatch_logs/ # Task distribution logs
When you submit a job, logs are organized as:
logs/<experiment_name>/
└── run_info_N/ # Auto-incremented for each submission
├── train.sh # Copy of your training script
├── params.json # Copy of parameter file
├── tasks # Generated task file
└── logs/ # Per-task outputs
├── log_0.out
├── log_0.err
└── ...
Training outputs go to:
outputs/<model>/<dataset>/<optimizer>/<run_name>/
├── .hydra/config.yaml # Full config snapshot
├── task.log # Training log
├── wandb/ # WandB files (if enabled)
└── <optimizer>-*.json # Training metrics
SLURM system logs:
slurm_logs/
├── slurm_job_<id>.out
└── slurm_job_<id>.err
python run_hydra.py \
model=gpt-small \
logging=wandb # Use logging/wandb.yamlEdit hydra_conf/logging/wandb.yaml:
logging_params:
wandb:
project: "my_project"
name: ${model.name}_${optimizer.optimizer_params.name}When using submit.sh or submit_nodes_ddp.sh, all parameter combinations are automatically grouped in WandB under the experiment name for easy comparison.
# Check job status
squeue --format="%.18i %.9P %.30j %.8u %.8T %.10M %.9l %.6D %R" --me
# Watch live
watch -n 2 'squeue --me'
# View task output
tail -f logs/<experiment_name>/run_info_N/logs/log_0.outscancel <job_id> # Cancel specific job
scancel --name=<experiment_name> # Cancel by name
scancel -u $USER # Cancel all your jobssrun --gpus=1 --cpus-per-gpu=8 --time=4:00:00 --partition=gpu --pty bash
module load python
source venv/bin/activateBash scripts that call run_hydra.py with predefined configurations. Accept parameters as arguments.
Example: scripts/run_slim_pajama10B_adam_medium.sh
#!/bin/bash
lr=${1:-0.0001}
wd=${2:-0.1}
python run_hydra.py \
model=gpt-medium \
optimizer=adamw \
training=slim_pajama10B \
data=slim_pajama10B \
optimizer.optimizer_params.lr=$lr \
optimizer.optimizer_params.weight_decay=$wd| Script | Purpose |
|---|---|
submit.sh |
Submit standard parameter sweeps |
submit_nodes_ddp.sh |
Submit multi-node DDP training |
sbatch.sh |
Execute standard GPU jobs |
sbatch_ddp.sh |
Execute multi-node DDP jobs |
launch_ddp_local.sh |
Launch torchrun for local DDP |
rename_utils.sh |
Auto-increment run_info directories |
- Creates
logs/<experiment_name>/run_info/directory - Copies training script and param file
- Renames to
run_info_N(auto-incrementing) - Generates task file with all parameter combinations
- Submits to SLURM via
sbatch.sh - Uses
disBatchto distribute tasks across GPUs - Each task's output saved to
logs/<experiment_name>/run_info_N/logs/
- Same setup as standard workflow
- Prefixes each task with
launch_ddp_local.sh - Submits to SLURM via
sbatch_ddp.sh - Each node runs one task using torchrun with N GPUs
- PyTorch DDP handles distributed training coordination
| Task | Command |
|---|---|
| Run locally | python run_hydra.py model=gpt-small optimizer=adamw data=shakespeare |
| Submit sweep | ./slurm_scripts/submit.sh scripts/run_*.sh params.json exp_name 8 |
| Submit DDP | ./slurm_scripts/submit_nodes_ddp.sh scripts/run_*.sh params.json exp_name 4 |
| Custom GPUs/partition | Add --num_gpus=8 --partition=gpu --constraint=h100 |
| Override param | python run_hydra.py optimizer.optimizer_params.lr=0.001 |
| Check jobs | squeue --me |
| View logs | tail -f logs/<exp>/run_info_N/logs/log_0.out |
# Old YAML-based config
python run.py --config configs/shakespeare.yamlFor questions or issues, check the code or contact the maintainers.