Learn Globally, Speak Locally:
Bridging the Gaps in Multilingual Reasoning

TL;DR: We introduce M2A and GeoFact-X to evaluate and improve multilingual reasoning in LLMs by aligning internal reasoning with the input language using language-consistency rewards.

Repository Structure

eval/: Evaluation tools for mathematical reasoning
dataset/: GeoFact-X dataset
factual_evaluation/: Factual reasoning evaluation scripts (GeoFact-X)
data/: Synthetic data generation scripts
scripts/: Shell scripts for training and evaluation
train/: Python training scripts
utils/: Utility functions and helpers

Training

Use the scripts in scripts/ to launch training.

Hardware recommendations:

Factual reasoning: ≥ 4 A100 GPUs
Mathematical reasoning: ≥ 8 A100 GPUs

Example: Multi-node training with Slurm

export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
export MASTER_PORT=29500  # Change if needed
export NNODES=$SLURM_NNODES
GPUS_PER_NODE=$(nvidia-smi -L | wc -l)
export WORLD_SIZE=$((GPUS_PER_NODE * NNODES))
export NODE_RANK=$SLURM_NODEID

echo "Master Node: $MASTER_ADDR"
echo "Running on $WORLD_SIZE GPUs across $NNODES nodes"

uid="$(date +%Y%m%d_%H%M%S)"

model_size=7
base_model="Qwen/Qwen2.5-${model_size}B-Instruct"
lr=1e-5
epochs=5
weight_decay=1e-4
micro_batch_size=1
gradient_accumulation_steps=1
push_to_hub=false

srun accelerate launch \
  --config_file deepspeed_zero3.yaml \
  --num_processes $WORLD_SIZE \
  --num_machines $NNODES \
  --main_process_ip $MASTER_ADDR \
  --machine_rank $NODE_RANK \
  --main_process_port $MASTER_PORT \
  --rdzv_backend c10d \
  train/math_m2a.py \
    --block_size=20000 \
    --train_file_path="simplescaling/s1K-1.1_tokenized" --per_device_train_batch_size=${micro_batch_size} --per_device_eval_batch_size=${micro_batch_size} --gradient_accumulation_steps=${gradient_accumulation_steps} --num_train_epochs=${epochs} \
    --model_name=${base_model} \
    --bf16=True --eval_strategy="no" --logging_steps=1 --save_strategy="no" --lr_scheduler_type="cosine" --learning_rate=${lr} --weight_decay=${weight_decay} --adam_beta1=0.9 --adam_beta2=0.95 \
    --output_dir="ckpts/M2A-${model_size}b-ep${epochs}-${uid}" --push_to_hub=${push_to_hub} --save_only_model=True --use-liger-kernel  --gradient_checkpointing=True --use_grpo --grpo_loss_coeff=0.01 --sentence_level='mean+randomNegCos' --metric='randomNegCos' --mt5_max_len=15000

Evaluation

Factual Reasoning

Update model paths in the scripts as needed:

sh scripts/eval_geofact-x.sh

Mathematical Reasoning

We cloned lm-evaluation-harness at commit 4cec66e4e468d15789473d6d63c3a61a751fa524 and modified it. Setup:

cd eval/lm-evaluation-harness
pip install -e .[math,vllm]

If you want to compute statistics (avg thinking tokens etc) for an evaluation run you can use python eval/compute_sample_stats.py path_to_samples_file.jsonl

All our evaluation result files are at: https://hf.co/datasets/simplescaling/results

To run REBASE: commands are in eval/rebase/run.sh

MGSM for fast evaluation

tasks='mgsm_native_cot_bn,mgsm_native_cot_de,mgsm_native_cot_es,mgsm_native_cot_fr,mgsm_native_cot_ru,mgsm_native_cot_sw,mgsm_native_cot_te,mgsm_native_cot_th,mgsm_native_cot_zh,mgsm_native_cot_en,mgsm_native_cot_ja'
lm_eval --model vllm --model_args pretrained=${model_name},dtype=auto,tensor_parallel_size=${num_gpu},gpu_memory_utilization=0.90,max_model_len=20000 --tasks $tasks --batch_size auto --apply_chat_template --output_path ${output_dir} --log_samples --gen_kwargs "max_gen_toks=20000"

Measure Language Performance

python3 utils/language_detector.py

Acknowledgement

This codebase is based on https://github.com/simplescaling/s1.

Citation

If you use this code for your research, please cite our paper.

@article{hwang2025learn,
      title={Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning},
      author={Hwang, Jaedong and Tanmay, Kumar and Lee, Seok-Jin and Agrawal, Ayush and Palangi, Hamid and Ayush, Kumar and Fiete, Ila R and Liang, Paul Pu},
      journal={arXiv preprint arXiv:2507.05418},
      year={2025}
    }

Name	Name	Last commit message	Last commit date
Latest commit History 8 Commits 8 Commits
data	data
dataset	dataset
eval	eval
factual_evaluation/GeoFact-X	factual_evaluation/GeoFact-X
scripts	scripts
train	train
utils	utils
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
deepspeed_zero3.yaml	deepspeed_zero3.yaml
requirements.txt	requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learn Globally, Speak Locally:
Bridging the Gaps in Multilingual Reasoning

Repository Structure

Training

Example: Multi-node training with Slurm

Evaluation

Factual Reasoning

Mathematical Reasoning

MGSM for fast evaluation

Measure Language Performance

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

Learn Globally, Speak Locally:Bridging the Gaps in Multilingual Reasoning

Repository Structure

Training

Example: Multi-node training with Slurm

Evaluation

Factual Reasoning

Mathematical Reasoning

MGSM for fast evaluation

Measure Language Performance

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Learn Globally, Speak Locally:
Bridging the Gaps in Multilingual Reasoning

Packages