This README is about reproducing the paper [Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics]
This paper presents a unified framework for understanding and optimizing language model steering, explaining diverse intervention methods through dynamic weight updates and their preference–utility dynamics.
Our contributions are as follows:
-
Unified View. We unify local weight fine-tuning, LoRA, and activation steering under a single framework of dynamic weight updates, revealing consistent preference–utility dynamics across intervention forms.
-
Preference–Utility Mechanism. We introduce an activation-manifold analysis showing that preference arises from projection onto target directions, while utility degradation is primarily driven by off-manifold deviations, yielding predictive log-odds–control relationships.
-
SPILT Steering Method. Guided by this mechanism, we propose SPILT, a preference–utility joint training objective that improves controllability while better preserving utility.
To set up the environment for running steering experiments, follow these steps:
git clone https://github.com/zjunlp/EasyEdit.git
conda create -n spilt python=3.10
conda activate spilt
pip install -r requirements_2.txtAfter downloading, organize the resources as follows:
Place the model files in the ./models/ directory:
models/
└── {model_name}/
├── config.json
├── tokenizer.json
└── ... (model files)
Place the dataset files in the appropriate data directories as specified in hparams/Steer/dataset_format.yaml:
data/
├── psychopathy/
│ ├── train.jsonl
│ └── test.jsonl
├── axbench/
│ └── ...
└── powerseeking/
└── ...
Extract the pre-trained vectors to the following directory structure:
vectors/
└── {model_name}/
└── {method}/
└── {dataset}/
└── {intervention_method}/
├── layer_{layer_id}.pt
└── metadata_layer_{layer_id}.jsonl (optional)
For example, for gemma-2-9b-it model with spilt method on psychopathy dataset using local_weight intervention, the vectors should be placed at:
vectors/gemma-2-9b-it/spilt/psychopathy/spilt_local_weight/
├── layer_20.pt
└── metadata_layer_20.jsonl (optional)
An example for generating and applying steering vectors on psychopathy dataset using SPILT method with local_weight intervention
Run the script run_SPILT.py using the following line of code:
bash examples/run_SPILT.sh
Or directly run the Python script:
python run_SPILT.py \
--dataset psychopathy \
--method spilt \
--model_name gemma-2-9b-it \
--intervention_method local_weight \
--mode both \
--multipliers 1.0 \
--device cuda:0 \
--base_dir .
This command runs both vector generation and application for the psychopathy dataset using the SFT method with local_weight intervention. Below are the explanations for each argument:
-
--dataset: Specifies the dataset name. Options:axbench,psychopathy,powerseeking. This determines which dataset will be used for training and evaluation. -
--method: Specifies the steering method to use. Options:caa,reps,sft,spilt, orall(to run all methods). Each method implements a different approach to generating steering vectors:caa: Contrastive Activation Additionreps: Representation Engineering via Preference Steeringsft: Supervised Fine-tuning based Steeringspilt: Our SPILT method implementationall: Run all available methods sequentially
-
--model_name: Specifies the model name (e.g.,gemma-2-9b-it,qwen2.5-7b-it). The model should be located in./models/{model_name}/. -
--intervention_method: Specifies how the steering vector is applied to the model. Options:vector,lora,local_weight, orall(to run all intervention methods):vector: Direct vector addition to activationslora: Low-Rank Adaptation style interventionlocal_weight: Local weight modification interventionall: Run all available intervention methods sequentially
-
--mode: Specifies which phase to run. Options:generate,apply, orboth(default:both):generate: Only generate steering vectors from training dataapply: Only apply pre-generated vectors for text generationboth: Run both generation and application sequentially
-
--device: Specifies the device to use for computation (default:cuda:0). Can be set to any valid CUDA device orcpu. -
--multipliers: Specifies multiplier values for vector application (default:[1.0]). Multiple values can be provided to test different steering strengths, e.g.,--multipliers 1.0 2.0 3.0. -
--base_dir: Specifies the base directory for the project (default:.). All relative paths in the script will be resolved relative to this directory. -
--dry_run: If specified, only prints the commands that would be executed without actually running them. Useful for verifying dataset/method routing and parameter configurations.
python run_SPILT.py \
--dataset psychopathy \
--method spilt \
--model_name gemma-2-9b-it \
--intervention_method local_weight \
--mode both \
--base_dir .
Note: When using all for either --method or --intervention_method, the script will automatically skip invalid combinations (e.g., CAA only supports vector intervention, so caa + lora will be skipped).
If you want to skip the vector generation phase and directly apply pre-trained steering vectors, modify the run_SPILT.sh script or run the Python script directly with --mode apply:
Edit examples/run_SPILT.sh and change the --mode parameter from both or generate to apply:
python run_SPILT.py \
--dataset psychopathy \
--method all \
--model_name gemma-2-9b-it \
--intervention_method all \
--mode apply \
--multipliers 1.0 \
--device cuda:0 \
--base_dir .This will:
- Skip vector generation (since vectors already exist)
- Apply all available pre-trained vectors for all method-intervention combinations
- Generate text outputs with different multiplier values (1.0 and 2.0 in this example)
- Save results to
generation/{model_name}/{method}/{dataset}/{intervention_method}/m{multiplier}/
Note: Make sure all required vector files exist before running with --mode apply. The script will skip combinations where vector files are missing and print a warning message.
The loss mode calculates training loss for a dataset using pre-generated steering vectors. This mode automatically runs calculations for both winning_only and losing_only preference types, which helps analyze the preference-utility decomposition of model behavior.
- After generating steering vectors (using
--mode generateor--mode both) - To analyze how different preference types affect the training loss
- To evaluate the effectiveness of steering vectors before applying them
- Pre-generated steering vectors must exist at:
vectors/{model_name}/{method}/{dataset}/{method}_{intervention_method}/ - The loss calculation hparam file must exist at:
hparams/Steer/experiment_hparams/spilt_experiment/{dataset}/{model_name}/sft/generate_sft_loss.yaml - Note: Loss calculation is not supported for the
axbenchdataset
Calculate loss for psychopathy dataset using SPILT method with local_weight intervention:
python run_SPILT.py \
--dataset psychopathy \
--method spilt \
--model_name gemma-2-9b-it \
--intervention_method local_weight \
--mode loss \
--multipliers 0.2 \
--device cuda:0 \
--base_dir .This command will:
- Load pre-generated vectors from
vectors/gemma-2-9b-it/spilt/psychopathy/spilt_local_weight/ - Calculate training loss for both
winning_onlyandlosing_onlypreference types - Use multiplier value 0.2 for steering factor
- Save loss results to:
vectors/{model_name}/get_sft_loss/{method}/{dataset}/{method}_{intervention_method}/{intervention_method}_{method}_m{multiplier}_winning_only/vectors/{model_name}/get_sft_loss/{method}/{dataset}/{method}_{intervention_method}/{intervention_method}_{method}_m{multiplier}_losing_only/
Each loss calculation run produces:
train.log: Training log filetrain_losses.csv: CSV file containing loss values for each training step
The loss values help understand how the steering vectors affect model behavior under different preference conditions (winning vs losing).
