# Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
This README is about reproducing the paper [Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics]
This paper presents a unified framework for understanding and optimizing language model steering, explaining diverse intervention methods through dynamic weight updates and their preference–utility dynamics.
Our contributions are as follows:
- **Unified View.** We unify local weight fine-tuning, LoRA, and activation steering under a single framework of *dynamic weight updates*, revealing consistent preference–utility dynamics across intervention forms.
- **Preference–Utility Mechanism.** We introduce an activation-manifold analysis showing that preference arises from projection onto target directions, while utility degradation is primarily driven by off-manifold deviations, yielding predictive log-odds–control relationships.
- **SPILT Steering Method.** Guided by this mechanism, we propose **SPILT**, a preference–utility joint training objective that improves controllability while better preserving utility.
## Table of Contents
- [Requirements](#Requirements)
- [Directory Structure](#Directory-Structure)
- [Quick Start](#Quick-Start)
- [Using Pre-trained Vectors](#Using-Pre-trained-Vectors)
- [Loss Calculation](#Loss-Calculation)
## Requirements
### Environment Setup
To set up the environment for running steering experiments, follow these steps:
```bash
git clone https://github.com/zjunlp/EasyEdit.git
conda create -n spilt python=3.10
conda activate spilt
pip install -r requirements_2.txt
```
## Directory Structure
After downloading, organize the resources as follows:
#### Models
Place the model files in the `./models/` directory:
```
models/
└── {model_name}/
├── config.json
├── tokenizer.json
└── ... (model files)
```
#### Datasets
Place the dataset files in the appropriate data directories as specified in `hparams/Steer/dataset_format.yaml`:
```
data/
├── psychopathy/
│ ├── train.jsonl
│ └── test.jsonl
├── axbench/
│ └── ...
└── powerseeking/
└── ...
```
#### Pre-trained Steering Vectors
Extract the pre-trained vectors to the following directory structure:
```
vectors/
└── {model_name}/
└── {method}/
└── {dataset}/
└── {intervention_method}/
├── layer_{layer_id}.pt
└── metadata_layer_{layer_id}.jsonl (optional)
```
For example, for `gemma-2-9b-it` model with `spilt` method on `psychopathy` dataset using `local_weight` intervention, the vectors should be placed at:
```
vectors/gemma-2-9b-it/spilt/psychopathy/spilt_local_weight/
├── layer_20.pt
└── metadata_layer_20.jsonl (optional)
```
## Quick Start
### An example for generating and applying steering vectors on psychopathy dataset using SPILT method with local_weight intervention
Run the script [run_SPILT.py](../run_SPILT.py) using the following line of code:
bash examples/run_SPILT.sh
Or directly run the Python script:
python run_SPILT.py \
--dataset psychopathy \
--method spilt \
--model_name gemma-2-9b-it \
--intervention_method local_weight \
--mode both \
--multipliers 1.0 \
--device cuda:0 \
--base_dir .
This command runs both vector generation and application for the psychopathy dataset using the SFT method with local_weight intervention. Below are the explanations for each argument:
### Required Arguments
- `--dataset`: Specifies the dataset name. Options: `axbench`, `psychopathy`, `powerseeking`. This determines which dataset will be used for training and evaluation.
- `--method`: Specifies the steering method to use. Options: `caa`, `reps`, `sft`, `spilt`, or `all` (to run all methods). Each method implements a different approach to generating steering vectors:
- `caa`: Contrastive Activation Addition
- `reps`: Representation Engineering via Preference Steering
- `sft`: Supervised Fine-tuning based Steering
- `spilt`: Our SPILT method implementation
- `all`: Run all available methods sequentially
- `--model_name`: Specifies the model name (e.g., `gemma-2-9b-it`, `qwen2.5-7b-it`). The model should be located in `./models/{model_name}/`.
- `--intervention_method`: Specifies how the steering vector is applied to the model. Options: `vector`, `lora`, `local_weight`, or `all` (to run all intervention methods):
- `vector`: Direct vector addition to activations
- `lora`: Low-Rank Adaptation style intervention
- `local_weight`: Local weight modification intervention
- `all`: Run all available intervention methods sequentially
### Optional Arguments
- `--mode`: Specifies which phase to run. Options: `generate`, `apply`, or `both` (default: `both`):
- `generate`: Only generate steering vectors from training data
- `apply`: Only apply pre-generated vectors for text generation
- `both`: Run both generation and application sequentially
- `--device`: Specifies the device to use for computation (default: `cuda:0`). Can be set to any valid CUDA device or `cpu`.
- `--multipliers`: Specifies multiplier values for vector application (default: `[1.0]`). Multiple values can be provided to test different steering strengths, e.g., `--multipliers 1.0 2.0 3.0`.
- `--base_dir`: Specifies the base directory for the project (default: `.`). All relative paths in the script will be resolved relative to this directory.
- `--dry_run`: If specified, only prints the commands that would be executed without actually running them. Useful for verifying dataset/method routing and parameter configurations.
### Advanced Usage
#### Running spilt methods with a specific intervention
python run_SPILT.py \
--dataset psychopathy \
--method spilt \
--model_name gemma-2-9b-it \
--intervention_method local_weight \
--mode both \
--base_dir .
**Note**: When using `all` for either `--method` or `--intervention_method`, the script will automatically skip invalid combinations (e.g., CAA only supports `vector` intervention, so `caa + lora` will be skipped).
## Using Pre-trained Vectors
If you want to skip the vector generation phase and directly apply pre-trained steering vectors, modify the `run_SPILT.sh` script or run the Python script directly with `--mode apply`:
#### Apply Vectors with modified run_SPILT.sh
Edit `examples/run_SPILT.sh` and change the `--mode` parameter from `both` or `generate` to `apply`:
```bash
python run_SPILT.py \
--dataset psychopathy \
--method all \
--model_name gemma-2-9b-it \
--intervention_method all \
--mode apply \
--multipliers 1.0 \
--device cuda:0 \
--base_dir .
```
This will:
1. Skip vector generation (since vectors already exist)
2. Apply all available pre-trained vectors for all method-intervention combinations
3. Generate text outputs with different multiplier values (1.0 and 2.0 in this example)
4. Save results to `generation/{model_name}/{method}/{dataset}/{intervention_method}/m{multiplier}/`
**Note**: Make sure all required vector files exist before running with `--mode apply`. The script will skip combinations where vector files are missing and print a warning message.
## Loss Calculation
The `loss` mode calculates training loss for a dataset using pre-generated steering vectors. This mode automatically runs calculations for both `winning_only` and `losing_only` preference types, which helps analyze the preference-utility decomposition of model behavior.
### When to Use Loss Mode
- After generating steering vectors (using `--mode generate` or `--mode both`)
- To analyze how different preference types affect the training loss
- To evaluate the effectiveness of steering vectors before applying them
### Requirements
- Pre-generated steering vectors must exist at: `vectors/{model_name}/{method}/{dataset}/{method}_{intervention_method}/`
- The loss calculation hparam file must exist at: `hparams/Steer/experiment_hparams/spilt_experiment/{dataset}/{model_name}/sft/generate_sft_loss.yaml`
- **Note**: Loss calculation is not supported for the `axbench` dataset
### Example Usage
Calculate loss for psychopathy dataset using SPILT method with local_weight intervention:
```bash
python run_SPILT.py \
--dataset psychopathy \
--method spilt \
--model_name gemma-2-9b-it \
--intervention_method local_weight \
--mode loss \
--multipliers 0.2 \
--device cuda:0 \
--base_dir .
```
This command will:
1. Load pre-generated vectors from `vectors/gemma-2-9b-it/spilt/psychopathy/spilt_local_weight/`
2. Calculate training loss for both `winning_only` and `losing_only` preference types
3. Use multiplier value 0.2 for steering factor
4. Save loss results to:
- `vectors/{model_name}/get_sft_loss/{method}/{dataset}/{method}_{intervention_method}/{intervention_method}_{method}_m{multiplier}_winning_only/`
- `vectors/{model_name}/get_sft_loss/{method}/{dataset}/{method}_{intervention_method}/{intervention_method}_{method}_m{multiplier}_losing_only/`
### Output Files
Each loss calculation run produces:
- `train.log`: Training log file
- `train_losses.csv`: CSV file containing loss values for each training step
The loss values help understand how the steering vectors affect model behavior under different preference conditions (winning vs losing).