Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

zengxyyu/HiCI

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HiCI: Hierarchical Construction-Integration for Long-Context Attention

Xiangyu Zeng, Qi Xu, Yunke Wang, Chang Xu


Table of Contents


News

  • [2026.04] HiCI v2 updated on arXiv. Expanded results on LLaMA-3 and Qwen3.
  • [2026.03] HiCI paper released on arXiv.

Highlights

  1. HiCI injects a three-stage hierarchical attention module (Local Construction → Global Integration → Top-down Broadcast) into each transformer layer as a plug-in. It is fully compatible with Flash-Attention and requires no architectural changes at inference time.
  2. We release fine-tuned models across multiple scales and context lengths, including Llama-2-7b-HiCI-100k, Llama-2-13b-HiCI-64k, Llama-3-8b-HiCI-32k, and Qwen3-8b-HiCI-48k.
  3. HiCI achieves consistent perplexity improvements over LongLoRA at equal context length and surpasses GPT-3.5-Turbo-16K on code comprehension — while adding only ~5.5% additional parameters.

Requirements

To download and use the pre-trained models you will need:


Installation

Hardware: Python 3.11, CUDA 12.4. Training requires multi-GPU (we use 8× H100 80GB for LLaMA models and 8× H200 for Qwen3).

Step 1 — clone the repository:

git clone https://github.com/zengxyyu/HiCI.git && cd HiCI

Step 2 — install dependencies:

# For LLaMA-2 / LLaMA-3
pip install -r requirements.txt

# For Qwen3 (requires transformers 4.51)
pip install -r requirements-qwen3.txt

Step 3 — install Flash Attention (compiled from source):

pip install flash-attn==2.5.8 --no-build-isolation

If DeepSpeed reports a CUDA version mismatch: export DS_SKIP_CUDA_CHECK=1

After installation, use either a released model for inference or fine-tune a model to fit your preferences.


Data Preparation

Pre-training data

python -c "
from datasets import load_dataset
dataset = load_dataset('ZengXiangyu/RedPajama-Data-1T-Sample', cache_dir='./cache')
dataset.save_to_disk('./cache/datasets')
"

SFT data (LongAlpaca-12k)

mkdir -p data/sft
wget -P data/sft https://huggingface.co/datasets/Yukang/LongAlpaca-12k/resolve/main/LongAlpaca-12k.json

Evaluation data (PG-19 and Proof-pile)

Option 1 — Download all pre-tokenized files at once (recommended):

huggingface-cli download ZengXiangyu/pg19-and-proof-pile \
    --repo-type dataset \
    --local-dir ./data \
    --local-dir-use-symlinks False

Or download individually (replace the path for other files):

# PG-19
wget -P data/pg19_llama2/ https://huggingface.co/datasets/ZengXiangyu/pg19-and-proof-pile/resolve/main/pg19_llama2/test.bin

# Proof-pile (128 docs sampled from test split, same as LongLoRA)
wget -P data/proof-pile_qwen3/ https://huggingface.co/datasets/ZengXiangyu/pg19-and-proof-pile/resolve/main/proof-pile_qwen3/test_sampled_data.bin
Option 2 — Prepare from scratch

First download the raw text (requires internet access):

python3 download_pg19.py --split test
# → data/pg19_raw/test.txt

Then tokenize for each model family:

# LLaMA-2
python3 -c "
import numpy as np, os
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('./models/Llama-2-7b-hf')
text = open('data/pg19_raw/test.txt').read()
tokens = tokenizer.encode(text)
os.makedirs('data/pg19_llama2', exist_ok=True)
np.array(tokens, dtype=np.uint16).tofile('data/pg19_llama2/test.bin')
print(f'Done: {len(tokens):,} tokens')
"

# LLaMA-3
python3 prepare_eval_data.py \
    --model_path ./models/Meta-Llama-3-8B \
    --text_file data/pg19_raw/test.txt \
    --output_dir data/pg19_llama3

# Qwen3
python3 prepare_eval_data.py \
    --model_path ./models/Qwen3-8B \
    --text_file data/pg19_raw/test.txt \
    --output_dir data/pg19_qwen3

Models

Models are currently private and will be released upon paper acceptance. Model page: https://huggingface.co/ZengXiangyu/models

Model Base Context Link
Llama-2-7b-HiCI-8k LLaMA-2-7B 8K 🤗
Llama-2-7b-HiCI-32k LLaMA-2-7B 32K 🤗
Llama-2-7b-HiCI-100k LLaMA-2-7B 100K 🤗
Llama-2-13b-HiCI-64k LLaMA-2-13B 64K 🤗
Llama-3-8b-HiCI-32k LLaMA-3-8B 32K 🤗
Qwen3-8b-HiCI-48k Qwen3-8B 48K 🤗
Perplexity on PG-19 (↓ lower is better)

LLaMA-2

Model Train ctx 2K 4K 8K 16K 32K
LLaMA-2-7B-HiCI 8K 7.27 7.01 6.93
LLaMA-2-7B-HiCI 16K 7.55 7.24 7.02 6.93
LLaMA-2-7B-HiCI 32K 7.87 7.50 7.26 7.09 7.11
LLaMA-2-13B-HiCI 8K 6.68 6.46 6.34
LLaMA-2-13B-HiCI 16K 6.95 6.65 6.43 6.28

LLaMA-3

Model Train ctx Steps 2K 4K 8K 16K 32K
LLaMA-3-8B 8K 9.19 8.71 8.38 >100 >100
LLaMA-3-8B-HiCI 32K 1000 7.90 7.86 7.54 7.28 7.20

Qwen3

Model Train ctx Steps 2K 4K 8K 16K 32K 48K
Qwen3-8B (baseline) 32K 13.26 12.58 12.09 11.72 12.76 11.32
Qwen3-8B-HiCI 48K 500 11.71 11.06 10.59 10.24 9.98 9.89
Qwen3-8B-HiCI 48K 1000 11.46 10.84 10.38 10.06 9.82 9.73

See the paper for full results including Proof-pile, LongBench, and topic retrieval.


Training

Download Base Models

Login to Hugging Face first. For LLaMA models, also accept the Meta license on the model page before downloading.

huggingface-cli login
# LLaMA-2-7B
huggingface-cli download meta-llama/Llama-2-7b-hf \
    --local-dir ./models/Llama-2-7b-hf \
    --local-dir-use-symlinks False \
    --max-workers 1

# LLaMA-2-13B
huggingface-cli download meta-llama/Llama-2-13b-hf \
    --local-dir ./models/Llama-2-13b-hf \
    --local-dir-use-symlinks False

# LLaMA-3-8B
huggingface-cli download meta-llama/Meta-Llama-3-8B \
    --local-dir ./models/Meta-Llama-3-8B \
    --local-dir-use-symlinks False

# Qwen3-8B
huggingface-cli download Qwen/Qwen3-8B \
    --local-dir ./models/Qwen3-8B \
    --local-dir-use-symlinks False \
    --max-workers 1

Script–model pairing

Use case Shell script Python script Attention module
LLaMA-2/3 continued pre-training train_fine_tune_hici.sh fine-tune_hici.py llama_attn_hici.py
LLaMA-2/3 SFT train_fine_tune_hici_sft.sh fine-tune_hici_sft.py llama_attn_hici_sft.py
Qwen3 continued pre-training train_fine_tune_hici_qwen3.sh fine-tune_hici_qwen3.py qwen3_attn_hici.py

LLaMA-2 / LLaMA-3

bash train_fine_tune_hici.sh

Or manually (LLaMA-2-7B, 8K context example):

torchrun --nproc_per_node 8 --master_port=38493 fine-tune_hici.py \
    --model_name_or_path ./models/Llama-2-7b-hf \
    --bf16 True \
    --output_dir ./checkpoints/Llama-2-7b-8k-hici \
    --cache_dir ./cache \
    --model_max_length 8192 \
    --use_flash_attn True \
    --low_rank_training True \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --learning_rate 2e-5 \
    --warmup_steps 20 \
    --lr_scheduler_type constant_with_warmup \
    --logging_steps 1 \
    --deepspeed ds_configs/stage2.json \
    --tf32 True \
    --max_steps 1000 \
    --num_chunks 4 \
    --num_local_slots 8 \
    --global_slots 4 \
    --num_heads 8 \
    --use_bottleneck True \
    --bottleneck_dim 512 \
    --shared_compress_dim 128 \
    --use_local_constructor True \
    --use_global_integrator True \
    --use_hierarchical_forward True \
    --use_llama_init False \
    --use_local_constructor_flash False \
    --trainable_params "embed,norm,local_constructor,global_integrator" \
    --hici_lr 2e-4 \
    --hici_grad_clip 0.3

Qwen3

bash train_fine_tune_hici_qwen3.sh

Or manually (Qwen3-8B, 48K context example):

torchrun --nproc_per_node 8 \
    --master_port=38493 \
    fine-tune_hici_qwen3.py \
    --model_name_or_path ./models/Qwen3-8B \
    --bf16 True \
    --output_dir ./checkpoints/Qwen3-8b-hici-48k \
    --cache_dir ./cache \
    --model_max_length 49152 \
    --use_flash_attn True \
    --low_rank_training True \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --learning_rate 2e-5 \
    --warmup_steps 20 \
    --lr_scheduler_type constant_with_warmup \
    --logging_steps 1 \
    --deepspeed ds_configs/stage3.json \
    --tf32 True \
    --max_steps 1000 \
    --save_steps 500 \
    --save_total_limit 2 \
    --num_chunks 4 \
    --num_local_slots 8 \
    --global_slots 4 \
    --num_heads 8 \
    --use_bottleneck True \
    --bottleneck_dim 512 \
    --shared_compress_dim 128 \
    --use_local_constructor True \
    --use_global_integrator True \
    --use_hierarchical_forward True \
    --use_attn_init False \
    --use_local_constructor_flash False \
    --trainable_params "embed,norm,local_constructor,global_integrator" \
    --hici_lr 2e-4 \
    --hici_grad_clip 0.3

If you have access to multiple nodes (e.g. two 4-GPU nodes), use the multi-node script instead. Run the following on each node simultaneously, passing the node rank (0 for master, 1, 2, … for workers):

bash train_fine_tune_hici_qwen3_multinode.sh 0   # master node
bash train_fine_tune_hici_qwen3_multinode.sh 1   # worker node

Supervised Fine-Tuning

SFT resumes from a HiCI pre-trained checkpoint to teach instruction-following while preserving long-context capabilities.

bash train_fine_tune_hici_sft.sh

Or manually (LLaMA-2-7B, 16K context example):

torchrun --nproc_per_node 8 \
    --master_port=38493 \
    fine-tune_hici_sft.py \
    --model_name_or_path ./models/Llama-2-7b-hf \
    --resume_from_checkpoint ./checkpoints/Llama-2-7b-hici-16k/checkpoint-1000 \
    --data_path ./data/sft/LongAlpaca-12k.json \
    --bf16 True \
    --output_dir ./checkpoints/Llama-2-7b-hici-sft-16k \
    --cache_dir ./cache \
    --model_max_length 16384 \
    --use_flash_attn True \
    --low_rank_training True \
    --num_train_epochs 15 \
    --max_steps 3000 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --learning_rate 2e-5 \
    --warmup_steps 20 \
    --lr_scheduler_type constant_with_warmup \
    --logging_steps 1 \
    --deepspeed ds_configs/stage2.json \
    --tf32 True \
    --save_steps 500 \
    --save_total_limit 4 \
    --num_chunks 4 \
    --num_local_slots 8 \
    --global_slots 4 \
    --num_heads 8 \
    --use_bottleneck True \
    --bottleneck_dim 512 \
    --shared_compress_dim 128 \
    --use_local_constructor True \
    --use_global_integrator True \
    --use_hierarchical_forward True \
    --use_llama_init False \
    --use_local_constructor_flash False \
    --trainable_params "embed,norm,local_constructor,global_integrator" \
    --hici_lr 2e-4 \
    --hici_grad_clip 0.3

--resume_from_checkpoint is optional; omit it to fine-tune directly from the base model.

Key hyperparameters

Argument Default Description
--num_local_slots 8 Learnable query slots per segment (local cardinality M)
--global_slots 4 Global context vectors (global cardinality K)
--num_heads 8 Attention heads in HiCI modules (use 40 for 13B)
--bottleneck_dim 512 Bottleneck compression dimension
--shared_compress_dim 128 Shared compressor intermediate dim for GlobalIntegratorShared (128 for 7B/8B, 160 for 13B)
--num_chunks 4 Number of segments to split the input into
--hici_lr 2e-4 Separate LR for HiCI modules (≈ 10× base LR)
--hici_grad_clip 0.3 Gradient clipping for HiCI modules
--use_local_constructor_flash False Use LocalConstructorFlash (flash-attn cross-attention); default False uses LocalConstructorMulti

Weight Extraction

After training, two steps are required before evaluation or merging.

Step 1 — Reconstruct full weights from DeepSpeed ZeRO shards:

cd ./checkpoints/Llama-3-8b-hici-32k/checkpoint-1000 && python zero_to_fp32.py . . && cd -

This produces pytorch_model.bin inside the checkpoint directory.

Step 2 — Extract LoRA and HiCI parameters:

python get_trainable_weights.py \
    --checkpoint_path ./checkpoints/Llama-3-8b-hici-32k/checkpoint-1000 \
    --trainable_params "embed,norm,local_constructor,global_integrator"

This produces trainable_params.bin, which is required by the eval and merge scripts.

Skipping either step causes trainable_params.bin not found during evaluation or merging.

Merging

Training produces a base model and a separate trainable_params.bin (LoRA + HiCI adapter weights). Merging combines them into a single self-contained HuggingFace model directory for easier distribution and loading. There are two options with different trade-offs:

  • Option A (LoRA adapters + embed/norm only): HiCI modules are discarded; the result is a standard transformer that works with any inference tool (vLLM, transformers, etc.) without any custom code.
  • Option B (LoRA adapters + embed/norm + HiCI modules): HiCI modules are included in the merged weights; loading still requires injecting the HiCI architecture via replace_llama_attn() / register_hici_to_model(), but the weights are fully self-contained without needing trainable_params.bin.

There are two merging options corresponding to the two inference modes reported in the paper.

Option A — LoRA adapters + embed/norm only (full-attention inference, no HiCI at prefill)

The merged model contains LoRA adapters + embed/norm weights from training, but excludes HiCI modules. Inference uses standard full attention.

python merge_lora_weights_and_save_hf_model.py \
    --base_model ./models/Llama-2-7b-hf \
    --peft_model ./checkpoints/Llama-2-7b-hici-8k/checkpoint-1000 \
    --context_size 8192 \
    --save_path ./models/merged/Llama-2-7b-hici-8k-merged

Option B — LoRA adapters + embed/norm + HiCI modules (HiCI hierarchical attention at prefill)

The merged model contains LoRA adapters + embed/norm weights + HiCI modules. Inference uses HiCI hierarchical attention during prefill.

# LLaMA-2/3
python merge_lora_weights_hici.py \
    --base_model ./models/Llama-2-7b-hf \
    --peft_model ./checkpoints/Llama-2-7b-hici-16k/checkpoint-1000 \
    --save_path ./models/merged/Llama-2-7b-HiCI-16k \
    --context_size 16384 \
    --num_local_slots 8 \
    --global_slots 4 \
    --num_heads 8 \
    --bottleneck_dim 512

Evaluation

Before running any evaluation, you need trained adapter weights. There are two ways to obtain them:

Option A — Download our released adapter weights from HuggingFace

# Example: Qwen3-8b-HiCI-48k
huggingface-cli download ZengXiangyu/Qwen3-8b-HiCI-48k \
    --local-dir ./checkpoints/Qwen3-8b-HiCI-48k \
    --local-dir-use-symlinks False

Option B — Use your own trained adapter weights (see Training)

For your own weights, follow the Weight Extraction steps first to produce trainable_params.bin inside the checkpoint directory. Downloaded weights already include this file.


Once you have adapter weights, choose how to use them for evaluation:

Path 1 — Evaluate directly without merging (pass the adapter weights directory via --peft_model):

--base_model ./models/Llama-2-7b-hf \
--peft_model ./checkpoints/Llama-3-8b-HiCI-32k \

Path 2 — Merge first, then evaluate (omit --peft_model, pass the merged model via --base_model):

# LoRA only (standard full attention at inference)
python merge_lora_weights_and_save_hf_model.py \
    --base_model ./models/Llama-2-7b-hf \
    --peft_model ./checkpoints/Llama-3-8b-HiCI-32k \
    --save_path ./models/merged/Llama-3-8b-HiCI-32k \
    --context_size 32768

# Then evaluate with the merged model (no --peft_model)
--base_model ./models/merged/Llama-3-8b-HiCI-32k \

See the Merging section for the full list of options.


Perplexity on PG-19 / Proof-pile

bash eval_distributed_hici.sh        # LLaMA-2/3
bash eval_distributed_hici_qwen3.sh  # Qwen3

Or manually (LLaMA-2-7B, 8K context example):

torchrun --nproc_per_node=8 \
    --master_port=38493 \
    eval_distributed_hici.py \
    --base_model ./models/Llama-2-7b-hf \
    --peft_model ./checkpoints/Llama-2-7b-8k-hici/checkpoint-1000 \
    --data_path ./data/pg19_llama2/test.bin \
    --seq_len 2048 \
    --context_size 8192 \
    --batch_size 1 \
    --flash_attn True \
    --use_local_constructor True \
    --use_global_integrator True \
    --num_local_slots 8 \
    --global_slots 4 \
    --num_heads 8 \
    --use_bottleneck True \
    --bottleneck_dim 512 \
    --use_hierarchical_forward True \
    --use_local_constructor_flash False \
    --eval_mode "full"

For LLaMA-3 use --data_path ./data/pg19_llama3/test.bin; for Qwen3 use ./data/pg19_qwen3/test.bin.

Proof-pile works identically — the .bin files are tokenizer-specific numpy memmaps, same format as PG-19. ./data/proof-pile/test_sampled_data.bin is tokenized with the LLaMA-2 tokenizer; for other model families, re-tokenize first using prepare_eval_data.py (see Option 2 above), then pass the resulting --data_path accordingly.

Multi-node (2 nodes × 4 GPUs): bash eval_distributed_hici_multinode.sh 0 / ... 1

--eval_mode options:

Value Description
None HiCI attention, same as training — not used in paper for fairness
"full" Full attention (standard), used in all paper results for fair comparison

ChunkLlama (Training-Free Baseline)

ChunkLlama is a training-free context extension method used as a baseline in our paper. We extend the original implementation to support Qwen3 (ChunkLlama/chunkqwen3_attn_replace.py), which was not covered in the original paper.

bash eval_chunkdca_pg19.sh llama3          # DCA mode, LLaMA-3
bash eval_chunkdca_pg19.sh qwen3           # DCA mode, Qwen3
bash eval_chunkdca_pg19.sh llama3 baseline # original model, no DCA
bash eval_chunkdca_pg19.sh qwen3  baseline

Passkey Retrieval

python passkey_retrieval.py \
    --base_model ./models/merged/Llama-2-7b-HiCI-32k \
    --context_size 32768 \
    --max_tokens 57344 \
    --interval 1024

Topic Retrieval

Evaluation runs in two stages. First, eval_topic_retrieval_predict.sh runs the model and writes raw predictions to LongChat/longeval/evaluation/topics/predictions/<model-name>_full/. Then, the predictions can be scored in two ways: rule-based scoring via eval_topic_retrieval_score.sh (no API key, results saved to eval_topic_retrieval/<model-name>_score.txt), or LLM-based scoring via auto_topic_eval.py (requires an OpenAI API key).

# Stage 1: generate predictions (edit MODEL_NAME inside the script first)
bash eval_topic_retrieval_predict.sh

# Stage 2: score the predictions (two options)

Option A — Rule-based scoring (no API key required): eval_topic_retrieval_score.sh uses topic_retrieval_manual_eval.py, which checks whether the label string appears in the model's output. Fast and reproducible, but simple string matching may occasionally mis-score edge cases — spot-check the raw .txt files in eval_topic_retrieval/ if needed.

bash eval_topic_retrieval_score.sh

Option B — GPT auto-scoring (requires OpenAI API key): uses auto_topic_eval.py inside LongChat for LLM-based judgement, which handles paraphrases and formatting variations that rule-based matching would miss.

export OPENAI_API_KEY='your-api-key'
cd LongChat/longeval
python3 auto_topic_eval.py --test_file evaluation/topics/predictions/<model-name>_full/*.txt

LongBench

Requires an SFT model (trained with fine-tune_hici_sft.py). Two options — baseline (no HiCI) and HiCI — both using run_pred.sh:

cd LongBench/LongBench

# Baseline: --ori disables HiCI, uses standard full attention
bash run_pred.sh --model <model-name> --ori --suffix "_ori"

# HiCI: HiCI hierarchical attention in prefill (entire sequence as one group, no segmentation)
bash run_pred.sh --model <model-name> --suffix "_hici"

# Score each run (--model must match the directory name created under pred/)
python eval.py --model <model-name>_ori
python eval.py --model <model-name>_hici

Citation

If you find this project useful in your research, please consider citing:

@article{zeng2026hici,
  title={HiCI: Hierarchical Construction-Integration for Long-Context Attention},
  author={Zeng, Xiangyu and Xu, Qi and Wang, Yunke and Xu, Chang},
  journal={arXiv preprint arXiv:2603.20843},
  year={2026}
}

Acknowledgement

  • We follow the training recipe of LongLoRA (ICLR 2024 Oral) — fine-tuning LoRA adapters together with embedding and LayerNorm weights — but replace Shift Short Attention with our HiCI hierarchical attention.
  • Pre-trained base models: LLaMA-2, LLaMA-3 by Meta, and Qwen3 by Alibaba.
  • We integrate ChunkLlama as a training-free baseline for comparison, and extend it to support Qwen3 (not covered in the original paper).
  • Training is accelerated by DeepSpeed, PEFT, and Flash-Attention 2.
  • We use LongChat for topic retrieval evaluation.
  • SFT data: LongAlpaca-12k by Yukang Chen et al.

License

Releases

No releases published

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.