InstructGLM

基于ChatGLM-6B+LoRA在指令数据集上进行微调

开源指令数据集

斯坦福52k英文指令数据

instruction:52K 条指令中的每一条都是唯一的,答案由text-davinci-003模型生成得到的

BELLE项目生成的中文指令数据

生成方式基于种子prompt，调用openai的api生成中文指令

GuanacoDataset 多语言指令数据集

Guanaco 是在 Meta 的 LLaMA 7B 模型上训练的指令跟随语言模型。在 Alpaca 模型原始 52K 数据的基础上，我们添加了额外的 98,369 个条目，涵盖英语、简体中文、繁体中文（台湾）、繁体中文（香港）、日语、德语以及各种语言和语法任务。通过使用这些丰富的数据重新训练和优化模型，Guanaco 在多语言环境中展示了出色的性能和潜力。项目链接可以查看 https://guanaco-model.github.io/

alpaca中文指令微调数据集

与原始alpaca数据json格式相同,数据生成的方法是机器翻译和self-instruct

人工精调的中文对话数据集

加入除了alpaca之外的其他中文聊天对话人工微调，部分并不中文化的问题，我们将重新询问chatgpt或文心一言，重新获取回答并覆盖掉alpaca的回答

微调1：alpaca英文指令数据

斯坦福羊驼52k数据，原始数据格式如下：

{
    "instruction": "Evaluate this sentence for spelling and grammar mistakes",
    "input": "He finnished his meal and left the resturant",
    "output": "He finished his meal and left the restaurant."
}

数据集地址：https://github.com/tatsu-lab/stanford_alpaca

1.数据预处理

转化alpaca数据集为jsonl,这一步可以执行设置数据转换后格式，比如：

###Instruction:xxx###Input:xxxx###Response:xxx

python cover_alpaca2jsonl.py \
    --data_path data/alpaca_data.json \
    --save_path data/alpaca_data.jsonl

对文本进行tokenize,加快训练速度，文本长度可根据运行资源自行设置

python tokenize_dataset_rows.py \
    --jsonl_path data/alpaca_data.jsonl \
    --save_path data/alpaca \
    --max_seq_length 320

2. 模型训练

python finetune.py \
    --dataset_path data/alpaca \
    --lora_rank 8 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --max_steps 52000 \
    --save_steps 1000 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --fp16 \
    --remove_unused_columns false \
    --logging_steps 50 \
    --output_dir output

微调2:BELLE中文指令数据

包含543314条由BELLE项目生成的中文指令数据,数据格式如下：

input	target
用一句话描述地球为什么是独一无二的。\n	地球上有适宜生命存在的条件和多样化的生命形式

数据集地址：https://huggingface.co/datasets/BelleGroup/generated_train_0.5M_CN

1.数据预处理

转化bell数据集为jsonl

python cover_alpaca2jsonl.py \
    --dataset_name BelleGroup/generated_train_0.5M_CN \
    --save_path data/belle_data.jsonl

文本长度统计

count    543314.000000
mean         83.536944
std          95.665178
min           4.000000
25%          33.000000
50%          51.000000
75%          88.000000
90%         194.000000
max        4410.000000
Name: input_len, dtype: float64

count    543314.000000
mean        121.079030
std         165.472722
min           1.000000
25%          27.000000
50%          67.000000
75%         151.000000
90%         296.000000
max        9463.000000
Name: target_len, dtype: float64

分词处理

python tokenize_dataset_rows.py \
    --jsonl_path data/belle_data.jsonl \
    --save_path data/belle \
    --max_seq_length 320

转换后的数据：

                                           input_ids  seq_len                                                                                                                   
0  [20005, 92863, 20012, 20005, 83864, 87784, 871...       20
1  [20005, 92863, 20012, 20005, 91432, 86523, 885...       80
2  [20005, 92863, 20012, 104069, 85056, 86334, 89...       61
3  [20005, 92863, 20012, 91492, 89122, 83866, 852...       24
4  [20005, 92863, 20012, 20005, 83834, 99899, 927...       24

2. 模型训练

基于原始chatglm-6b训练

python finetune.py \
    --dataset_path data/belle \
    --lora_rank 8 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --max_steps 52000 \
    --save_steps 1000 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --fp16 \
    --remove_unused_columns false \
    --logging_steps 50 \
    --output_dir output

基于alpaca的lora继续微调

python finetune.py \
    --dataset_path data/belle \
    --lora_rank 8 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 1 \
    --max_steps 52000 \
    --save_steps 10000 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --fp16 \
    --remove_unused_columns false \
    --logging_steps 50 \
    --output_dir output/belle \
    --is_resume True \
    --resume_path output/alpaca/chatglm-lora.pt

实验环境

安装所需要的包：pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
显卡：2xA100 80G

实验结果

训练好的lora权重


链接：https://pan.baidu.com/s/1c-zRSEUn4151YLoowPN4YA?pwd=hxbr
提取码：hxbr
--来自百度网盘超级会员V3的分享

alpaca数据微调效果

belle数据微淘效果

Reference

非常感谢以下作者的无私开源

Name	Name	Last commit message	Last commit date
Latest commit History 21 Commits 21 Commits
data	data
examples	examples
output	output
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
cli_demo.py	cli_demo.py
configuration_chatglm.py	configuration_chatglm.py
cover_alpaca2jsonl.py	cover_alpaca2jsonl.py
cover_belle2jsonl.py	cover_belle2jsonl.py
demo.py	demo.py
finetune.py	finetune.py
infer.ipynb	infer.ipynb
infer.py	infer.py
modeling_chatglm.py	modeling_chatglm.py
quantization.py	quantization.py
requirements.txt	requirements.txt
tokenization_chatglm.py	tokenization_chatglm.py
tokenize_dataset_rows.py	tokenize_dataset_rows.py
web_demo.py	web_demo.py
web_demo_test.py	web_demo_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InstructGLM

开源指令数据集

微调1：alpaca英文指令数据

1.数据预处理

2. 模型训练

微调2:BELLE中文指令数据

1.数据预处理

2. 模型训练

实验环境

实验结果

Reference

About

Uh oh!

Releases

Packages

Languages

Search code, repositories, users, issues, pull requests...

License

primeideals/InstructGLM

Folders and files

Latest commit

History

Repository files navigation

InstructGLM

开源指令数据集

微调1：alpaca英文指令数据

1.数据预处理

2. 模型训练

微调2:BELLE中文指令数据

1.数据预处理

2. 模型训练

实验环境

实验结果

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages