Description
作者你好,我在运行accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/coba_train_config.json --distributed_type "DeepSpeed"指令时,
coba_train_config.json如下
{
"data_paths": "/home/descfly/MFTCoder-main/CodeExercise-Python-27k",
"output_dir": "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/output",
"tb_dir": "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/output/tensorboard",
"pretrained_model_path": "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/phi1.5/phi-1_5",
"model_type": "phi",
"load_raw_dataset": true,
"data_split": "95,5,0",
"padding_mode": "padding",
"use_dynamic_padding": true,
"tokenize_mode": "sft",
"tokenizer_type": "AutoTokenizer",
"weighted_loss_mode": "coba",
"coba_warmup_steps": 100,
"coba_history_length": 200,
"coba_tau": 5,
"coba_update_interval": 1,
"coba_sample_valid_num": 1,
"attn_implementation": "flash_attention_2",
"seq_length": 4096,
"seed": 1234,
"peft_type": "qlora",
"quantization": "4bit",
"lora_rank": 96,
"lora_alpha": 32,
"lora_dropout": 0.05,
"per_device_train_batch_size": 8,
"per_device_eval_batch_size": 8,
"learning_rate": 5e-5,
"min_lr": 5e-6,
"weight_decay": 0.1,
"gradient_accumulation_steps": 1,
"lr_scheduler_type": "cosine",
"num_warmup_steps": 300,
"num_train_epochs": 4,
"resume_from_checkpoint": null,
"log_interval": 10,
"checkpointing_steps": 100,
"evaluation_steps": 100,
"max_train_steps": null,
"epoch_checkpointing": true,
"shuffle_before_split": true,
"early_stopping": true,
"early_stopping_stall_num": 5,
"saving_limit": null
}
yaml文件如下
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: none
zero3_init_flag: false
zero3_save_16bit_model: true
zero_stage: 2
steps_per_print: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'bf16'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
use_cpu: false
数据集使用了提供的CodeExercise-Python-27k
结果出现报错如下
Initial eos_token_id 50256 from tokenizer
Tokenizer: <class 'transformers.models.codegen.tokenization_codegen_fast.CodeGenTokenizerFast'>
Length of tokenizer: 50295
build_tokenizer pad_token_id: 50256, eos_token_id: 50256
build_tokenizer pad_token : <|endoftext|>, eos_token: <|endoftext|>
padded vocab (size: 50257) with 15 dummy tokens (new size: 50272)
data splits: [95.0, 5.0, 0.0]
/home/descfly/MFTCoder-main/mftcoder_accelerate/src/data/multi_task_dataset.py:248: RuntimeWarning: invalid value encountered in divide
effective_token_rate.append(cur_dataset_num_tokens / (cur_dataset_sample_num * args.seq_length))
[Global Rank 0]shape of cur train dataset: (0,)
[Global Rank 0]num tokens: [0]
[Global Rank 0]effective token rate: [nan]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/pefts/mft_accelerate.py", line 571, in
[rank0]: main()
[rank0]: File "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/pefts/mft_accelerate.py", line 327, in main
[rank0]: train_dataset, valid_dataset = load_dataset_from_jsonl(
[rank0]: File "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/data/multi_task_dataset.py", line 295, in load_dataset_from_jsonl
[rank0]: train_loss_weights = ds_fn(all_train_datasets_length)
[rank0]: File "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/data/multi_task_dataset.py", line 72, in ds_weights_by_num_docs_sft
[rank0]: weights = [1 / i for i in l]
[rank0]: File "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/data/multi_task_dataset.py", line 72, in
[rank0]: weights = [1 / i for i in l]
[rank0]: ZeroDivisionError: division by zero
[rank0]:[W324 15:24:17.031963716 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
E0324 15:24:17.659670 126203480196928 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1036782) of binary: /home/descfly/anaconda3/envs/coba_wzw/bin/python
Traceback (most recent call last):
File "/home/descfly/anaconda3/envs/coba_wzw/bin/accelerate", line 8, in
sys.exit(main())
File "/home/descfly/anaconda3/envs/coba_wzw/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/descfly/anaconda3/envs/coba_wzw/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1082, in launch_command
deepspeed_launcher(args)
File "/home/descfly/anaconda3/envs/coba_wzw/lib/python3.9/site-packages/accelerate/commands/launch.py", line 786, in deepspeed_launcher
distrib_run.run(args)
File "/home/descfly/anaconda3/envs/coba_wzw/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/descfly/anaconda3/envs/coba_wzw/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/descfly/anaconda3/envs/coba_wzw/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
请问这是为什么呢?