Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

运行accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/coba_train_config.json --distributed_type "DeepSpeed"报错 #88

Copy link
Copy link
Open
@SYVAE

Description

@SYVAE
Issue body actions

作者你好,我在运行accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/coba_train_config.json --distributed_type "DeepSpeed"指令时,
coba_train_config.json如下

{
"data_paths": "/home/descfly/MFTCoder-main/CodeExercise-Python-27k",
"output_dir": "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/output",
"tb_dir": "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/output/tensorboard",
"pretrained_model_path": "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/phi1.5/phi-1_5",
"model_type": "phi",
"load_raw_dataset": true,
"data_split": "95,5,0",
"padding_mode": "padding",
"use_dynamic_padding": true,
"tokenize_mode": "sft",
"tokenizer_type": "AutoTokenizer",
"weighted_loss_mode": "coba",
"coba_warmup_steps": 100,
"coba_history_length": 200,
"coba_tau": 5,
"coba_update_interval": 1,
"coba_sample_valid_num": 1,
"attn_implementation": "flash_attention_2",
"seq_length": 4096,
"seed": 1234,
"peft_type": "qlora",
"quantization": "4bit",
"lora_rank": 96,
"lora_alpha": 32,
"lora_dropout": 0.05,
"per_device_train_batch_size": 8,
"per_device_eval_batch_size": 8,
"learning_rate": 5e-5,
"min_lr": 5e-6,
"weight_decay": 0.1,
"gradient_accumulation_steps": 1,
"lr_scheduler_type": "cosine",
"num_warmup_steps": 300,
"num_train_epochs": 4,
"resume_from_checkpoint": null,
"log_interval": 10,
"checkpointing_steps": 100,
"evaluation_steps": 100,
"max_train_steps": null,
"epoch_checkpointing": true,
"shuffle_before_split": true,
"early_stopping": true,
"early_stopping_stall_num": 5,
"saving_limit": null
}

yaml文件如下

compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: none
zero3_init_flag: false
zero3_save_16bit_model: true
zero_stage: 2

steps_per_print: 1

distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'bf16'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
use_cpu: false

数据集使用了提供的CodeExercise-Python-27k
结果出现报错如下

Initial eos_token_id 50256 from tokenizer
Tokenizer: <class 'transformers.models.codegen.tokenization_codegen_fast.CodeGenTokenizerFast'>
Length of tokenizer: 50295
build_tokenizer pad_token_id: 50256, eos_token_id: 50256
build_tokenizer pad_token : <|endoftext|>, eos_token: <|endoftext|>

padded vocab (size: 50257) with 15 dummy tokens (new size: 50272)
data splits: [95.0, 5.0, 0.0]
/home/descfly/MFTCoder-main/mftcoder_accelerate/src/data/multi_task_dataset.py:248: RuntimeWarning: invalid value encountered in divide
effective_token_rate.append(cur_dataset_num_tokens / (cur_dataset_sample_num * args.seq_length))
[Global Rank 0]shape of cur train dataset: (0,)
[Global Rank 0]num tokens: [0]
[Global Rank 0]effective token rate: [nan]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/pefts/mft_accelerate.py", line 571, in
[rank0]: main()
[rank0]: File "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/pefts/mft_accelerate.py", line 327, in main
[rank0]: train_dataset, valid_dataset = load_dataset_from_jsonl(
[rank0]: File "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/data/multi_task_dataset.py", line 295, in load_dataset_from_jsonl
[rank0]: train_loss_weights = ds_fn(all_train_datasets_length)
[rank0]: File "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/data/multi_task_dataset.py", line 72, in ds_weights_by_num_docs_sft
[rank0]: weights = [1 / i for i in l]
[rank0]: File "/home/descfly/MFTCoder-main/mftcoder_accelerate/src/data/multi_task_dataset.py", line 72, in
[rank0]: weights = [1 / i for i in l]
[rank0]: ZeroDivisionError: division by zero
[rank0]:[W324 15:24:17.031963716 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
E0324 15:24:17.659670 126203480196928 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1036782) of binary: /home/descfly/anaconda3/envs/coba_wzw/bin/python
Traceback (most recent call last):
File "/home/descfly/anaconda3/envs/coba_wzw/bin/accelerate", line 8, in
sys.exit(main())
File "/home/descfly/anaconda3/envs/coba_wzw/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/descfly/anaconda3/envs/coba_wzw/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1082, in launch_command
deepspeed_launcher(args)
File "/home/descfly/anaconda3/envs/coba_wzw/lib/python3.9/site-packages/accelerate/commands/launch.py", line 786, in deepspeed_launcher
distrib_run.run(args)
File "/home/descfly/anaconda3/envs/coba_wzw/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/descfly/anaconda3/envs/coba_wzw/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/descfly/anaconda3/envs/coba_wzw/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

请问这是为什么呢?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.