You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: PyTorch/LanguageModeling/BERT/README.md
+45-25Lines changed: 45 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -133,6 +133,10 @@ The following features are supported by this model.
133
133
134
134
[LAMB](https://arxiv.org/pdf/1904.00962.pdf) stands for Layerwise Adaptive Moments based optimizer, is a large batch optimization technique that helps accelerate training of deep neural networks using large minibatches. It allows using a global batch size of 65536 and 32768 on sequence lengths 128 and 512 respectively, compared to a batch size of 256 for Adam. The optimized implementation accumulates 1024 gradients batches in phase 1 and 4096 steps in phase 2 before updating weights once. This results in 15% training speedup. On multi-node systems, LAMB allows scaling up to 1024 GPUs resulting in training speedups of up to 72x in comparison to [Adam](https://arxiv.org/pdf/1412.6980.pdf). Adam has limitations on the learning rate that can be used since it is applied globally on all parameters whereas LAMB follows a layerwise learning rate strategy.
135
135
136
+
NVLAMB adds necessary tweaks to [LAMB version 1](https://arxiv.org/abs/1904.00962v1), to ensure correct convergence. The algorithm is as follows:
137
+
138
+

139
+
136
140
### Mixed precision training
137
141
138
142
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
@@ -155,7 +159,7 @@ Automatic mixed precision can be enabled with the following code changes:
Depending on the speed of your internet connection, this process takes about a day to complete.
254
+
Depending on the speed of your internet connection, this process takes about a day to complete. The BookCorpus server could sometimes get overloaded and also contain broken links resulting in HTTP 403 and 503 errors. You can either skip the missing files or retry downloading at a later time.
251
255
252
256
6. Start pretraining.
253
257
@@ -337,6 +341,8 @@ The complete list of the available parameters for the `run_pretraining.py` scrip
337
341
--output_dir OUTPUT_DIR - Path to the output directory where the model
338
342
checkpoints will be written.
339
343
344
+
--init_checkpoint - Initial checkpoint to start pretraining from (Usually a BERT pretrained checkpoint)
345
+
340
346
--max_seq_length MAX_SEQ_LENGTH
341
347
- The maximum total input sequence length after
342
348
WordPiece tokenization. Sequences longer than
@@ -365,6 +371,10 @@ The complete list of the available parameters for the `run_pretraining.py` scrip
365
371
- Number of update steps to accumulate before
366
372
performing a backward/update pass.
367
373
374
+
--allreduce_post_accumulation - If set to true, performs allreduce only after the defined number of gradient accumulation steps.
375
+
376
+
--allreduce_post_accumulation_fp16 - If set to true, performs allreduce after gradient accumulation steps in FP16.
377
+
368
378
--fp16 - If set, will perform computations using
369
379
automatic mixed precision.
370
380
@@ -386,6 +396,7 @@ The complete list of the available parameters for the `run_pretraining.py` scrip
386
396
387
397
--phase1_end_step - The number of steps phase 1 was trained for. In order to
388
398
resume phase 2 the correct way, phase1_end_step should correspond to the --max_steps phase 1 was trained for.
399
+
389
400
```
390
401
391
402
@@ -566,25 +577,30 @@ Where:
566
577
-`<seed>` random seed for the run.
567
578
-`<allreduce_post_accumulation>` - If set to `true`, performs allreduce only after the defined number of gradient accumulation steps.
568
579
-`<allreduce_post_accumulation_fp16>` - If set to `true`, performs allreduce after gradient accumulation steps in FP16.
569
-
-`<accumulate_into_fp16>` - If set to `true`, accumulates/sums the gradients in FP16.
570
580
571
-
Note: The above three options need to be set to false when running on fp32.
581
+
Note: The above two options need to be set to false when running on fp32.
572
582
573
583
-`<training_batch_size_phase2>` is per-GPU batch size used for training in phase 2. Larger batch sizes run more efficiently, but require more memory.
574
584
-`<learning_rate_phase2>` is the base learning rate for training phase 2.
575
585
-`<warmup_proportion_phase2>` is the percentage of training steps used for warm-up at the start of training.
576
586
-`<training_steps_phase2>` is the total number of training steps for phase 2, to be continued in addition to phase 1.
577
587
-`<gradient_accumulation_steps_phase2>` an integer indicating the number of steps to accumulate gradients over in phase 2. Effective batch size = `training_batch_size_phase2` / `gradient_accumulation_steps_phase2`.
588
+
-`<init_checkpoint>` A checkpoint to start the pretraining routine on (Usually a BERT pretrained checkpoint).
578
589
579
590
For example:
580
591
581
592
`bash scripts/run_pretraining.sh`
582
593
583
594
Trains BERT-large from scratch on a DGX-1 32G using FP16 arithmetic. 90% of the training steps are done with sequence length 128 (phase1 of training) and 10% of the training steps are done with sequence length 512 (phase2 of training).
584
595
585
-
In order to train on a DGX-1 16G, set `gradient_accumulation_steps` to `512` and `gradient_accumulation_steps_phase2` to `1024` in `scripts/run_pretraining.sh`.
596
+
To train on a DGX-1 16G, set `gradient_accumulation_steps` to `512` and `gradient_accumulation_steps_phase2` to `1024` in `scripts/run_pretraining.sh`.
597
+
598
+
To train on a DGX-2 32G, set `train_batch_size` to `4096`, `train_batch_size_phase2` to `2048`, `num_gpus` to `16`, `gradient_accumulation_steps` to `64` and `gradient_accumulation_steps_phase2` to `256` in `scripts/run_pretraining.sh`
586
599
587
-
In order to train on a DGX-2 32G, set `train_batch_size` to `4096`, `train_batch_size_phase2` to `2048`, `num_gpus` to `16`, `gradient_accumulation_steps` to `64` and `gradient_accumulation_steps_phase2` to `256` in `scripts/run_pretraining.sh`
600
+
In order to run pretraining routine on an initial checkpoint, do the following in `scripts/run_pretraining.sh`:
601
+
- point the `init_checkpoint` variable to location of the checkpoint
602
+
- set `resume_training` to `true`
603
+
- Note: The parameter value assigned to `BERT_CONFIG` during training should remain unchanged. Also to resume pretraining on your corpus of choice, the training dataset should be created using the same vocabulary file used in `data/create_datasets_from_start.sh`
588
604
589
605
##### Fine-tuning
590
606
@@ -830,7 +846,6 @@ Our results were obtained by running the `scripts/run_pretraining.sh` and `scrip
Our results were obtained by running the `scripts/run_pretraining_inference.sh` and `scripts/run_squad.sh` scripts in the pytorch:19.07-py3 NGC container on NVIDIA DGX-1 with (1x V100 32G) GPUs.
886
901
887
902
###### Pre-training inference on NVIDIA DGX-1 with 32G
Our results were obtained by running the `scripts/run_pretraining_inference.sh` and `scripts/run_squad.sh` scripts in the pytorch:19.07-py3 NGC container on NVIDIA DGX-2 with (1x V100 32G) GPUs.
902
917
903
918
###### Pre-training inference on NVIDIA DGX-2 with 32G
0 commit comments