Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit c4061eb

Browse filesBrowse files
authored
Merge branch 'master' into master
2 parents 2a93a1e + 7d772b8 commit c4061eb
Copy full SHA for c4061eb

File tree

Expand file treeCollapse file tree

28 files changed

+2365
-492
lines changed
Open diff view settings
Filter options
Expand file treeCollapse file tree

28 files changed

+2365
-492
lines changed
Open diff view settings
Collapse file

‎PyTorch/LanguageModeling/BERT/Dockerfile‎

Copy file name to clipboardExpand all lines: PyTorch/LanguageModeling/BERT/Dockerfile
+1-1Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
# See the License for the specific language governing permissions and
1212
# limitations under the License.
1313

14-
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.08-py3
14+
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.10-py3
1515
FROM ${FROM_IMAGE_NAME}
1616
RUN apt-get update && apt-get install -y pbzip2 pv bzip2 cabextract
1717

Collapse file

‎PyTorch/LanguageModeling/BERT/LICENSE‎

Copy file name to clipboardExpand all lines: PyTorch/LanguageModeling/BERT/LICENSE
+2-3Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
12
Apache License
23
Version 2.0, January 2004
34
http://www.apache.org/licenses/
@@ -175,8 +176,6 @@
175176

176177
END OF TERMS AND CONDITIONS
177178

178-
Copyright 2019 NVIDIA CORPORATION. All rights reserved.
179-
180179
APPENDIX: How to apply the Apache License to your work.
181180

182181
To apply the Apache License to your work, attach the following
@@ -200,4 +199,4 @@
200199
distributed under the License is distributed on an "AS IS" BASIS,
201200
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
202201
See the License for the specific language governing permissions and
203-
limitations under the License.
202+
limitations under the License.
Collapse file

‎PyTorch/LanguageModeling/BERT/README.md‎

Copy file name to clipboardExpand all lines: PyTorch/LanguageModeling/BERT/README.md
+45-25Lines changed: 45 additions & 25 deletions
  • Display the source diff
  • Display the rich diff
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,10 @@ The following features are supported by this model.
133133

134134
[LAMB](https://arxiv.org/pdf/1904.00962.pdf) stands for Layerwise Adaptive Moments based optimizer, is a large batch optimization technique that helps accelerate training of deep neural networks using large minibatches. It allows using a global batch size of 65536 and 32768 on sequence lengths 128 and 512 respectively, compared to a batch size of 256 for Adam. The optimized implementation accumulates 1024 gradients batches in phase 1 and 4096 steps in phase 2 before updating weights once. This results in 15% training speedup. On multi-node systems, LAMB allows scaling up to 1024 GPUs resulting in training speedups of up to 72x in comparison to [Adam](https://arxiv.org/pdf/1412.6980.pdf). Adam has limitations on the learning rate that can be used since it is applied globally on all parameters whereas LAMB follows a layerwise learning rate strategy.
135135

136+
NVLAMB adds necessary tweaks to [LAMB version 1](https://arxiv.org/abs/1904.00962v1), to ensure correct convergence. The algorithm is as follows:
137+
138+
![NVLAMB](images/nvlamb.png)
139+
136140
### Mixed precision training
137141

138142
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
@@ -155,7 +159,7 @@ Automatic mixed precision can be enabled with the following code changes:
155159
from apex import amp
156160
if fp16:
157161
# Wrap optimizer and model
158-
model, optimizer = amp.initialize(model, optimizer, opt_level=<opt_level>, loss_scale=dynamic)
162+
model, optimizer = amp.initialize(model, optimizer, opt_level=<opt_level>, loss_scale="dynamic")
159163
160164
if fp16:
161165
with amp.scale_loss(loss, optimizer) as scaled_loss:
@@ -247,7 +251,7 @@ This repository provides scripts to download, verify and extract the following d
247251
To download, verify, extract the datasets, and create the shards in hdf5 format, run:
248252
`/workspace/bert/data/create_datasets_from_start.sh`
249253

250-
Depending on the speed of your internet connection, this process takes about a day to complete.
254+
Depending on the speed of your internet connection, this process takes about a day to complete. The BookCorpus server could sometimes get overloaded and also contain broken links resulting in HTTP 403 and 503 errors. You can either skip the missing files or retry downloading at a later time.
251255

252256
6. Start pretraining.
253257

@@ -337,6 +341,8 @@ The complete list of the available parameters for the `run_pretraining.py` scrip
337341
--output_dir OUTPUT_DIR - Path to the output directory where the model
338342
checkpoints will be written.
339343
344+
--init_checkpoint - Initial checkpoint to start pretraining from (Usually a BERT pretrained checkpoint)
345+
340346
--max_seq_length MAX_SEQ_LENGTH
341347
- The maximum total input sequence length after
342348
WordPiece tokenization. Sequences longer than
@@ -365,6 +371,10 @@ The complete list of the available parameters for the `run_pretraining.py` scrip
365371
- Number of update steps to accumulate before
366372
performing a backward/update pass.
367373
374+
--allreduce_post_accumulation - If set to true, performs allreduce only after the defined number of gradient accumulation steps.
375+
376+
--allreduce_post_accumulation_fp16 - If set to true, performs allreduce after gradient accumulation steps in FP16.
377+
368378
--fp16 - If set, will perform computations using
369379
automatic mixed precision.
370380
@@ -386,6 +396,7 @@ The complete list of the available parameters for the `run_pretraining.py` scrip
386396
387397
--phase1_end_step - The number of steps phase 1 was trained for. In order to
388398
resume phase 2 the correct way, phase1_end_step should correspond to the --max_steps phase 1 was trained for.
399+
389400
```
390401

391402

@@ -566,25 +577,30 @@ Where:
566577
- `<seed>` random seed for the run.
567578
- `<allreduce_post_accumulation>` - If set to `true`, performs allreduce only after the defined number of gradient accumulation steps.
568579
- `<allreduce_post_accumulation_fp16>` - If set to `true`, performs allreduce after gradient accumulation steps in FP16.
569-
- `<accumulate_into_fp16>` - If set to `true`, accumulates/sums the gradients in FP16.
570580

571-
Note: The above three options need to be set to false when running on fp32.
581+
Note: The above two options need to be set to false when running on fp32.
572582

573583
- `<training_batch_size_phase2>` is per-GPU batch size used for training in phase 2. Larger batch sizes run more efficiently, but require more memory.
574584
- `<learning_rate_phase2>` is the base learning rate for training phase 2.
575585
- `<warmup_proportion_phase2>` is the percentage of training steps used for warm-up at the start of training.
576586
- `<training_steps_phase2>` is the total number of training steps for phase 2, to be continued in addition to phase 1.
577587
- `<gradient_accumulation_steps_phase2>` an integer indicating the number of steps to accumulate gradients over in phase 2. Effective batch size = `training_batch_size_phase2` / `gradient_accumulation_steps_phase2`.
588+
- `<init_checkpoint>` A checkpoint to start the pretraining routine on (Usually a BERT pretrained checkpoint).
578589

579590
For example:
580591

581592
`bash scripts/run_pretraining.sh`
582593

583594
Trains BERT-large from scratch on a DGX-1 32G using FP16 arithmetic. 90% of the training steps are done with sequence length 128 (phase1 of training) and 10% of the training steps are done with sequence length 512 (phase2 of training).
584595

585-
In order to train on a DGX-1 16G, set `gradient_accumulation_steps` to `512` and `gradient_accumulation_steps_phase2` to `1024` in `scripts/run_pretraining.sh`.
596+
To train on a DGX-1 16G, set `gradient_accumulation_steps` to `512` and `gradient_accumulation_steps_phase2` to `1024` in `scripts/run_pretraining.sh`.
597+
598+
To train on a DGX-2 32G, set `train_batch_size` to `4096`, `train_batch_size_phase2` to `2048`, `num_gpus` to `16`, `gradient_accumulation_steps` to `64` and `gradient_accumulation_steps_phase2` to `256` in `scripts/run_pretraining.sh`
586599

587-
In order to train on a DGX-2 32G, set `train_batch_size` to `4096`, `train_batch_size_phase2` to `2048`, `num_gpus` to `16`, `gradient_accumulation_steps` to `64` and `gradient_accumulation_steps_phase2` to `256` in `scripts/run_pretraining.sh`
600+
In order to run pretraining routine on an initial checkpoint, do the following in `scripts/run_pretraining.sh`:
601+
- point the `init_checkpoint` variable to location of the checkpoint
602+
- set `resume_training` to `true`
603+
- Note: The parameter value assigned to `BERT_CONFIG` during training should remain unchanged. Also to resume pretraining on your corpus of choice, the training dataset should be created using the same vocabulary file used in `data/create_datasets_from_start.sh`
588604

589605
##### Fine-tuning
590606

@@ -830,7 +846,6 @@ Our results were obtained by running the `scripts/run_pretraining.sh` and `scrip
830846
|8 | 4 | 8| 512| 68.16| 247.04| 3.62| 7.57| 7.64
831847
|16 | 4 | 8| 512| 135.68| 488.96| 3.60| 15.08| 15.13
832848

833-
834849
###### Pre-training on multiple NVIDIA DGX-2H With 32G
835850

836851
Note: Multi-node performance numbers below are on DGX-2H whereas the single node performance numbers above are on DGX-2.
@@ -870,47 +885,47 @@ Our results were obtained by running the `scripts/run_pretraining_inference.sh`
870885

871886
###### Pre-training inference on NVIDIA DGX-1 with 16G
872887

873-
|GPUs | Throughput - FP32(sequences/sec)|Throughput - Mixed Precision(sequences/sec)
874-
|---------- |---------|---------------
875-
| 1| 28.32| 94.36
888+
| GPUs | Batch Size \(FP32/FP16\) | Throughput \- FP32\(sequences/sec\) | Throughput \- Mixed Precision\(sequences/sec\) |
889+
|------|---------------------------|-------------------------------------|------------------------------------------------|
890+
| 1 | 2/4 | 28\.32 | 94\.36 |
876891

877892
###### Fine-tuning inference on NVIDIA DGX-1 with 16G
878893

879-
|GPUs | Throughput - FP32(sequences/sec)|Throughput - Mixed Precision(sequences/sec)
880-
|---------- |---------|---------------
881-
| 1| 37.64| 119.76
894+
| GPUs | Batch Size \(FP32/FP16\) | Throughput \- FP32\(sequences/sec\) | Throughput \- Mixed Precision\(sequences/sec\) |
895+
|------|---------------------------|-------------------------------------|------------------------------------------------|
896+
| 1 | 4/4 | 37\.64 | 119\.76 |
882897

883898
##### Inference performance: NVIDIA DGX-1 (1x V100 32G)
884899

885900
Our results were obtained by running the `scripts/run_pretraining_inference.sh` and `scripts/run_squad.sh` scripts in the pytorch:19.07-py3 NGC container on NVIDIA DGX-1 with (1x V100 32G) GPUs.
886901

887902
###### Pre-training inference on NVIDIA DGX-1 with 32G
888903

889-
|GPUs | Throughput(sequences/sec) - FP32|Throughput - Mixed Precision(sequences/sec)
890-
|---------- |---------|---------------
891-
| 1| 27.58| 90.16
904+
| GPUs | Batch Size \(FP32/FP16\) | Throughput \- FP32\(sequences/sec\) | Throughput \- Mixed Precision\(sequences/sec\) |
905+
|------|---------------------------|-------------------------------------|------------------------------------------------|
906+
| 1 | 4/8 | 27\.58 | 90\.16 |
892907

893908
###### Fine-tuning inference on NVIDIA DGX-1 with 32G
894909

895-
|GPUs | Throughput(sequences/sec) - FP32|Throughput - Mixed Precision(sequences/sec)
896-
|---------- |---------|---------------
897-
| 1| 37.64| 119.76
910+
| GPUs | Batch Size \(FP32/FP16\) | Throughput \- FP32\(sequences/sec\) | Throughput \- Mixed Precision\(sequences/sec\) |
911+
|------|---------------------------|-------------------------------------|------------------------------------------------|
912+
| 1 | 4/4 | 37\.64 | 119\.76 |
898913

899914
##### Inference performance: NVIDIA DGX-2 (1x V100 32G)
900915

901916
Our results were obtained by running the `scripts/run_pretraining_inference.sh` and `scripts/run_squad.sh` scripts in the pytorch:19.07-py3 NGC container on NVIDIA DGX-2 with (1x V100 32G) GPUs.
902917

903918
###### Pre-training inference on NVIDIA DGX-2 with 32G
904919

905-
|GPUs | Throughput - FP32(sequences/sec)|Throughput - Mixed Precision(sequences/sec)
906-
|---------- |---------|---------------
907-
| 1| 30.24| 97.72
920+
| GPUs | Batch Size \(FP32/FP16\) | Throughput \- FP32\(sequences/sec\) | Throughput \- Mixed Precision\(sequences/sec\) |
921+
|------|---------------------------|-------------------------------------|------------------------------------------------|
922+
| 1 | 4/8 | 30\.24 | 97\.72 |
908923

909924
###### Fine-tuning inference on NVIDIA DGX-2 with 32G
910925

911-
|GPUs | Throughput - FP32(sequences/sec)|Throughput - Mixed Precision(sequences/sec)
912-
|---------- |---------|---------------
913-
| 1| 35.76| 112.60
926+
| GPUs | Batch Size \(FP32/FP16\) | Throughput \- FP32\(sequences/sec\) | Throughput \- Mixed Precision\(sequences/sec\) |
927+
|------|---------------------------|-------------------------------------|------------------------------------------------|
928+
| 1 | 4/4 | 35\.76 | 112\.60 |
914929

915930
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
916931

@@ -920,6 +935,11 @@ The inference performance metrics used were items/second.
920935

921936
### Changelog
922937

938+
November 2019
939+
- Use LAMB from APEX
940+
- Code cleanup
941+
- Bug fix in BertAdam optimizer
942+
923943
September 2019
924944
- Scripts to support multi-node launch
925945
- Update pretraining loss results based on the latest data preparation scripts
Collapse file

‎PyTorch/LanguageModeling/BERT/data/bertPrep.py‎

Copy file name to clipboardExpand all lines: PyTorch/LanguageModeling/BERT/data/bertPrep.py
-2Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,6 @@ def create_record_worker(filename_prefix, shard_id, output_format='tfrecord'):
158158
bert_preprocessing_command += ' --random_seed=' + str(args.random_seed)
159159
bert_preprocessing_command += ' --dupe_factor=' + str(args.dupe_factor)
160160
bert_preprocessing_process = subprocess.Popen(bert_preprocessing_command, shell=True)
161-
bert_preprocessing_process.communicate()
162161

163162
last_process = bert_preprocessing_process
164163

@@ -198,7 +197,6 @@ def create_record_worker(filename_prefix, shard_id, output_format='hdf5'):
198197
bert_preprocessing_command += ' --random_seed=' + str(args.random_seed)
199198
bert_preprocessing_command += ' --dupe_factor=' + str(args.dupe_factor)
200199
bert_preprocessing_process = subprocess.Popen(bert_preprocessing_command, shell=True)
201-
bert_preprocessing_process.communicate()
202200

203201
last_process = bert_preprocessing_process
204202

Collapse file
86.1 KB
  • Display the source diff
  • Display the rich diff
Loading

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.