Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[Resnet/Mxnet] Increase batch size causes Cudamalloc error #691

Copy link
Copy link
@karanveersingh5623

Description

@karanveersingh5623
Issue body actions

Related to Model/Framework(s)
Resnet/Mxnet

Describe the bug
On Tesla GPU T4 , when trying to increase batch size from 192 to 256 with 2 iterations , cuda malloc error occurs , below is the trace , FYI - Batch size 192 with 500 iterations working fine .
Requirement - Need to run 10k Batch size with 10 iterations on 1,00,000 training images

root@061df1cec673:/workspace/rn50# python3 benchmark.py -n 1 -b 256 --data-root /data/imagenet/train-val-recordio-passthrough/tmp --dtype float16 -o benchmark_report_fp16.json -i 2 -e 1 --mode train
[1,0]:2020-09-17 11:55:59,675:INFO: Start with arguments Namespace(amp=False, arch='resnetv15', batch_size=256, batchnorm_eps=1e-05, batchnorm_layout='NHWC', batchnorm_mom=0.9, begin_epoch=0, benchmark_iters=2, brightness=0, contrast=0, conv_layout='NHWC', dali_fuse_decoder=1, dali_nvjpeg_memory_padding=64, dali_prefetch_queue=2, dali_separ_val=False, dali_threads=3, dali_validation_threads=10, data_backend='dali-gpu', data_mxnet_threads=40, data_pred=None, data_train='/data/imagenet/train-val-recordio-passthrough/tmp/train.rec', data_train_idx='/data/imagenet/train-val-recordio-passthrough/tmp/train.idx', data_val='/data/imagenet/train-val-recordio-passthrough/tmp/val.rec', data_val_idx='/data/imagenet/train-val-recordio-passthrough/tmp/val.idx', data_val_resize=256, disp_batches=20, dtype='float16', fuse_bn_add_relu=1, fuse_bn_relu=1, gpus=[0], image_shape=[4, 224, 224], input_layout='NCHW', kv_store='horovod', label_smoothing=0.1, load=None, log='log.log', lr=0.256, lr_factor=0.256, lr_schedule='cosine', lr_steps=[], max_crop_size=-1, max_random_area=1, max_random_aspect_ratio=1.33, max_random_h=0, max_random_l=0, max_random_rotate_angle=0, max_random_s=0, max_random_scale=1, max_random_shear_ratio=0, min_crop_size=-1, min_random_area=0.05, min_random_aspect_ratio=0.75, min_random_scale=1, mixup=0, mode='train', model_prefix='model', mom=0.875, no_metrics=True, num_classes=1000, num_epochs=1, num_examples=1281167, num_groups=32, num_layers=50, optimizer='sgd', pca_noise=0, pooling_layout='NHWC', random_crop=0, random_mirror=1, random_resized_crop=1, report='benchmark_report_fp16.json-1,256', rgb_mean=[123.68, 116.779, 103.939], rgb_std=[58.393, 57.12, 57.375], saturation=0, save_frequency=-1, seed=None, test_io=False, test_io_mode='train', warmup_epochs=5, wd=3.0517578125e-05)
[1,0]:2020-09-17 11:56:05,036:WARNING: DALI iterator does not support resetting while epoch is not finished. Ignoring...
[1,0]:2020-09-17 11:56:05,037:INFO: Starting epoch 0
[1,0]:[11:56:05] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:120: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[1,0]:061df1cec673:37265:37480 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
[1,0]:061df1cec673:37265:37480 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
[1,0]:061df1cec673:37265:37480 [0] NCCL INFO NET/IB : No device found.
[1,0]:NCCL version 2.4.7+cuda10.1
[1,0]:061df1cec673:37265:37480 [0] NCCL INFO Setting affinity for GPU 0 to aaaaaa,aaaaaaaa,aaaaaaaa
[1,0]:061df1cec673:37265:37480 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
[1,0]:061df1cec673:37265:37480 [0] NCCL INFO comm 0x7fd90c270d30 rank 0 nranks 1 cudaDev 0 nvmlDev 0 - Init COMPLETE
[1,0]:Traceback (most recent call last):
[1,0]: File "train.py", line 70, in
[1,0]: fit.fit(args, model, data_loader)
[1,0]: File "/workspace/rn50/fit.py", line 518, in fit
[1,0]: mx.nd.waitall()
[1,0]: File "/opt/mxnet/python/mxnet/ndarray/ndarray.py", line 166, in waitall
[1,0]: check_call(_LIB.MXNDArrayWaitAll())
[1,0]: File "/opt/mxnet/python/mxnet/base.py", line 252, in check_call
[1,0]: raise MXNetError(py_str(_LIB.MXGetLastError()))
[1,0]:mxnet.base.MXNetError: [11:56:09] src/storage/./pooled_storage_manager.h:157: cudaMalloc failed: out of memory
[1,0]:Stack trace:
[1,0]: [bt] (0) /usr/local/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x7fdab7c3d783]
[1,0]: [bt] (1) /usr/local/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::Alloc(mxnet::Storage::Handle*)+0x21c) [0x7fdaba6ca2ec]
[1,0]: [bt] (2) /usr/local/lib/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle*)+0x5a) [0x7fdaba6cc73a]
[1,0]: [bt] (3) /usr/local/lib/libmxnet.so(mxnet::NDArray::CheckAndAlloc() const+0x804) [0x7fdab7d3cb74]
[1,0]: [bt] (4) /usr/local/lib/libmxnet.so(mxnet::exec::StorageFallbackOpExecutor::PreFCompute(bool)+0x137a) [0x7fdab9e52a1a]
[1,0]: [bt] (5) /usr/local/lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x34) [0x7fdab9e53154]
[1,0]: [bt] (6) /usr/local/lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::imperative::CreateEngineOp(mxnet::Context const&, std::vector<std::shared_ptrmxnet::exec::OpExecutor, std::allocator<std::shared_ptrmxnet::exec::OpExecutor > > const&)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&)+0x15e) [0x7fdab9f6e5be]
[1,0]: [bt] (7) /usr/local/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x5b5) [0x7fdaba6ae155]
[1,0]: [bt] (8) /usr/local/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptrdmlc::ManualEvent const&)+0x176) [0x7fdaba6c40f6]
[1,0]:
[1,0]:

To Reproduce
Steps to reproduce the behavior:

  1. python3 benchmark.py -n 1 -b 256 --data-root /data/imagenet/train-val-recordio-passthrough/tmp --dtype float16 -o benchmark_report_fp16.json -i 2 -e 1 --mode train

Expected behavior
Need to run 10k Batch size with 10 iterations on 1,00,000 training images

Environment
Please provide at least:

  • Container version : MXNet 19.07-py3 NGC container
  • GPUs in the system: tesla t4
  • CUDA driver version : 10.1
Reactions are currently unavailable

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    Morty Proxy This is a proxified and sanitized view of the page, visit original site.