-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
When I tried to resume training maskrcnn using Detectron. The training progress goes well at the begining, but alone with the training continues, the training time for each iter grows progressly, after handreds or thousands of iterations, the training broke down with the cuda error below:
E0628 19:13:22.730840 3624 net_dag.cc:195] Exception from operator chain starting at '' (type 'Concat'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: unspecified launch failure Error from operator:
input: "gpu_3/roi_feat_3rd" input: "gpu_3/fc1_3rd_w" input: "gpu_3/fc1_3rd_b" output: "gpu_3/fc1_3rd" name: "" type: "FC" arg { name: "use_cudnn" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "cudnn_exhaustive_search" i: 0 } device_option { device_type: 1 cuda_gpu_id: 3 }
E0628 19:13:22.730844 3631 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'WeightedSum'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: unspecified launch failure Error from operator:
input: "gpu_3/fc1_2nd_w_grad" input: "gpu_3/one" input: "gpu_3/fc1_2nd_w" input: "gpu_3/wd" output: "gpu_3/fc1_2nd_w_grad" name: "" type: "WeightedSum" device_option { device_type: 1 cuda_gpu_id: 3 }
E0628 19:13:22.730901 3635 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'Concat'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: unspecified launch failure Error from operator:
input: "gpu_3/_[mask]fcn1" output: "gpu_3/[mask]_fcn1" name: "" type: "Relu" arg { name: "cudnn_exhaustive_search" i: 0 } arg { name: "order" s: "NCHW" } device_option { device_type: 1 cuda_gpu_id: 3 } engine: "CUDNN"
F0628 19:13:22.730955 3631 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failure
F0628 19:13:22.730959 3624 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.730976 3635 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731009 3632 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3636 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3628 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731050 3626 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731067 3630 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731077 3622 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731089 3633 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731108 3634 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731112 3627 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731127 3623 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731139 3637 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3629 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3625 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failure
*** Check failure stack trace: ***
F0628 19:13:22.730959 3624 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.730976 3635 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731009 3632 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3636 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3628 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731050 3626 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731067 3630 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731077 3622 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731089 3633 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731108 3634 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731112 3627 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731127 3623 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731139 3637 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3629 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3625 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failure
*** Check failure stack trace: ***
*** Check failure stack trace: ***
F0628 19:13:22.730959 3624 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.730976 3635 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731009 3632 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3636 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3628 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731050 3626 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731067 3630 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731077 3622 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731089 3633 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731108 3634 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731112 3627 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731127 3623 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731139 3637 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3629 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3625 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failure
*** Check failure stack trace: ***
F0628 19:13:22.730959 3624 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.730976 3635 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731009 3632 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3636 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3628 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731050 3626 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731067 3630 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731077 3622 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731089 3633 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731108 3634 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731112 3627 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731127 3623 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731139 3637 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3629 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3625 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failure
F0628 19:13:22.730959 3624 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.730976 3635 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731009 3632 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3636 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3628 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731050 3626 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731067 3630 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731077 3622 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731089 3633 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731108 3634 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731112 3627 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731127 3623 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731139 3637 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3629 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3625 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failure
*** Check failure stack trace: ***
I googled this error, it seems like something about GPU memory leak, but during training time, GPU memory usage are stable and normal until the progress broke down. I tried to reboot my server but it didn't work, can you help me out with this?
I checked the context_gpu.h at line 107, the code is:
~ThreadLocalCUDAObjects() noexcept {
99 for (int i = 0; i < CAFFE2_COMPILE_TIME_MAX_GPUS; ++i) {
100 for (auto& handle : cublas_handles_[i]) {
101 if (handle) {
102 CUBLAS_CHECK(cublasDestroy(handle));
103 }
104 }
105 for (auto& stream : cuda_streams_[i]) {
106 if (stream) {
107 CUDA_CHECK(cudaStreamDestroy(stream));
108 }
109 }
110 for (auto& handle : cudnn_handles_[i]) {
111 if (handle) {
112 CUDNN_CHECK(cudnnDestroy(handle));
113 }
114 }
115 }
116 }