Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Check failed: error == cudaSuccess unspecified launch failure #3

Copy link
Copy link
@wenhe-jia

Description

@wenhe-jia
Issue body actions

When I tried to resume training maskrcnn using Detectron. The training progress goes well at the begining, but alone with the training continues, the training time for each iter grows progressly, after handreds or thousands of iterations, the training broke down with the cuda error below:

E0628 19:13:22.730840 3624 net_dag.cc:195] Exception from operator chain starting at '' (type 'Concat'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: unspecified launch failure Error from operator:
input: "gpu_3/roi_feat_3rd" input: "gpu_3/fc1_3rd_w" input: "gpu_3/fc1_3rd_b" output: "gpu_3/fc1_3rd" name: "" type: "FC" arg { name: "use_cudnn" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "cudnn_exhaustive_search" i: 0 } device_option { device_type: 1 cuda_gpu_id: 3 }
E0628 19:13:22.730844 3631 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'WeightedSum'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: unspecified launch failure Error from operator:
input: "gpu_3/fc1_2nd_w_grad" input: "gpu_3/one" input: "gpu_3/fc1_2nd_w" input: "gpu_3/wd" output: "gpu_3/fc1_2nd_w_grad" name: "" type: "WeightedSum" device_option { device_type: 1 cuda_gpu_id: 3 }
E0628 19:13:22.730901 3635 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'Concat'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: unspecified launch failure Error from operator:
input: "gpu_3/_[mask]fcn1" output: "gpu_3/[mask]_fcn1" name: "" type: "Relu" arg { name: "cudnn_exhaustive_search" i: 0 } arg { name: "order" s: "NCHW" } device_option { device_type: 1 cuda_gpu_id: 3 } engine: "CUDNN"
F0628 19:13:22.730955 3631 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failure
F0628 19:13:22.730959 3624 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.730976 3635 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731009 3632 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3636 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3628 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731050 3626 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731067 3630 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731077 3622 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731089 3633 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731108 3634 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731112 3627 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731127 3623 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731139 3637 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3629 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3625 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failure
*** Check failure stack trace: ***
F0628 19:13:22.730959 3624 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.730976 3635 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731009 3632 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3636 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3628 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731050 3626 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731067 3630 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731077 3622 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731089 3633 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731108 3634 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731112 3627 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731127 3623 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731139 3637 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3629 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3625 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failure
*** Check failure stack trace: ***
*** Check failure stack trace: ***
F0628 19:13:22.730959 3624 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.730976 3635 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731009 3632 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3636 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3628 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731050 3626 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731067 3630 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731077 3622 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731089 3633 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731108 3634 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731112 3627 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731127 3623 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731139 3637 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3629 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3625 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failure
*** Check failure stack trace: ***
F0628 19:13:22.730959 3624 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.730976 3635 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731009 3632 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3636 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3628 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731050 3626 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731067 3630 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731077 3622 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731089 3633 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731108 3634 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731112 3627 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731127 3623 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731139 3637 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3629 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3625 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failure
F0628 19:13:22.730959 3624 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.730976 3635 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731009 3632 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3636 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731016 3628 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731050 3626 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731067 3630 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731077 3622 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731089 3633 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731108 3634 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731112 3627 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731127 3623 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731139 3637 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3629 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failureF0628 19:13:22.731155 3625 context_gpu.h:107] Check failed: error == cudaSuccess unspecified launch failure
*** Check failure stack trace: ***

I googled this error, it seems like something about GPU memory leak, but during training time, GPU memory usage are stable and normal until the progress broke down. I tried to reboot my server but it didn't work, can you help me out with this?
I checked the context_gpu.h at line 107, the code is:

~ThreadLocalCUDAObjects() noexcept {
 99     for (int i = 0; i < CAFFE2_COMPILE_TIME_MAX_GPUS; ++i) {
100       for (auto& handle : cublas_handles_[i]) {
101         if (handle) {
102           CUBLAS_CHECK(cublasDestroy(handle));
103         }
104       }
105       for (auto& stream : cuda_streams_[i]) {
106         if (stream) {
107           CUDA_CHECK(cudaStreamDestroy(stream));
108         }
109       }
110       for (auto& handle : cudnn_handles_[i]) {
111         if (handle) {
112           CUDNN_CHECK(cudnnDestroy(handle));
113         }
114       }
115     }
116   }
Reactions are currently unavailable

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.