Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[ELECTRA / TF2] One GPU-Util falls 0% periodically when using multi-GPU Training with horovod #1276

Copy link
Copy link
@goreng2

Description

@goreng2
Issue body actions

Related to Model/Framework(s)
ELECTRA / TF2

Describe the bug
Firstly, There is no problem when using GPU A6000 (Not NV-Link) 4ea
But in A100 (+ NV-Link) 2ea, One GPU-Util (in my case #1 GPU, Not #0) periodically fall 0% when multi-GPU Training on nvcr.io/nvidia/tensorflow:YY.MM-tf2-py3 docker image

There is no problem when single-GPU training. I think All of GPU Hardware is normal
Also, I switched GPUs each other But the problem remained (GPU is different, But #1 GPU still not work properly)

I also suspect CUDA version mis-match. (Driver=11.4, Docker=11.7 (nvcr.io/nvidia/tensorflow:22.04-tf2-py3))
So, I tested Docker 11.4(21.07), 11.3(21.06) But the problem is remaind too

To Reproduce
Same Quick Start Guide

Expected behavior
All of GPU-Util are steadily occupied 100% like when using GPU A6000

Environment
Please provide at least:

  • Container version (e.g. pytorch:19.05-py3): I tested nvcr.io/nvidia/tensorflow:22.04-tf2-py3, nvcr.io/nvidia/tensorflow:21.07-tf2-py3, nvcr.io/nvidia/tensorflow:21.06-tf2-py3,
  • GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): Nvidia A100 80GB
  • CUDA driver version (e.g. 418.67): 470.103.01
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:31:00.0 Off |                    0 |
| N/A   70C    P0   120W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000000:4B:00.0 Off |                    0 |
| N/A   55C    P0    87W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
Reactions are currently unavailable

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.