-
Notifications
You must be signed in to change notification settings - Fork 422
[nvidia-ctk-installer] do not revert cri-o config on shutdown #1360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
29dca9c
to
7d12529
Compare
This commit updates the behavior of the nvidia-ctk-installer for cri-o. On shutdown, we no longer delete the drop-in config file as long as none of the nvidia runtime handlers are set as the default runtime. This change was made to workaround an issue observed when uninstalling the gpu-operator -- management containers launched with the nvidia runtime handler would get stuck in the terminating state with the below error message: ``` failed to find runtime handler nvidia from runtime list map[crun:... runc:...], failed to "KillPodSandbox" for ... ``` There appears to be a race condition where the nvidia-ctk-installer removes the drop-in file and restarts cri-o. After the cri-o restart, if there are still pods / containers to terminate that were started with the nvidia runtime, then cri-o fails to terminate them. The behavior of cri-o, and its in-memory runtime handler cache, appears to differ from that of containerd as we have never encountered such an issue with containerd. This commit can be considered a stop-gap solution until more robust solution is developed. Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
7d12529
to
c9c7aa7
Compare
One question: Even if we don't delete the drop-in file, do we no still remove the binaries? |
if !o.SetAsDefault { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we still need to check whether we're using a drop-in file? We could be modifying the top-level config directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow why that matters? Regardless of whether we use a drop-in file or not, removing the nvidia runtime from the cri-o config can lead to the issues described in the PR description.
The reason I thought to add this conditional is that we must attempt to revert the config if nvidia is the default runtime, or else all future pods will fail to function (after the nvidia runtime binaries are removed).
I believe we still do, yes. |
d32d3f9
to
e6d4b6e
Compare
…eanup Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
e6d4b6e
to
521c243
Compare
Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
This commit updates the behavior of the nvidia-ctk-installer for cri-o.
On shutdown, we no longer delete the drop-in config file as long as
none of the nvidia runtime handlers are set as the default runtime.
This change was made to workaround an issue observed when uninstalling
the gpu-operator -- management containers launched with the nvidia
runtime handler would get stuck in the terminating state with the below
error message:
There appears to be a race condition where the nvidia-ctk-installer removes the drop-in file
and restarts cri-o. After the cri-o restart, if there are still pods / containers to terminate
that were started with the nvidia runtime, then cri-o fails to terminate them. The behavior
of cri-o, and its in-memory runtime handler cache, appears to differ from that of containerd as
we have never encountered such an issue with containerd.
This commit can be considered a stop-gap solution until more robust solution is developed.
Signed-off-by: Christopher Desiniotis cdesiniotis@nvidia.com