[nvidia-ctk-installer] do not revert cri-o config on shutdown #1360

cdesiniotis · Oct 16, 2025

This commit updates the behavior of the nvidia-ctk-installer for cri-o.
On shutdown, we no longer delete the drop-in config file as long as
none of the nvidia runtime handlers are set as the default runtime.
This change was made to workaround an issue observed when uninstalling
the gpu-operator -- management containers launched with the nvidia
runtime handler would get stuck in the terminating state with the below
error message:

failed to find runtime handler nvidia from runtime list map[crun:... runc:...], failed to "KillPodSandbox" for ...

There appears to be a race condition where the nvidia-ctk-installer removes the drop-in file
and restarts cri-o. After the cri-o restart, if there are still pods / containers to terminate
that were started with the nvidia runtime, then cri-o fails to terminate them. The behavior
of cri-o, and its in-memory runtime handler cache, appears to differ from that of containerd as
we have never encountered such an issue with containerd.

This commit can be considered a stop-gap solution until more robust solution is developed.

Signed-off-by: Christopher Desiniotis cdesiniotis@nvidia.com

This commit updates the behavior of the nvidia-ctk-installer for cri-o. On shutdown, we no longer delete the drop-in config file as long as none of the nvidia runtime handlers are set as the default runtime. This change was made to workaround an issue observed when uninstalling the gpu-operator -- management containers launched with the nvidia runtime handler would get stuck in the terminating state with the below error message: ``` failed to find runtime handler nvidia from runtime list map[crun:... runc:...], failed to "KillPodSandbox" for ... ``` There appears to be a race condition where the nvidia-ctk-installer removes the drop-in file and restarts cri-o. After the cri-o restart, if there are still pods / containers to terminate that were started with the nvidia runtime, then cri-o fails to terminate them. The behavior of cri-o, and its in-memory runtime handler cache, appears to differ from that of containerd as we have never encountered such an issue with containerd. This commit can be considered a stop-gap solution until more robust solution is developed. Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>

elezar · Oct 17, 2025

One question: Even if we don't delete the drop-in file, do we no still remove the binaries?

elezar · Oct 17, 2025

cmd/nvidia-ctk-installer/container/runtime/crio/crio.go

+	if !o.SetAsDefault {
+		return nil
+	}


Don't we still need to check whether we're using a drop-in file? We could be modifying the top-level config directly.

I am not sure I follow why that matters? Regardless of whether we use a drop-in file or not, removing the nvidia runtime from the cri-o config can lead to the issues described in the PR description.

The reason I thought to add this conditional is that we must attempt to revert the config if nvidia is the default runtime, or else all future pods will fail to function (after the nvidia runtime binaries are removed).

cdesiniotis · Oct 17, 2025

One question: Even if we don't delete the drop-in file, do we no still remove the binaries?

I believe we still do, yes.

pkg/config/engine/config.go

…eanup Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>

pkg/config/engine/containerd/config.go

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>

cdesiniotis force-pushed the do-not-cleanup-crio-drop-in branch 7 times, most recently from 29dca9c to 7d12529 Compare October 17, 2025 05:49

cdesiniotis force-pushed the do-not-cleanup-crio-drop-in branch from 7d12529 to c9c7aa7 Compare October 17, 2025 05:50

cdesiniotis requested review from elezar and tariq1890 October 17, 2025 05:50

cdesiniotis marked this pull request as ready for review October 17, 2025 05:50

cdesiniotis changed the title ~~[nvidia-ctk-installer] do not revert cri-o config when nvidia is not …~~ [nvidia-ctk-installer] do not revert cri-o config on shutdown Oct 17, 2025

elezar reviewed Oct 17, 2025

View reviewed changes

tariq1890 reviewed Oct 17, 2025

View reviewed changes

pkg/config/engine/config.go Outdated Show resolved Hide resolved

cdesiniotis force-pushed the do-not-cleanup-crio-drop-in branch 2 times, most recently from d32d3f9 to e6d4b6e Compare October 17, 2025 20:57

[nvidia-ctk-installer] remove default_runtime from cri-o config on cl…

521c243

…eanup Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>

cdesiniotis force-pushed the do-not-cleanup-crio-drop-in branch from e6d4b6e to 521c243 Compare October 17, 2025 20:59

tariq1890 reviewed Oct 17, 2025

View reviewed changes

pkg/config/engine/containerd/config.go Outdated Show resolved Hide resolved

[no-relnote] move 'set' / 'unset' update actions to constants

42a4ad6

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>

tariq1890 approved these changes Oct 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[nvidia-ctk-installer] do not revert cri-o config on shutdown #1360

[nvidia-ctk-installer] do not revert cri-o config on shutdown #1360

Uh oh!

cdesiniotis commented Oct 16, 2025 •

edited

Loading

Uh oh!

elezar commented Oct 17, 2025

Uh oh!

elezar Oct 17, 2025

Uh oh!

cdesiniotis Oct 17, 2025

Uh oh!

cdesiniotis commented Oct 17, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Search code, repositories, users, issues, pull requests...

[nvidia-ctk-installer] do not revert cri-o config on shutdown #1360

Are you sure you want to change the base?

[nvidia-ctk-installer] do not revert cri-o config on shutdown #1360

Uh oh!

Conversation

cdesiniotis commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elezar commented Oct 17, 2025

Uh oh!

elezar Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

cdesiniotis commented Oct 17, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cdesiniotis commented Oct 16, 2025 •

edited

Loading