Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

cdesiniotis
Copy link
Contributor

@cdesiniotis cdesiniotis commented Oct 16, 2025

This commit updates the behavior of the nvidia-ctk-installer for cri-o.
On shutdown, we no longer delete the drop-in config file as long as
none of the nvidia runtime handlers are set as the default runtime.
This change was made to workaround an issue observed when uninstalling
the gpu-operator -- management containers launched with the nvidia
runtime handler would get stuck in the terminating state with the below
error message:

failed to find runtime handler nvidia from runtime list map[crun:... runc:...], failed to "KillPodSandbox" for ...

There appears to be a race condition where the nvidia-ctk-installer removes the drop-in file
and restarts cri-o. After the cri-o restart, if there are still pods / containers to terminate
that were started with the nvidia runtime, then cri-o fails to terminate them. The behavior
of cri-o, and its in-memory runtime handler cache, appears to differ from that of containerd as
we have never encountered such an issue with containerd.

This commit can be considered a stop-gap solution until more robust solution is developed.

Signed-off-by: Christopher Desiniotis cdesiniotis@nvidia.com

@cdesiniotis cdesiniotis force-pushed the do-not-cleanup-crio-drop-in branch 7 times, most recently from 29dca9c to 7d12529 Compare October 17, 2025 05:49
This commit updates the behavior of the nvidia-ctk-installer for cri-o.
On shutdown, we no longer delete the drop-in config file as long as
none of the nvidia runtime handlers are set as the default runtime.
This change was made to workaround an issue observed when uninstalling
the gpu-operator -- management containers launched with the nvidia
runtime handler would get stuck in the terminating state with the below
error message:

```
failed to find runtime handler nvidia from runtime list map[crun:... runc:...], failed to "KillPodSandbox" for ...
```

There appears to be a race condition where the nvidia-ctk-installer removes the drop-in file
and restarts cri-o. After the cri-o restart, if there are still pods / containers to terminate
that were started with the nvidia runtime, then cri-o fails to terminate them. The behavior
of cri-o, and its in-memory runtime handler cache, appears to differ from that of containerd as
we have never encountered such an issue with containerd.

This commit can be considered a stop-gap solution until more robust solution is developed.

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
@cdesiniotis cdesiniotis force-pushed the do-not-cleanup-crio-drop-in branch from 7d12529 to c9c7aa7 Compare October 17, 2025 05:50
@cdesiniotis cdesiniotis marked this pull request as ready for review October 17, 2025 05:50
@cdesiniotis cdesiniotis changed the title [nvidia-ctk-installer] do not revert cri-o config when nvidia is not … [nvidia-ctk-installer] do not revert cri-o config on shutdown Oct 17, 2025
@elezar
Copy link
Member

elezar commented Oct 17, 2025

One question: Even if we don't delete the drop-in file, do we no still remove the binaries?

Comment on lines +183 to +185
if !o.SetAsDefault {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still need to check whether we're using a drop-in file? We could be modifying the top-level config directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow why that matters? Regardless of whether we use a drop-in file or not, removing the nvidia runtime from the cri-o config can lead to the issues described in the PR description.

The reason I thought to add this conditional is that we must attempt to revert the config if nvidia is the default runtime, or else all future pods will fail to function (after the nvidia runtime binaries are removed).

@cdesiniotis
Copy link
Contributor Author

One question: Even if we don't delete the drop-in file, do we no still remove the binaries?

I believe we still do, yes.

pkg/config/engine/config.go Outdated Show resolved Hide resolved
@cdesiniotis cdesiniotis force-pushed the do-not-cleanup-crio-drop-in branch 2 times, most recently from d32d3f9 to e6d4b6e Compare October 17, 2025 20:57
…eanup

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
@cdesiniotis cdesiniotis force-pushed the do-not-cleanup-crio-drop-in branch from e6d4b6e to 521c243 Compare October 17, 2025 20:59
pkg/config/engine/containerd/config.go Outdated Show resolved Hide resolved
Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.