-
Notifications
You must be signed in to change notification settings - Fork 40.7k
WIP: Retry pod admission when the device manager fails #131190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
WIP: Retry pod admission when the device manager fails #131190
Conversation
Currently admission is a binary succeed/reject, but this doesn't correctly model that admission requests can sometimes _fail_. This is primarily true when interfacing with devices and other external components that may be racing the Kubelet to start or recycle configuration. Here we introduce the concept of failure to the interface to allow for the Kubelet to make better decisions about pod admission and retries.
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: endocrimes The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
b31bf22
to
e801b76
Compare
Currently admission is a binary succeed/reject, but this doesn't correctly model that admission requests can sometimes _fail_. This is primarily true when interfacing with devices and other external components that may be racing the Kubelet to start or recycle configuration. Here we introduce the concept of failure to scoped TM admission to allow for the Kubelet to make better decisions about pod admission and retries.
e95a4bb
to
8c6fe8b
Compare
This is a very coarse hack in the process of resolving kubernetes#128043 I'm not happy with the retriable error concept, or necessarily sure about the exact paths that we want to retry. Very much open to changes here.
8c6fe8b
to
629fe20
Compare
@endocrimes: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Currently Pods that consume devices may fail during a node reboot due to a few different issues and races within the device plugin flow. This PR introduces a new mechanism to allow pod admission to fail rather than being a simple binary admit/reject. We then use this signal to implement a single 5s retry during the kubelet sync loop, before rejecting the pod even in the case of a retryable error (the latter part To-Still-Be-Implemented).
This is a partial improvement to the node reliability issue, with further improvements to the admission cycle requiring a KEP, but should help many cases in the meantime.
Which issue(s) this PR fixes:
Partially resolves #128043
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/hold