-
Notifications
You must be signed in to change notification settings - Fork 40.7k
Add concurrent worker configuration for DisruptionController #131386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Sun Apr 20 01:33:00 UTC 2025. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @xigang. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/sig apps |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
/remove-sig api-machinery |
/cc @liggitt @wojtek-t @deads2k @alculquicondor Could you help review this? Thanks! |
cc @ricky1993 as this is a feature, a release note may be needed. |
Release note added. |
@siyuanfoundation Thanks for the reply! I'll submit a KEP later:) |
/remove-sig api-machinery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xigang I would suggest to discuss this first before writing a KEP.
Couple of questions:
- How often do we observe this and under what conditions/ cluster load?
- Do we have some metrics to support that?
if obj.ConcurrentDisruptionSyncs == 0 { | ||
obj.ConcurrentDisruptionSyncs = 5 | ||
} | ||
if obj.ConcurrentDisruptionStalePodSyncs == 0 { | ||
obj.ConcurrentDisruptionStalePodSyncs = 5 | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we choose 5 as a default here? Is it a good idea to change the defaults and spin up 10 goroutines for current users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it's unclear what the default value should be set to. For now, we're referring to the Deployment controller's default value of 5.
obj.ConcurrentDeploymentSyncs = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default value should be the same as it used today, that IIUIC is 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea Agreed. The ConcurrentDisruptionSyncs and ConcurrentDisruptionStalePodSyncs parameters have been set to their original default value of 1.
done.
go wait.Until(dc.recheckWorker, time.Second, ctx.Done()) | ||
go wait.UntilWithContext(ctx, dc.stalePodDisruptionWorker, time.Second) | ||
|
||
for i := 0; i < stalePodWorkers; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it useful to increase the number of workers for both the pdb processing and stale conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, can you break down the perf implications of each option? @xigang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have specific performance metrics yet, but I was concerned that a single worker might limit throughput, so I increased the number of concurrent workers.
issue: #82930
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in addition, have we checked that the reconcile loop can be parallelized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea Yes, the stalePodDisruptionWorker
reconcile loop is safe for parallelization. It uses thread-safe workqueues and processes pod keys independently. Each worker safely updates the Pod's DisruptionTarget
condition to False when it becomes stale (after stalePodDisruptionTimeout
).
@atiratree I currently don’t have any load metrics related to PDB. I noticed that @ricky1993 in the community encountered a performance issue ##82930 due to single-threaded processing of PDBs, so a PR was submitted to add multi-worker support to the Disruption Controller to address this issue. In large-scale cluster scenarios (like a single cluster with 150,000 Pods), the Disruption Controller can run into performance issues during reconciliation because it relies on a ListWatch of Pods to evaluate PDBs. I think it makes sense for the Disruption Controller to expose a configurable parameter for the number of workers. What do you think? |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: xigang, yashpawar6849 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I am afraid it is insufficient to merge a user facing code without analyzing the impact. This is especially true because the disruption controller worker can result in an increased number of API requests. |
@ricky1993 Could you share the performance issue data for the PDB? Looking forward to your response. |
Signed-off-by: xigang <wangxigang2014@gmail.com>
/test pull-kubernetes-unit |
/test pull-kubernetes-e2e-kind |
/test pull-kubernetes-e2e-kind-ipv6 |
@atiratree We can expose the |
/remove-sig api-machinery |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds configuration options for concurrent workers in the DisruptionController to allow scaling of PDB processing and stale pod cleanup. This enables better performance tuning for clusters with many PDBs or pods requiring disruption processing.
Which issue(s) this PR fixes:
Fixes ##82930
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: