Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Add concurrent worker configuration for DisruptionController #131386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
Loading
from

Conversation

xigang
Copy link
Member

@xigang xigang commented Apr 20, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds configuration options for concurrent workers in the DisruptionController to allow scaling of PDB processing and stale pod cleanup. This enables better performance tuning for clusters with many PDBs or pods requiring disruption processing.

Which issue(s) this PR fixes:

Fixes ##82930

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Add flags to configure the number of concurrent workers in the DisruptionController.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. labels Apr 20, 2025
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.33 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.33.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Sun Apr 20 01:33:00 UTC 2025.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 20, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

Hi @xigang. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 20, 2025
@k8s-ci-robot k8s-ci-robot requested review from atiratree and dims April 20, 2025 07:02
@k8s-ci-robot k8s-ci-robot added area/code-generation kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 20, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Apr 20, 2025
@xigang
Copy link
Member Author

xigang commented Apr 20, 2025

/sig apps

@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@siyuanfoundation
Copy link
Contributor

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Apr 22, 2025
@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Apr 23, 2025
@xigang
Copy link
Member Author

xigang commented Apr 23, 2025

/cc @liggitt @wojtek-t @deads2k @alculquicondor

Could you help review this? Thanks!

@pacoxu
Copy link
Member

pacoxu commented Apr 28, 2025

cc @ricky1993
/ok-to-test

as this is a feature, a release note may be needed.

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 28, 2025
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 28, 2025
@xigang
Copy link
Member Author

xigang commented Apr 28, 2025

Release note added.

@siyuanfoundation
Copy link
Contributor

@xigang Thanks for the PR. For a new feature like this, you should probably follow the KEP process to provide more context and discuss the alternatives.

  1. discuss this in the sig-apps meeting
  2. raise a KEP PR

@xigang
Copy link
Member Author

xigang commented Apr 28, 2025

@siyuanfoundation Thanks for the reply! I'll submit a KEP later:)

@yongruilin
Copy link
Contributor

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 1, 2025
Copy link
Member

@atiratree atiratree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xigang I would suggest to discuss this first before writing a KEP.

Couple of questions:

  • How often do we observe this and under what conditions/ cluster load?
  • Do we have some metrics to support that?

Comment on lines 25 to 31
if obj.ConcurrentDisruptionSyncs == 0 {
obj.ConcurrentDisruptionSyncs = 5
}
if obj.ConcurrentDisruptionStalePodSyncs == 0 {
obj.ConcurrentDisruptionStalePodSyncs = 5
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we choose 5 as a default here? Is it a good idea to change the defaults and spin up 10 goroutines for current users?

Copy link
Member Author

@xigang xigang May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it's unclear what the default value should be set to. For now, we're referring to the Deployment controller's default value of 5.

obj.ConcurrentDeploymentSyncs = 5

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default value should be the same as it used today, that IIUIC is 1

Copy link
Member Author

@xigang xigang May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea Agreed. The ConcurrentDisruptionSyncs and ConcurrentDisruptionStalePodSyncs parameters have been set to their original default value of 1.

done.

go wait.Until(dc.recheckWorker, time.Second, ctx.Done())
go wait.UntilWithContext(ctx, dc.stalePodDisruptionWorker, time.Second)

for i := 0; i < stalePodWorkers; i++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to increase the number of workers for both the pdb processing and stale conditions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, can you break down the perf implications of each option? @xigang

Copy link
Member Author

@xigang xigang May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have specific performance metrics yet, but I was concerned that a single worker might limit throughput, so I increased the number of concurrent workers.

issue: #82930

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in addition, have we checked that the reconcile loop can be parallelized?

Copy link
Member Author

@xigang xigang May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea Yes, the stalePodDisruptionWorker reconcile loop is safe for parallelization. It uses thread-safe workqueues and processes pod keys independently. Each worker safely updates the Pod's DisruptionTarget condition to False when it becomes stale (after stalePodDisruptionTimeout).

@xigang
Copy link
Member Author

xigang commented May 5, 2025

@atiratree I currently don’t have any load metrics related to PDB. I noticed that @ricky1993 in the community encountered a performance issue ##82930 due to single-threaded processing of PDBs, so a PR was submitted to add multi-worker support to the Disruption Controller to address this issue.

In large-scale cluster scenarios (like a single cluster with 150,000 Pods), the Disruption Controller can run into performance issues during reconciliation because it relies on a ListWatch of Pods to evaluate PDBs.

I think it makes sense for the Disruption Controller to expose a configurable parameter for the number of workers. What do you think?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xigang, yashpawar6849
Once this PR has been reviewed and has the lgtm label, please assign smarterclayton for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@atiratree
Copy link
Member

I am afraid it is insufficient to merge a user facing code without analyzing the impact. This is especially true because the disruption controller worker can result in an increased number of API requests.

@xigang
Copy link
Member Author

xigang commented May 15, 2025

@ricky1993 Could you share the performance issue data for the PDB? Looking forward to your response.

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 19, 2025
Signed-off-by: xigang <wangxigang2014@gmail.com>
@xigang
Copy link
Member Author

xigang commented May 19, 2025

/test pull-kubernetes-unit

@xigang
Copy link
Member Author

xigang commented May 19, 2025

/test pull-kubernetes-e2e-kind
/test pull-kubernetes-e2e-kind-ipv6

@xigang
Copy link
Member Author

xigang commented May 19, 2025

/test pull-kubernetes-e2e-kind-ipv6

@xigang
Copy link
Member Author

xigang commented May 19, 2025

@atiratree We can expose the ConcurrentDisruptionSyncs and ConcurrentDisruptionStalePodSyncs parameters and set their default values to 1. This approach helps avoid putting excessive pressure on the API server, while also allowing us to tune the parameters in scenarios where the PDB controller is processing slowly.

@Jefftree
Copy link
Member

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/code-generation area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

Morty Proxy This is a proxified and sanitized view of the page, visit original site.