Add concurrent worker configuration for DisruptionController #131386

xigang · Apr 20, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds configuration options for concurrent workers in the DisruptionController to allow scaling of PDB processing and stale pod cleanup. This enables better performance tuning for clusters with many PDBs or pods requiring disruption processing.

Which issue(s) this PR fixes:

Fixes ##82930

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Add flags to configure the number of concurrent workers in the DisruptionController.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · Apr 20, 2025

Please note that we're already in Test Freeze for the release-1.33 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.33.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Sun Apr 20 01:33:00 UTC 2025.

k8s-ci-robot · Apr 20, 2025

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · Apr 20, 2025

Hi @xigang. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

xigang · Apr 20, 2025

/sig apps

k8s-triage-robot · Apr 20, 2025

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

siyuanfoundation · Apr 22, 2025

/remove-sig api-machinery

xigang · Apr 23, 2025

/cc @liggitt @wojtek-t @deads2k @alculquicondor

Could you help review this? Thanks!

pacoxu · Apr 28, 2025

cc @ricky1993
/ok-to-test

as this is a feature, a release note may be needed.

xigang · Apr 28, 2025

Release note added.

siyuanfoundation · Apr 28, 2025

@xigang Thanks for the PR. For a new feature like this, you should probably follow the KEP process to provide more context and discuss the alternatives.

discuss this in the sig-apps meeting
raise a KEP PR

xigang · Apr 28, 2025

@siyuanfoundation Thanks for the reply! I'll submit a KEP later：）

yongruilin · May 1, 2025

/remove-sig api-machinery

atiratree

@xigang I would suggest to discuss this first before writing a KEP.

Couple of questions:

How often do we observe this and under what conditions/ cluster load?
Do we have some metrics to support that?

atiratree · May 2, 2025

pkg/controller/disruption/config/v1alpha1/defaults.go

+	if obj.ConcurrentDisruptionSyncs == 0 {
+		obj.ConcurrentDisruptionSyncs = 5
+	}
+	if obj.ConcurrentDisruptionStalePodSyncs == 0 {
+		obj.ConcurrentDisruptionStalePodSyncs = 5
+	}
+}


Why did we choose 5 as a default here? Is it a good idea to change the defaults and spin up 10 goroutines for current users?

Currently, it's unclear what the default value should be set to. For now, we're referring to the Deployment controller's default value of 5.

kubernetes/pkg/controller/deployment/config/v1alpha1/defaults.go

Line 34 in 71f0fc6

obj.ConcurrentDeploymentSyncs = 5

the default value should be the same as it used today, that IIUIC is 1

@aojea Agreed. The ConcurrentDisruptionSyncs and ConcurrentDisruptionStalePodSyncs parameters have been set to their original default value of 1.

done.

atiratree · May 2, 2025

pkg/controller/disruption/disruption.go

 	go wait.Until(dc.recheckWorker, time.Second, ctx.Done())
-	go wait.UntilWithContext(ctx, dc.stalePodDisruptionWorker, time.Second)
+
+	for i := 0; i < stalePodWorkers; i++ {


Is it useful to increase the number of workers for both the pdb processing and stale conditions?

+1, can you break down the perf implications of each option? @xigang

I don't have specific performance metrics yet, but I was concerned that a single worker might limit throughput, so I increased the number of concurrent workers.

issue: #82930

in addition, have we checked that the reconcile loop can be parallelized?

@aojea Yes, the stalePodDisruptionWorker reconcile loop is safe for parallelization. It uses thread-safe workqueues and processes pod keys independently. Each worker safely updates the Pod's DisruptionTarget condition to False when it becomes stale (after stalePodDisruptionTimeout).

xigang · May 5, 2025

@atiratree I currently don’t have any load metrics related to PDB. I noticed that @ricky1993 in the community encountered a performance issue ##82930 due to single-threaded processing of PDBs, so a PR was submitted to add multi-worker support to the Disruption Controller to address this issue.

In large-scale cluster scenarios (like a single cluster with 150,000 Pods), the Disruption Controller can run into performance issues during reconciliation because it relies on a ListWatch of Pods to evaluate PDBs.

I think it makes sense for the Disruption Controller to expose a configurable parameter for the number of workers. What do you think?

k8s-ci-robot · May 10, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xigang, yashpawar6849
Once this PR has been reviewed and has the lgtm label, please assign smarterclayton for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

api/OWNERS
cmd/kube-controller-manager/OWNERS
pkg/controller/apis/config/OWNERS
pkg/controller/disruption/OWNERS
pkg/generated/openapi/OWNERS
staging/src/k8s.io/kube-controller-manager/config/OWNERS
test/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

atiratree · May 14, 2025

I am afraid it is insufficient to merge a user facing code without analyzing the impact. This is especially true because the disruption controller worker can result in an increased number of API requests.

xigang · May 15, 2025

@ricky1993 Could you share the performance issue data for the PDB? Looking forward to your response.

Signed-off-by: xigang <wangxigang2014@gmail.com>

xigang · May 19, 2025

/test pull-kubernetes-unit

xigang · May 19, 2025

/test pull-kubernetes-e2e-kind
/test pull-kubernetes-e2e-kind-ipv6

xigang · May 19, 2025

/test pull-kubernetes-e2e-kind-ipv6

xigang · May 19, 2025

@atiratree We can expose the ConcurrentDisruptionSyncs and ConcurrentDisruptionStalePodSyncs parameters and set their default values to 1. This approach helps avoid putting excessive pressure on the API server, while also allowing us to tune the parameters in scenarios where the PDB controller is processing slowly.

Jefftree · May 20, 2025

/remove-sig api-machinery

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. labels Apr 20, 2025

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 20, 2025

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 20, 2025

k8s-ci-robot requested review from atiratree and dims April 20, 2025 07:02

github-project-automation bot added this to SIG Apps Apr 20, 2025

github-project-automation bot moved this to Needs Triage in SIG Apps Apr 20, 2025

xigang mentioned this pull request Apr 20, 2025

Disruption controller support configure workers' number #82930

Open

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Apr 22, 2025

xigang force-pushed the pdb_controller branch from 108a069 to 1a516b7 Compare April 23, 2025 12:54

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Apr 23, 2025

k8s-ci-robot requested review from alculquicondor and deads2k April 23, 2025 13:02

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 28, 2025

xigang force-pushed the pdb_controller branch from 1a516b7 to e62fce1 Compare April 28, 2025 03:58

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 28, 2025

xigang force-pushed the pdb_controller branch from e62fce1 to 7164d4e Compare April 28, 2025 06:37

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 1, 2025

atiratree reviewed May 2, 2025

View reviewed changes

yashpawar6849 approved these changes May 10, 2025

View reviewed changes

xigang force-pushed the pdb_controller branch from 7164d4e to b7aa23d Compare May 19, 2025 00:34

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 19, 2025

Add concurrent worker configuration for DisruptionController

5036006

Signed-off-by: xigang <wangxigang2014@gmail.com>

xigang force-pushed the pdb_controller branch from b7aa23d to 5036006 Compare May 19, 2025 00:41

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 20, 2025

Search code, repositories, users, issues, pull requests...

Add concurrent worker configuration for DisruptionController #131386

Are you sure you want to change the base?

Add concurrent worker configuration for DisruptionController #131386

Uh oh!

Conversation

xigang commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Apr 20, 2025

Uh oh!

k8s-ci-robot commented Apr 20, 2025

Uh oh!

k8s-ci-robot commented Apr 20, 2025

Uh oh!

xigang commented Apr 20, 2025

Uh oh!

k8s-triage-robot commented Apr 20, 2025

Uh oh!

siyuanfoundation commented Apr 22, 2025

Uh oh!

xigang commented Apr 23, 2025

Uh oh!

pacoxu commented Apr 28, 2025

Uh oh!

xigang commented Apr 28, 2025

Uh oh!

siyuanfoundation commented Apr 28, 2025

Uh oh!

xigang commented Apr 28, 2025

Uh oh!

yongruilin commented May 1, 2025

Uh oh!

atiratree left a comment

Choose a reason for hiding this comment

Uh oh!

atiratree May 2, 2025

Choose a reason for hiding this comment

Uh oh!

xigang May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aojea May 18, 2025

Choose a reason for hiding this comment

Uh oh!

xigang May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atiratree May 2, 2025

Choose a reason for hiding this comment

Uh oh!

cartermckinnon May 12, 2025

Choose a reason for hiding this comment

Uh oh!

xigang May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aojea May 18, 2025

Choose a reason for hiding this comment

Uh oh!

xigang May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xigang commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented May 10, 2025

Uh oh!

atiratree commented May 14, 2025

Uh oh!

xigang commented Apr 20, 2025 •

edited

Loading

xigang May 6, 2025 •

edited

Loading

xigang May 19, 2025 •

edited

Loading

xigang May 12, 2025 •

edited

Loading

xigang May 19, 2025 •

edited

Loading

xigang commented May 5, 2025 •

edited

Loading

xigang commented May 19, 2025 •

edited

Loading