Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

@emerbe
Copy link
Contributor

@emerbe emerbe commented Oct 23, 2025

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR modifies DRA test logic a bit so it's parametrized to simplify running test with different drivers and with different scale.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 23, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @emerbe. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: emerbe
Once this PR has been reviewed and has the lgtm label, please assign marseel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 23, 2025
@emerbe
Copy link
Contributor Author

emerbe commented Oct 23, 2025

/assign @alaypatel07

@emerbe emerbe force-pushed the update-dra-test-config branch 3 times, most recently from 65b6445 to 93485a2 Compare October 24, 2025 17:40
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 30, 2025
@emerbe emerbe force-pushed the update-dra-test-config branch from 93485a2 to e231a1a Compare October 30, 2025 15:53
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 30, 2025
@emerbe emerbe force-pushed the update-dra-test-config branch 3 times, most recently from 7128f8c to e7621d2 Compare October 30, 2025 16:02
clusterloader2/testing/dra/config.yaml Outdated Show resolved Hide resolved
{{$LOAD_TEST_THROUGHPUT := DefaultParam .CL2_LOAD_TEST_THROUGHPUT 10}}
{{$STEADY_STATE_QPS := DefaultParam .CL2_STEADY_STATE_QPS 5}}
{{$RESOURCE_SLICES_PER_NODE := DefaultParam .CL2_RESOURCE_SLICES_PER_NODE 1}}
{{$UPSIZE_THRESHOLD := DefaultParam .CL2_UPSIZE_THRESHOLD "10m"}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value does not have any impact except from printing the threshold value. Why is the change needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those thresholds define the acceptable limits for the pod startup times.

The initial values in this PT were too stretched for 100 nodes but in bigger scale I observed that those need to be higher. Even for 100 nodes in GKE 5 seconds is not enough:

E1102 21:24:30.207908 25461 clusterloader.go:258] Errors: [measurement call PodStartupLatency - FastFillPodStartupLatency error: pod startup: too high latency 99th percentile: got 8.792346318s expected: 5s]

I've been using this test with different than drivers and observed that times needed for different measurements may vary.

# Node resource configuration
{{$gpusPerNode := DefaultParam .CL2_GPUS_PER_NODE 8}}
{{$resourceSlicesPerNode := DefaultParam .CL2_RESOURCE_SLICES_PER_NODE 1}}
{{$workerNodeCount := MultiplyInt $resourceSlicesPerNode .Nodes}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be named $totalResourceSliceCount?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done.

{{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}}

# dra
{{$draNamespace := DefaultParam .CL2_DRA_NAMESPACE "dra-example-driver"}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the point of exposing these as env variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I removed it and made namepsace name in dra.go generic

@emerbe emerbe force-pushed the update-dra-test-config branch from e7621d2 to d405c3e Compare November 2, 2025 21:46
Copy link
Contributor Author

@emerbe emerbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alaypatel07 answered your comments. PTAL

{{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}}

# dra
{{$draNamespace := DefaultParam .CL2_DRA_NAMESPACE "dra-example-driver"}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I removed it and made namepsace name in dra.go generic

# Node resource configuration
{{$gpusPerNode := DefaultParam .CL2_GPUS_PER_NODE 8}}
{{$resourceSlicesPerNode := DefaultParam .CL2_RESOURCE_SLICES_PER_NODE 1}}
{{$workerNodeCount := MultiplyInt $resourceSlicesPerNode .Nodes}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done.

clusterloader2/testing/dra/config.yaml Outdated Show resolved Hide resolved
{{$LOAD_TEST_THROUGHPUT := DefaultParam .CL2_LOAD_TEST_THROUGHPUT 10}}
{{$STEADY_STATE_QPS := DefaultParam .CL2_STEADY_STATE_QPS 5}}
{{$RESOURCE_SLICES_PER_NODE := DefaultParam .CL2_RESOURCE_SLICES_PER_NODE 1}}
{{$UPSIZE_THRESHOLD := DefaultParam .CL2_UPSIZE_THRESHOLD "10m"}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those thresholds define the acceptable limits for the pod startup times.

The initial values in this PT were too stretched for 100 nodes but in bigger scale I observed that those need to be higher. Even for 100 nodes in GKE 5 seconds is not enough:

E1102 21:24:30.207908 25461 clusterloader.go:258] Errors: [measurement call PodStartupLatency - FastFillPodStartupLatency error: pod startup: too high latency 99th percentile: got 8.792346318s expected: 5s]

I've been using this test with different than drivers and observed that times needed for different measurements may vary.

}

func isResourceSlicesPublished(config *dependency.Config, namespace string) (bool, error) {
// Get a list of all nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why changes in this file are needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed that I haven't deleted commented code in the previous PR :/

clusterloader2/pkg/dependency/dra/dra.go Outdated Show resolved Hide resolved
@emerbe emerbe force-pushed the update-dra-test-config branch from d405c3e to cba8b3d Compare November 5, 2025 12:53
{{$CHURN_POD_STARTUP_PERC50_THRESHOLD := DefaultParam .CL2_CHURN_POD_STARTUP_PERC50_THRESHOLD "5s"}}
{{$CHURN_POD_STARTUP_PERC90_THRESHOLD := DefaultParam .CL2_CHURN_POD_STARTUP_PERC90_THRESHOLD "5s"}}
{{$CHURN_POD_STARTUP_PERC99_THRESHOLD := DefaultParam .CL2_CHURN_POD_STARTUP_PERC99_THRESHOLD "5s"}}
{{$FINISHED_JOBS_THRESHOLD := DefaultParam .CL2_FINISHED_JOBS_THRESHOLD "10m"}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CL2_LONG_LIVED_JOBS_WAIT_THRESHOLD and CL2_CHURN_JOBS_WAIT_THRESHOLD will good env variable names for this.

Also other places where we refer to these jobs in variable names can be changed for example:

{{$smallJobCompletions := DefaultParam .CL2_SMALL_JOB_COMPLETIONS 10}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

clusterloader2/testing/dra/config.yaml Outdated Show resolved Hide resolved
@emerbe emerbe force-pushed the update-dra-test-config branch from cba8b3d to b8380d8 Compare November 6, 2025 09:11
{{$LOAD_TEST_THROUGHPUT := DefaultParam .CL2_LOAD_TEST_THROUGHPUT 10}}
{{$STEADY_STATE_QPS := DefaultParam .CL2_STEADY_STATE_QPS 5}}
{{$RESOURCE_SLICES_PER_NODE := DefaultParam .CL2_RESOURCE_SLICES_PER_NODE 1}}
{{$UPSIZE_THRESHOLD := DefaultParam .CL2_UPSIZE_THRESHOLD "5s"}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unclear about the meaning of this when we have UPSIZE_PERC50_THRESHOLD and UPSIZE_PERC90_THRESHOLD

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's absolute threshold.

I've changed name to: LONG_LIVED_JOBS_STARTUP_PERC100_THRESHOLD

{{$UPSIZE_PERC90_THRESHOLD := DefaultParam .CL2_UPSIZE_PERC90_THRESHOLD "5s"}}
{{$CHURN_POD_STARTUP_PERC50_THRESHOLD := DefaultParam .CL2_CHURN_POD_STARTUP_PERC50_THRESHOLD "5s"}}
{{$CHURN_POD_STARTUP_PERC90_THRESHOLD := DefaultParam .CL2_CHURN_POD_STARTUP_PERC90_THRESHOLD "5s"}}
{{$CHURN_POD_STARTUP_PERC99_THRESHOLD := DefaultParam .CL2_CHURN_POD_STARTUP_PERC99_THRESHOLD "5s"}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make a lot of sense when we use the same default value for P50, P90 and P99.

Copy link
Contributor Author

@emerbe emerbe Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are defaults for 100 nodes that can be changed for larger scale.

@emerbe emerbe force-pushed the update-dra-test-config branch 3 times, most recently from 194c6c7 to c2d5837 Compare November 6, 2025 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.