Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Implement DRA Device Binding Conditions (KEP-5007) #130160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 44 commits into
base: master
Choose a base branch
Loading
from

Conversation

KobayashiD27
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR implements the KEP-5007 DRA device binding conditions. This feature ensures that the scheduler waits in the PreBind phase until any DRA devices that need preparation are ready.

Which issue(s) this PR fixes:

Related to kubernetes/enhancements#5007

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Add new optional APIs in ResouceSlice.Basic and ResourceClaim.Status.AllocatedDeviceStatus.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/issues/5007

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 14, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @KobayashiD27. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added area/code-generation area/kubelet kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 14, 2025
@KobayashiD27
Copy link
Contributor Author

We are pleased to share the initial implementation of the KEP-5007 DRA device binding conditions. While this PR aligns with the outlined in the KEP, we recognize that there may be areas for improvement. We invite the community to review the implementation and provide feedback and insights to help refine and enhance this feature.

@pohly @johnbelamaric
First, from the perspective of DRA, I would like to confirm that the overall approach aligns with the intended direction. Your insight would be invaluable in ensuring that we are on the right track.

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick comments. I have looked at the API and allocator, but not the scheduler plugin.

The API lacks validation, but that's of course okay when the main goal for now is to try out the functionality.

const (
// IsPrepared indicates the device ready state.
// If NeedToPreparing is True and IsPrepared is True, the scheduler proceeds to Bind.
IsPrepared = "dra.example.com/is-prepared"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just an example in the KEP. It doesn't belong into the upstream API. Same for PreparingFailed.

})
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the device status only when needed by a device.

You also have to add feature gate checking: I don't remember whether it was spelled out explicitly in the KEP (if not, please add in a follow-up), but what would make sense to me is to ignore devices which have binding conditions when the feature is turned off. In other words, don't select them because the code which waits during binding wouldn't be active.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add allocatedDeviceStatus only when BindingConditions exist.
and I pass the featureGate status of BindingConditions to the allocator and check it.

@KobayashiD27
Copy link
Contributor Author

@pohly
Thank you for the initial feedback. To ensure we are on the right track, I would like to highlight a few key areas where community consensus is crucial. It would be helpful if you could prioritize reviewing the following aspects:

  1. Transferring AllocatedDeviceStatus from the allocator to the scheduler plugin
  2. Clearing the allocation state in the UnReserve phase

Early feedback on these sections would be very helpful.

Additionally, regarding the comment about the lack of API validation, are you referring to pkg/apis/resources/validation/validation.go?

@pohly
Copy link
Contributor

pohly commented Feb 19, 2025

/milestone v1.33

@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Feb 19, 2025
pkg/apis/resource/types.go Outdated Show resolved Hide resolved
pkg/apis/resource/types.go Outdated Show resolved Hide resolved
pkg/apis/resource/types.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/memorymanager/memory_manager_test.go Outdated Show resolved Hide resolved
staging/src/k8s.io/api/resource/v1alpha3/types.go Outdated Show resolved Hide resolved
staging/src/k8s.io/api/resource/v1alpha3/types.go Outdated Show resolved Hide resolved
staging/src/k8s.io/api/resource/v1beta1/types.go Outdated Show resolved Hide resolved
staging/src/k8s.io/api/resource/v1beta1/types.go Outdated Show resolved Hide resolved
staging/src/k8s.io/api/resource/v1beta1/types.go Outdated Show resolved Hide resolved
@KobayashiD27 KobayashiD27 force-pushed the dra-device-binding-conditions branch from 1c36f41 to f07e273 Compare February 25, 2025 00:18
@KobayashiD27 KobayashiD27 force-pushed the dra-device-binding-conditions branch from 57f494a to 8e56894 Compare May 9, 2025 02:21
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 9, 2025
@KobayashiD27
Copy link
Contributor Author

@pohly @johnbelamaric (for DRA)
@dom4ha @macsko (for Scheduler)
@thockin (for API)

Could you please review the full implementation from each of your respective perspectives?
It looks like the release cycle for v1.34 hasn’t been announced yet, but I’d like to move this forward as much as possible in the meantime.

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API review... haven't looked at the implementation yet.

pkg/apis/resource/types.go Outdated Show resolved Hide resolved
pkg/apis/resource/types.go Outdated Show resolved Hide resolved
// feature gate.
//
// +optional
// +featureGate=DRADeviceBindingConditions,DRAResourceClaimDeviceStatus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could imagine use cases for this besides DRADeviceBindingConditions, so making it depend on that feature gate might be too strict.

But introducing a new field in this KEP with just DRAResourceClaimDeviceStatus as feature gate (not really the right one) or no feature gate at all (core DRA = structured parameters) would be odd. Let's keep it like this.

pkg/apis/resource/types.go Show resolved Hide resolved
pkg/apis/resource/validation/validation.go Outdated Show resolved Hide resolved
Device: goodName,
AdminAccess: ptr.To(false),
BindingConditions: []string{"condition1", "condition2"},
BindingFailureConditions: []string{"condition3", "condition4"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extend this to four conditions in both slices (current maximum) and include a complex label:
https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#Condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added those test cases.

pkg/features/kube_features.go Outdated Show resolved Hide resolved
@@ -1173,6 +1181,10 @@ var defaultVersionedKubernetesFeatureGates = map[featuregate.Feature]featuregate
{Version: version.MustParse("1.32"), Default: false, PreRelease: featuregate.Beta},
},

DRADeviceBindingConditions: {
{Version: version.MustParse("1.33"), Default: false, PreRelease: featuregate.Alpha},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move it before DRADeviceTaints and change it to 1.34.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I will move it over to arrange it in alphabetical order.
My understanding is that the client hasn't been updated to v1.34 yet. If that's the case, there might be issues with testing, etc. Therefore I'll wait to change it to v1.34 until we can confirm the client is updated.

}(),
bindingConditions: true,
deviceStatus: true,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of completeness: let's add bindingConditions: true, deviceStatus: false and bindingConditions: false, deviceStatus: true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added those test cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the same "drop disabled fields" in resourceclaim/strategy.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add dropDisabledFields for bindingConditions in resourceclaim/strategy.go

KobayashiD27 and others added 7 commits May 15, 2025 10:32
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scheduler_perf review: would be easier to understand with a bit more documentation.

These new test cases need a lot of new plumbing (= operations). I'm undecided whether this is the right approach for them. Let me experiment a bit with doing the same under test/integration/dra.

@@ -235,6 +238,10 @@ func resourceSlice(driverName, nodeName string, capacity int) *resourceapi.Resou
Capacity: map[resourceapi.QualifiedName]resourceapi.DeviceCapacity{
"memory": {Value: resource.MustParse("1Gi")},
},
UsageRestrictedToNode: ptr.To(true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering how this works in other tests until I realized that these fields get stripped when the feature gate is off.

Please add a comment explaining this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I will add a comment here.

@@ -363,3 +371,278 @@ claims:
tCtx.Fatalf("Could not allocate claim %d out of %d", i, len(claims))
}
}

type updateDeviceConditionsOp struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add documentation for the new operations:

  • what do they do?
  • what parameters do they have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added documentation for the new op and values.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented May 16, 2025

@KobayashiD27: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-kind-alpha-beta-features dba0508 link false /test pull-kubernetes-e2e-kind-alpha-beta-features

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. area/code-generation area/kubelet area/release-eng Issues or PRs related to the Release Engineering subproject area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: Assigned
Status: Needs Triage
Status: !SIG Auth
Status: Needs Triage
Archived in project
Status: 👀 In review
Status: Tracked
Development

Successfully merging this pull request may close these issues.

Morty Proxy This is a proxified and sanitized view of the page, visit original site.