Implement DRA Device Binding Conditions (KEP-5007) #130160

KobayashiD27 · Feb 14, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR implements the KEP-5007 DRA device binding conditions. This feature ensures that the scheduler waits in the PreBind phase until any DRA devices that need preparation are ready.

Which issue(s) this PR fixes:

Related to kubernetes/enhancements#5007

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Add new optional APIs in ResouceSlice.Basic and ResourceClaim.Status.AllocatedDeviceStatus.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/issues/5007

k8s-ci-robot · Feb 14, 2025

Hi @KobayashiD27. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

KobayashiD27 · Feb 14, 2025

We are pleased to share the initial implementation of the KEP-5007 DRA device binding conditions. While this PR aligns with the outlined in the KEP, we recognize that there may be areas for improvement. We invite the community to review the implementation and provide feedback and insights to help refine and enhance this feature.

@pohly @johnbelamaric
First, from the perspective of DRA, I would like to confirm that the overall approach aligns with the intended direction. Your insight would be invaluable in ensuring that we are on the right track.

pohly

Some quick comments. I have looked at the API and allocator, but not the scheduler plugin.

The API lacks validation, but that's of course okay when the main goal for now is to try out the functionality.

staging/src/k8s.io/api/resource/v1beta1/types.go

staging/src/k8s.io/dynamic-resource-allocation/structured/allocator.go

KobayashiD27 · Feb 18, 2025

@pohly
Thank you for the initial feedback. To ensure we are on the right track, I would like to highlight a few key areas where community consensus is crucial. It would be helpful if you could prioritize reviewing the following aspects:

Transferring AllocatedDeviceStatus from the allocator to the scheduler plugin
Clearing the allocation state in the UnReserve phase

Early feedback on these sections would be very helpful.

Additionally, regarding the comment about the lack of API validation, are you referring to pkg/apis/resources/validation/validation.go?

pohly · Feb 19, 2025

/milestone v1.33

pkg/apis/resource/validation/validation.go

staging/src/k8s.io/api/resource/v1beta1/types.go

pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go

pkg/apis/resource/types.go

pkg/kubelet/cm/memorymanager/memory_manager_test.go

pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go

staging/src/k8s.io/api/resource/v1alpha3/types.go

staging/src/k8s.io/api/resource/v1beta1/types.go

pohly · Jul 4, 2025

/assign @mortent

For a first pass. You are probably not familiar with the scheduler plugin, though.

@KobayashiD27: at some point I had implemented integration tests for this feature. Did you copy those into your branch? If not, then please do.

k8s-ci-robot · Jul 7, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: johnbelamaric, KobayashiD27
Once this PR has been reviewed and has the lgtm label, please ask for approval from liggitt. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

api/OWNERS
pkg/apis/OWNERS
pkg/features/OWNERS
pkg/generated/openapi/OWNERS
pkg/registry/OWNERS
pkg/scheduler/OWNERS
staging/src/k8s.io/api/OWNERS
staging/src/k8s.io/client-go/applyconfigurations/OWNERS
staging/src/k8s.io/dynamic-resource-allocation/OWNERS
test/compatibility_lifecycle/reference/OWNERS
~~test/integration/dra/OWNERS~~ [johnbelamaric]
test/integration/scheduler_perf/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

KobayashiD27 · Jul 7, 2025

/retest

KobayashiD27 · Jul 7, 2025

/assign @mortent

For a first pass. You are probably not familiar with the scheduler plugin, though.

@mortent
I know you are busy, but I would appreciate it if you could review this PR.

@KobayashiD27: at some point I had implemented integration tests for this feature. Did you copy those into your branch? If not, then please do.

Hi @pohly, thank you for your comment.

I’ve reviewed your suggestion and confirmed that the integration tests for this feature were not initially included in my branch. I’ve now copied them over and verified that they are working as expected.

Originally, the integration tests were failing due to changes introduced by the new API. I’ve also addressed those issues and updated the tests so that they now pass successfully.

…ation test

macsko · Jul 7, 2025

test/integration/scheduler_perf/dra/performance-config.yaml

+      podTemplatePath: templates/pod-with-claim-template.yaml
+      countParam: $measurePods
+      steadyState: true
+      collectMetrics: true


collectMetrics is blocking, so this step will create pods and wait until they are scheduled, before going to the next op (updateDeviceConditions)

https://github.com/kubernetes/kubernetes/blob/master/test/integration/scheduler_perf/scheduler_perf.go#L2325
Based on the comments, it appears that we want every workload to include at least one CreatePods operation where collectMetrics is set to true.

But, what this test case is expected to do if all pods are scheduled at this point and the updateDeviceConditions is meaningless for them?

Btw, I see this comment is a bit outdated as the startCollectingMetrics and stopCollectingMetrics can be used for that as well.

I rearize that I misunderstood your comment. If steadyState is true, I believe it will not cause any blocking.
Therefore, updateDeviceConditions is working fine.

Which approach would be appropriate in this case?

Steady state is blocking, see other usages.

Let's try to run these new tests without two last steps - updateDeviceConditions and checkPodScheduled. You will see that the test is still successful.

OK, I see. Instead of using collectMetrics I'll change it to use startCollectingMetrics and stopCollectingMetrics.

But, what's the purpose of this test, if the pods can be just scheduled without device conditions update. Shouldn't the pods stay unschedulable until the conditions get updated?

In the previous test case, steadyState was set to true due to my misunderstanding, so there were 0 pods when checkPodScheduled was called.
Now, with skipWaitToCompletion: true in createPods op, the expected check should work.

If we don't run updateDeviceConditions, the AttachmentSuccess test will fail as expected.

If we don't run updateDeviceConditions, the AttachmentSuccess test will fail as expected.

But what with the another test? Why is it still successful even if there is no update? Should it work like that? I'd expect that the scheduling is blocked (on PreBind) until the device conditions are added.

test/integration/scheduler_perf/dra/performance-config.yaml

test/integration/scheduler_perf/dra.go

pkg/apis/resource/validation/validation.go

pkg/registry/resource/resourceclaim/strategy.go

staging/src/k8s.io/dynamic-resource-allocation/structured/allocator_test.go

mortent · Jul 7, 2025

staging/src/k8s.io/dynamic-resource-allocation/structured/allocator_test.go

+// deviceRequestAllocationResult returns an DeviceRequestAllocationResult object for testing purposes,
+// specifying the driver, pool, device, usage restriction, binding conditions,
+// binding failure conditions, and binding timeout.
+func deviceRequestAllocationResult(request, driver, pool, device string, bindingConditions, bindingFailureConditions []string, bindingTimeout *int64) resourceapi.DeviceRequestAllocationResult {


We already have a function called deviceAllocationResult that is very similar, it just supports adminAccess rather than binding conditions. Can we combine them, or at the very least, create a name convention that makes it a bit less confusing?

Since deviceAllocationResult is already heavily used, I think it would be better to rename this function.
What do you think about the name change to deviceRequestAllocationResultWithBindingConditions?

I have changed the name of the function.

mortent · Jul 7, 2025

staging/src/k8s.io/dynamic-resource-allocation/structured/pools.go

+					needBindingSlice.Spec.Devices = nil
+					nodeSlice := slice.DeepCopy()
+					nodeSlice.Spec.Devices = nil
+					for _, device := range slice.Spec.Devices {


This splits devices from a single ResourceSlice into two separate slices with the same name. I have two concerns about this:

I don't think this will work with the partitionable devices feature, since it relies on managing the consumption of counters across devices within a ResourceSlice. This duplicates the counters, which will almost certainly lead to oversubscription of the underlying devices.

We end up with two ResourceSlice objects with the same name, which means that lookup of a specific ResourceSlice by name will not work. Have we checked whether there are any logic today that relies on this? In any case, it seems like it would be surprising behavior for anyone adding features in the future.

Also, doesn't the next case statement that covers perDeviceNodeSelection also need to handle this?

As far as I understand, this operation does not split a single ResourceSlice into two. They are then combined again into a single pool by the addSlice() operation. In this sequence, we are just switching the order.
If my understanding is incorrect, I would appreciate it if you could let me know.

Also, doesn't the next case statement that covers perDeviceNodeSelection also need to handle this?

I'll check if it's necessary in this case.

After reviewing the code again, it looks like that two slices with the same name could appears in the pool. So I'll update that just swap their order and add the slice to the pool without splitting it.

Also, doesn't the next case statement that covers perDeviceNodeSelection also need to handle this?

I'll check if it's necessary in this case.

I think this should be added to this case too. I will add it.

I fixed the issue with the slices being split and added handling for the next case too.

mortent · Jul 7, 2025

staging/src/k8s.io/dynamic-resource-allocation/structured/allocator_test.go

@@ -3325,6 +3356,40 @@ func TestAllocator(t *testing.T) {
 				deviceAllocationResult(req0, driverA, pool1, device1, false),
 			)},
 		},
+		"device-binding-conditions": {


We should add tests here to make sure this feature works with the partitionable devices feature. And probably some of the other features like device taints and prioritized list too unless we know they are orthogonal

I'll try and see if I can add a suitable test.

cc @pohly @johnbelamaric
If you know of orthogonal with other features, could you please let me know?

I added some test cases.

It's the other ResourceSlice ones I would be more concerned about than the ResourceClaim based ones. So, partitionable for one, and eventually consumable capacity - but that's not merged yet either so we can deal with it later.

…t to address review comment Instead of using `collectMetrics`, use `startCollectingMetrics` and `stopCollectingMetrics`

Add test cases for multiple featuregates Update allocator testcode and fix some confusing function name FIx GatherPool() logic Update validation code to use validateSlice

mortent · Jul 11, 2025

pkg/registry/resource/resourceclaim/strategy_test.go

@@ -896,6 +938,66 @@ func TestStatusStrategyUpdate(t *testing.T) {
 				}
 			},
 		},
+		"keep-fields-binding-conditions": {


I think we also need a test here to verify that the fields are kept if they already exist, even if the feature is disabled.

mortent · Jul 11, 2025

pkg/registry/resource/resourceslice/strategy_test.go

@@ -410,12 +446,79 @@ func TestResourceSliceStrategyUpdate(t *testing.T) {
 				return obj
 			}(),
 		},
+		"drop-fields-binding-conditions": {


Same as above, I think we also need a test here to verify that we don't drop the fields if they are already in use even if the feature is disabled.

mortent · Jul 11, 2025

staging/src/k8s.io/dynamic-resource-allocation/structured/allocator.go

+			// If deviceBindingConditions are enabled, we need to populate the AllocatedDeviceStatus.
+			if a.features.DeviceBinding {
+				for _, device := range internal.slice.Spec.Devices {
+					if device.Name == internal.id.Device && len(device.Basic.BindingConditions) > 0 {


Why are we only checking the length of device.Basic.BindingConditions here and not device.Basic.BindingFailureConditions?

I do not believe that it is desirable for a device that only contains BindingFailureConditions to cause a wait in PreBind.

I agree. Even better, such a device should be considered invalid (= rejected during validation) because specifying BindingFailureConditions without at least one BindingCondition does not make sense.

That makes sense. Agree with @pohly that handling this in validation would be useful. Otherwise the BindingFailureConditions will just get dropped here if there aren't any BindingConditions.

mortent · Jul 11, 2025

staging/src/k8s.io/dynamic-resource-allocation/structured/pools.go

+						}
+					}
+					if len(needBindingSlice.Spec.Devices) > 0 {
+						// If there are devices that need binding, we create a separate slice


This comment needs to be updated I think.

mortent · Jul 11, 2025

staging/src/k8s.io/dynamic-resource-allocation/structured/pools.go

+						// If there are devices that need binding, we create a separate slice
+						// for them, so that they can be handled separately.
+						nodeSlice.Spec.Devices = append(nodeSlice.Spec.Devices, needBindingSlice.Spec.Devices...)
+						needBindingSlices = append(needBindingSlices, nodeSlice)


The naming here is a bit confusing since we don't add the needBindingSlice to the needBindingSlices list. Also, do we need the needBindingSlice at all? Can't we just use a list of devices?

mortent · Jul 11, 2025

staging/src/k8s.io/dynamic-resource-allocation/structured/pools.go

@@ -88,6 +109,26 @@ func GatherPools(ctx context.Context, slices []*resourceapi.ResourceSlice, node
 						device.String(), slice.Name, err)
 				}
 				if match {
+					if hasBindingConditions(slice) {


Is there a reasonable way to avoid the duplication here?

I will try extracting the sorting process as a function.

mortent · Jul 11, 2025

staging/src/k8s.io/dynamic-resource-allocation/structured/pools.go

+								nodeSlice.Spec.Devices = append(nodeSlice.Spec.Devices, device)
+							}
+						}
+						if len(needBindingSlice.Spec.Devices) > 0 {


Is there any way this check doesn't pass when the call to hadBindingConditions on line 112 passed?

If I understand correctly, you're saying this check is unnecessary because line 112 already covers the condition, right? If so, I agree. I'll remove it. Thanks for catching that!

Yeah, it seems like hasBindingConditions has already established that there is at least one device with binding conditions in the slice, so I'm not sure how the list of devices in needBindingSlice can be empty here.

mortent · Jul 11, 2025

staging/src/k8s.io/dynamic-resource-allocation/structured/pools.go

@@ -107,13 +148,39 @@ func GatherPools(ctx context.Context, slices []*resourceapi.ResourceSlice, node

 	}

+	if len(needBindingSlices) > 0 {


From what I understand, the logic here puts devices that uses binding conditions last in the device list for each ResourceSlice and then puts the ResourceSlices with devices with binding conditions last in each pool, and finally puts those pools last for the allocator search.

What is the goal here? This makes it less likely that a device with conditions are selected, but the allocator can still end up choosing a device with conditions even though a device without them are available.

I'm a bit uncertain about changing the order of the devices in a ResourceSlice. Since the allocator does a first fit search, the order can be important. For example, a request for a GPU with at least 1Gi of memory will select one with 40Gi available if it is listed first in the ResourceSlice, even though that might be a poor fit. Giving the ResourceSlice author control of the order makes makes it a bit easier to avoid poor allocation choices, although scoring is obviously the real solution here. Maybe not a big issue since this only happens for devices with binding conditions, which is also controlled by the ResourceSlice author.

From what I understand, the logic here puts devices that uses binding conditions last in the device list for each ResourceSlice and then puts the ResourceSlices with devices with binding conditions last in each pool, and finally puts those pools last for the allocator search.

Yes, you are right.

What is the goal here? This makes it less likely that a device with conditions are selected, but the allocator can still end up choosing a device with conditions even though a device without them are available.

I'm a bit uncertain about changing the order of the devices in a ResourceSlice. Since the allocator does a first fit search, the order can be important. For example, a request for a GPU with at least 1Gi of memory will select one with 40Gi available if it is listed first in the ResourceSlice, even though that might be a poor fit. Giving the ResourceSlice author control of the order makes makes it a bit easier to avoid poor allocation choices, although scoring is obviously the real solution here. Maybe not a big issue since this only happens for devices with binding conditions, which is also controlled by the ResourceSlice author.

The goal of this change is to prioritize devices without BindingConditions during allocation. By adjusting the search order, we aim to reduce the likelihood of selecting devices with conditions when other suitable options are available.

Originally, the implementation only changed the order of ResourceSlices within a pool. Based on this discussion, I added logic to also reorder devices within a ResourceSlice.

However, I’ve realized that the discussion around whether device reordering is appropriate hasn’t been fully settled. If there are concerns about changing the order of devices, one alternative could be to revert that part of the logic and instead document that ResourceSlices containing devices with BindingConditions are less likely to be selected by the allocator.

@pohly
Any thoughts on sorting the devices?

What we really want is "order all devices in the pool" according to the triplet binding conditions yes/no, name of slice, index within slice.

The "name of the slice" is just there to ensure that we don't compare device indices from different slices, which would be meaningless. The name is random, so driver's cannot rely on that to prefer some devices over others.

If a driver does not use binding conditions, then this is equivalent to the current "look at random slices, iterate over devices in them in the order chosen by the DRA driver".

If a driver uses binding conditions, then it can publish some devices with binding conditions in one slice and others without binding conditions in another and it will work as intended. The sorting in the allocator wouldn't be necessary if all devices were in the same slice, but we cannot assume that.

Maintaining such a list is a change of how the allocator currently iterates over devices and also adds overhead. So as a less intrusive change, sorting slices based on "some device uses binding conditions" and not sorting within the slice should be sufficient.

It's just an approximation, but DRA driver authors can make it work for them by not mixing devices with and without binding conditions in the same slice and if they need to do that, use only a single slice with devices using binding conditions come last.

I had violated that constraint in #130160 (comment) and didn't get the expected outcome. But if we clearly document this for driver authors, they can avoid such a mistake, so I am fine with the original implementation.

So my understanding is that the original implementation just ordered ResourceSlices based on whether they had devices with BindingConditions in them. Agree that it should be sufficient since users can control how devices are split across ResourceSlices (with some caveats around Partitionable Devices) and they control the order of devices within each ResourceSlice.

mortent · Jul 11, 2025

test/integration/scheduler_perf/dra.go

@@ -363,3 +371,278 @@ claims:
 		tCtx.Fatalf("Could not allocate claim %d out of %d", i, len(claims))
 	}
 }
+
+type updateDeviceConditionsOp struct {


Can we update these to explain it in a little more detail? It would be helpful to understand how the ResourceClaims to be updated are identified and how the process is configured with the parameters defined on this op.

mortent · Jul 11, 2025

test/integration/scheduler_perf/dra.go

+		claimName := item.GetName()
+		var claim *resourceapi.ResourceClaim
+
+		err = wait.PollUntilContextTimeout(ctx, 1*time.Second, op.Duration.Duration, true, func(ctx context.Context) (bool, error) {


Do we plan to use these tests for performance? If so, how much will the result be influenced on how fast this code can update the conditions? I just want to understand what piece of code is actually being tested in that case.

I'm not intending to use these tests for performance measurement.
However, it’s important that the code updates the conditions faster than the timeout value specified by BindingTimeoutSeconds. Otherwise, the test may unintentionally fail due to the Pod not being scheduled in time. So while performance isn’t the goal here, the timing does need to meet that threshold to ensure the test behaves as expected.

The reason for my question is mostly around whether it would be better to handle integration tests as part of https://github.com/kubernetes/kubernetes/tree/master/test/integration/dra rather than here.

pkg/apis/resource/types.go

johnbelamaric · Jul 11, 2025

pkg/registry/resource/resourceclaim/strategy.go

@@ -345,3 +347,42 @@ func dropDeallocatedStatusDevices(newClaim, oldClaim *resource.ResourceClaim) {
 }

 // TODO: add tests after partitionable devices is merged (code conflict!)


Is this an out-of-date TODO? If so please remove it

This is not a TODO I added, and I don't think it's a good idea for me to delete it in this PR.

Agreed, I'll take a look.

johnbelamaric · Jul 11, 2025

pkg/registry/resource/resourceclaim/strategy_test.go

@@ -278,6 +279,44 @@ var objWithAdminAccessStatus = &resource.ResourceClaim{
 	},
 }

+var objWithDeviseBindingConditions = &resource.ResourceClaim{


s/Devise/Device/

johnbelamaric · Jul 11, 2025

pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go

+		bound, err := pl.isClaimBound(claim)
+		// If the claim is not bound, assumed timeout and it need to clean up.
+		if err != nil || !bound {
+			unavailableClaims = append(unavailableClaims, index)


Looking below, this has been implemented (using AllocationTimestamp). If the integration tests are done too (I haven't gotten there yet), can we resolve this comment?

pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go

johnbelamaric · Jul 11, 2025

pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go

@@ -884,6 +959,98 @@ func (pl *DynamicResources) bindClaim(ctx context.Context, state *stateData, ind
 	return claim, nil
 }

+// isClaimBound checks whether a given resource claim is successfully
+// bound to a device.


I think we're overusing the word "binding" here. The "binding conditions" are the conditions needed to be able to proceed to the Bind() phase for the Pod. They are not about binding a device to a claim. I think instead what you want is "isClaimBindable" or "isClaimReadyForBinding" or maybe just "areBindingConditionsMet" or "claimNotBlockingBinding" or something like that.

OK, I see. I will chang the function name to isClaimReadyForBinding.

johnbelamaric · Jul 11, 2025

pkg/scheduler/framework/plugins/dynamicresources/dynamicresources_test.go

@@ -974,6 +1043,82 @@ func TestPlugin(t *testing.T) {
 				},
 			},
 		},
+		"bound-claim-with-successed-binding-conditions": {


Can we add a case where a single pod has two different claims that have binding conditions, and then have tests for both succeed, one fails, one fails with timeout, etc.? Just to be sure we have something covering the multiple claims cases.

johnbelamaric · Jul 11, 2025

staging/src/k8s.io/dynamic-resource-allocation/structured/allocator_test.go

@@ -3325,6 +3356,40 @@ func TestAllocator(t *testing.T) {
 				deviceAllocationResult(req0, driverA, pool1, device1, false),
 			)},
 		},
+		"device-binding-conditions": {


It's the other ResourceSlice ones I would be more concerned about than the ResourceClaim based ones. So, partitionable for one, and eventually consumable capacity - but that's not merged yet either so we can deal with it later.

pohly · Jul 14, 2025

pkg/apis/resource/types.go

+	//
+	// +optional
+	// +featureGate=DRADeviceBindingConditions,DRAResourceClaimDeviceStatus
+	UsageRestrictedToNode *bool


BindsToNode sounds best to me (from https://github.com/kubernetes/kubernetes/pull/130160/files#r2201675405).

It ties the field to the corresponding feature and the present tense isn't wrong like "bound" was.

Thank you, I see. I will change the name.

macsko · Jul 14, 2025

test/integration/scheduler_perf/scheduler_perf.go

+			}
+		}
+
+		time.Sleep(1 * time.Second)


Use ticker for that (time.NewTicker)

macsko · Jul 14, 2025

test/integration/scheduler_perf/scheduler_perf.go

+			if ctx.Err() == context.DeadlineExceeded {
+				if op.ExpectedScheduled {
+					tCtx.Fatalf("Timeout waiting for pods to be scheduled. Context error: %v. Last violation: %s", ctx.Err(), lastViolation)
+				} else if lastViolation != "" {
+					tCtx.Fatalf("Expected pods to remain unscheduled, but some pods were scheduled. Last violation: %s", lastViolation)
+				} else {
+					// If ExpectedScheduled is false (pods should be unscheduled), timeout means success.
+					tCtx.Logf("All pods remained unscheduled for the duration")
+					return
+				}
+			} else {
+				tCtx.Fatalf("Context cancelled: %v. Last violation: %s", ctx.Err(), lastViolation)
+			}


You could flatten the conditions:

Suggested change

if ctx.Err() == context.DeadlineExceeded {

if op.ExpectedScheduled {

tCtx.Fatalf("Timeout waiting for pods to be scheduled. Context error: %v. Last violation: %s", ctx.Err(), lastViolation)

} else if lastViolation != "" {

tCtx.Fatalf("Expected pods to remain unscheduled, but some pods were scheduled. Last violation: %s", lastViolation)

} else {

// If ExpectedScheduled is false (pods should be unscheduled), timeout means success.

tCtx.Logf("All pods remained unscheduled for the duration")

return

}

} else {

tCtx.Fatalf("Context cancelled: %v. Last violation: %s", ctx.Err(), lastViolation)

}

if ctx.Err() != context.DeadlineExceeded {

tCtx.Fatalf("Context cancelled: %v. Last violation: %s", ctx.Err(), lastViolation)

}

if op.ExpectedScheduled {

tCtx.Fatalf("Timeout waiting for pods to be scheduled. Context error: %v. Last violation: %s", ctx.Err(), lastViolation)

}

if lastViolation != "" {

tCtx.Fatalf("Expected pods to remain unscheduled, but some pods were scheduled. Last violation: %s", lastViolation)

}

// If ExpectedScheduled is false (pods should be unscheduled), timeout means success.

tCtx.Logf("All pods remained unscheduled for the duration")

return

macsko · Jul 14, 2025

test/integration/scheduler_perf/scheduler_perf.go

+				currentViolation = fmt.Sprintf("Pod %s: ExpectedScheduled=%v, ActualScheduled=%v (Node=%s)",
+					pod.Name, op.ExpectedScheduled, pod.Spec.NodeName != "", pod.Spec.NodeName)
+				lastViolation = currentViolation


Do it directly:

Suggested change

currentViolation = fmt.Sprintf("Pod %s: ExpectedScheduled=%v, ActualScheduled=%v (Node=%s)",

pod.Name, op.ExpectedScheduled, pod.Spec.NodeName != "", pod.Spec.NodeName)

lastViolation = currentViolation

lastViolation = fmt.Sprintf("Pod %s: ExpectedScheduled=%v, ActualScheduled=%v (Node=%s)",

pod.Name, op.ExpectedScheduled, pod.Spec.NodeName != "", pod.Spec.NodeName)

macsko · Jul 14, 2025

test/integration/scheduler_perf/scheduler_perf.go

+				if stableStart == nil {
+					now := time.Now()
+					stableStart = &now
+					tCtx.Logf("All pods are unscheduled, starting stability check")


It would be good to document somewhere how the stable start works here

macsko · Jul 14, 2025

staging/src/k8s.io/dynamic-resource-allocation/structured/pools.go

+			for _, device := range slice.Spec.Devices {
+				if device.Basic != nil && device.Basic.BindingConditions != nil {
+					hasBindingFlag = true
+					break


This break works only for the inner loop. You could consider using a function where you would just return.

Or you could use hasBindingConditions method here as well

k8s-ci-robot requested review from bart0sh and damemi February 14, 2025 09:41

pohly reviewed Feb 14, 2025

View reviewed changes

staging/src/k8s.io/api/resource/v1beta1/types.go Outdated Show resolved Hide resolved

staging/src/k8s.io/dynamic-resource-allocation/structured/allocator.go Outdated Show resolved Hide resolved

KobayashiD27 mentioned this pull request Feb 18, 2025

DRA Device Binding Conditions #5007 document kubernetes/website#49814

Open

pohly mentioned this pull request Feb 18, 2025

DRA: Device Binding Conditions kubernetes/enhancements#5007

Open

6 tasks

k8s-ci-robot added this to the v1.33 milestone Feb 19, 2025

pohly reviewed Feb 19, 2025

View reviewed changes

johnbelamaric reviewed Feb 21, 2025

View reviewed changes

KobayashiD27 force-pushed the dra-device-binding-conditions branch from 1c36f41 to f07e273 Compare February 25, 2025 00:18

KobayashiD27 force-pushed the dra-device-binding-conditions branch 3 times, most recently from 9cb01c9 to e0b29e2 Compare July 4, 2025 09:57

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 4, 2025

KobayashiD27 added 4 commits July 4, 2025 09:57

KEP-5007 DRA Device Binding Conditions: API Update

48193aa

KEP-5007 DRA Device Binding Conditions: Run make update

3537bd0

KEP-5007 DRA Device Binding Conditions: Implement scheduler logic

63d5aea

KEP-5007 DRA Device Binding Conditions: Implement DRA logic

d7cd902

KobayashiD27 force-pushed the dra-device-binding-conditions branch from e0b29e2 to 5c4350f Compare July 4, 2025 10:01

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 4, 2025

KobayashiD27 force-pushed the dra-device-binding-conditions branch from 5c4350f to 03312ae Compare July 4, 2025 10:41

k8s-ci-robot assigned mortent Jul 4, 2025

KobayashiD27 added 2 commits July 7, 2025 07:28

KEP-5007 DRA Device Binding Conditions: Add scheduler_perf dra integr…

77ee6ac

…ation test

KEP-5007 DRA Device Binding Conditions: Add dra integration test

e19f3ff

KobayashiD27 force-pushed the dra-device-binding-conditions branch from 44082f0 to e19f3ff Compare July 7, 2025 08:00

macsko reviewed Jul 7, 2025

View reviewed changes

mortent reviewed Jul 7, 2025

View reviewed changes

KobayashiD27 added 2 commits July 8, 2025 06:03

DRA Device Binding Conditions: Update scheduler_perf integratnion tes…

3e662af

…t to address review comment Instead of using `collectMetrics`, use `startCollectingMetrics` and `stopCollectingMetrics`

DRA Device Binding Conditions: Update DRA implementation

86a597b

Add test cases for multiple featuregates Update allocator testcode and fix some confusing function name FIx GatherPool() logic Update validation code to use validateSlice

KobayashiD27 force-pushed the dra-device-binding-conditions branch from 3b390d0 to 86a597b Compare July 8, 2025 09:00

KobayashiD27 requested review from macsko and mortent July 8, 2025 09:40

mortent reviewed Jul 11, 2025

View reviewed changes

johnbelamaric reviewed Jul 11, 2025

View reviewed changes

pohly reviewed Jul 14, 2025

View reviewed changes

macsko reviewed Jul 14, 2025

View reviewed changes

Search code, repositories, users, issues, pull requests...

Implement DRA Device Binding Conditions (KEP-5007) #130160

Are you sure you want to change the base?

Implement DRA Device Binding Conditions (KEP-5007) #130160

Conversation

KobayashiD27 commented Feb 14, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Feb 14, 2025

Uh oh!

KobayashiD27 commented Feb 14, 2025

Uh oh!

pohly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

KobayashiD27 commented Feb 18, 2025

Uh oh!

pohly commented Feb 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pohly commented Jul 4, 2025

Uh oh!

k8s-ci-robot commented Jul 7, 2025

Uh oh!

KobayashiD27 commented Jul 7, 2025

Uh oh!

KobayashiD27 commented Jul 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

KobayashiD27 Jul 8, 2025 •

edited

Loading