-
Notifications
You must be signed in to change notification settings - Fork 40.9k
Implement DRA Device Binding Conditions (KEP-5007) #130160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Implement DRA Device Binding Conditions (KEP-5007) #130160
Conversation
Hi @KobayashiD27. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
We are pleased to share the initial implementation of the KEP-5007 DRA device binding conditions. While this PR aligns with the outlined in the KEP, we recognize that there may be areas for improvement. We invite the community to review the implementation and provide feedback and insights to help refine and enhance this feature. @pohly @johnbelamaric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some quick comments. I have looked at the API and allocator, but not the scheduler plugin.
The API lacks validation, but that's of course okay when the main goal for now is to try out the functionality.
staging/src/k8s.io/dynamic-resource-allocation/structured/allocator.go
Outdated
Show resolved
Hide resolved
@pohly
Early feedback on these sections would be very helpful. Additionally, regarding the comment about the lack of API validation, are you referring to |
/milestone v1.33 |
pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go
Outdated
Show resolved
Hide resolved
pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go
Outdated
Show resolved
Hide resolved
1c36f41
to
f07e273
Compare
9cb01c9
to
e0b29e2
Compare
e0b29e2
to
5c4350f
Compare
5c4350f
to
03312ae
Compare
/assign @mortent For a first pass. You are probably not familiar with the scheduler plugin, though. @KobayashiD27: at some point I had implemented integration tests for this feature. Did you copy those into your branch? If not, then please do. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: johnbelamaric, KobayashiD27 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
@mortent
Hi @pohly, thank you for your comment. I’ve reviewed your suggestion and confirmed that the integration tests for this feature were not initially included in my branch. I’ve now copied them over and verified that they are working as expected. Originally, the integration tests were failing due to changes introduced by the new API. I’ve also addressed those issues and updated the tests so that they now pass successfully. |
44082f0
to
e19f3ff
Compare
podTemplatePath: templates/pod-with-claim-template.yaml | ||
countParam: $measurePods | ||
steadyState: true | ||
collectMetrics: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collectMetrics
is blocking, so this step will create pods and wait until they are scheduled, before going to the next op (updateDeviceConditions
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/kubernetes/kubernetes/blob/master/test/integration/scheduler_perf/scheduler_perf.go#L2325
Based on the comments, it appears that we want every workload to include at least one CreatePods operation where collectMetrics is set to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, what this test case is expected to do if all pods are scheduled at this point and the updateDeviceConditions
is meaningless for them?
Btw, I see this comment is a bit outdated as the startCollectingMetrics
and stopCollectingMetrics
can be used for that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rearize that I misunderstood your comment. If steadyState is true, I believe it will not cause any blocking.
Therefore, updateDeviceConditions is working fine.
Which approach would be appropriate in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Steady state is blocking, see other usages.
Let's try to run these new tests without two last steps - updateDeviceConditions and checkPodScheduled. You will see that the test is still successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. Instead of using collectMetrics
I'll change it to use startCollectingMetrics
and stopCollectingMetrics
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, what's the purpose of this test, if the pods can be just scheduled without device conditions update. Shouldn't the pods stay unschedulable until the conditions get updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous test case, steadyState was set to true due to my misunderstanding, so there were 0 pods when checkPodScheduled was called.
Now, with skipWaitToCompletion: true
in createPods op, the expected check should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't run updateDeviceConditions
, the AttachmentSuccess
test will fail as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't run
updateDeviceConditions
, theAttachmentSuccess
test will fail as expected.
But what with the another test? Why is it still successful even if there is no update? Should it work like that? I'd expect that the scheduling is blocked (on PreBind) until the device conditions are added.
staging/src/k8s.io/dynamic-resource-allocation/structured/allocator_test.go
Outdated
Show resolved
Hide resolved
// deviceRequestAllocationResult returns an DeviceRequestAllocationResult object for testing purposes, | ||
// specifying the driver, pool, device, usage restriction, binding conditions, | ||
// binding failure conditions, and binding timeout. | ||
func deviceRequestAllocationResult(request, driver, pool, device string, bindingConditions, bindingFailureConditions []string, bindingTimeout *int64) resourceapi.DeviceRequestAllocationResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a function called deviceAllocationResult
that is very similar, it just supports adminAccess
rather than binding conditions. Can we combine them, or at the very least, create a name convention that makes it a bit less confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since deviceAllocationResult
is already heavily used, I think it would be better to rename this function.
What do you think about the name change to deviceRequestAllocationResultWithBindingConditions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the name of the function.
needBindingSlice.Spec.Devices = nil | ||
nodeSlice := slice.DeepCopy() | ||
nodeSlice.Spec.Devices = nil | ||
for _, device := range slice.Spec.Devices { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This splits devices from a single ResourceSlice into two separate slices with the same name. I have two concerns about this:
- I don't think this will work with the partitionable devices feature, since it relies on managing the consumption of counters across devices within a ResourceSlice. This duplicates the counters, which will almost certainly lead to oversubscription of the underlying devices.
- We end up with two ResourceSlice objects with the same name, which means that lookup of a specific ResourceSlice by name will not work. Have we checked whether there are any logic today that relies on this? In any case, it seems like it would be surprising behavior for anyone adding features in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, doesn't the next case
statement that covers perDeviceNodeSelection
also need to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, this operation does not split a single ResourceSlice into two. They are then combined again into a single pool by the addSlice()
operation. In this sequence, we are just switching the order.
If my understanding is incorrect, I would appreciate it if you could let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, doesn't the next case statement that covers perDeviceNodeSelection also need to handle this?
I'll check if it's necessary in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing the code again, it looks like that two slices with the same name could appears in the pool. So I'll update that just swap their order and add the slice to the pool without splitting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, doesn't the next case statement that covers perDeviceNodeSelection also need to handle this?
I'll check if it's necessary in this case.
I think this should be added to this case too. I will add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the issue with the slices being split and added handling for the next case
too.
@@ -3325,6 +3356,40 @@ func TestAllocator(t *testing.T) { | ||
deviceAllocationResult(req0, driverA, pool1, device1, false), | ||
)}, | ||
}, | ||
"device-binding-conditions": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add tests here to make sure this feature works with the partitionable devices feature. And probably some of the other features like device taints and prioritized list too unless we know they are orthogonal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try and see if I can add a suitable test.
cc @pohly @johnbelamaric
If you know of orthogonal with other features, could you please let me know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the other ResourceSlice ones I would be more concerned about than the ResourceClaim based ones. So, partitionable for one, and eventually consumable capacity - but that's not merged yet either so we can deal with it later.
…t to address review comment Instead of using `collectMetrics`, use `startCollectingMetrics` and `stopCollectingMetrics`
Add test cases for multiple featuregates Update allocator testcode and fix some confusing function name FIx GatherPool() logic Update validation code to use validateSlice
3b390d0
to
86a597b
Compare
@@ -896,6 +938,66 @@ func TestStatusStrategyUpdate(t *testing.T) { | ||
} | ||
}, | ||
}, | ||
"keep-fields-binding-conditions": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need a test here to verify that the fields are kept if they already exist, even if the feature is disabled.
@@ -410,12 +446,79 @@ func TestResourceSliceStrategyUpdate(t *testing.T) { | ||
return obj | ||
}(), | ||
}, | ||
"drop-fields-binding-conditions": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, I think we also need a test here to verify that we don't drop the fields if they are already in use even if the feature is disabled.
// If deviceBindingConditions are enabled, we need to populate the AllocatedDeviceStatus. | ||
if a.features.DeviceBinding { | ||
for _, device := range internal.slice.Spec.Devices { | ||
if device.Name == internal.id.Device && len(device.Basic.BindingConditions) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we only checking the length of device.Basic.BindingConditions
here and not device.Basic.BindingFailureConditions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not believe that it is desirable for a device that only contains BindingFailureConditions
to cause a wait in PreBind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Even better, such a device should be considered invalid (= rejected during validation) because specifying BindingFailureConditions
without at least one BindingCondition
does not make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Agree with @pohly that handling this in validation would be useful. Otherwise the BindingFailureConditions
will just get dropped here if there aren't any BindingConditions
.
} | ||
} | ||
if len(needBindingSlice.Spec.Devices) > 0 { | ||
// If there are devices that need binding, we create a separate slice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment needs to be updated I think.
// If there are devices that need binding, we create a separate slice | ||
// for them, so that they can be handled separately. | ||
nodeSlice.Spec.Devices = append(nodeSlice.Spec.Devices, needBindingSlice.Spec.Devices...) | ||
needBindingSlices = append(needBindingSlices, nodeSlice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming here is a bit confusing since we don't add the needBindingSlice
to the needBindingSlices
list. Also, do we need the needBindingSlice
at all? Can't we just use a list of devices?
@@ -88,6 +109,26 @@ func GatherPools(ctx context.Context, slices []*resourceapi.ResourceSlice, node | ||
device.String(), slice.Name, err) | ||
} | ||
if match { | ||
if hasBindingConditions(slice) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reasonable way to avoid the duplication here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try extracting the sorting process as a function.
nodeSlice.Spec.Devices = append(nodeSlice.Spec.Devices, device) | ||
} | ||
} | ||
if len(needBindingSlice.Spec.Devices) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way this check doesn't pass when the call to hadBindingConditions
on line 112 passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, you're saying this check is unnecessary because line 112 already covers the condition, right? If so, I agree. I'll remove it. Thanks for catching that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it seems like hasBindingConditions
has already established that there is at least one device with binding conditions in the slice, so I'm not sure how the list of devices in needBindingSlice
can be empty here.
@@ -107,13 +148,39 @@ func GatherPools(ctx context.Context, slices []*resourceapi.ResourceSlice, node | ||
|
||
} | ||
|
||
if len(needBindingSlices) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, the logic here puts devices that uses binding conditions last in the device list for each ResourceSlice and then puts the ResourceSlices with devices with binding conditions last in each pool, and finally puts those pools last for the allocator search.
What is the goal here? This makes it less likely that a device with conditions are selected, but the allocator can still end up choosing a device with conditions even though a device without them are available.
I'm a bit uncertain about changing the order of the devices in a ResourceSlice. Since the allocator does a first fit search, the order can be important. For example, a request for a GPU with at least 1Gi of memory will select one with 40Gi available if it is listed first in the ResourceSlice, even though that might be a poor fit. Giving the ResourceSlice author control of the order makes makes it a bit easier to avoid poor allocation choices, although scoring is obviously the real solution here. Maybe not a big issue since this only happens for devices with binding conditions, which is also controlled by the ResourceSlice author.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, the logic here puts devices that uses binding conditions last in the device list for each ResourceSlice and then puts the ResourceSlices with devices with binding conditions last in each pool, and finally puts those pools last for the allocator search.
Yes, you are right.
What is the goal here? This makes it less likely that a device with conditions are selected, but the allocator can still end up choosing a device with conditions even though a device without them are available.
I'm a bit uncertain about changing the order of the devices in a ResourceSlice. Since the allocator does a first fit search, the order can be important. For example, a request for a GPU with at least 1Gi of memory will select one with 40Gi available if it is listed first in the ResourceSlice, even though that might be a poor fit. Giving the ResourceSlice author control of the order makes makes it a bit easier to avoid poor allocation choices, although scoring is obviously the real solution here. Maybe not a big issue since this only happens for devices with binding conditions, which is also controlled by the ResourceSlice author.
The goal of this change is to prioritize devices without BindingConditions
during allocation. By adjusting the search order, we aim to reduce the likelihood of selecting devices with conditions when other suitable options are available.
Originally, the implementation only changed the order of ResourceSlices
within a pool. Based on this discussion, I added logic to also reorder devices within a ResourceSlice
.
However, I’ve realized that the discussion around whether device reordering is appropriate hasn’t been fully settled. If there are concerns about changing the order of devices, one alternative could be to revert that part of the logic and instead document that ResourceSlices
containing devices with BindingConditions
are less likely to be selected by the allocator.
@pohly
Any thoughts on sorting the devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we really want is "order all devices in the pool" according to the triplet binding conditions yes/no, name of slice, index within slice
.
The "name of the slice" is just there to ensure that we don't compare device indices from different slices, which would be meaningless. The name is random, so driver's cannot rely on that to prefer some devices over others.
If a driver does not use binding conditions, then this is equivalent to the current "look at random slices, iterate over devices in them in the order chosen by the DRA driver".
If a driver uses binding conditions, then it can publish some devices with binding conditions in one slice and others without binding conditions in another and it will work as intended. The sorting in the allocator wouldn't be necessary if all devices were in the same slice, but we cannot assume that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintaining such a list is a change of how the allocator currently iterates over devices and also adds overhead. So as a less intrusive change, sorting slices based on "some device uses binding conditions" and not sorting within the slice should be sufficient.
It's just an approximation, but DRA driver authors can make it work for them by not mixing devices with and without binding conditions in the same slice and if they need to do that, use only a single slice with devices using binding conditions come last.
I had violated that constraint in #130160 (comment) and didn't get the expected outcome. But if we clearly document this for driver authors, they can avoid such a mistake, so I am fine with the original implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my understanding is that the original implementation just ordered ResourceSlices based on whether they had devices with BindingConditions in them. Agree that it should be sufficient since users can control how devices are split across ResourceSlices (with some caveats around Partitionable Devices) and they control the order of devices within each ResourceSlice.
@@ -363,3 +371,278 @@ claims: | ||
tCtx.Fatalf("Could not allocate claim %d out of %d", i, len(claims)) | ||
} | ||
} | ||
|
||
type updateDeviceConditionsOp struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update these to explain it in a little more detail? It would be helpful to understand how the ResourceClaims to be updated are identified and how the process is configured with the parameters defined on this op.
claimName := item.GetName() | ||
var claim *resourceapi.ResourceClaim | ||
|
||
err = wait.PollUntilContextTimeout(ctx, 1*time.Second, op.Duration.Duration, true, func(ctx context.Context) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to use these tests for performance? If so, how much will the result be influenced on how fast this code can update the conditions? I just want to understand what piece of code is actually being tested in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not intending to use these tests for performance measurement.
However, it’s important that the code updates the conditions faster than the timeout value specified by BindingTimeoutSeconds. Otherwise, the test may unintentionally fail due to the Pod not being scheduled in time. So while performance isn’t the goal here, the timing does need to meet that threshold to ensure the test behaves as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for my question is mostly around whether it would be better to handle integration tests as part of https://github.com/kubernetes/kubernetes/tree/master/test/integration/dra rather than here.
@@ -345,3 +347,42 @@ func dropDeallocatedStatusDevices(newClaim, oldClaim *resource.ResourceClaim) { | ||
} | ||
|
||
// TODO: add tests after partitionable devices is merged (code conflict!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an out-of-date TODO? If so please remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a TODO I added, and I don't think it's a good idea for me to delete it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I'll take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=> #132927
@@ -278,6 +279,44 @@ var objWithAdminAccessStatus = &resource.ResourceClaim{ | ||
}, | ||
} | ||
|
||
var objWithDeviseBindingConditions = &resource.ResourceClaim{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Devise/Device/
bound, err := pl.isClaimBound(claim) | ||
// If the claim is not bound, assumed timeout and it need to clean up. | ||
if err != nil || !bound { | ||
unavailableClaims = append(unavailableClaims, index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking below, this has been implemented (using AllocationTimestamp). If the integration tests are done too (I haven't gotten there yet), can we resolve this comment?
@@ -884,6 +959,98 @@ func (pl *DynamicResources) bindClaim(ctx context.Context, state *stateData, ind | ||
return claim, nil | ||
} | ||
|
||
// isClaimBound checks whether a given resource claim is successfully | ||
// bound to a device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're overusing the word "binding" here. The "binding conditions" are the conditions needed to be able to proceed to the Bind() phase for the Pod. They are not about binding a device to a claim. I think instead what you want is "isClaimBindable" or "isClaimReadyForBinding" or maybe just "areBindingConditionsMet" or "claimNotBlockingBinding" or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. I will chang the function name to isClaimReadyForBinding
.
@@ -974,6 +1043,82 @@ func TestPlugin(t *testing.T) { | ||
}, | ||
}, | ||
}, | ||
"bound-claim-with-successed-binding-conditions": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a case where a single pod has two different claims that have binding conditions, and then have tests for both succeed, one fails, one fails with timeout, etc.? Just to be sure we have something covering the multiple claims cases.
@@ -3325,6 +3356,40 @@ func TestAllocator(t *testing.T) { | ||
deviceAllocationResult(req0, driverA, pool1, device1, false), | ||
)}, | ||
}, | ||
"device-binding-conditions": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the other ResourceSlice ones I would be more concerned about than the ResourceClaim based ones. So, partitionable for one, and eventually consumable capacity - but that's not merged yet either so we can deal with it later.
// | ||
// +optional | ||
// +featureGate=DRADeviceBindingConditions,DRAResourceClaimDeviceStatus | ||
UsageRestrictedToNode *bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BindsToNode
sounds best to me (from https://github.com/kubernetes/kubernetes/pull/130160/files#r2201675405).
It ties the field to the corresponding feature and the present tense isn't wrong like "bound" was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I see. I will change the name.
} | ||
} | ||
|
||
time.Sleep(1 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ticker for that (time.NewTicker)
if ctx.Err() == context.DeadlineExceeded { | ||
if op.ExpectedScheduled { | ||
tCtx.Fatalf("Timeout waiting for pods to be scheduled. Context error: %v. Last violation: %s", ctx.Err(), lastViolation) | ||
} else if lastViolation != "" { | ||
tCtx.Fatalf("Expected pods to remain unscheduled, but some pods were scheduled. Last violation: %s", lastViolation) | ||
} else { | ||
// If ExpectedScheduled is false (pods should be unscheduled), timeout means success. | ||
tCtx.Logf("All pods remained unscheduled for the duration") | ||
return | ||
} | ||
} else { | ||
tCtx.Fatalf("Context cancelled: %v. Last violation: %s", ctx.Err(), lastViolation) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could flatten the conditions:
if ctx.Err() == context.DeadlineExceeded { | |
if op.ExpectedScheduled { | |
tCtx.Fatalf("Timeout waiting for pods to be scheduled. Context error: %v. Last violation: %s", ctx.Err(), lastViolation) | |
} else if lastViolation != "" { | |
tCtx.Fatalf("Expected pods to remain unscheduled, but some pods were scheduled. Last violation: %s", lastViolation) | |
} else { | |
// If ExpectedScheduled is false (pods should be unscheduled), timeout means success. | |
tCtx.Logf("All pods remained unscheduled for the duration") | |
return | |
} | |
} else { | |
tCtx.Fatalf("Context cancelled: %v. Last violation: %s", ctx.Err(), lastViolation) | |
} | |
if ctx.Err() != context.DeadlineExceeded { | |
tCtx.Fatalf("Context cancelled: %v. Last violation: %s", ctx.Err(), lastViolation) | |
} | |
if op.ExpectedScheduled { | |
tCtx.Fatalf("Timeout waiting for pods to be scheduled. Context error: %v. Last violation: %s", ctx.Err(), lastViolation) | |
} | |
if lastViolation != "" { | |
tCtx.Fatalf("Expected pods to remain unscheduled, but some pods were scheduled. Last violation: %s", lastViolation) | |
} | |
// If ExpectedScheduled is false (pods should be unscheduled), timeout means success. | |
tCtx.Logf("All pods remained unscheduled for the duration") | |
return |
currentViolation = fmt.Sprintf("Pod %s: ExpectedScheduled=%v, ActualScheduled=%v (Node=%s)", | ||
pod.Name, op.ExpectedScheduled, pod.Spec.NodeName != "", pod.Spec.NodeName) | ||
lastViolation = currentViolation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do it directly:
currentViolation = fmt.Sprintf("Pod %s: ExpectedScheduled=%v, ActualScheduled=%v (Node=%s)", | |
pod.Name, op.ExpectedScheduled, pod.Spec.NodeName != "", pod.Spec.NodeName) | |
lastViolation = currentViolation | |
lastViolation = fmt.Sprintf("Pod %s: ExpectedScheduled=%v, ActualScheduled=%v (Node=%s)", | |
pod.Name, op.ExpectedScheduled, pod.Spec.NodeName != "", pod.Spec.NodeName) |
if stableStart == nil { | ||
now := time.Now() | ||
stableStart = &now | ||
tCtx.Logf("All pods are unscheduled, starting stability check") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to document somewhere how the stable start works here
for _, device := range slice.Spec.Devices { | ||
if device.Basic != nil && device.Basic.BindingConditions != nil { | ||
hasBindingFlag = true | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This break works only for the inner loop. You could consider using a function where you would just return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you could use hasBindingConditions
method here as well
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR implements the KEP-5007 DRA device binding conditions. This feature ensures that the scheduler waits in the PreBind phase until any DRA devices that need preparation are ready.
Which issue(s) this PR fixes:
Related to kubernetes/enhancements#5007
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: