Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Implement DRA Device Binding Conditions (KEP-5007) #130160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
Loading
from

Conversation

KobayashiD27
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR implements the KEP-5007 DRA device binding conditions. This feature ensures that the scheduler waits in the PreBind phase until any DRA devices that need preparation are ready.

Which issue(s) this PR fixes:

Related to kubernetes/enhancements#5007

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Add new optional APIs in ResouceSlice.Basic and ResourceClaim.Status.AllocatedDeviceStatus.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/issues/5007

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 14, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @KobayashiD27. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added area/code-generation area/kubelet kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 14, 2025
@KobayashiD27
Copy link
Contributor Author

We are pleased to share the initial implementation of the KEP-5007 DRA device binding conditions. While this PR aligns with the outlined in the KEP, we recognize that there may be areas for improvement. We invite the community to review the implementation and provide feedback and insights to help refine and enhance this feature.

@pohly @johnbelamaric
First, from the perspective of DRA, I would like to confirm that the overall approach aligns with the intended direction. Your insight would be invaluable in ensuring that we are on the right track.

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick comments. I have looked at the API and allocator, but not the scheduler plugin.

The API lacks validation, but that's of course okay when the main goal for now is to try out the functionality.

staging/src/k8s.io/api/resource/v1beta1/types.go Outdated Show resolved Hide resolved
@KobayashiD27
Copy link
Contributor Author

@pohly
Thank you for the initial feedback. To ensure we are on the right track, I would like to highlight a few key areas where community consensus is crucial. It would be helpful if you could prioritize reviewing the following aspects:

  1. Transferring AllocatedDeviceStatus from the allocator to the scheduler plugin
  2. Clearing the allocation state in the UnReserve phase

Early feedback on these sections would be very helpful.

Additionally, regarding the comment about the lack of API validation, are you referring to pkg/apis/resources/validation/validation.go?

@pohly
Copy link
Contributor

pohly commented Feb 19, 2025

/milestone v1.33

@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Feb 19, 2025
pkg/apis/resource/validation/validation.go Outdated Show resolved Hide resolved
pkg/apis/resource/validation/validation.go Outdated Show resolved Hide resolved
staging/src/k8s.io/api/resource/v1beta1/types.go Outdated Show resolved Hide resolved
pkg/apis/resource/types.go Outdated Show resolved Hide resolved
pkg/apis/resource/types.go Outdated Show resolved Hide resolved
pkg/apis/resource/types.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/memorymanager/memory_manager_test.go Outdated Show resolved Hide resolved
staging/src/k8s.io/api/resource/v1alpha3/types.go Outdated Show resolved Hide resolved
staging/src/k8s.io/api/resource/v1alpha3/types.go Outdated Show resolved Hide resolved
staging/src/k8s.io/api/resource/v1beta1/types.go Outdated Show resolved Hide resolved
staging/src/k8s.io/api/resource/v1beta1/types.go Outdated Show resolved Hide resolved
staging/src/k8s.io/api/resource/v1beta1/types.go Outdated Show resolved Hide resolved
@KobayashiD27 KobayashiD27 force-pushed the dra-device-binding-conditions branch from 1c36f41 to f07e273 Compare February 25, 2025 00:18
@KobayashiD27 KobayashiD27 force-pushed the dra-device-binding-conditions branch 3 times, most recently from 9cb01c9 to e0b29e2 Compare July 4, 2025 09:57
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 4, 2025
@KobayashiD27 KobayashiD27 force-pushed the dra-device-binding-conditions branch from e0b29e2 to 5c4350f Compare July 4, 2025 10:01
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 4, 2025
@KobayashiD27 KobayashiD27 force-pushed the dra-device-binding-conditions branch from 5c4350f to 03312ae Compare July 4, 2025 10:41
@pohly
Copy link
Contributor

pohly commented Jul 4, 2025

/assign @mortent

For a first pass. You are probably not familiar with the scheduler plugin, though.

@KobayashiD27: at some point I had implemented integration tests for this feature. Did you copy those into your branch? If not, then please do.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: johnbelamaric, KobayashiD27
Once this PR has been reviewed and has the lgtm label, please ask for approval from liggitt. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@KobayashiD27
Copy link
Contributor Author

/retest

@KobayashiD27
Copy link
Contributor Author

/assign @mortent

For a first pass. You are probably not familiar with the scheduler plugin, though.

@mortent
I know you are busy, but I would appreciate it if you could review this PR.

@KobayashiD27: at some point I had implemented integration tests for this feature. Did you copy those into your branch? If not, then please do.

Hi @pohly, thank you for your comment.

I’ve reviewed your suggestion and confirmed that the integration tests for this feature were not initially included in my branch. I’ve now copied them over and verified that they are working as expected.

Originally, the integration tests were failing due to changes introduced by the new API. I’ve also addressed those issues and updated the tests so that they now pass successfully.

@KobayashiD27 KobayashiD27 force-pushed the dra-device-binding-conditions branch from 44082f0 to e19f3ff Compare July 7, 2025 08:00
podTemplatePath: templates/pod-with-claim-template.yaml
countParam: $measurePods
steadyState: true
collectMetrics: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collectMetrics is blocking, so this step will create pods and wait until they are scheduled, before going to the next op (updateDeviceConditions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kubernetes/kubernetes/blob/master/test/integration/scheduler_perf/scheduler_perf.go#L2325
Based on the comments, it appears that we want every workload to include at least one CreatePods operation where collectMetrics is set to true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, what this test case is expected to do if all pods are scheduled at this point and the updateDeviceConditions is meaningless for them?

Btw, I see this comment is a bit outdated as the startCollectingMetrics and stopCollectingMetrics can be used for that as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rearize that I misunderstood your comment. If steadyState is true, I believe it will not cause any blocking.
Therefore, updateDeviceConditions is working fine.

Which approach would be appropriate in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steady state is blocking, see other usages.

Let's try to run these new tests without two last steps - updateDeviceConditions and checkPodScheduled. You will see that the test is still successful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. Instead of using collectMetrics I'll change it to use startCollectingMetrics and stopCollectingMetrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, what's the purpose of this test, if the pods can be just scheduled without device conditions update. Shouldn't the pods stay unschedulable until the conditions get updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous test case, steadyState was set to true due to my misunderstanding, so there were 0 pods when checkPodScheduled was called.
Now, with skipWaitToCompletion: true in createPods op, the expected check should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't run updateDeviceConditions, the AttachmentSuccess test will fail as expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't run updateDeviceConditions, the AttachmentSuccess test will fail as expected.

But what with the another test? Why is it still successful even if there is no update? Should it work like that? I'd expect that the scheduling is blocked (on PreBind) until the device conditions are added.

test/integration/scheduler_perf/dra.go Outdated Show resolved Hide resolved
test/integration/scheduler_perf/dra.go Outdated Show resolved Hide resolved
pkg/apis/resource/validation/validation.go Outdated Show resolved Hide resolved
pkg/registry/resource/resourceclaim/strategy.go Outdated Show resolved Hide resolved
// deviceRequestAllocationResult returns an DeviceRequestAllocationResult object for testing purposes,
// specifying the driver, pool, device, usage restriction, binding conditions,
// binding failure conditions, and binding timeout.
func deviceRequestAllocationResult(request, driver, pool, device string, bindingConditions, bindingFailureConditions []string, bindingTimeout *int64) resourceapi.DeviceRequestAllocationResult {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a function called deviceAllocationResult that is very similar, it just supports adminAccess rather than binding conditions. Can we combine them, or at the very least, create a name convention that makes it a bit less confusing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since deviceAllocationResult is already heavily used, I think it would be better to rename this function.
What do you think about the name change to deviceRequestAllocationResultWithBindingConditions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the name of the function.

needBindingSlice.Spec.Devices = nil
nodeSlice := slice.DeepCopy()
nodeSlice.Spec.Devices = nil
for _, device := range slice.Spec.Devices {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This splits devices from a single ResourceSlice into two separate slices with the same name. I have two concerns about this:

  • I don't think this will work with the partitionable devices feature, since it relies on managing the consumption of counters across devices within a ResourceSlice. This duplicates the counters, which will almost certainly lead to oversubscription of the underlying devices.
  • We end up with two ResourceSlice objects with the same name, which means that lookup of a specific ResourceSlice by name will not work. Have we checked whether there are any logic today that relies on this? In any case, it seems like it would be surprising behavior for anyone adding features in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, doesn't the next case statement that covers perDeviceNodeSelection also need to handle this?

Copy link
Contributor Author

@KobayashiD27 KobayashiD27 Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, this operation does not split a single ResourceSlice into two. They are then combined again into a single pool by the addSlice() operation. In this sequence, we are just switching the order.
If my understanding is incorrect, I would appreciate it if you could let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, doesn't the next case statement that covers perDeviceNodeSelection also need to handle this?

I'll check if it's necessary in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing the code again, it looks like that two slices with the same name could appears in the pool. So I'll update that just swap their order and add the slice to the pool without splitting it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, doesn't the next case statement that covers perDeviceNodeSelection also need to handle this?

I'll check if it's necessary in this case.

I think this should be added to this case too. I will add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the issue with the slices being split and added handling for the next case too.

@@ -3325,6 +3356,40 @@ func TestAllocator(t *testing.T) {
deviceAllocationResult(req0, driverA, pool1, device1, false),
)},
},
"device-binding-conditions": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add tests here to make sure this feature works with the partitionable devices feature. And probably some of the other features like device taints and prioritized list too unless we know they are orthogonal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try and see if I can add a suitable test.

cc @pohly @johnbelamaric
If you know of orthogonal with other features, could you please let me know?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some test cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the other ResourceSlice ones I would be more concerned about than the ResourceClaim based ones. So, partitionable for one, and eventually consumable capacity - but that's not merged yet either so we can deal with it later.

…t to address review comment

Instead of using `collectMetrics`, use  `startCollectingMetrics` and `stopCollectingMetrics`
Add test cases for multiple featuregates
Update allocator testcode and fix some confusing function name
FIx GatherPool() logic
Update validation code to use validateSlice
@KobayashiD27 KobayashiD27 force-pushed the dra-device-binding-conditions branch from 3b390d0 to 86a597b Compare July 8, 2025 09:00
@KobayashiD27 KobayashiD27 requested review from macsko and mortent July 8, 2025 09:40
@@ -896,6 +938,66 @@ func TestStatusStrategyUpdate(t *testing.T) {
}
},
},
"keep-fields-binding-conditions": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need a test here to verify that the fields are kept if they already exist, even if the feature is disabled.

@@ -410,12 +446,79 @@ func TestResourceSliceStrategyUpdate(t *testing.T) {
return obj
}(),
},
"drop-fields-binding-conditions": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I think we also need a test here to verify that we don't drop the fields if they are already in use even if the feature is disabled.

// If deviceBindingConditions are enabled, we need to populate the AllocatedDeviceStatus.
if a.features.DeviceBinding {
for _, device := range internal.slice.Spec.Devices {
if device.Name == internal.id.Device && len(device.Basic.BindingConditions) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we only checking the length of device.Basic.BindingConditions here and not device.Basic.BindingFailureConditions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe that it is desirable for a device that only contains BindingFailureConditions to cause a wait in PreBind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Even better, such a device should be considered invalid (= rejected during validation) because specifying BindingFailureConditions without at least one BindingCondition does not make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Agree with @pohly that handling this in validation would be useful. Otherwise the BindingFailureConditions will just get dropped here if there aren't any BindingConditions.

}
}
if len(needBindingSlice.Spec.Devices) > 0 {
// If there are devices that need binding, we create a separate slice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment needs to be updated I think.

// If there are devices that need binding, we create a separate slice
// for them, so that they can be handled separately.
nodeSlice.Spec.Devices = append(nodeSlice.Spec.Devices, needBindingSlice.Spec.Devices...)
needBindingSlices = append(needBindingSlices, nodeSlice)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming here is a bit confusing since we don't add the needBindingSlice to the needBindingSlices list. Also, do we need the needBindingSlice at all? Can't we just use a list of devices?

@@ -88,6 +109,26 @@ func GatherPools(ctx context.Context, slices []*resourceapi.ResourceSlice, node
device.String(), slice.Name, err)
}
if match {
if hasBindingConditions(slice) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reasonable way to avoid the duplication here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try extracting the sorting process as a function.

nodeSlice.Spec.Devices = append(nodeSlice.Spec.Devices, device)
}
}
if len(needBindingSlice.Spec.Devices) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way this check doesn't pass when the call to hadBindingConditions on line 112 passed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you're saying this check is unnecessary because line 112 already covers the condition, right? If so, I agree. I'll remove it. Thanks for catching that!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems like hasBindingConditions has already established that there is at least one device with binding conditions in the slice, so I'm not sure how the list of devices in needBindingSlice can be empty here.

@@ -107,13 +148,39 @@ func GatherPools(ctx context.Context, slices []*resourceapi.ResourceSlice, node

}

if len(needBindingSlices) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, the logic here puts devices that uses binding conditions last in the device list for each ResourceSlice and then puts the ResourceSlices with devices with binding conditions last in each pool, and finally puts those pools last for the allocator search.

What is the goal here? This makes it less likely that a device with conditions are selected, but the allocator can still end up choosing a device with conditions even though a device without them are available.

I'm a bit uncertain about changing the order of the devices in a ResourceSlice. Since the allocator does a first fit search, the order can be important. For example, a request for a GPU with at least 1Gi of memory will select one with 40Gi available if it is listed first in the ResourceSlice, even though that might be a poor fit. Giving the ResourceSlice author control of the order makes makes it a bit easier to avoid poor allocation choices, although scoring is obviously the real solution here. Maybe not a big issue since this only happens for devices with binding conditions, which is also controlled by the ResourceSlice author.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, the logic here puts devices that uses binding conditions last in the device list for each ResourceSlice and then puts the ResourceSlices with devices with binding conditions last in each pool, and finally puts those pools last for the allocator search.

Yes, you are right.

What is the goal here? This makes it less likely that a device with conditions are selected, but the allocator can still end up choosing a device with conditions even though a device without them are available.

I'm a bit uncertain about changing the order of the devices in a ResourceSlice. Since the allocator does a first fit search, the order can be important. For example, a request for a GPU with at least 1Gi of memory will select one with 40Gi available if it is listed first in the ResourceSlice, even though that might be a poor fit. Giving the ResourceSlice author control of the order makes makes it a bit easier to avoid poor allocation choices, although scoring is obviously the real solution here. Maybe not a big issue since this only happens for devices with binding conditions, which is also controlled by the ResourceSlice author.

The goal of this change is to prioritize devices without BindingConditions during allocation. By adjusting the search order, we aim to reduce the likelihood of selecting devices with conditions when other suitable options are available.

Originally, the implementation only changed the order of ResourceSlices within a pool. Based on this discussion, I added logic to also reorder devices within a ResourceSlice.

However, I’ve realized that the discussion around whether device reordering is appropriate hasn’t been fully settled. If there are concerns about changing the order of devices, one alternative could be to revert that part of the logic and instead document that ResourceSlices containing devices with BindingConditions are less likely to be selected by the allocator.

@pohly
Any thoughts on sorting the devices?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we really want is "order all devices in the pool" according to the triplet binding conditions yes/no, name of slice, index within slice.

The "name of the slice" is just there to ensure that we don't compare device indices from different slices, which would be meaningless. The name is random, so driver's cannot rely on that to prefer some devices over others.

If a driver does not use binding conditions, then this is equivalent to the current "look at random slices, iterate over devices in them in the order chosen by the DRA driver".

If a driver uses binding conditions, then it can publish some devices with binding conditions in one slice and others without binding conditions in another and it will work as intended. The sorting in the allocator wouldn't be necessary if all devices were in the same slice, but we cannot assume that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintaining such a list is a change of how the allocator currently iterates over devices and also adds overhead. So as a less intrusive change, sorting slices based on "some device uses binding conditions" and not sorting within the slice should be sufficient.

It's just an approximation, but DRA driver authors can make it work for them by not mixing devices with and without binding conditions in the same slice and if they need to do that, use only a single slice with devices using binding conditions come last.

I had violated that constraint in #130160 (comment) and didn't get the expected outcome. But if we clearly document this for driver authors, they can avoid such a mistake, so I am fine with the original implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my understanding is that the original implementation just ordered ResourceSlices based on whether they had devices with BindingConditions in them. Agree that it should be sufficient since users can control how devices are split across ResourceSlices (with some caveats around Partitionable Devices) and they control the order of devices within each ResourceSlice.

@@ -363,3 +371,278 @@ claims:
tCtx.Fatalf("Could not allocate claim %d out of %d", i, len(claims))
}
}

type updateDeviceConditionsOp struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update these to explain it in a little more detail? It would be helpful to understand how the ResourceClaims to be updated are identified and how the process is configured with the parameters defined on this op.

claimName := item.GetName()
var claim *resourceapi.ResourceClaim

err = wait.PollUntilContextTimeout(ctx, 1*time.Second, op.Duration.Duration, true, func(ctx context.Context) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to use these tests for performance? If so, how much will the result be influenced on how fast this code can update the conditions? I just want to understand what piece of code is actually being tested in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not intending to use these tests for performance measurement.
However, it’s important that the code updates the conditions faster than the timeout value specified by BindingTimeoutSeconds. Otherwise, the test may unintentionally fail due to the Pod not being scheduled in time. So while performance isn’t the goal here, the timing does need to meet that threshold to ensure the test behaves as expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for my question is mostly around whether it would be better to handle integration tests as part of https://github.com/kubernetes/kubernetes/tree/master/test/integration/dra rather than here.

pkg/apis/resource/types.go Show resolved Hide resolved
@@ -345,3 +347,42 @@ func dropDeallocatedStatusDevices(newClaim, oldClaim *resource.ResourceClaim) {
}

// TODO: add tests after partitionable devices is merged (code conflict!)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an out-of-date TODO? If so please remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a TODO I added, and I don't think it's a good idea for me to delete it in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'll take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=> #132927

@@ -278,6 +279,44 @@ var objWithAdminAccessStatus = &resource.ResourceClaim{
},
}

var objWithDeviseBindingConditions = &resource.ResourceClaim{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Devise/Device/

bound, err := pl.isClaimBound(claim)
// If the claim is not bound, assumed timeout and it need to clean up.
if err != nil || !bound {
unavailableClaims = append(unavailableClaims, index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking below, this has been implemented (using AllocationTimestamp). If the integration tests are done too (I haven't gotten there yet), can we resolve this comment?

@@ -884,6 +959,98 @@ func (pl *DynamicResources) bindClaim(ctx context.Context, state *stateData, ind
return claim, nil
}

// isClaimBound checks whether a given resource claim is successfully
// bound to a device.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're overusing the word "binding" here. The "binding conditions" are the conditions needed to be able to proceed to the Bind() phase for the Pod. They are not about binding a device to a claim. I think instead what you want is "isClaimBindable" or "isClaimReadyForBinding" or maybe just "areBindingConditionsMet" or "claimNotBlockingBinding" or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. I will chang the function name to isClaimReadyForBinding.

@@ -974,6 +1043,82 @@ func TestPlugin(t *testing.T) {
},
},
},
"bound-claim-with-successed-binding-conditions": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a case where a single pod has two different claims that have binding conditions, and then have tests for both succeed, one fails, one fails with timeout, etc.? Just to be sure we have something covering the multiple claims cases.

@@ -3325,6 +3356,40 @@ func TestAllocator(t *testing.T) {
deviceAllocationResult(req0, driverA, pool1, device1, false),
)},
},
"device-binding-conditions": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the other ResourceSlice ones I would be more concerned about than the ResourceClaim based ones. So, partitionable for one, and eventually consumable capacity - but that's not merged yet either so we can deal with it later.

//
// +optional
// +featureGate=DRADeviceBindingConditions,DRAResourceClaimDeviceStatus
UsageRestrictedToNode *bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BindsToNode sounds best to me (from https://github.com/kubernetes/kubernetes/pull/130160/files#r2201675405).

It ties the field to the corresponding feature and the present tense isn't wrong like "bound" was.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I see. I will change the name.

}
}

time.Sleep(1 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ticker for that (time.NewTicker)

Comment on lines +1075 to +1087
if ctx.Err() == context.DeadlineExceeded {
if op.ExpectedScheduled {
tCtx.Fatalf("Timeout waiting for pods to be scheduled. Context error: %v. Last violation: %s", ctx.Err(), lastViolation)
} else if lastViolation != "" {
tCtx.Fatalf("Expected pods to remain unscheduled, but some pods were scheduled. Last violation: %s", lastViolation)
} else {
// If ExpectedScheduled is false (pods should be unscheduled), timeout means success.
tCtx.Logf("All pods remained unscheduled for the duration")
return
}
} else {
tCtx.Fatalf("Context cancelled: %v. Last violation: %s", ctx.Err(), lastViolation)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could flatten the conditions:

Suggested change
if ctx.Err() == context.DeadlineExceeded {
if op.ExpectedScheduled {
tCtx.Fatalf("Timeout waiting for pods to be scheduled. Context error: %v. Last violation: %s", ctx.Err(), lastViolation)
} else if lastViolation != "" {
tCtx.Fatalf("Expected pods to remain unscheduled, but some pods were scheduled. Last violation: %s", lastViolation)
} else {
// If ExpectedScheduled is false (pods should be unscheduled), timeout means success.
tCtx.Logf("All pods remained unscheduled for the duration")
return
}
} else {
tCtx.Fatalf("Context cancelled: %v. Last violation: %s", ctx.Err(), lastViolation)
}
if ctx.Err() != context.DeadlineExceeded {
tCtx.Fatalf("Context cancelled: %v. Last violation: %s", ctx.Err(), lastViolation)
}
if op.ExpectedScheduled {
tCtx.Fatalf("Timeout waiting for pods to be scheduled. Context error: %v. Last violation: %s", ctx.Err(), lastViolation)
}
if lastViolation != "" {
tCtx.Fatalf("Expected pods to remain unscheduled, but some pods were scheduled. Last violation: %s", lastViolation)
}
// If ExpectedScheduled is false (pods should be unscheduled), timeout means success.
tCtx.Logf("All pods remained unscheduled for the duration")
return

Comment on lines +1104 to +1106
currentViolation = fmt.Sprintf("Pod %s: ExpectedScheduled=%v, ActualScheduled=%v (Node=%s)",
pod.Name, op.ExpectedScheduled, pod.Spec.NodeName != "", pod.Spec.NodeName)
lastViolation = currentViolation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do it directly:

Suggested change
currentViolation = fmt.Sprintf("Pod %s: ExpectedScheduled=%v, ActualScheduled=%v (Node=%s)",
pod.Name, op.ExpectedScheduled, pod.Spec.NodeName != "", pod.Spec.NodeName)
lastViolation = currentViolation
lastViolation = fmt.Sprintf("Pod %s: ExpectedScheduled=%v, ActualScheduled=%v (Node=%s)",
pod.Name, op.ExpectedScheduled, pod.Spec.NodeName != "", pod.Spec.NodeName)

if stableStart == nil {
now := time.Now()
stableStart = &now
tCtx.Logf("All pods are unscheduled, starting stability check")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to document somewhere how the stable start works here

for _, device := range slice.Spec.Devices {
if device.Basic != nil && device.Basic.BindingConditions != nil {
hasBindingFlag = true
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This break works only for the inner loop. You could consider using a function where you would just return.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you could use hasBindingConditions method here as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. area/code-generation area/kubelet area/release-eng Issues or PRs related to the Release Engineering subproject area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: Assigned
Status: 👀 In review
Status: Needs Triage
Status: !SIG Auth
Status: Needs Triage
Archived in project
Status: Tracked
Development

Successfully merging this pull request may close these issues.

Morty Proxy This is a proxified and sanitized view of the page, visit original site.