-
Notifications
You must be signed in to change notification settings - Fork 40.6k
[WIP][FG:InPlacePodVerticalScaling] Fix Static CPU management policy alongside InPlacePodVerticalScaling #129719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@Chunxia202410, @hshiina , @ffromani , @AnishShah , @tallclair , @SergeyKanzhelev , @vinaykul , moved old PR here due to company transfer, now i can continue working on this. Will prioritize updating the commit with the recent changes done in InPlacePodVerticalScaling and the proposals shared by Chunxia202410. |
6b6da95
to
cc91639
Compare
Thanks @Chunxia202410 for the contributions, merged your two PRs to check tests . I have some questions about the second PR regarding strategy , thought best to discuss here, will ask later this week, need to do some more tests. Thanks for the API PR seems is one of the options discussed. Will try to update tests tomorrow or by end of this week, to make sure test is covered and passed before the Beta in v1.33 to have a point of reference. We will need to raise the final solution or solutions in Sig-node , after v1.33 is finished and this PR is refactored with the new InPlacePodVerticalScaling KEP ongoing changes . Thanks again for your PRs |
/retest |
e6f5d5a
to
dc25a5a
Compare
Hi, @ffromani and @kad please take another look for this PR, i have followed the promised local solution approach as it has been asked during the sig-node meeting . I don't think Francesco that it adds a value to split this in three issues as it has been asked above and not sure if KEP update is needed for this fix. But leaving this up to you and kad to decide. PS: I have kept from @Chunxia202410 14df31c commit the non API parts for the moment but depending on #129719 (comment) decision might revert it fully, in that case i will revert and then squash the commits together in order to make it easier for rebase and review . |
@@ -438,7 +562,7 @@ func (p *staticPolicy) allocateCPUs(s state.State, numCPUs int, numaAffinity bit | ||
numAlignedToAlloc = numCPUs | ||
} | ||
|
||
allocatedCPUs, err := p.takeByTopology(alignedCPUs, numAlignedToAlloc) | ||
allocatedCPUs, err := p.takeByTopology(alignedCPUs, numAlignedToAlloc, reusableCPUsForResize, mustKeepCPUsForResize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a possibility that mustKeepCPUsForResize (and reusableCPUsForResize ?) is a subset of alignedCPUs. Maybe worth adding a check and throw and error if there are not. Otherwise, we might end up leaking CPUS.
klog.InfoS("AllocateCPUs", "numCPUs", numCPUs, "socket", numaAffinity) | ||
|
||
allocatableCPUs := p.GetAvailableCPUs(s).Union(reusableCPUs) | ||
allocatableCPUs := cpuset.New() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth exploring an alternate approach here.
Instead of combining everything (reusableCPUsForResize, available cpus) into allocatableCPUs and then attempting allocation, we can have a more tiered approach.
- Attempt allocation from mustKeepCPUsForResize.
- Remaining is allocated from reusableCPUsForResize.
- Remaining from NUMA aligned CPUs
- Remaining from other NUMA nodes
With this approach, takeByTopology() function does not have to change ?
klog.InfoS("Regenerating TopologyHints for CPUs already allocated", "pod", klog.KObj(pod), "containerName", container.Name) | ||
return map[string][]topologymanager.TopologyHint{ | ||
string(v1.ResourceCPU): p.generateCPUTopologyHints(allocated, cpuset.CPUSet{}, requested), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not sure if this is handled.
Do we throw an error is mustKeepCPUsForResize is not used during Allocation() ?. mustKeepCPUsForResize might not align with the new NUMA hint we generate here.
Use new topology.Allocation struct (a CPU set plus alignment metadata) instead of CPU set, due to rebase. Remove duplicate unecessary SetDefaultCPUSet call as per review comment.
- Revert introduction of API env mustKeepCPUs - Replace mustKeepCPUs with local checkpoint "promised" - Introduce "promised" in CPUManagerCheckpointV3 format - Add logic, refactor with Beta candidate - Fix lint issues - TODO improve alogn resize tests, go through testing, corner cases - TODO improve CPUManagerCheckpointV3 tests - TODO address code review/feedback - TODO check init-containers - TODO check migration from v2 to v3 CPU Manager checkpoint - TODO check kubectl failure when prohibited can this be done earlier? - TODO update CPU Manager tests to use refactpred cpu_manager_test
@esotsal: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
It is needed to resize allocated CPU of a Guaranteed QoS Class with integer CPU requests Pod without restart with Static CPU management policy alongside InPlacePodVerticalScaling.
Current PR retains "promised" upon start of container CPUs
Which issue(s) this PR fixes:
It is mentioned the issue in https://kubernetes.io/blog/2023/05/12/in-place-pod-resize-alpha/#known-issues last bullet.
I wasn't able to find a K8s issue, posting this PR following "I found it, I fix it" ethos after discussion in slack.
Update: Attempt to fix #127262
Special notes for your reviewer:
[Update] Latest 14th April 2025 Demo screencast
Screencast.from.2025-04-14.10-02-45.webm
This PR replace #123319 , needed due to company transfer.
This PR includes also a merge from esotsal#9 , to test proposals and review in one common PR. Thanks @Chunxia202410 for the contributions.
Does this PR introduce a user-facing change?