-
Notifications
You must be signed in to change notification settings - Fork 40.6k
add exponential backoff with cap and reset in reflector during retrying ListWatch #87795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add exponential backoff with cap and reset in reflector during retrying ListWatch #87795
Conversation
Hi @zhan849. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'm not opposed to jittering this, though I'm skeptical that jittering by O(1 second) will make much of a difference at scale. I'd be more inclined to do something like period=1s, jitter factor more like 5-10, but would like feedback from @lavalamp / @smarterclayton / @wojtek-t |
In this case I think exponential backoff makes more sense. This jitter still averages out to 1 qps per lister when apiserver is unhealthy, or constant load in the unhealthy state. What we actually want is decreasing load in the unhealthy state. |
good point… I guess the question is over what period we would want the retry to reset to the min delay. If I recall correctly, this loop is exercised every time we list, which happens under normal circumstances occasionally. Do we have an existing helper that will 1) back off exponentially to a max delay when called frequently, then 2) reset to min delay after not being called for "a while"? |
Thanks @lavalamp and @liggitt for inputs jittered exponential backoff is definitely what we need here. I checked that we don't currently have anything in staging but I can try to implement something inside reflector leveraging existing building blocks that
will update the PR shortly |
7116807
to
5f488e5
Compare
5a3ee04
to
9921061
Compare
9921061
to
23e1b2e
Compare
|
||
// backoffManager manages jittered exponential back off with a cap and a reset period. Exponential back off will stop | ||
// to increase at the given max, and timer will reset after not backing off for a while | ||
type backoffManager struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems useful enough that it should be moved to a public place imho.
client-go/util/flowcontrol seems to be a good place
Probably also a separate PR as this will be universally useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, let's get it merged here and move it in a second step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the feedback! I will create a separate PR for backoff manager in client-go/util/flowcontrol
package, and get that merged first. There will be a second PR updating reflector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structure looks good, have some implementation requests. Thanks!
@@ -161,11 +165,28 @@ var internalPackages = []string{"client-go/tools/cache/"} | ||
// Run will exit when stopCh is closed. | ||
func (r *Reflector) Run(stopCh <-chan struct{}) { | ||
klog.V(3).Infof("Starting reflector %v (%s) from %s", r.expectedTypeName, r.resyncPeriod, r.name) | ||
wait.Until(func() { | ||
defer func() { | ||
klog.V(2).Infof("Stopping reflector %v (%s) from %s", r.expectedType, r.resyncPeriod, r.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V(3) to match the above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually after a second thought, changed both in v2 for debuggability - most open sourced k8s provisioner sets components to v2, and these logs can provide useful debug information about component startup
|
||
// backoffManager manages jittered exponential back off with a cap and a reset period. Exponential back off will stop | ||
// to increase at the given max, and timer will reset after not backing off for a while | ||
type backoffManager struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, let's get it merged here and move it in a second step.
// this will result in a 0.2sec, 0.4sec, 0.8sec, 1.6sec, 3.2sec, 6.4sec, 10sec, 10sec, ... each backoff is | ||
// added with a 0.5 jitter. If the backoff is not triggered within 30sec, assume connection to api server | ||
// is stable and we reset backoff | ||
backoff: newBackoffManager(200*time.Millisecond, 10*time.Second, 30*time.Second, 2.0, 0.5), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for at least 20s max wait (95% traffic reduction in broken state vs current 1s), and I'd go as high as 100s (99% reduction).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 100s is a bit too much. But I would also go higher than 10s, though - maybe 30s as being somewhat round number?
|
||
for { | ||
// check stopch before each loop in case r.backoff.Backoff() gets selected before stopCh | ||
// we avoid do one extra invoke |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the call to ListAndWatch return immediately in that case? I don't think this is necessary? The current code doesn't seem to feel a need to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/reflector.go#L219 List() is executed asynchronously so still might go through
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code has the same problem. Even this code still has a race, the channel could close immediately after we pass this line.
When context is wired in, there will be a more obvious way to make this correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. will fix this in next PR. gonna focus this PR to get the BackoffManager correct
} | ||
} | ||
|
||
func (b *backoffManager) Backoff() *time.Timer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For testability, I suggest you split this into two functions, one that computes the duration (can be private) and one that sets and returns the timer.
} | ||
|
||
func (b *backoffManager) Backoff() *time.Timer { | ||
if time.Now().Sub(b.lastBackoffStart) > b.backoffResetDuration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
} | ||
|
||
func TestBackoffManagerResetTimer(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please test it all the way up to the max time and verify that it stops there.
419f50d
to
04679e9
Compare
clock: &clock.RealClock{}, | ||
// this will result in a 0.1sec, 0.2sec, 0.4sec, 0.8sec, 1.6sec, 3.2sec, 6.4sec, 12.8sec, 25.6sec, | ||
// 30sec, 30sec, ... backoff. With such backoff, we have ~95% traffic reduction (0.05 QPS) after 1min, | ||
// and ~98% traffic reduction (0.02 QPS) after ~1.5min. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't fully understand this math - after 1 min, we will be sending 1 request per 30s, so ~0.03QPS.
The argument with traffic reduction was more comparing to the current state (where we're sending request every 1s), so I think this % reduction doesn't make much sense in the comment file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like I made a bad estimation. I think the jitter would even make the exact math more complicated here. For comments, I will just make it simple -
we used to make the call every 1sec, our goal is to make a 98% traffic reduction (~0.02) QPS by stop at 30sec backoff with 1.0 jitter (random between 30sec and 60sec.)
04679e9
to
4548c79
Compare
// We used to make the call every 1sec (1 QPS), the goal here is to achieve ~98% traffic reduction when | ||
// API server is not healthy. With these parameters, backoff will stop at [30,60) sec interval which is | ||
// 0.22 QPS. | ||
backoffManager: wait.NewExponentialBackoffManager(100*time.Millisecond, 30*time.Second, 2*time.Minute, 2.0, 1.0, realClock), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, that 100ms initial (vs 1s) might make the first couple seconds worse than before-- what do you think about putting that at 800ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
800ms can help in the api server crash case but would make reconnecting not so agile during normal connection broken (in production we are seeing at Pinterest, normal re-connect, aka connection closed by api server is more common at scale)
100ms is more likely to screw up in api server crash case (i.e. oom killed, when it gets restarted, all api calls can squeeze in), but since there are exponential backoff, what it does is to (sort of) delay the recovery of api server by 10s seconds to minutes.
but I'm ok to make it 800ms, [800ms, 1600ms) with jitter, which is no worse than the original behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your clients are reconnecting often enough that this 1s is a problem, something else is wrong, you have a watch cache configuration problem or etcd size problem or something.
I think 800ms is a safe improvement over 1s (that is, it's mostly compensated by the jitter you added).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. Lets' go with 800ms. I updated the PR
4548c79
to
f124816
Compare
Agree with Daniel - the current version looks fine for me. /ok-to-test |
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wojtek-t, zhan849 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Review the full test history for this PR. Silence the bot with an |
1 similar comment
/retest Review the full test history for this PR. Silence the bot with an |
/retest |
/retest Review the full test history for this PR. Silence the bot with an |
@@ -168,12 +172,13 @@ var internalPackages = []string{"client-go/tools/cache/"} | ||
// objects and subsequent deltas. | ||
// Run will exit when stopCh is closed. | ||
func (r *Reflector) Run(stopCh <-chan struct{}) { | ||
klog.V(3).Infof("Starting reflector %v (%s) from %s", r.expectedTypeName, r.resyncPeriod, r.name) | ||
wait.Until(func() { | ||
klog.V(2).Infof("Starting reflector %s (%s) from %s", r.expectedTypeName, r.resyncPeriod, r.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it intentional that the log level got bumped up from 3 to 2?
I just updated to client-go from 1.21.1 and these messages started to pop up quite a lot in my client (an e2e test suite).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it really was - we should fix that and backport.
Would you be able to send PR for it (happy to approve)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #102788
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that was my intention. In practice we found reflector starting is an important debugging point and it only prints once when reflector initially created so won’t be too verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pohly if this line of log is polluting your log file, does your test include a lot of reflector creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'm running an e2e.test suite where essentially each test is independent and therefore instantiates its own reflectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this log help debug then? I.e. IMO it makes sense for v2 log or even v1 log to include critical time points such as component start / finish etc. in general.
@wojtek-t thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this log help debug then?
No. For me it is just additional noise. That's why I investigated.
I.e. IMO it makes sense for v2 log or even v1 log to include critical time points such as component start / finish etc. in general.
But client-go as a library cannot make assumptions about what constitutes component start/finish. In particular one cannot assume that this code is only called at the start and end of a program run. E2E testing is one counter example, but there might very well be others.
A library that gets embedded in command line tools should be very cautious about logging at a level that is visible by default. I would not expect any output from client-go at V < 3 unless that output is relevant for the end-user of that command line tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it really was - we should fix that and backport.
Here's the backport: #102800
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this log help debug then? I.e. IMO it makes sense for v2 log or even v1 log to include critical time points such as component start / finish etc. in general.
@wojtek-t thoughts?
Reflector starts may happen not only on component start/finish. Any time watch is not able to resume it may also happen. If you really want those logs, you can always configure verbosity per file. I don't think it should be v=2 by default.
kubernetes#87795 most likely unintentionally increased the log level of "Starting reflector" and "Stopping reflector", with the result that since Kubernetes 1.21 clients have printed that message by default. This is undesirable, we should use the original level 3 again.
kubernetes/kubernetes#87795 most likely unintentionally increased the log level of "Starting reflector" and "Stopping reflector", with the result that since Kubernetes 1.21 clients have printed that message by default. This is undesirable, we should use the original level 3 again. Kubernetes-commit: fd972934e4916879b04508686302659ce82cfa75
kubernetes#87795 most likely unintentionally increased the log level of "Starting reflector" and "Stopping reflector", with the result that since Kubernetes 1.21 clients have printed that message by default. This is undesirable, we should use the original level 3 again.
kubernetes/kubernetes#87795 most likely unintentionally increased the log level of "Starting reflector" and "Stopping reflector", with the result that since Kubernetes 1.21 clients have printed that message by default. This is undesirable, we should use the original level 3 again. Kubernetes-commit: ef0beb2cf7091f3c1c837af63120cb8ce9f084bc
What type of PR is this?
What this PR does / why we need it:
/kind bug
currently reflector's ListWatch backoffs with a fixed 1sec, which can cause a global QPS spike during API server crash (i.e. OOMKilled). This PR adds up a jittered exponential backoff to the retry of ListWatch. Exponential backoff has a cap, and will be reset after not getting backed-off for a while.
Which issue(s) this PR fixes:
Fixes #87794
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/assign @wojtek-t
/cc @liggitt @smarterclayton