Closed
Bug 823304
Opened 12 years ago
Closed 9 years ago
Fuzz backoff interval on failure
Categories
(Firefox Health Report Graveyard :: Client: Desktop, defect, P4)
Firefox Health Report Graveyard
Client: Desktop
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: gps, Unassigned)
References
Details
Operational consideration to help mitigate thundering herd.
policy.jsm:989
Reporter | ||
Comment 1•12 years ago
|
||
We may want to consider expanding scope of this to cover larger backoff-related issues:
* 500 is a hard failure causing immediate 24h backoff.
* Are 15 minutes and 60 minutes the correct backoff intervals.
Comment 2•12 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #1)
> We may want to consider expanding scope of this to cover larger
> backoff-related issues:
>
> * 500 is a hard failure causing immediate 24h backoff.
I don't think there should be a hard failure this severe. If the server returns 503+Retry-After we should respect that, otherwise we should treat it like any other error.
> * Are 15 minutes and 60 minutes the correct backoff intervals.
I think we should use ~30 minutes as an initial base with a combination of fuzzing and progressive backoff. If we just spike and overload infra, we'll slow down gradually until the system recovers. If there's a serious issue I'd expect Ops to deal with it more explicitly, but our dual goals here are "don't DoS the infra" and "collect as much data as we can" so I think we want to be aggressive at first.
Here's what I'd want here:
let base = 20 * 60 * 1000; // 20m
let maxBI = 24 * 60 * 60 * 1000; // 24h
let backoffMS = base * failureCount + Math.floor(Math.random() * base);
return Math.min(backoffMS, maxBI);
Comment 3•12 years ago
|
||
(In reply to Mike Connor [:mconnor] from comment #2)
> > * 500 is a hard failure causing immediate 24h backoff.
>
> I don't think there should be a hard failure this severe. If the server
> returns 503+Retry-After we should respect that, otherwise we should treat it
> like any other error.
I agree… if someone gets paged when a production service returns a 500. 500 means "someone screwed up", with consequences of unknown severity.
If nobody gets paged, then clients should act as if that 500 just brought down a cluster (which it might well have done), and should retreat to the nearest bar as fast as possible.
Note that this can be resolved by having a LB turn all 500s into 503s with long backoffs, of course.
Or phrased differently: sure, there shouldn't be a hard failure this severe. What happens when there is?
Updated•12 years ago
|
Priority: -- → P4
Reporter | ||
Updated•12 years ago
|
Component: Metrics and Firefox Health Report → Client: Desktop
Product: Mozilla Services → Firefox Health Report
Comment 4•9 years ago
|
||
won't fix based on FHR removal - https://bugzilla.mozilla.org/show_bug.cgi?id=1209088
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Updated•6 years ago
|
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•