We may want to consider expanding scope of this to cover larger backoff-related issues: * 500 is a hard failure causing immediate 24h backoff. * Are 15 minutes and 60 minutes the correct backoff intervals.

Mike Connor [:mconnor]

Comment 2

•

12 years ago

(In reply to Gregory Szorc [:gps] from comment #1) > We may want to consider expanding scope of this to cover larger > backoff-related issues: > > * 500 is a hard failure causing immediate 24h backoff. I don't think there should be a hard failure this severe. If the server returns 503+Retry-After we should respect that, otherwise we should treat it like any other error. > * Are 15 minutes and 60 minutes the correct backoff intervals. I think we should use ~30 minutes as an initial base with a combination of fuzzing and progressive backoff. If we just spike and overload infra, we'll slow down gradually until the system recovers. If there's a serious issue I'd expect Ops to deal with it more explicitly, but our dual goals here are "don't DoS the infra" and "collect as much data as we can" so I think we want to be aggressive at first. Here's what I'd want here: let base = 20 * 60 * 1000; // 20m let maxBI = 24 * 60 * 60 * 1000; // 24h let backoffMS = base * failureCount + Math.floor(Math.random() * base); return Math.min(backoffMS, maxBI);

Richard Newman [:rnewman]

Comment 3

•

12 years ago

(In reply to Mike Connor [:mconnor] from comment #2) > > * 500 is a hard failure causing immediate 24h backoff. > > I don't think there should be a hard failure this severe. If the server > returns 503+Retry-After we should respect that, otherwise we should treat it > like any other error. I agree… if someone gets paged when a production service returns a 500. 500 means "someone screwed up", with consequences of unknown severity. If nobody gets paged, then clients should act as if that 500 just brought down a cluster (which it might well have done), and should retreat to the nearest bar as fast as possible. Note that this can be resolved by having a LB turn all 500s into 503s with long backoffs, of course. Or phrased differently: sure, there shouldn't be a hard failure this severe. What happens when there is?

Richard Newman [:rnewman]

Updated

•

12 years ago

Priority: -- → P4

Gregory Szorc [:gps]

Reporter

Updated

•

12 years ago

Component: Metrics and Firefox Health Report → Client: Desktop

Product: Mozilla Services → Firefox Health Report

Thomas Huelbert

Comment 4

•

9 years ago

won't fix based on FHR removal - https://bugzilla.mozilla.org/show_bug.cgi?id=1209088

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WONTFIX

BMO Automation

Updated

•

6 years ago

Product: Firefox Health Report → Firefox Health Report Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Fuzz backoff interval on failure

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect, P4)

Tracking

(Not tracked)

People

(Reporter: gps, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Updated

Updated

Comment 4

Updated