Closed Bug 1140037 Opened 9 years ago Closed 9 years ago

Telemetry submission rate spikes every hour

Categories

(Toolkit :: Telemetry, defect)

39 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla41
Tracking Status
firefox38 --- unaffected
firefox39 + fixed
firefox40 + fixed
firefox41 --- fixed

People

(Reporter: mreid, Assigned: gfritzsche)

References

Details

Attachments

(6 files, 4 obsolete files)

(deleted), image/png
Details
(deleted), image/png
Details
(deleted), patch
Dexter
: review+
vladan
: feedback+
Details | Diff | Splinter Review
(deleted), patch
Dexter
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
vladan
: review+
Details | Diff | Splinter Review
Attached image nightly_submission_rates.png (deleted) —
There appears to be a spike just about every hour. Per discussion with Vladan on IRC, this may be due to the "daily" ping being sent at midnight local time.  If that's the case, we should fuzz the submission time a bit so submissions don't bunch up right on the hour.
Blocks: 1069869
Oops, I cut off the y-axis scale.  It is "# Requests per minute".
If the "daily" ping is indeed the culprit here, then we can expect to see these hourly pings from all browsers that are open at midnight local time. There is a spike for each timezone.

From the small amount of data we have so far, it appears that the biggest spike is about 7% of the users. I'm not sure how user behaviour differs by channel, but the current data is for nightly only.

Based on this, how much do we need to "fuzz" this submission time in order to keep the servers happy?

What impact will this have on anomaly detection / alerting?
Flags: needinfo?(whd)
(In reply to Mark Reid [:mreid] from comment #2)
> Based on this, how much do we need to "fuzz" this submission time in order
> to keep the servers happy?

I'm going to get a second opinion from people more experienced with ELB behavior about this, but the answer may be "a lot". From what I know ELBs are not designed to handle spikes in traffic (they scale up gradually) and requests will probably wind up 503ing at the ELB layer. Typically this is solved by requesting AWS "pre-warm" the ELBs for spikes, but in our case it is a series of rather predictable spikes that are expected every day (the usual cases are swapping in a new ELB and other "one-time" events, see https://aws.amazon.com/articles/1636185810492479#pre-warming).

> What impact will this have on anomaly detection / alerting?

Simple anomaly detection in stackdriver can filter out spikes easily by requiring a spike to last longer than a certain amount of time before alerting. I assume the same can be achieved in heka by picking a larger window for the standard anomaly detection algorithms. I'll be experimenting with that this week.
Flags: needinfo?(whd)
Depends on bug 1133536 for the scheduler implementation.
Depends on: 1133536
Assignee: nobody → gfritzsche
Status: NEW → ASSIGNED
Would spreading the submissions out over 1 hour be enough?
Flags: needinfo?(whd)
Attached patch Fuzz ping submission times between 12AM and 1AM (obsolete) (deleted) — Splinter Review
Note that i found the logic in _isDailyPingDue() a little hard to follow, so i took the liberty of refactoring this a bit.

For ease of reasoning this now makes pings due between 12AM and 1AM, unless they are overdue (or coalesced with some other ping).
Attachment #8581839 - Flags: review?(vdjeric)
Comment on attachment 8581839 [details] [diff] [review]
Fuzz ping submission times between 12AM and 1AM

Review of attachment 8581839 [details] [diff] [review]:
-----------------------------------------------------------------

I think we should separate the generation of midnight pings from the daily sending of pings in this bug. 
The daily pings should be generated close to midnight with some small tolerance, e.g. +/- 5 minutes.
The sending of pings can be spread out throughout the day, or at idle, or within an hour of midnight, etc

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +47,5 @@
>  
>  const ENVIRONMENT_CHANGE_LISTENER = "TelemetrySession::onEnvironmentChange";
>  
>  const SEC_IN_ONE_DAY  = 24 * 60 * 60;
>  const MS_IN_ONE_DAY   = SEC_IN_ONE_DAY * 1000;

Can you remove these 2 consts? i don't think we use SEC_IN_ONE_DAY or MS_IN_ONE_DAY since bug 1133536

@@ +482,5 @@
> +      this._log.trace("_isDailyPingDue - already sent one today");
> +      return false;
> +    }
> +
> +    if (!nearestMidnight) {

This logic is not correct. To determine if a ping is overdue, we need to 1) check that we did not send a ping for the previous midnight and 2) check that we didn't already send an overdue ping today.
Since the "tolerance" allows pings to be sent both earlier or later than midnight, it's not enough to check this._lastDailyPingTime >= todayDate.getTime() to satisfy condition #1.
Attachment #8581839 - Flags: review?(vdjeric)
Attached image fuzz.png (deleted) —
I've attached some graphs for various fuzzing intervals. The greatest spike (16:00 UTC) corresponds to a little more than half the traffic for the hour interval in which it is included. With an hour of fuzz we no more than double the amount of baseline traffic for the hour. Per http://aws.amazon.com/articles/1636185810492479 AWS recommends spikes no more than 50% over a five minute interval, so somewhere around two hours of fuzz is optimal. However, as we also have smaller hourly spikes prior to the biggest one, an hour of fuzz is probably fine.

If submission time and generation time end up being decoupled, these fuzzing considerations still apply.
Flags: needinfo?(whd)
I don't understand comment 8. These pings are going to hit at midnight each time zone, and so I don't understand why spreading the load over more than an hour would help. Note that current nightly is heavily skewed towards NA, so we can't really use that to extrapolate global load behavior.
Flags: needinfo?(whd)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #7)
> I think we should separate the generation of midnight pings from the daily
> sending of pings in this bug. 
> The daily pings should be generated close to midnight with some small
> tolerance, e.g. +/- 5 minutes.
> The sending of pings can be spread out throughout the day, or at idle, or
> within an hour of midnight, etc

Good point, i'm currently moving over the "daily due" logic changes to bug 1139751, which makes also makes it much clearer to review.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #9)
> I don't understand comment 8. These pings are going to hit at midnight each
> time zone, and so I don't understand why spreading the load over more than
> an hour would help. Note that current nightly is heavily skewed towards NA,
> so we can't really use that to extrapolate global load behavior.

We have a spike periodicity of 1 hour (not strictly true e.g. India), and generation periodicity of 24 hours. Currently we have generation spikes and submission spikes that occur at the same time, but are planning on fuzzing the submission spikes by the spike periodicity.

If all submission spikes are of equal size, fuzzing over the the spike periodicity is sufficient because each spike is then distributed over the hour, and spikes occur once an hour. Server load will therefore never spike.

Likewise, if we fuzz a ping's submission time over the generation periodicity (24h), there shouldn't be detectable submission spikes, regardless of the spike periodicity and the size of the generation spikes.

Since the submission spikes we observe are of varying size and we are considering fuzz intervals shorter than the generation periodicity, the change in load will be noticeable on the server side for the timezones where we have more users (see fuzz.png attached). As long as we spike no more than 50% over five minutes, ELBs should scale without issue.

The analysis in comment 8 was skewed to the NA population being the largest spike, preceded by a series of smaller spikes. In the "safest" analysis, we assume there are no previous spikes, and distribute the load for the largest generation spike across the interval required to make the base QPS stair-step increase be no more than 50%. Using nightly data, this would be roughly 2 hours.

At any rate, this can be solved on the server side by telling AWS to keep the ELBs warmed (see https://mana.mozilla.org/wiki/display/SVCOPS/AWS+Pre-warming+ELBs), so one hour of fuzz is fine.
Flags: needinfo?(whd)
Attached patch Fuzz ping submission times around midnight (obsolete) (deleted) — Splinter Review
Finally managed to nail down some test issues here - the removal of test_noServerPing() fixes stray pings showing up when we happen to trigger sending of persisted pings somehow.
See bug 1147436 for a real test for that, the current one wasnt actually really doing anything.

This patch is part of this stack: https://pastebin.mozilla.org/8827100
Attachment #8581839 - Attachment is obsolete: true
Attachment #8583136 - Flags: review?(vdjeric)
Comment on attachment 8583136 [details] [diff] [review]
Fuzz ping submission times around midnight

Review of attachment 8583136 [details] [diff] [review]:
-----------------------------------------------------------------

Wouldn't it be a natural fit to have the TelemetryScheduler also schedule the sending of pings?

::: toolkit/components/telemetry/TelemetryPing.jsm
@@ +46,5 @@
>  // Timeout after which we consider a ping submission failed.
>  const PING_SUBMIT_TIMEOUT_MS = 2 * 60 * 1000;
>  
> +// We treat pings before midnight as happening "at midnight" with this tolerance.
> +const MIDNIGHT_TOLERANCE_MS = 15 * 60 * 1000;

we shouldn't duplicate the same constant across JSMs, reference this constant from TelemetrySession or vice-versa

@@ +127,5 @@
>  
> +/**
> + * Takes a date and returns it trunctated to a date with daily precision.
> + */
> +function truncateToDays(date) {

we have 2 copies of this function already, maybe we should add a TimeDateUtils.jsm to telemetry/

@@ +436,5 @@
>          }
>        }, error => this._log.error("addPendingPing - Unable to add the pending ping", error));
>    },
>  
> +  _getNextPingSendTime: function(now) {

add comment headers

@@ +440,5 @@
> +  _getNextPingSendTime: function(now) {
> +    const todayDate = truncateToDays(now);
> +    const tomorrowDate = tomorrow(todayDate);
> +    const nextMidnightRangeStart = tomorrowDate.getTime() - MIDNIGHT_TOLERANCE_MS;
> +    const currentMidnightRangeEnd = todayDate.getTime() - MIDNIGHT_TOLERANCE_MS + Policy.midnightPingFuzzingDelay();

shouldn't midnight_tolerance_ms be added to  (instead of subtracted from) today.getTime() for currentMidnightRangeEnd?

@@ +450,5 @@
> +    if (now.getTime() >= nextMidnightRangeStart) {
> +      return nextMidnightRangeStart + Policy.midnightPingFuzzingDelay();
> +    }
> +
> +    return now.getTime();

if we're not close to last midnight or next midnight, we send immediately?
Attachment #8583136 - Flags: review?(vdjeric)
Flags: needinfo?(gfritzsche)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #13)
> Comment on attachment 8583136 [details] [diff] [review]
> Fuzz ping submission times around midnight
> 
> Review of attachment 8583136 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Wouldn't it be a natural fit to have the TelemetryScheduler also schedule
> the sending of pings?

Maybe, i can open a non-urgent follow-up bug on making it a shared scheduler?

> ::: toolkit/components/telemetry/TelemetryPing.jsm
> @@ +46,5 @@
> >  // Timeout after which we consider a ping submission failed.
> >  const PING_SUBMIT_TIMEOUT_MS = 2 * 60 * 1000;
> >  
> > +// We treat pings before midnight as happening "at midnight" with this tolerance.
> > +const MIDNIGHT_TOLERANCE_MS = 15 * 60 * 1000;
> 
> we shouldn't duplicate the same constant across JSMs, reference this
> constant from TelemetrySession or vice-versa

Currently those are two independent logics that we may want to change independently too (unless/until we move to a shared scheduler).

> @@ +127,5 @@
> >  
> > +/**
> > + * Takes a date and returns it trunctated to a date with daily precision.
> > + */
> > +function truncateToDays(date) {
> 
> we have 2 copies of this function already, maybe we should add a
> TimeDateUtils.jsm to telemetry/

Follow-up? Could probably just live in toolkit/modules.
 
> @@ +440,5 @@
> > +  _getNextPingSendTime: function(now) {
> > +    const todayDate = truncateToDays(now);
> > +    const tomorrowDate = tomorrow(todayDate);
> > +    const nextMidnightRangeStart = tomorrowDate.getTime() - MIDNIGHT_TOLERANCE_MS;
> > +    const currentMidnightRangeEnd = todayDate.getTime() - MIDNIGHT_TOLERANCE_MS + Policy.midnightPingFuzzingDelay();
> 
> shouldn't midnight_tolerance_ms be added to  (instead of subtracted from)
> today.getTime() for currentMidnightRangeEnd?

With the changes here i treat this as the range for a |midnight|:
* start: midnight.getTime() - MIDNIGHT_TOLERANCE_MS
* end: start + fuzzingDelay

That's were the tolerance substraction comes from.

> @@ +450,5 @@
> > +    if (now.getTime() >= nextMidnightRangeStart) {
> > +      return nextMidnightRangeStart + Policy.midnightPingFuzzingDelay();
> > +    }
> > +
> > +    return now.getTime();
> 
> if we're not close to last midnight or next midnight, we send immediately?

Yes, same as before, we are just evening out the midnight spike now.
Flags: needinfo?(gfritzsche)
Attached patch Fuzz ping submission times around midnight (obsolete) (deleted) — Splinter Review
Fixed comments.
Attachment #8583136 - Attachment is obsolete: true
Attachment #8584468 - Flags: review?(vdjeric)
Comment on attachment 8584468 [details] [diff] [review]
Fuzz ping submission times around midnight

Review of attachment 8584468 [details] [diff] [review]:
-----------------------------------------------------------------

I don't like how TelemetryPing.send() has become a maybeSend(), the behaviour becomes too opaque. There's already a side-effect of that opacity: environment-changed pings and idle-daily* pings will only be sent at midnight.
There's also a lot of repetition with the scheduling code in TelemetrySession.jsm.

* The idle-daily ping itself can be removed, but i like having the idle-daily event for now, to help clear out persisted pings at random times. It helps decrease ping transmission latency and it's a fallback in case of a bug in our midnight sending code.

I'd prefer that:
- TelemetryPing.send() continue to always send a ping
- TelemetryScheduler continue to drive the sending of pings
- the 1hr midnight send period should begin after midnight + SCHEDULER_MIDNIGHT_TOLERANCE_MS
- we need protections against the computer going to sleep, e.g. users who put their computer to sleep every night at 22:30 until next morning. Right now, the only defense we have against this is a) idle-daily, and b) OVERDUE_PING_FILE_AGE which forces ping sends on startup but only after they're 2 weeks old!

Can you write a brief design doc that describes how Telemetry sending should behave wrt different ping types and different edge cases? If you plan to work on this over the weekend, I'll be online & helping with reviews/discussion. If not, I think it's ok to do this next week and uplift to Aurora.

::: toolkit/components/telemetry/TelemetryPing.jsm
@@ -432,5 @@
> -      this.sendPersistedPings(),
> -    ];
> -
> -    let promise = Promise.all(p);
> -    this._trackPendingPingTask(promise);

what was this method?
Attachment #8584468 - Flags: review?(vdjeric)
correction: OVERDUE_PING_FILE_AGE is 1 week, not 2
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #16)
> I don't like how TelemetryPing.send() has become a maybeSend(), the
> behaviour becomes too opaque. There's already a side-effect of that opacity:
> environment-changed pings and idle-daily* pings will only be sent at
> midnight.

I don't see that, am i missing it? We should only reschedule pings that occur around midnight, not pings that happen at other times.

> There's also a lot of repetition with the scheduling code in
> TelemetrySession.jsm.
> 

> I'd prefer that:
> - TelemetryPing.send() continue to always send a ping
> - TelemetryScheduler continue to drive the sending of pings

Are you suggesting that TelemetrySession decides whether to do TelemetryPing.send(dailyPing) or TelemetryPing.addPendingPing(dailyPing)?
Side-note: We don't want to end up duplicating the persistance code in TelemetrySession either.

> - the 1hr midnight send period should begin after midnight +
> SCHEDULER_MIDNIGHT_TOLERANCE_MS

Why after SCHEDULER_MIDNIGHT_TOLERANCE_MS? Why not start it directly at midnight?

> - we need protections against the computer going to sleep, e.g. users who
> put their computer to sleep every night at 22:30 until next morning. Right
> now, the only defense we have against this is a) idle-daily, and b)
> OVERDUE_PING_FILE_AGE which forces ping sends on startup but only after
> they're 2 weeks old!

Is this something specific about this bug/patch or just a general concern you have that can move to another bug?

> Can you write a brief design doc that describes how Telemetry sending should
> behave wrt different ping types and different edge cases? 

I think we should clear up the above questions first? I'm not quite sure yet what you want to see here.
 
> ::: toolkit/components/telemetry/TelemetryPing.jsm
> @@ -432,5 @@
> > -      this.sendPersistedPings(),
> > -    ];
> > -
> > -    let promise = Promise.all(p);
> > -    this._trackPendingPingTask(promise);
> 
> what was this method?

This tracks pending async ping activity so we can wait for it on shutdown.
Flags: needinfo?(vdjeric)
Blocks: 1120356
No longer blocks: 1069869
(In reply to Georg Fritzsche [:gfritzsche] from comment #18)
> (In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #16)
> > I don't like how TelemetryPing.send() has become a maybeSend(), the
> > behaviour becomes too opaque. There's already a side-effect of that opacity:
> > environment-changed pings and idle-daily* pings will only be sent at
> > midnight.
> 
> I don't see that, am i missing it? We should only reschedule pings that
> occur around midnight, not pings that happen at other times.

Oops I mis-interpreted how _getNextPingSendTime() behaves outside midnight and how that affects the  nextPingSendTime > now.getTime() decision

> Are you suggesting that TelemetrySession decides whether to do
> TelemetryPing.send(dailyPing) or TelemetryPing.addPendingPing(dailyPing)?

Yes.

I think my bigger issue is that the relationship between TelemetrySession and TelemetryPing is fuzzy to me. 

Both modules are initialized via TelemetryStartup (suggesting they're peers), but TelemetryScheduler logic lives inside TelemetrySession and only TelemetrySession calls into TelemetryPing (suggesting that TelemetrySession drives TelemetryPing -- using it as a utility library for sending), but TelemetryPing is in charge of deciding whether Telemetry is even enabled and it decides whether pings should be sent based on time criteria (second-guessing TelemetryScheduler) and it's also in charge of restoring state from past sessions. 

So is Session in charge of recording the current session and driving most Telemetry behaviour, and TelemetryPing takes care of all ping operations and decisions? It seems like a bit of an odd model, but maybe I just don't understand it?

I think a half-page design doc would be very helpful. I think we're adding lots of new requirements and the current abstractions are becoming obsolete. This is a good time to take a step back and make sure the design is logical and can be extended for future requirements.

Personally, I'd like to see a design where one module is in charge of driving Telemetry behaviour, i.e. initializing overall Telemetry state + scheduling periodic actions including sending and generating different payload types. It should also handle one-off events like environment changes or first-run installs. The other modules would be passive helpers for the driving module. Error handling would happen in the driving module.

With that said, I actually don't have a specific design in mind right now, but I could come up with a proposal or we can talk about options on Vidyo.

> > - the 1hr midnight send period should begin after midnight +
> > SCHEDULER_MIDNIGHT_TOLERANCE_MS
> 
> Why after SCHEDULER_MIDNIGHT_TOLERANCE_MS? Why not start it directly at
> midnight?

It's really a minor nitpick, not a real issue, but it jumped out at me when reviewing the code. The issue is that the ping generation window and ping sending window overlap.

Most daily pings will be generated at 23:45, but the daily ping generation window is still [23:45, 00:15] so daily pings could be generated at other times. The sending window is [23:45, 0:45]. So if a ping is generated at 00:10 for example, that means its effective sending window is reduced to [0:10, 0:45] making the probability of sending a ping during each minute of the send window slightly non-uniform. It's not a real issue, more of a conceptual nitpick.

> > - we need protections against the computer going to sleep, e.g. users who
> > put their computer to sleep every night at 22:30 until next morning. Right
> > now, the only defense we have against this is a) idle-daily, and b)
> > OVERDUE_PING_FILE_AGE which forces ping sends on startup but only after
> > they're 2 weeks old!
> 
> Is this something specific about this bug/patch or just a general concern
> you have that can move to another bug?

It could be in another bug. However, I think we do need a design doc for how Telemetry behaves and it should address this issue.
Flags: needinfo?(vdjeric)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #19)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #18)
> > (In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #16)
> > > I don't like how TelemetryPing.send() has become a maybeSend(), the
> > > behaviour becomes too opaque. There's already a side-effect of that opacity:
> > > environment-changed pings and idle-daily* pings will only be sent at
> > > midnight.
> > 
> > I don't see that, am i missing it? We should only reschedule pings that
> > occur around midnight, not pings that happen at other times.
> 
> Oops I mis-interpreted how _getNextPingSendTime() behaves outside midnight
> and how that affects the  nextPingSendTime > now.getTime() decision
> 
> > Are you suggesting that TelemetrySession decides whether to do
> > TelemetryPing.send(dailyPing) or TelemetryPing.addPendingPing(dailyPing)?
> 
> Yes.
> 
> I think my bigger issue is that the relationship between TelemetrySession
> and TelemetryPing is fuzzy to me. 
> 
> Both modules are initialized via TelemetryStartup (suggesting they're
> peers), but TelemetryScheduler logic lives inside TelemetrySession and only
> TelemetrySession calls into TelemetryPing (suggesting that TelemetrySession
> drives TelemetryPing -- using it as a utility library for sending), but
> TelemetryPing is in charge of deciding whether Telemetry is even enabled and
> it decides whether pings should be sent based on time criteria
> (second-guessing TelemetryScheduler) and it's also in charge of restoring
> state from past sessions. 
> 
> So is Session in charge of recording the current session and driving most
> Telemetry behaviour, and TelemetryPing takes care of all ping operations and
> decisions? It seems like a bit of an odd model, but maybe I just don't
> understand it?

We already talked about that we will make TelemetryPing init TelemetrySession so we have a proper ordered relationship.
* TelemetryPing has become the "main"/general Telemetry JS module, handling ping storage
* TelemetrySession is one user of TelemetryPing and implementing specific ping types, the session data collection etc.

Not all things may be cleanly separated yet after we split TelemetryPing up, but TelemetrySession should not decide about when to send pings or not (e.g. if it uses TelemetryPing.addToPending(dailyPing), it cant know whether TelemetryPing might decide to submit pending pings 10 seconds later).

> Personally, I'd like to see a design where one module is in charge of
> driving Telemetry behaviour, i.e. initializing overall Telemetry state +
> scheduling periodic actions including sending and generating different
> payload types. It should also handle one-off events like environment changes
> or first-run installs. The other modules would be passive helpers for the
> driving module. Error handling would happen in the driving module.

I think we have a relatively clear separation of concerns now, but a common scheduler would blur this again.

> With that said, I actually don't have a specific design in mind right now,
> but I could come up with a proposal or we can talk about options on Vidyo.

We should talk, but we also need this bug finished soon. I'm happy to think about a better common scheduler strategy, but maybe we can do this separately and fix the server load issue now.

> > > - the 1hr midnight send period should begin after midnight +
> > > SCHEDULER_MIDNIGHT_TOLERANCE_MS
> > 
> > Why after SCHEDULER_MIDNIGHT_TOLERANCE_MS? Why not start it directly at
> > midnight?
> 
> It's really a minor nitpick, not a real issue, but it jumped out at me when
> reviewing the code. The issue is that the ping generation window and ping
> sending window overlap.
> 
> Most daily pings will be generated at 23:45, but the daily ping generation
> window is still [23:45, 00:15] so daily pings could be generated at other
> times. The sending window is [23:45, 0:45]. So if a ping is generated at
> 00:10 for example, that means its effective sending window is reduced to
> [0:10, 0:45] making the probability of sending a ping during each minute of
> the send window slightly non-uniform. It's not a real issue, more of a
> conceptual nitpick.

We are worried about submission spikes from midnight pings here.
I think as long as we even out that spike we should be good, i don't think we need to worry about uniformly distributing all the pings in that window.

> > > - we need protections against the computer going to sleep, e.g. users who
> > > put their computer to sleep every night at 22:30 until next morning. Right
> > > now, the only defense we have against this is a) idle-daily, and b)
> > > OVERDUE_PING_FILE_AGE which forces ping sends on startup but only after
> > > they're 2 weeks old!
> > 
> > Is this something specific about this bug/patch or just a general concern
> > you have that can move to another bug?
> 
> It could be in another bug. However, I think we do need a design doc for how
> Telemetry behaves and it should address this issue.

Yes, we haven't gotten to a proper design for Telemetry ping sending behavior yet.
I'm mostly worried about the big scope we are talking about in this bug that should land soon.
We talked about the issues here in a meeting today:
Vladan is planning to sketch out the design for the affected parts until next week, then we can see how that best maps to this and/or new bugs.
[Tracking Requested - why for this release]:

bsmedberg wants us to leave sending the unified pings on for beta if possible, so we need this fixed on 39.
Georg confirmed that this doesn't affect 38. Tracking 39+
This is mostly a rebase of the patch here. AFAIR we agreed on the approach pending the new design. Given that the new design still has a separate scheduler for ping sending, i am adopting this patch directly without further refactoring so uplift is feasible and low-risk.
Attachment #8598110 - Flags: review?(alessio.placitelli)
Attachment #8598110 - Flags: feedback?(vdjeric)
Attachment #8584468 - Attachment is obsolete: true
Vladan, if you have time please have a quick look whether you can live with this landing for now.
The refactor into TelemetrySend etc. is planned in bug 1156359, further ping send redesign per bug 1156712.
Comment on attachment 8598110 [details] [diff] [review]
(Fx 40) Fuzz daily telemetry ping submission times to avoid submission volume spikes around midnight

Review of attachment 8598110 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good!

::: toolkit/components/telemetry/TelemetryController.jsm
@@ +128,5 @@
>  
>  /**
> + * Takes a date and returns it trunctated to a date with daily precision.
> + */
> +function truncateToDays(date) {

Due to the new design, shouldn't we have a module to own this kind of utility functions? I remember those being used in other places as well throughout Telemetry.

::: toolkit/components/telemetry/tests/unit/head.js
@@ +152,4 @@
>  
> +  for (let m of modules) {
> +    m.Policy.now = () => date;
> +  }

This looks much better now!
Attachment #8598110 - Flags: review?(alessio.placitelli) → review+
(In reply to Alessio Placitelli [:Dexter] from comment #27)
> ::: toolkit/components/telemetry/TelemetryController.jsm
> @@ +128,5 @@
> >  
> >  /**
> > + * Takes a date and returns it trunctated to a date with daily precision.
> > + */
> > +function truncateToDays(date) {
> 
> Due to the new design, shouldn't we have a module to own this kind of
> utility functions? I remember those being used in other places as well
> throughout Telemetry.

We don't have that module yet and i'd like to keep the patches for uplift and trunk mostly similar here.
This will be handled in bug 1156361 (or some other bug that touches those and doesn't require uplift).
Attachment #8598110 - Flags: feedback?(vdjeric) → feedback+
https://hg.mozilla.org/mozilla-central/rev/3a35fd270670
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Mark, can you check whether things actually improved?

I still see spikes here: https://pipeline-prototype-cep.prod.mozaws.net/#sandboxes/PrototypeSandbox-mreid_CountByChannel/outputs/PrototypeSandbox-mreid_CountByChannel.nightlyRequests.cbuf
... but i can't tell whether that's from older build ids or not or whether the volume at least dropped over the weekend.
Flags: needinfo?(mreid)
I set up a monitor to check builds before/after this fix landed:
https://pipeline-prototype-cep.prod.mozaws.net/#plugins/filters/PrototypeSandbox-mreid_CountByBuildId

It will take some time to fill in, but we should be able to tell once we've passed a few hour boundaries.
Flags: needinfo?(mreid)
Comparing this, it definitely improved things.
I still see one spike outstanding though, that is probably still spiking too fast, right?
Flags: needinfo?(mreid)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED
Keywords: leave-open
(In reply to Georg Fritzsche [:gfritzsche] from comment #33)
> Comparing this, it definitely improved things.
> I still see one spike outstanding though, that is probably still spiking too
> fast, right?

It is and we spotted the issue we had (not properly throttling submissions for fuzzing intervals <15min).
Flags: needinfo?(mreid)
Right, that spike still looks too high (~4x or 5x the normal peak).
Attached patch Fix ping fuzzing calculations (deleted) — Splinter Review
This fixes the issue and simplifies the fuzzing calculation a lot. To avoid duplicating even more date utilities, i introduced TelemetryUtils.jsm here as per the redesign doc.
Attachment #8601695 - Flags: review?(alessio.placitelli)
Attachment #8601695 - Flags: feedback?(vdjeric)
Attachment #8598112 - Attachment is obsolete: true
Blocks: 1156361
Comment on attachment 8601695 [details] [diff] [review]
Fix ping fuzzing calculations

Review of attachment 8601695 [details] [diff] [review]:
-----------------------------------------------------------------

It's r=me with the minor documentation details addressed.

Feel free to skip renaming |truncateToDays| and moving the timestamp |truncateToDays| to TelemetryUtils if you think it's not worth doing now.

::: toolkit/components/telemetry/TelemetryController.jsm
@@ +536,3 @@
>      }
>  
>      return now.getTime();

Could you please document this third case as you did for the previous two? When are we supposed to hit this?

::: toolkit/components/telemetry/TelemetryUtils.jsm
@@ +11,5 @@
> +const {classes: Cc, interfaces: Ci, results: Cr, utils: Cu} = Components;
> +
> +this.TelemetryUtils = {
> +  /**
> +   * Takes a date and returns it trunctated to a date with daily precision.

Could you document the input arguments and the return value?

@@ +13,5 @@
> +this.TelemetryUtils = {
> +  /**
> +   * Takes a date and returns it trunctated to a date with daily precision.
> +   */
> +  truncateToDays: function(date) {

What do you think about renaming this to |truncateDateToDays|?

Not really important now, so feel free to skip this, but TelemetryEnvironment.js also defines a similar function which takes a timestamp as argument instead of a Date object. Maybe we should move that too while we're here (also removing it from head.js)?

::: toolkit/components/telemetry/tests/unit/test_TelemetrySession.js
@@ +1634,1 @@
>    let request = yield gRequestIterator.next();

This test also defines a |truncateDateToDays| function. Maybe we could use the one from TelemetryUtils now?
Attachment #8601695 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8601695 [details] [diff] [review]
Fix ping fuzzing calculations

Review of attachment 8601695 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryController.jsm
@@ +54,5 @@
>  // Timeout after which we consider a ping submission failed.
>  const PING_SUBMIT_TIMEOUT_MS = 2 * 60 * 1000;
>  
>  // We treat pings before midnight as happening "at midnight" with this tolerance.
>  const MIDNIGHT_TOLERANCE_MS = 15 * 60 * 1000;

do we still need MIDNIGHT_TOLERANCE_MS for anything?

@@ +56,5 @@
>  
>  // We treat pings before midnight as happening "at midnight" with this tolerance.
>  const MIDNIGHT_TOLERANCE_MS = 15 * 60 * 1000;
> +// For midnight fuzzing we want to affect pings around midnight with this tolerance.
> +const MIDNIGHT_TOLERANCE_FUZZ_MS = 5 * 60 * 1000;

the naming is confusing, MIDNIGHT_FUZZING_INTERVAL_MS + MIDNIGHT_TOLERANCE_FUZZ_MS + MIDNIGHT_FUZZING_DELAY_MS

@@ +517,5 @@
>     *
>     * @param now Date The current time.
>     * @return Number The next time (ms from UNIX epoch) when we can send pings.
>     */
>    _getNextPingSendTime: function(now) {

I'm not 100% clear on the logic

1. First we check if the time is between 11pm and 1am. If it's not, we send immediately
2. If we confirmed the time is indeed between 11pm and 1am in step 1, we then check if the time is also 11:55pm or later. If it's not, we send immediately
3. Finally, if the time is between 11:55pm and 1am, we disallow sending before (midnight + fuzzing delay), which is a random time between 12am-1am (decided at startup)

Is that correct?

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +454,5 @@
>        this._log.trace("_isDailyPingDue - already sent one today");
>        return false;
>      }
>  
> +    const nearestMidnight = TelemetryUtils.getNearestMidnight(nowDate, SCHEDULER_MIDNIGHT_TOLERANCE_MS);

If we're going to be calling these utility functions frequently, then TelemetryUtils is too long of a name.

How about calling the module Utils and putting it in resource://gre/modules/telemetry/Utils.jsm? 
Or if you think it's going to be used outside Telemetry, DateTime.jsm?
Attachment #8601695 - Flags: feedback?(vdjeric)
Flags: needinfo?(gfritzsche)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #40)
> @@ +517,5 @@
> >     *
> >     * @param now Date The current time.
> >     * @return Number The next time (ms from UNIX epoch) when we can send pings.
> >     */
> >    _getNextPingSendTime: function(now) {
> 
> I'm not 100% clear on the logic
> 
> 1. First we check if the time is between 11pm and 1am. If it's not, we send
> immediately
> 2. If we confirmed the time is indeed between 11pm and 1am in step 1, we
> then check if the time is also 11:55pm or later. If it's not, we send
> immediately
> 3. Finally, if the time is between 11:55pm and 1am, we disallow sending
> before (midnight + fuzzing delay), which is a random time between 12am-1am
> (decided at startup)
> 
> Is that correct?

Yes. The behavior is basically the same as intended before, only changing from (11:45pm to 0:45am) to (11:55pm to 1:00am).
This was hard to follow and had a bug before, so i tried to opt for the most straight-forward solution i could think of.

Any commenting that you think would help? Or do you think we need to improve something here?

> ::: toolkit/components/telemetry/TelemetrySession.jsm
> @@ +454,5 @@
> >        this._log.trace("_isDailyPingDue - already sent one today");
> >        return false;
> >      }
> >  
> > +    const nearestMidnight = TelemetryUtils.getNearestMidnight(nowDate, SCHEDULER_MIDNIGHT_TOLERANCE_MS);
> 
> If we're going to be calling these utility functions frequently, then
> TelemetryUtils is too long of a name.
> 
> How about calling the module Utils and putting it in
> resource://gre/modules/telemetry/Utils.jsm? 

Hm, we could do that or just alias it locally if needed (as say TU or Utils).
But i'm torn here. Utils.jsm is very generic if you see it in stack traces etc.

> Or if you think it's going to be used outside Telemetry, DateTime.jsm?

If there is a use-case outside Telemetry, i think we can always start moving date/time functions to DateTime.jsm or so later. Right now we want to collect all the shared helper functions there per the re-design doc.
Flags: needinfo?(gfritzsche)
Flags: needinfo?(vdjeric)
Comment on attachment 8602062 [details] [diff] [review]
(Fx 39) Fuzz daily telemetry ping submission times to avoid submission volume spikes around midnight

Approval Request Comment
[Feature/regressing bug #]: Telemetry unification.
[User impact if declined]: Telemetry server gets hit by big spikes in ping submission volume.
[Describe test coverage new/current, TreeHerder]: Automation, tracking actual via [0].
[Risks and why]: Low-risk of not fixing the issue, other impacts are unlikely.
[String/UUID change made/needed]: None.

Sadly we don’t have too much data yet in [0], but things look much better so far (no big spikes visible yet).

The risk here is that too many clients submit at the same time (local midnight).
To fix that i want to proceed with uplifting the patch here so it is in before beta.

In case we still see issues:
* mreid confirms that we will survive spikes and can migitate
* clients will retry sending later if the server was overloaded
* (would like to avoid) worst-case we can land a patch to disable the offending pings
Attachment #8602062 - Flags: approval-mozilla-aurora?
Comment on attachment 8602062 [details] [diff] [review]
(Fx 39) Fuzz daily telemetry ping submission times to avoid submission volume spikes around midnight

Approved for uplift to aurora. There is some risk but there are plans to mitigate potential issues.
Attachment #8602062 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(In reply to Georg Fritzsche [:gfritzsche] from comment #42)
> Any commenting that you think would help? Or do you think we need to improve
> something here?

It's a little funky that you check for A < x < B and then check A + c < x, usually you would check A + c < x < B directly. I would just comment with the list of checks I wrote

> Hm, we could do that or just alias it locally if needed (as say TU or Utils).
> But i'm torn here. Utils.jsm is very generic if you see it in stack traces
> etc.

Aliasing locally is fine by me, as is making it resource://gre/modules/telemetry/Utils.jsm
Flags: needinfo?(vdjeric)
Depends on: 1164007
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #46)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #42)
> > Any commenting that you think would help? Or do you think we need to improve
> > something here?
> 
> It's a little funky that you check for A < x < B and then check A + c < x,
> usually you would check A + c < x < B directly. I would just comment with
> the list of checks I wrote

Ok.

> > Hm, we could do that or just alias it locally if needed (as say TU or Utils).
> > But i'm torn here. Utils.jsm is very generic if you see it in stack traces
> > etc.
> 
> Aliasing locally is fine by me, as is making it
> resource://gre/modules/telemetry/Utils.jsm

Alright, i'll leave it then.
I had bad experiences before with finding and quick-opening a utils.js, we also already have a few utils.jsm in the tree.
How about this?
Attachment #8607332 - Flags: review?(vdjeric)
Comment on attachment 8607332 [details] [diff] [review]
Improve midnight fuzzing readability and comments

Review of attachment 8607332 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryController.jsm
@@ +553,1 @@
>      const midnight = TelemetryUtils.getNearestMidnight(now, MIDNIGHT_FUZZING_INTERVAL_MS);

You're not aliasing TelemetryUtils locally after all?
Attachment #8607332 - Flags: review?(vdjeric) → review+
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #49)
> ::: toolkit/components/telemetry/TelemetryController.jsm
> @@ +553,1 @@
> >      const midnight = TelemetryUtils.getNearestMidnight(now, MIDNIGHT_FUZZING_INTERVAL_MS);
> 
> You're not aliasing TelemetryUtils locally after all?

I thought we were talking generally. I'll alias it in TelemetryController then.
Keywords: leave-open
https://hg.mozilla.org/mozilla-central/rev/a4ad694b4bd8
Status: ASSIGNED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Target Milestone: mozilla40 → mozilla41
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: