1449662 - Investigate increase in number of tab switch spinners on March 21st, 2018 Nightly build

Reporter

Description

•

7 years ago

From this graph: https://mikeconley.github.io/bug1310250/ https://screenshots.firefox.com/sPq0nh9LqtosT1bg/mikeconley.github.io I see a dip on March 13th in the number of users seeing tab switch spinners. This is because we enabled tab warming. On March 21st, that dip effectively reverses itself despite the fact that warming is still enabled. Here are the changes in that range: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=025426a3e2e780d33bb666c78bfbaf22b3456c19&tochange=68464218a41a5350ffa926e1b5d77fde75bc4e01 None of these jump out at me as potential culprits, unfortunately. :/ Getting this on file to do due diligence to try to find out what happened here.

Doug Thayer [:dthayer] (he/him)

Comment 1

•

7 years ago

What's the source of https://mikeconley.github.io/bug1310250/ ? I know your changes to this don't quite line up, but does it for any reason reference the FX_TAB_SWITCH_REQUEST_TAB_WARMING_STATE probe? (Grasping at straws here)

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 2

•

7 years ago

The graph is powered by an Airflow job here: https://github.com/mozilla/python_mozetl/tree/master/mozetl/tab_spinner The change to the probe landed on March 21st - I'm not sure if it made it into any of the March 21st Nightly builds (since we now spin it twice a day). If so, it's possible that my changeset window is too small and should factor in the second Nightly spin.

Doug Thayer [:dthayer] (he/him)

Comment 3

•

7 years ago

It's particularly strange that the 50ms - 99ms and 100ms - 199ms buckets remain unaffected.

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 4

•

7 years ago

One thing I've noticed is that I seem to be able to get spinner histogram entries when selecting lazy-inserted browsers after a session restore. I don't know if that was happening before. Maybe this _does_ have to do with the Telemetry stuff we did? Perhaps the lazy getTabState stuff is changing state in a funky way, increasing the probability of spinners? Looking...

Mike Conley (:mconley) (:⚙️)

Reporter

Updated

•

7 years ago

Depends on: 1450170

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 5

•

7 years ago

Interestingly, looking at http://arewesmoothyet.com/?mode=track&trackedStat=All%20Hangs, I see a dramatic shift in the Content ForcePaint hangs... Presumably, that was because of bug 1443329 and co, which made our BHR data a lot more useful - but the shift appears to start on March 21st, which is the same day of the spike in the tab switch spinner graph. I find that suspicious - particularly because bug 1443329 landed outside of the window of commits I posted in comment 0. Which brings me to this question: if bug 1443329 landed in a March 20th build instead of a March 21st build (which I must assume, given that bug 1443329 landed on March 20th and isn't in the commit range), how come the BHR dashboard is showing the shift starting on the 21st? Any ideas, dthayer?

Flags: needinfo?(dothayer)

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 6

•

7 years ago

(I'm reminded a bit of https://chuttenblog.wordpress.com/2016/11/29/privileged-to-be-a-mozillian/, though in this case, since I _believe_ we're using buildIDs as our X-axes, it's not a problem)

Doug Thayer [:dthayer] (he/him)

Comment 7

•

7 years ago

I'm fairly confident that the changes are due to bug 1443688 and bug 1443329, and that we just have a strange off by one error somewhere. Bug 1443688 landed on the 19th, and should result in the same number of hangs, but a higher representation of lower-duration hangs. This is what we see happening on the 20th in the graph (roughly - there's a spike up in total hangs which I'm not sure I understand). Then, bug 1443329 should change the total number of hangs and the category they fall into, and that's what we see happening on the 21st in the graph. Because _both_ of these bugs line up with the graph one day after they should, I think they are responsible for the changes, and somehow our date reasoning is wrong. Might be as simple as a bug in how I build the list of date labels - I'll take a look. (And yeah, we are using buildIDs, so that shouldn't be the problem.)

Flags: needinfo?(dothayer)

Doug Thayer [:dthayer] (he/him)

Comment 8

•

7 years ago

Huh. It's not the dashboard, because the raw data shows the same pattern, and it's not the data collection, because there's a pretty straightforward path in their from application/buildId to the output. So, this is fairly mysterious :/

Dão Gottwald [:dao]

Updated

•

7 years ago

Priority: -- → P2

Dão Gottwald [:dao]

Updated

•

7 years ago

Blocks: async-tab-switcher

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 9

•

7 years ago

While digging at this, something was brought up in #telemetry that I wanted to get down: We normally ship two Nightlies per day. The second Nightly often has to compete with the first Nightly of the day when sending any signals of Telemetry regressions, and any true signal shows up usually the day after. So I think we should extend the changeset range a bit to include the second Nightly from the day before: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=bfb7edfd0436db388bb9e103b8ad817fc50bfdcf&tochange=68464218a41a5350ffa926e1b5d77fde75bc4e01

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 10

•

7 years ago

Just following up from a conversation I just had with dthayer - when you get a chance Doug, would you mind filing a bug and posting a patch to disable BHR for Content ForcePaint that I can review? We can then land it and see what impact it has on spinners this week.

Flags: needinfo?(dothayer)

Doug Thayer [:dthayer] (he/him)

Updated

•

7 years ago

Depends on: 1452766

Doug Thayer [:dthayer] (he/him)

Updated

•

7 years ago

Flags: needinfo?(dothayer)

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 11

•

7 years ago

Wow, amazing - I think we found our culprit: https://screenshotscdn.firefoxusercontent.com/images/a9d05b84-1408-4a5a-9d2e-857c1fdd605d.png I did not expect that. I'm reasonably certain this is the culprit because the regressions in the buckets in the Short Severity data are essentially reversed: https://screenshotscdn.firefoxusercontent.com/images/4e67645d-8d7a-411b-b14e-924f521b9014.png So I guess the question now is what to do. A few choices come to mind: 1. We could ensure that only Nightly users with BHR enabled hit this. This means that our graphs will always have this little bit of extra deficit due to the BHR stuff, but should never impact release. This is probably the easiest thing to do, but means noise in our data. 2. We figure out why BHR is causing this regression and fix it if possible. This is probably the harder thing to do. 3. We stop collecting BHR using Content ForcePaint, and instead find some way of annotating BHR stacks on the main thread any time a tab switch is in progress. This might move the problem instead of fixing it, but wanted to get the idea down. Hey dthayer, can you think of any reason why bug 1443329 might be involved here? Could it be that the act of collecting a stack just has strangely consistent time costs, resulting in that weirdly specific bucketing of regressions? I guess I'm just looking for any insights you have here before I try making a call on how to proceed.

Blocks: 1443329

Flags: needinfo?(dothayer)

Doug Thayer [:dthayer] (he/him)

Comment 12

•

7 years ago

Honestly I'm at a loss here for what could be causing this. It seems like whatever it is would have to have at least a multi-millisecond impact, which is a staggering amount of time relative to what we're doing. Since we were already using BHR before these changes, and should have been cutting stacks during this window anyway, I want to wait at least a day or two more to see if that graph stabilizes _below_ pre-bug-1443329 levels. Right now it's dipped under our baseline, suggesting BHR proper really is the culprit. However, if it stabilizes _at_ pre-bug-1443329 levels, that makes me think it's a little more likely that something we're doing in our usage of BHR is to blame. The funny thing with this is that we're already collecting these stacks as regular content process BHR hangs. We could conceivably reduce the work we're adding by finding some way to reuse that, if that's really what's hurting us here.

Flags: needinfo?(dothayer)

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 13

•

7 years ago

Alright, so we have a cause for the spike (which we've backed out), despite not having a super-firm grasp on why the regression took place. So bug 1452766 took care of this by disabling tab switching BHR. I think we should attempt to migrate away from the special Content ForcePaint BHR thread monitor and use the annotation system instead to try to get the same information albeit in a (hopefully) less spin-tastic fashion. So let's keep bug 1452766 in the tree, and I'll file a new bug to migrate us over to a new BHR mechanism for tab switches. Marking WFM since the investigation has concluded and the fix actually occurred in bug 1452766.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WORKSFORME

Bugzilla

Investigate increase in number of tab switch spinners on March 21st, 2018 Nightly build

Categories

(Firefox :: Tabbed Browser, defect, P2)

Tracking

()

People

(Reporter: mconley, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Updated

Comment 9

Comment 10

Updated

Updated

Comment 11

Comment 12

Comment 13