716945 - Figure out why Flash hang volume dropped very significantly in 10.0a1 trunk on 2011-10-29

Reporter

Description

•

13 years ago

Not sure what's the best place to file this in, as this is just about an investigation of something good that happened. It would be very good to know how we fixed significant problems for our users, so that we understand the mechanisms of Flash hangs better. Looking at https://crash-analysis.mozilla.com/rkaiser/firefox.10.0a1.flashsummary.html you'll see that on 2011-10-29 the volume of Flash hangs dropped significantly, in the following days to less than 50% of what it was before (on that day, we can assume that a lot of nightlies from the last day were still in use, so the full extent is only visible the day after, but the day the drop started is probably the Nightly that had the fix). This was not coupled with a drop in ADUs or so (we only saw that later when Nightly changed to 12.0a1, where volume drops again, of course). This trend continued with the switch of Aurora and now Beta from 9 to 10 - comparing https://crash-analysis.mozilla.com/rkaiser/firefox.10.0.flashsummary.html to when https://crash-analysis.mozilla.com/rkaiser/firefox.9.0.flashsummary.html tracked Beta has our Flash hangs dropping by almost a factor 10, which is incredibly significant (on Beta, we're usually seeing a lot more intense Flash usage than on Aurora and Nightly, so the effect getting stronger could be somewhat expected, but it's really a huge difference between 3000-3600 on 9 Beta to 300-450 on 10 Beta). The assumption is that something in http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2011-10-28&enddate=2011-10-29+04%3A00%3A00 probably caused this tremendous drop in Flash hang volume, but it would be really good to know what exactly we did to make things so much better for users - or have at least a really good educated guess on what it was.

Benjamin Smedberg

Comment 1

•

13 years ago

Was this seen on both Windows and Mac or just Windows? Are we sure that this is a real improvement, or perhaps we're just failing to submit hang reports? I've seen a lot of people seeing "Failed to submit crash report" from plugin reports... Was there any corresponding drop in plugin crash (non-hang) reports? I see that bug 665196 landed on the 27th, so I strongly suspect that the UI change is the root cause here. Either a bug is causing us not to submit the report, or people are misreading the UI and more of them are choosing not to submit a report.

Component: IPC → General

QA Contact: ipc → general

Robert Kaiser

Reporter

Comment 2

•

13 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #1) > Was this seen on both Windows and Mac or just Windows? I don't have that breakdown, would need to create it. > Are we sure that this is a real improvement, or perhaps we're just failing > to submit hang reports? That's surely a possibility. > Was there any corresponding drop in plugin crash (non-hang) reports? Actually, there was, and your explanation of bug 665196 sounds somewhat likely. Does that mean we should back this out for 10? We are getting late in Beta with it now...

Benjamin Smedberg

Comment 3

•

13 years ago

I could theorize that the stream we use to submit the report only stays alive as long as the current page is loaded, which means that we cancel the stream as soon as we reload. Now that we're submitting-and-reloading all at once, most users aren't submitting reports. So yeah, we could back out the patch. It's got l10n and UI impact, so it's not a terribly easy thing to back out of release channels.

Blocks: 665196

tracking-firefox10: --- → ?

Alex Keybl [:akeybl]

Updated

•

13 years ago

tracking-firefox10: ? → +

tracking-firefox11: --- → +

Felix Fung (:felix)

Assignee

Comment 4

•

13 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #3) > I could theorize that the stream we use to submit the report only stays > alive as long as the current page is loaded, which means that we cancel the > stream as soon as we reload. Now that we're submitting-and-reloading all at > once, most users aren't submitting reports. I'm looking into whether that's the case and whether there is a relatively simple fix...

Robert Kaiser

Reporter

Comment 5

•

13 years ago

OK, just as a data point, I can't make out any obvious hang signature pairs we're missing, but it's a bit hard to tell. I think we should for now assume that we're not missing any specific hangs but that we're just getting fewer reports overall (but keep in mind that this is an assumption). What's interesting is that while in 10.0a1 Nightly data, the drop in hangs and crashes was similar at about a factor 3 (but hard to tell exactly on the low level we have in Nightly), in beta data (9.0 vs. 10.0) OTOH, the drop is way higher in hangs than in crashes - while hangs are down almost by a factor of 10 (~10-15% of 9.0 values), crashes are down only slightly by at most 1/3 or 1/4 from what I can estimate from the numbers (~65-80% of 9.0 values).

Verdi [:verdi] Best to slack me

Comment 6

•

13 years ago

FWIW, out of about 6 flash crashes with the new UI I have yet to have a successful crash report submission. Like David, I get the failed submission message http://www.flickr.com/photos/djst/6635768441

[:Cww]

Comment 7

•

13 years ago

Talked this through with verdi: One additional scenario is that the new UI tells you to reload to fix the problem. However, if you don't click the reload button in the toolbar or error message and instead use the regular reload button then the crash report doesn't submit even though the UI seems to imply that it will.

Benjamin Smedberg

Comment 8

•

13 years ago

I have a backout prepared for Aurora, see try results here: https://tbpl.mozilla.org/?tree=Try&rev=0a073bc8c072 Felix/dolske, unless you have a real fix in-hand, I think we should go ahead and land this on Aurora, do you agree?

[Baboo]

Comment 9

•

13 years ago

Not sure if related but I had recently a Flash-related bug report that wouldn't submit: https://crash-stats.mozilla.com/report/index/bp-1a3aa2e0-a92b-473e-bf6a-fbc542120121

Robert Kaiser

Reporter

Comment 10

•

13 years ago

I posted some thinking I had on this on IRC, I guess this should make it here as well for preservation: <bsmedberg> so I think that somehow we're timing out or doing something else funky with the XHR doing the submission <ted> seems odd, since we're using the same code as about:crashes <KaiRo> ted: well, just thinking about the difference between about:crashes and the plugin crash reporter, in the former we navigate to a different URL, while in the latter we reload (the parent) - maybe the latter is more destructive than the former and we get destroyed before we can finish sending - after all, at navigation something stays around in the bfcache while reload doesn't do that, right? <ted> KaiRo: certainly possible <KaiRo> ted: I'm only theorizing without any detailed code knowledge of course I have enough overview knowledge to make educated guesses there, I hope, but I surely don't know anything of how the code itself works, so I leave it up to those who do to dig deeper into this (if they think it's valuable) or not (if it's clearly a wrong idea).

Benjamin Smedberg

Comment 11

•

13 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/2878c9b2f02e https://hg.mozilla.org/releases/mozilla-aurora/rev/03ff575be222 https://hg.mozilla.org/releases/mozilla-aurora/rev/18d1d5407dbe Fixed on Aurora. We decided not to fix this on Beta for 10. I have not pushed the backout to -central, so I'm marking tracking for 12 so that we either fix or perform the backout again on the next aurora.

status-firefox10: --- → wontfix

status-firefox11: --- → fixed

tracking-firefox12: --- → +

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 12

•

13 years ago

I think those aurora l10n changes are going to lead to problems, without additional outreach.

Axel Hecht [:Pike]

Comment 13

•

13 years ago

To quote from my previous answer in email: For the l10n impact, I suggest that we keep the the new string, add back the old string, and add a fallback to content that picks up the new string if the old one's not there. Something like <!ENTITY report.please "&report.checkbox;"> either before or after the l10n file, in pluginProblem.xml. If you get the order right, the double definition just gets dropped, and otherwise we end up with the slightly odd string, but still a localized one.

Benjamin Smedberg

Comment 14

•

13 years ago

I believed that advice was for the beta channel only; why can't we or localizers just retrieve the previous string from hg blame?

Axel Hecht [:Pike]

Comment 15

•

13 years ago

Aurora is string frozen, and the advice was for any string frozen tree. I don't see any reason to debate string freezes, surely not in this bug, so I'll just ignore the "why" question.

Axel Hecht [:Pike]

Comment 16

•

13 years ago

What happens about this on central? Like, backing this out again on aurora for 12 shouldn't happen.

Benjamin Smedberg

Comment 17

•

13 years ago

Backing this out again on Aurora is the plan unless Felix comes up with the correct fix.

Assignee: nobody → felix.the.cheshire.cat

Alex Keybl [:akeybl]

Comment 18

•

13 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #17) > Backing this out again on Aurora is the plan unless Felix comes up with the > correct fix. In order to avoid the l10n impact on Aurora, we should back out of m-c as well. We can re-land bug 665196 on FF13 once we have a fix in hand.

Robert Kaiser

Reporter

Updated

•

13 years ago

Blocks: 722394

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 19

•

13 years ago

Attached patch backout of bug 665196 (deleted) — Details — Splinter Review

This is a backout patch that matches the three changesets in comment 11, minus the change to nsExceptionHandler (which is harmless to leave as-is). Want to make sure I got this right before backing it out on trunk.

Attachment #592897 - Flags: feedback?(dolske)

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

13 years ago

Attachment #592897 - Flags: feedback?(dolske)

Robert Kaiser

Reporter

Comment 20

•

13 years ago

bsmedberg, gavin, akeybl, apparently we missed to do this on trunk before the uplift to aurora, can we get this backout done on aurora at least very early? What should we do about trunk?

Alex Keybl [:akeybl]

Comment 21

•

13 years ago

I believe we should land https://bugzilla.mozilla.org/show_bug.cgi?id=716945#c11 on Aurora and https://bugzilla.mozilla.org/show_bug.cgi?id=716945#c19 on m-c.

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 22

•

13 years ago

The patch in comment 19 is just a combined version of the changes in comment 14, minus an unrelated change that doesn't need to be backed out. They're the same from an l10n perspective - we should use comment 19's patch for any backouts.

Robert Kaiser

Reporter

Comment 23

•

13 years ago

So, can we please get the backout landed at least on Aurora 12 ASP? Maybe we can even land it on Nightly 13 so that we get better data there and can detect any regressions or improvements better, until we find a real fix to the problem?

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 24

•

13 years ago

Comment on attachment 592897 [details] [diff] [review] backout of bug 665196 [Approval Request Comment] String changes made by this patch: one string change

Attachment #592897 - Flags: approval-mozilla-aurora?

Alex Keybl [:akeybl]

Comment 25

•

13 years ago

Comment on attachment 592897 [details] [diff] [review] backout of bug 665196 [Triage Comment] Approved for Aurora 12 - thanks Gavin.

Attachment #592897 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Axel Hecht [:Pike]

Comment 26

•

13 years ago

This is still not backed out on Aurora, nor on central. Gavin, can you own that?

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 27

•

13 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/2087f423a1a4 https://hg.mozilla.org/releases/mozilla-aurora/rev/baeb3c39c477

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

13 years ago

status-firefox12: --- → fixed

Ed Morley [:emorley]

Comment 28

•

13 years ago

(In reply to Gavin Sharp (use gavin@gavinsharp.com for email) from comment #27) > https://hg.mozilla.org/integration/mozilla-inbound/rev/2087f423a1a4 https://hg.mozilla.org/mozilla-central/rev/2087f423a1a4

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla13

u279076

Comment 29

•

13 years ago

Is there anything here for QA to verify?

Whiteboard: [qa?]

Serge Gautherie (:sgautherie)

Updated

•

13 years ago

Blocks: 723970

Mark Banner (:standard8)

Updated

•

13 years ago

Blocks: 725898

u279076

Updated

•

13 years ago

Whiteboard: [qa?] → [qa-]