Closed Bug 996848 Opened 10 years ago Closed 10 years ago

Rewrite nsAppShell native event handling in light of what's been learned at bug 959281

Categories

(Core :: Widget: Cocoa, defect)

All
macOS
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla32

People

(Reporter: smichaud, Assigned: smichaud)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 4 obsolete files)

As the summary says, this bug is spun off from bug 959281.

We discovered there that sending too many native events to Gecko while in nsAppShell::ProcessNextNativeEvent() can slow down (i.e. starve) the processing of nsIRunnable events in nsThread::ProcessNextEvent().  This can also (probably) cause Gecko to skip some of these events, or to process them in the wrong order.

As best we can tell, neither of these problems arises if we only send native events to Gecko (only call -[NSApp sendEvent:]) from nsAppShell::ProcessNextNativeEvent() when aMayWait is true.

For more information see bug 959281 comment #98 and bug 959281 comment #100.

I've posted a "bandaid patch" at bug 959281 that fixes the bug (and allows the patch for bug 953435 to be relanded).  But I've learned a lot since I wrote (most of) the nsAppShell native event handling code, both by figuring out bug 959281 and by getting better at reverse engineering OS X's own event handling.  So I'd like to take the chance to substantially rewrite the native event handling code in nsAppShell.mm and nsAppShell.h.  I suspect I'll able to radically simplify it (and remove a lot of my own earlier patches).

Since my "radical simplification" is riskier than the bandaid patch for bug 959281, I'll pursue it separately.  It's likely that we'll be able to get the bandaid patch (together with the restored patch for bug 953435) into a release fairly quickly.  But my "radical simplification" patch will need to ride the trains, and to get onto them as early as possible.  I currently hope to land it soon after the 32 branch is created.
Assignee: nobody → smichaud
Status: NEW → ASSIGNED
Note that the patch for this bug should make the patch for bug 953435 redundant.

So the patch for bug 953435 should be backed out when this patch lands.
Attached patch "Radical simplification" patch (obsolete) (deleted) — Splinter Review
Here's my latest revision of the patch.

I've started a set of tryserver builds here:
https://tbpl.mozilla.org/?tree=Try&rev=66fbc7fada2c
> I've started a set of tryserver builds here:
> https://tbpl.mozilla.org/?tree=Try&rev=66fbc7fada2c

That set seems to have crapped out, and there seems to be something wrong with hg.mozilla.org (I get "abort: HTTP Error 500: Internal Server Error" when I try "hg pull -u").  I guess I'll try again tomorrow.
I've started another set of tryserver builds for this patch:
https://tbpl.mozilla.org/?tree=Try&rev=47fcb1ea0b71
Attachment #8407217 - Attachment is obsolete: true
The second set of tryserver builds also crapped out -- most of the tests failed with crashes on exit.

I have a hunch what the problem might be.  I'll revise my patch and do another set.  Then if the failures go away I'll post my revision.
My hunch was correct.  I needed to restore a very old hack that previous versions of the patch got rid of -- the sending of fake events to ensure the native event loop stays awake while FF is shutting down.

Here's the set of tryserver builds:
https://tbpl.mozilla.org/?tree=Try&rev=c58d9ee1895
Attachment #8407578 - Attachment is obsolete: true
Blocks: 1004648
Attached patch Rev2 patch updated to current trunk (obsolete) (deleted) — Splinter Review
Here's an update to my rev2 patch that removes my bandaid patch for bug 959281.  The patch for bug 953435 hasn't relanded yet (and at this point presumably never will) -- if it had, I'd also have to back it out.

I've started another tryserver build:
https://tbpl.mozilla.org/?tree=Try&rev=aec5fa8a4997
Attachment #8407885 - Attachment is obsolete: true
Attachment #8416731 - Flags: review?(spohl.mozilla.bugs)
Oops, I discovered a possible problem:  Two of the test failures in the tryserver build from comment #6 are repeated in the tryserver build from comment #7.

I'll look into this, but please do go ahead with your review, Stephen.
Blocks: 1006507
(Following up comment #8)

Neither test "failure" is really a failure.  In both cases the test, as written, expects a number of assertions on the Mac.  The "failure" is that, with this patch, there are fewer assertions than expected (in one case none at all).  So the SimpleTest.expectAssertions(n) statement in each test will need to be revised.  I'll post a patch that does this.  It can land at the same time as the patch for this bug.
Comment on attachment 8416731 [details] [diff] [review]
Rev2 patch updated to current trunk

Review of attachment 8416731 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good to me! I've also had this patch applied since yesterday and didn't run into any issues. I've verified that all dependent bugs of bug 996848 and bug 959281 now work correctly (the ones with a test case, anyway). I've verified this on both 10.9.2 and 10.6.8.

::: widget/cocoa/nsAppShell.mm
@@ +601,5 @@
>      }
>    } while (mRunningEventLoop);
>  
> +  if (eventProcessed) {
> +    moreEvents = 

nit: whitespace at end of line
Attachment #8416731 - Flags: review?(spohl.mozilla.bugs) → review+
Comment on attachment 8416731 [details] [diff] [review]
Rev2 patch updated to current trunk

Thanks!  I'll fix the whitespace problem.

But I won't land this until I've figured out what to do about the test "failures".  As you pointed out (on IRC), sometimes we actually *want* to see an assertion (if that's what we're testing for).  And it may also be possible that the tests in question behave differently in debug and non-debug builds (since the "failures" only happen with debug builds).
I've started a couple of tryserver builds with the relevant expectAssertions() statements commented out.

One contains just this change (doesn't have the patch for this bug):
https://tbpl.mozilla.org/?tree=Try&rev=5ea89c291763

One contains this change plus the patch for this bug:
https://tbpl.mozilla.org/?tree=Try&rev=c7f5229c8d4d

This should allow us to see which assertions don't happen with my patch with these tests.
Attached patch Fix rev3 (deleted) — Splinter Review
Along the way to figuring out the test failures, I found a bunch of dead code (made obsolete by my patch) that I hadn't yet removed -- the entire nsCocoaAppModalWindowList class in nsAppShell.h/mm.

This patch gets rid of it.  It also fixes the whitespace problem that Stephen noticed.
Attachment #8416731 - Attachment is obsolete: true
Attachment #8418348 - Flags: review?(spohl.mozilla.bugs)
Attached patch Fix test failures (deleted) — Splinter Review
The calls to SimpleTest.expectAssertions() in the failing tests were originally added by the patch for bug 404077.  So the assertions they "expect" are unwanted ones that need to be worked around, not ones the tests are *supposed* to trigger.

Also, as you can tell from the test results from comment #12, the "missing" assertions are from the dead code removed by my previous patch:

###!!! ASSERTION: PopCocoa() called without matching call to PushCocoa()!

So their absence is exactly what you'd expect.  This patch changes the tests to take that into account.
Attachment #8418359 - Flags: review?(spohl.mozilla.bugs)
Comment on attachment 8418348 [details] [diff] [review]
Fix rev3

This patch is amazing. Among all the other good it does, it also completely fixes bug 606937, from what I can tell, and improves the testcase there from 33.5 seconds to 3.8 seconds on my machine.
(In reply to comment #15)

That's good to hear, but a bit puzzling.

At bug 959281 I could easily detect delays in processing nsIRunnable "events", and the event starvation that caused it -- but only with the patch for bug 953435 landed (which was subsequently backed out).  The patch for bug 953435 made us spend (relatively) much more time in nsAppShell::ProcessNextNativeEvent() (and consequently in OnProcessNextEvent() called on observers from nsThread::ProcessNextEvent()).  But I didn't see this behavior without the patch for bug 953435.
I should run the testcase for bug 606937 in a profiler, with and without my patch for this bug.  Will the built-in Gecko Profiler do?  Is that what you've been using?
Just now I've been using Instruments, but the built-in Gecko profiler should work, too.

(In reply to Steven Michaud from comment #16)
> (In reply to comment #15)
> At bug 959281 I could easily detect delays in processing nsIRunnable
> "events", and the event starvation that caused it

But this delay came from different ordering of Gecko vs native events - Gecko events weren't getting processed because we were doing other processing things. But in bug 606937, there's nothing useful happening instead; there's only pure event loop overhead. And your patch in this bug seems to remove that overhead.

To be clear, I didn't apply your patches locally. What I compared was a regular nightly and your try build at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-c7f5229c8d4d/try-macosx64/firefox-32.0a1.en-US.mac.dmg . I may be confused about which patches I actually tested.
Just now I ran the testcase from bug 606937 with today's m-c nightly and the 10.8 opt tryserver build from the second tryserver run from comment #12.  They were both made today, using the same build configuration -- so they should be a good basis for comparison.  (I didn't run a profiler.  I ran the tests several times, but the results were pretty consistent.)

My tryserver build finished processing all the events a bit better than twice as fast as today's m-c nightly.  I tested on a MacPro (a big, beefy desktop with 24GB of RAM) running OS X 10.8.5.  Next I'll test on my Retina MacBook Pro.
Attachment #8418348 - Flags: review?(spohl.mozilla.bugs) → review+
Attachment #8418359 - Flags: review?(spohl.mozilla.bugs) → review+
(Following up comment #19)

I see a *huge* difference in performance with the bug 606937 testcase when I have the Gecko Profiler installed and running -- ten times or more (similar to what Markus reported).

I've also found the explanation:  Without my patch we spend *much* more time in nsAppShell::OnProcessNextEvent() -- ~75% as opposed to ~12%.  (After filtering on nsThread::ProcessNextEvent.)  So this is a case of nsIRunnable event starvation, very much like that of bug 959281.

The testcase for bug 606937 is surely an edge case.  I'm sure 95+% of web browsing doesn't trigger such bad behavior in builds without my patch.  But it's good to know that my patch can deal with this edge case, just as it can with the bug 959281 edge case and with all the weird hangs it fixes.

I'm sure the reason is ultimately that, with my patch, we do much less work in nsAppShell::ProcessNextNativeEvent() when aMayWait is false, and also never spin the native event loop.  Instead this work gets done in [NSApp run] (called from nsAppShell::Run()).  aMayWait is generally only true when a Gecko modal dialog is displayed.
A lot of these failures are in the very first two of the tests in the reftest-sanity test list.  I can't begin to understand how those could be connected to my patches, and I strongly suspect that they aren't.

But I'll keep digging to see what I can find.
Needless to say, I haven't been able to reproduce these failures locally, on either a (non-Retina) Mac Pro or a Retina MacBook Pro.
You can retrigger individual tests without doing a new push, by selecting the test on tbpl and clicking the blue plus in in the bottom part. I've done that for the 10.8 reftests on https://tbpl.mozilla.org/?tree=Try&rev=6592b4e1e9f8 a few times.
Oh, the 10.8 opt reftest on https://tbpl.mozilla.org/?tree=Try&rev=b7e6c834f047 already shows the failure.
(In reply to comment #28)

Yes, that would be more efficient.  But I was also looking for other possible intermittent failures.

As there don't appear to be any, my next tryserver build will just do the reftests (if I can figure out how to do that).  Regardless of result, I'll restart them five or six times, to see what happens.
More tryserver test runs, limited to reftests and to OS X 10.8 (where the failures have been occurring).  I'll repeat the reftests several times for each run.

https://tbpl.mozilla.org/?tree=Try&rev=89d36767696d
https://tbpl.mozilla.org/?tree=Try&rev=0ffdf702b05f

Frankly I'm hoping that the failures will disappear at some point :-)  When/if they do I'll resubmit my patches.

I'd appreciate advice on how to reproduce spurious reftest failures locally -- *any* spurious reftest failure.  Once I can do that, I may be able to figure out why they're happening (whether or not my patches make them happen more often).
Stephen and Markus have been digging into the reftest failures.  Markus suggested the proposed patch for bug 1005964 as a workaround.  I've started another set of tryserver tests:

https://tbpl.mozilla.org/?tree=Try&rev=b9c824438b06
(In reply to Steven Michaud from comment #31)
> I'd appreciate advice on how to reproduce spurious reftest failures locally
> -- *any* spurious reftest failure.

Total drive-by comment: have you tried using chaos mode to reproduce the failure locally? http://robert.ocallahan.org/2014/03/introducing-chaos-mode.html
I think mstange said that he tried chaos mode locally and wasn't able to reproduce. It can't hurt to try again though.
Stephen and Markus found that the reftest failures triggered by my patch happen when one image (of the pair being compared) has an overlay scrollbar displayed while the other doesn't.

Because whether (and when) an overlay scrollbar is displayed is inherently unpredictable, they should be turned off while reftests are running.  There's a bug (bug 986404, pointed out to me by Stephen) where a patch has been landed to "disable fade-out of overlay scrollbars during reftest".  But either that patch doesn't work properly when my patches for this bug are on the tree, or it doesn't go far enough.  I'd think overlay scrollbars should be disabled *completely*:  It's not just their fadeout that's unpredictable -- so is the timing of their appearance.
I've found another way to reproduce these reftest failures locally, without using chaos mode:

I have a MacPro, which by default (since it has a mouse connected) doesn't use overlay scrollbars.  But when I change the "Show scroll bars" setting from "Automatically based on mouse or trackpad" to "When scrolling", I can reproduce similar failures on my MacPro.  I don't yet know why I don't see them on my Retina MacBook Pro (where overlay scrollbars are enabled by default, since it has a trackpad and no mouse attached).

I don't see exactly the same failures as happen on the tryservers.  But I always see at least one case of a reftest failing because one image has an overlay scrollbar and the other doesn't.  I've just been testing with the reftest-sanity reftests:

./mach reftest layout/reftests/reftest-sanity/reftest.list

As Stephen kindly pointed out to me, you can load the encoded images from the logs into the Firefox URL bar.  Just be sure you include the "data:image/png;base64," at the beginning of each encoded image.
It's close to the end of my day, and I have a long weekend coming up.  I'll be working on this, but it's unlikely I'll be able to come up with a solution before sometime next week.
(Following up comment #32)

I can see from my local tests that this doesn't help.  The proposed patch for bug 1005964 doesn't fix the reftest failures.
(Following up comment #36)

> I don't yet know why I don't see them on my Retina MacBook Pro
> (where overlay scrollbars are enabled by default, since it has a
> trackpad and no mouse attached).

It was because I'd changed the system setting and forgotten about it.
Once I change "Show scroll bars" back to "Automatically based on input
device", I can also reproduce these reftest failures on my MacBook
Pro.  They also happen when it's set to "is scrolling".  

So I can (and always do) reproduce these failures locally when
Firefox's default setting is to use overlay scrollbars (because of how
the system "Show scroll bars" setting is set).
(In reply to Steven Michaud from comment #35)
> Stephen and Markus found that the reftest failures triggered by my patch
> happen when one image (of the pair being compared) has an overlay scrollbar
> displayed while the other doesn't.

We were actually seeing permanently visible scrollbars, not overlay scrollbars. I'm not sure what the system preferences on our 10.8 machines are set to. But if the system preferences say to display overlay scrollbars, we should never be seeing permanently visible scrollbars in the first place...
The 10.8 machines are set to always show scrollbars due to failures I believe.
Oh, good to know. I thought with bug 986404 fixed we might be using overlay scrollbars. But at least we're expected to see permanently visible scrollbars on our test machines, which makes these failures a little less daunting. Just got to figure out why they're visible, and why they're so tiny. Missing a reflow due to a resize would have been a great explanation, but that doesn't seem to be the case since the try builds from comment 32 still show these failures...
I'm definitely seeing overlay scrollbars in my local tests.

Do you have any idea, Timothy, why bug 986404 doesn't stop that from happening?
Not sure. What happens at http://mxr.mozilla.org/mozilla-central/source/layout/generic/nsGfxScrollFrame.cpp#2210 ? ie the value of LookAndFeel::GetInt(LookAndFeel::eIntID_UseOverlayScrollbars)
From memory bug 986404 shouldn't affect 10.8 on tbpl because we disable the overlay scrollbar via the system pref. It only affected b2g and 10.9 machines where the system pref allows overlay scrollbars.
> It only affected b2g and 10.9 machines where the system pref allows overlay scrollbars.

Thanks.  I'll look into it more carefully next week.

> What happens at http://mxr.mozilla.org/mozilla-central/source/layout/generic/nsGfxScrollFrame.cpp#2210?

I'll check next week.
Attached image about:blank with tiny scrollbars (deleted) —
(In reply to Steven Michaud from comment #36)
> I don't see exactly the same failures as happen on the tryservers.  But I
> always see at least one case of a reftest failing because one image has an
> overlay scrollbar and the other doesn't.

Those failures are definitely worth fixing (tracked by bug 947089) but unfortunately have nothing to do with this bug.

What Stephen and I were seeing was that about:blank had (non-overlay) scrollbars, instead of being completely blank. Many reftests are supposed to render nothing, i.e. should have all-white output. For that purpose they are compared to a snapshot of about:blank. Since that about:blank snapshot was *not* completely white, but instead showed tiny scrollbars in the top left corner, those tests would fail.

The first occurrence of this problem is very simple:
https://tbpl.mozilla.org/php/getParsedLog.php?id=39248811&tree=Try#error0
data:text/html,<body> is blank (as expected) while about:blank has scrollbars (see attachment).
Here's an observation: If I change my scrollbar system setting to "overlay scrollbars only", open a new tab, navigate to about:blank, and change the scrollbar system setting to "non-overlay scrollbars only", about:blank starts looking exactly like the bad snapshot: mostly white, but with tiny non-overlay scrollbars in the top left corner.

So my hypothesis for now is the following: Reftests on 10.8 start out with overlay scrollbars enabled, then something triggers a change of the system pref and we update our lookandfeel setting to disable overlay scrollbars. Depending on whether this switch happens before or after the reftest framework loaded about:blank, we either get a bad about:blank snapshot or a good one. And the patch in this bug may influence the ordering of the native event that notifies us of the system pref change vs the Gecko event that starts the reflow of the about:blank document.

I'm going to start a try build in order to test this hypothesis.
(In reply to Markus Stange [:mstange] from comment #48)
> Here's an observation: If I change my scrollbar system setting to "overlay
> scrollbars only", open a new tab, navigate to about:blank, and change the
> scrollbar system setting to "non-overlay scrollbars only", about:blank
> starts looking exactly like the bad snapshot: mostly white, but with tiny
> non-overlay scrollbars in the top left corner.

I filed bug 1008197 for this. I think that by reverting bug 994861 we should be able to land the patches here.
That try build looks good and bug 1008197 landed. Hopefully we are good to go now.
Agreed. Let's land this again.
Keywords: checkin-needed
I relanded the two changed sets that landed in comment 20 and 21. Hopefully there have been no changes since then.

https://hg.mozilla.org/integration/mozilla-inbound/rev/d8eb7fe67883
https://hg.mozilla.org/integration/mozilla-inbound/rev/ac7542838773
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/d8eb7fe67883
https://hg.mozilla.org/mozilla-central/rev/ac7542838773
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
Depends on: 1008978
Blocks: 606937
Depends on: 1013852
Marking for verification in Firefox 32, based on discussion from bug 959281.
Keywords: verifyme
Reproduced with Nightly 2014-01-09 on Mac OS X 10.9.4.
Verified as fixed with Firefox 32 beta 8 (Build ID: 20140818191513) with STR from bug 959281 comment 113:  http://people.mozilla.org/~stmichaud/paperjs.org/examples/voronoi/ testcase works smoothly. Also, couldn't reproduce bug 1016200 on the same beta build.
Status: RESOLVED → VERIFIED
Keywords: verifyme
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: