Closed
Bug 996848
Opened 11 years ago
Closed 11 years ago
Rewrite nsAppShell native event handling in light of what's been learned at bug 959281
Categories
(Core :: Widget: Cocoa, defect)
Tracking
()
VERIFIED
FIXED
mozilla32
People
(Reporter: smichaud, Assigned: smichaud)
References
(Blocks 1 open bug)
Details
Attachments
(3 files, 4 obsolete files)
(deleted),
patch
|
spohl
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
spohl
:
review+
|
Details | Diff | Splinter Review |
(deleted),
image/png
|
Details |
As the summary says, this bug is spun off from bug 959281.
We discovered there that sending too many native events to Gecko while in nsAppShell::ProcessNextNativeEvent() can slow down (i.e. starve) the processing of nsIRunnable events in nsThread::ProcessNextEvent(). This can also (probably) cause Gecko to skip some of these events, or to process them in the wrong order.
As best we can tell, neither of these problems arises if we only send native events to Gecko (only call -[NSApp sendEvent:]) from nsAppShell::ProcessNextNativeEvent() when aMayWait is true.
For more information see bug 959281 comment #98 and bug 959281 comment #100.
I've posted a "bandaid patch" at bug 959281 that fixes the bug (and allows the patch for bug 953435 to be relanded). But I've learned a lot since I wrote (most of) the nsAppShell native event handling code, both by figuring out bug 959281 and by getting better at reverse engineering OS X's own event handling. So I'd like to take the chance to substantially rewrite the native event handling code in nsAppShell.mm and nsAppShell.h. I suspect I'll able to radically simplify it (and remove a lot of my own earlier patches).
Since my "radical simplification" is riskier than the bandaid patch for bug 959281, I'll pursue it separately. It's likely that we'll be able to get the bandaid patch (together with the restored patch for bug 953435) into a release fairly quickly. But my "radical simplification" patch will need to ride the trains, and to get onto them as early as possible. I currently hope to land it soon after the 32 branch is created.
Assignee | ||
Updated•11 years ago
|
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → smichaud
Status: NEW → ASSIGNED
Assignee | ||
Comment 1•11 years ago
|
||
Note that the patch for this bug should make the patch for bug 953435 redundant.
So the patch for bug 953435 should be backed out when this patch lands.
Assignee | ||
Comment 2•11 years ago
|
||
Here's my latest revision of the patch.
I've started a set of tryserver builds here:
https://tbpl.mozilla.org/?tree=Try&rev=66fbc7fada2c
Assignee | ||
Comment 3•11 years ago
|
||
> I've started a set of tryserver builds here:
> https://tbpl.mozilla.org/?tree=Try&rev=66fbc7fada2c
That set seems to have crapped out, and there seems to be something wrong with hg.mozilla.org (I get "abort: HTTP Error 500: Internal Server Error" when I try "hg pull -u"). I guess I'll try again tomorrow.
Assignee | ||
Comment 4•11 years ago
|
||
I've started another set of tryserver builds for this patch:
https://tbpl.mozilla.org/?tree=Try&rev=47fcb1ea0b71
Attachment #8407217 -
Attachment is obsolete: true
Assignee | ||
Comment 5•11 years ago
|
||
The second set of tryserver builds also crapped out -- most of the tests failed with crashes on exit.
I have a hunch what the problem might be. I'll revise my patch and do another set. Then if the failures go away I'll post my revision.
Assignee | ||
Comment 6•11 years ago
|
||
My hunch was correct. I needed to restore a very old hack that previous versions of the patch got rid of -- the sending of fake events to ensure the native event loop stays awake while FF is shutting down.
Here's the set of tryserver builds:
https://tbpl.mozilla.org/?tree=Try&rev=c58d9ee1895
Attachment #8407578 -
Attachment is obsolete: true
Assignee | ||
Comment 7•11 years ago
|
||
Here's an update to my rev2 patch that removes my bandaid patch for bug 959281. The patch for bug 953435 hasn't relanded yet (and at this point presumably never will) -- if it had, I'd also have to back it out.
I've started another tryserver build:
https://tbpl.mozilla.org/?tree=Try&rev=aec5fa8a4997
Attachment #8407885 -
Attachment is obsolete: true
Assignee | ||
Updated•11 years ago
|
Attachment #8416731 -
Flags: review?(spohl.mozilla.bugs)
Assignee | ||
Comment 8•11 years ago
|
||
Oops, I discovered a possible problem: Two of the test failures in the tryserver build from comment #6 are repeated in the tryserver build from comment #7.
I'll look into this, but please do go ahead with your review, Stephen.
Assignee | ||
Comment 9•11 years ago
|
||
(Following up comment #8)
Neither test "failure" is really a failure. In both cases the test, as written, expects a number of assertions on the Mac. The "failure" is that, with this patch, there are fewer assertions than expected (in one case none at all). So the SimpleTest.expectAssertions(n) statement in each test will need to be revised. I'll post a patch that does this. It can land at the same time as the patch for this bug.
Comment 10•11 years ago
|
||
Comment on attachment 8416731 [details] [diff] [review]
Rev2 patch updated to current trunk
Review of attachment 8416731 [details] [diff] [review]:
-----------------------------------------------------------------
This looks good to me! I've also had this patch applied since yesterday and didn't run into any issues. I've verified that all dependent bugs of bug 996848 and bug 959281 now work correctly (the ones with a test case, anyway). I've verified this on both 10.9.2 and 10.6.8.
::: widget/cocoa/nsAppShell.mm
@@ +601,5 @@
> }
> } while (mRunningEventLoop);
>
> + if (eventProcessed) {
> + moreEvents =
nit: whitespace at end of line
Attachment #8416731 -
Flags: review?(spohl.mozilla.bugs) → review+
Assignee | ||
Comment 11•11 years ago
|
||
Comment on attachment 8416731 [details] [diff] [review]
Rev2 patch updated to current trunk
Thanks! I'll fix the whitespace problem.
But I won't land this until I've figured out what to do about the test "failures". As you pointed out (on IRC), sometimes we actually *want* to see an assertion (if that's what we're testing for). And it may also be possible that the tests in question behave differently in debug and non-debug builds (since the "failures" only happen with debug builds).
Assignee | ||
Comment 12•11 years ago
|
||
I've started a couple of tryserver builds with the relevant expectAssertions() statements commented out.
One contains just this change (doesn't have the patch for this bug):
https://tbpl.mozilla.org/?tree=Try&rev=5ea89c291763
One contains this change plus the patch for this bug:
https://tbpl.mozilla.org/?tree=Try&rev=c7f5229c8d4d
This should allow us to see which assertions don't happen with my patch with these tests.
Assignee | ||
Comment 13•11 years ago
|
||
Along the way to figuring out the test failures, I found a bunch of dead code (made obsolete by my patch) that I hadn't yet removed -- the entire nsCocoaAppModalWindowList class in nsAppShell.h/mm.
This patch gets rid of it. It also fixes the whitespace problem that Stephen noticed.
Attachment #8416731 -
Attachment is obsolete: true
Attachment #8418348 -
Flags: review?(spohl.mozilla.bugs)
Assignee | ||
Comment 14•11 years ago
|
||
The calls to SimpleTest.expectAssertions() in the failing tests were originally added by the patch for bug 404077. So the assertions they "expect" are unwanted ones that need to be worked around, not ones the tests are *supposed* to trigger.
Also, as you can tell from the test results from comment #12, the "missing" assertions are from the dead code removed by my previous patch:
###!!! ASSERTION: PopCocoa() called without matching call to PushCocoa()!
So their absence is exactly what you'd expect. This patch changes the tests to take that into account.
Attachment #8418359 -
Flags: review?(spohl.mozilla.bugs)
Comment 15•11 years ago
|
||
Comment on attachment 8418348 [details] [diff] [review]
Fix rev3
This patch is amazing. Among all the other good it does, it also completely fixes bug 606937, from what I can tell, and improves the testcase there from 33.5 seconds to 3.8 seconds on my machine.
Assignee | ||
Comment 16•11 years ago
|
||
(In reply to comment #15)
That's good to hear, but a bit puzzling.
At bug 959281 I could easily detect delays in processing nsIRunnable "events", and the event starvation that caused it -- but only with the patch for bug 953435 landed (which was subsequently backed out). The patch for bug 953435 made us spend (relatively) much more time in nsAppShell::ProcessNextNativeEvent() (and consequently in OnProcessNextEvent() called on observers from nsThread::ProcessNextEvent()). But I didn't see this behavior without the patch for bug 953435.
Assignee | ||
Comment 17•11 years ago
|
||
I should run the testcase for bug 606937 in a profiler, with and without my patch for this bug. Will the built-in Gecko Profiler do? Is that what you've been using?
Comment 18•11 years ago
|
||
Just now I've been using Instruments, but the built-in Gecko profiler should work, too.
(In reply to Steven Michaud from comment #16)
> (In reply to comment #15)
> At bug 959281 I could easily detect delays in processing nsIRunnable
> "events", and the event starvation that caused it
But this delay came from different ordering of Gecko vs native events - Gecko events weren't getting processed because we were doing other processing things. But in bug 606937, there's nothing useful happening instead; there's only pure event loop overhead. And your patch in this bug seems to remove that overhead.
To be clear, I didn't apply your patches locally. What I compared was a regular nightly and your try build at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-c7f5229c8d4d/try-macosx64/firefox-32.0a1.en-US.mac.dmg . I may be confused about which patches I actually tested.
Assignee | ||
Comment 19•11 years ago
|
||
Just now I ran the testcase from bug 606937 with today's m-c nightly and the 10.8 opt tryserver build from the second tryserver run from comment #12. They were both made today, using the same build configuration -- so they should be a good basis for comparison. (I didn't run a profiler. I ran the tests several times, but the results were pretty consistent.)
My tryserver build finished processing all the events a bit better than twice as fast as today's m-c nightly. I tested on a MacPro (a big, beefy desktop with 24GB of RAM) running OS X 10.8.5. Next I'll test on my Retina MacBook Pro.
Updated•11 years ago
|
Attachment #8418348 -
Flags: review?(spohl.mozilla.bugs) → review+
Updated•11 years ago
|
Attachment #8418359 -
Flags: review?(spohl.mozilla.bugs) → review+
Assignee | ||
Comment 20•11 years ago
|
||
Comment on attachment 8418348 [details] [diff] [review]
Fix rev3
Landed on mozilla-inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4341602f99b0
Assignee | ||
Comment 21•11 years ago
|
||
Comment on attachment 8418359 [details] [diff] [review]
Fix test failures
Landed on mozilla-inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bf9cc4b2aae6
Assignee | ||
Comment 22•11 years ago
|
||
(Following up comment #19)
I see a *huge* difference in performance with the bug 606937 testcase when I have the Gecko Profiler installed and running -- ten times or more (similar to what Markus reported).
I've also found the explanation: Without my patch we spend *much* more time in nsAppShell::OnProcessNextEvent() -- ~75% as opposed to ~12%. (After filtering on nsThread::ProcessNextEvent.) So this is a case of nsIRunnable event starvation, very much like that of bug 959281.
The testcase for bug 606937 is surely an edge case. I'm sure 95+% of web browsing doesn't trigger such bad behavior in builds without my patch. But it's good to know that my patch can deal with this edge case, just as it can with the bug 959281 edge case and with all the weird hangs it fixes.
I'm sure the reason is ultimately that, with my patch, we do much less work in nsAppShell::ProcessNextNativeEvent() when aMayWait is false, and also never spin the native event loop. Instead this work gets done in [NSApp run] (called from nsAppShell::Run()). aMayWait is generally only true when a Gecko modal dialog is displayed.
Comment 23•11 years ago
|
||
Backed out for reftest failures on OS X 10.8:
https://tbpl.mozilla.org/php/getParsedLog.php?id=39219300&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=39219796&tree=Mozilla-Inbound
Not a 100% fail rate, but >50% it seems. (A failure similar to this occurred in the second try link in comment 12).
remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/a1c71b3ee0f2
remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/85924e72f778
Comment 24•11 years ago
|
||
Assignee | ||
Comment 25•11 years ago
|
||
A lot of these failures are in the very first two of the tests in the reftest-sanity test list. I can't begin to understand how those could be connected to my patches, and I strongly suspect that they aren't.
But I'll keep digging to see what I can find.
Assignee | ||
Comment 26•11 years ago
|
||
Needless to say, I haven't been able to reproduce these failures locally, on either a (non-Retina) Mac Pro or a Retina MacBook Pro.
Assignee | ||
Comment 27•11 years ago
|
||
For what it's worth, I've started three more sets of tryserver builds, just to see what happens:
https://tbpl.mozilla.org/?tree=Try&rev=6592b4e1e9f8
https://tbpl.mozilla.org/?tree=Try&rev=e6da4a9157a0
https://tbpl.mozilla.org/?tree=Try&rev=b7e6c834f047
Comment 28•11 years ago
|
||
You can retrigger individual tests without doing a new push, by selecting the test on tbpl and clicking the blue plus in in the bottom part. I've done that for the 10.8 reftests on https://tbpl.mozilla.org/?tree=Try&rev=6592b4e1e9f8 a few times.
Comment 29•11 years ago
|
||
Oh, the 10.8 opt reftest on https://tbpl.mozilla.org/?tree=Try&rev=b7e6c834f047 already shows the failure.
Assignee | ||
Comment 30•11 years ago
|
||
(In reply to comment #28)
Yes, that would be more efficient. But I was also looking for other possible intermittent failures.
As there don't appear to be any, my next tryserver build will just do the reftests (if I can figure out how to do that). Regardless of result, I'll restart them five or six times, to see what happens.
Assignee | ||
Comment 31•11 years ago
|
||
More tryserver test runs, limited to reftests and to OS X 10.8 (where the failures have been occurring). I'll repeat the reftests several times for each run.
https://tbpl.mozilla.org/?tree=Try&rev=89d36767696d
https://tbpl.mozilla.org/?tree=Try&rev=0ffdf702b05f
Frankly I'm hoping that the failures will disappear at some point :-) When/if they do I'll resubmit my patches.
I'd appreciate advice on how to reproduce spurious reftest failures locally -- *any* spurious reftest failure. Once I can do that, I may be able to figure out why they're happening (whether or not my patches make them happen more often).
Assignee | ||
Comment 32•11 years ago
|
||
Stephen and Markus have been digging into the reftest failures. Markus suggested the proposed patch for bug 1005964 as a workaround. I've started another set of tryserver tests:
https://tbpl.mozilla.org/?tree=Try&rev=b9c824438b06
Comment 33•11 years ago
|
||
(In reply to Steven Michaud from comment #31)
> I'd appreciate advice on how to reproduce spurious reftest failures locally
> -- *any* spurious reftest failure.
Total drive-by comment: have you tried using chaos mode to reproduce the failure locally? http://robert.ocallahan.org/2014/03/introducing-chaos-mode.html
Comment 34•11 years ago
|
||
I think mstange said that he tried chaos mode locally and wasn't able to reproduce. It can't hurt to try again though.
Assignee | ||
Comment 35•11 years ago
|
||
Stephen and Markus found that the reftest failures triggered by my patch happen when one image (of the pair being compared) has an overlay scrollbar displayed while the other doesn't.
Because whether (and when) an overlay scrollbar is displayed is inherently unpredictable, they should be turned off while reftests are running. There's a bug (bug 986404, pointed out to me by Stephen) where a patch has been landed to "disable fade-out of overlay scrollbars during reftest". But either that patch doesn't work properly when my patches for this bug are on the tree, or it doesn't go far enough. I'd think overlay scrollbars should be disabled *completely*: It's not just their fadeout that's unpredictable -- so is the timing of their appearance.
Assignee | ||
Comment 36•11 years ago
|
||
I've found another way to reproduce these reftest failures locally, without using chaos mode:
I have a MacPro, which by default (since it has a mouse connected) doesn't use overlay scrollbars. But when I change the "Show scroll bars" setting from "Automatically based on mouse or trackpad" to "When scrolling", I can reproduce similar failures on my MacPro. I don't yet know why I don't see them on my Retina MacBook Pro (where overlay scrollbars are enabled by default, since it has a trackpad and no mouse attached).
I don't see exactly the same failures as happen on the tryservers. But I always see at least one case of a reftest failing because one image has an overlay scrollbar and the other doesn't. I've just been testing with the reftest-sanity reftests:
./mach reftest layout/reftests/reftest-sanity/reftest.list
As Stephen kindly pointed out to me, you can load the encoded images from the logs into the Firefox URL bar. Just be sure you include the "data:image/png;base64," at the beginning of each encoded image.
Assignee | ||
Comment 37•11 years ago
|
||
It's close to the end of my day, and I have a long weekend coming up. I'll be working on this, but it's unlikely I'll be able to come up with a solution before sometime next week.
Assignee | ||
Comment 38•11 years ago
|
||
(Following up comment #32)
I can see from my local tests that this doesn't help. The proposed patch for bug 1005964 doesn't fix the reftest failures.
Assignee | ||
Comment 39•11 years ago
|
||
(Following up comment #36)
> I don't yet know why I don't see them on my Retina MacBook Pro
> (where overlay scrollbars are enabled by default, since it has a
> trackpad and no mouse attached).
It was because I'd changed the system setting and forgotten about it.
Once I change "Show scroll bars" back to "Automatically based on input
device", I can also reproduce these reftest failures on my MacBook
Pro. They also happen when it's set to "is scrolling".
So I can (and always do) reproduce these failures locally when
Firefox's default setting is to use overlay scrollbars (because of how
the system "Show scroll bars" setting is set).
Comment 40•11 years ago
|
||
(In reply to Steven Michaud from comment #35)
> Stephen and Markus found that the reftest failures triggered by my patch
> happen when one image (of the pair being compared) has an overlay scrollbar
> displayed while the other doesn't.
We were actually seeing permanently visible scrollbars, not overlay scrollbars. I'm not sure what the system preferences on our 10.8 machines are set to. But if the system preferences say to display overlay scrollbars, we should never be seeing permanently visible scrollbars in the first place...
Comment 41•11 years ago
|
||
The 10.8 machines are set to always show scrollbars due to failures I believe.
Comment 42•11 years ago
|
||
Oh, good to know. I thought with bug 986404 fixed we might be using overlay scrollbars. But at least we're expected to see permanently visible scrollbars on our test machines, which makes these failures a little less daunting. Just got to figure out why they're visible, and why they're so tiny. Missing a reflow due to a resize would have been a great explanation, but that doesn't seem to be the case since the try builds from comment 32 still show these failures...
Assignee | ||
Comment 43•11 years ago
|
||
I'm definitely seeing overlay scrollbars in my local tests.
Do you have any idea, Timothy, why bug 986404 doesn't stop that from happening?
Comment 44•11 years ago
|
||
Not sure. What happens at http://mxr.mozilla.org/mozilla-central/source/layout/generic/nsGfxScrollFrame.cpp#2210 ? ie the value of LookAndFeel::GetInt(LookAndFeel::eIntID_UseOverlayScrollbars)
Comment 45•11 years ago
|
||
From memory bug 986404 shouldn't affect 10.8 on tbpl because we disable the overlay scrollbar via the system pref. It only affected b2g and 10.9 machines where the system pref allows overlay scrollbars.
Assignee | ||
Comment 46•11 years ago
|
||
> It only affected b2g and 10.9 machines where the system pref allows overlay scrollbars.
Thanks. I'll look into it more carefully next week.
> What happens at http://mxr.mozilla.org/mozilla-central/source/layout/generic/nsGfxScrollFrame.cpp#2210?
I'll check next week.
Comment 47•11 years ago
|
||
(In reply to Steven Michaud from comment #36)
> I don't see exactly the same failures as happen on the tryservers. But I
> always see at least one case of a reftest failing because one image has an
> overlay scrollbar and the other doesn't.
Those failures are definitely worth fixing (tracked by bug 947089) but unfortunately have nothing to do with this bug.
What Stephen and I were seeing was that about:blank had (non-overlay) scrollbars, instead of being completely blank. Many reftests are supposed to render nothing, i.e. should have all-white output. For that purpose they are compared to a snapshot of about:blank. Since that about:blank snapshot was *not* completely white, but instead showed tiny scrollbars in the top left corner, those tests would fail.
The first occurrence of this problem is very simple:
https://tbpl.mozilla.org/php/getParsedLog.php?id=39248811&tree=Try#error0
data:text/html,<body> is blank (as expected) while about:blank has scrollbars (see attachment).
Comment 48•11 years ago
|
||
Here's an observation: If I change my scrollbar system setting to "overlay scrollbars only", open a new tab, navigate to about:blank, and change the scrollbar system setting to "non-overlay scrollbars only", about:blank starts looking exactly like the bad snapshot: mostly white, but with tiny non-overlay scrollbars in the top left corner.
So my hypothesis for now is the following: Reftests on 10.8 start out with overlay scrollbars enabled, then something triggers a change of the system pref and we update our lookandfeel setting to disable overlay scrollbars. Depending on whether this switch happens before or after the reftest framework loaded about:blank, we either get a bad about:blank snapshot or a good one. And the patch in this bug may influence the ordering of the native event that notifies us of the system pref change vs the Gecko event that starts the reflow of the about:blank document.
I'm going to start a try build in order to test this hypothesis.
Comment 49•11 years ago
|
||
(In reply to Markus Stange [:mstange] from comment #48)
> Here's an observation: If I change my scrollbar system setting to "overlay
> scrollbars only", open a new tab, navigate to about:blank, and change the
> scrollbar system setting to "non-overlay scrollbars only", about:blank
> starts looking exactly like the bad snapshot: mostly white, but with tiny
> non-overlay scrollbars in the top left corner.
I filed bug 1008197 for this. I think that by reverting bug 994861 we should be able to land the patches here.
Comment 50•11 years ago
|
||
Launched a try build with bug 994861 reverted:
https://tbpl.mozilla.org/?tree=Try&rev=0aab2b84af36
Comment 51•11 years ago
|
||
That try build looks good and bug 1008197 landed. Hopefully we are good to go now.
Comment 53•11 years ago
|
||
I relanded the two changed sets that landed in comment 20 and 21. Hopefully there have been no changes since then.
https://hg.mozilla.org/integration/mozilla-inbound/rev/d8eb7fe67883
https://hg.mozilla.org/integration/mozilla-inbound/rev/ac7542838773
Keywords: checkin-needed
Comment 54•11 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/d8eb7fe67883
https://hg.mozilla.org/mozilla-central/rev/ac7542838773
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
Comment 55•10 years ago
|
||
Marking for verification in Firefox 32, based on discussion from bug 959281.
Keywords: verifyme
Comment 56•10 years ago
|
||
Reproduced with Nightly 2014-01-09 on Mac OS X 10.9.4.
Verified as fixed with Firefox 32 beta 8 (Build ID: 20140818191513) with STR from bug 959281 comment 113: http://people.mozilla.org/~stmichaud/paperjs.org/examples/voronoi/ testcase works smoothly. Also, couldn't reproduce bug 1016200 on the same beta build.
Status: RESOLVED → VERIFIED
Keywords: verifyme
Updated•8 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•