648353 - when processing a large volume of internal events we spend too much time checking for native events

This is something I see all the time with the mac event loop... I _think_ we do it to prevent starvation of user events. But yes, it really hurts us perf-wise.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 4

•

14 years ago

Based on sysprof, the situation is a bit less bad on linux. "Only" 25% is under DoProcessNextNativeEvent, not 50% like on OSX.

Andreas Gal :gal

Reporter

Comment 5

•

14 years ago

Still very lousy if you ask me.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 6

•

14 years ago

Can we use a timer to control the rate at which we check for native events?

Andreas Gal :gal

Reporter

Comment 7

•

14 years ago

#6 would also help with the couple percent we spend in polling the timer in AppShellBase.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 8

•

14 years ago

Or could we just change nsBaseAppShell::OnProcessNextEvent a bit more, like remove the performance hint. I have a patch which seems to work well with the testcase on my linux laptop (it cuts 33% of the processing time and UI stays responsive). I need to run it through tryserver to see if it regress tp badly. I have no idea how well it works on other platforms.

Andreas Gal :gal

Reporter

Comment 9

•

14 years ago

Can you post the patch? I am curious what it looks like.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 10

•

14 years ago

Attached patch wip (deleted) — Details — Splinter Review

So the idea is to process native events a bit less often, but when process them, try to process them as many as possible. The current else{} block in nsBaseAppShell::OnProcessNextEvent where DoProcessNextNativeEvent(PR_FALSE); is called just once doesn't seem to behave too well. This is all very experimental.

Boris Zbarsky [:bzbarsky]

Comment 11

•

14 years ago

To prevent starving either user events or web events, don't we want to process both in small chunks as opposed to big batches?

Andreas Gal :gal

Reporter

Comment 12

•

14 years ago

If we can't keep up with native events, I think we are kinda screwed anyway. No point in delaying them. No?

Boris Zbarsky [:bzbarsky]

Comment 13

•

14 years ago

I seem to recall there being cases (on Windows?) where we can get as many native events as we ask for... Maybe during resize event loops? But I might be misremembering. It's been a while, and I was watching from the sidelines.

neil@parkwaycc.co.uk

Comment 14

•

14 years ago

Resizes and native modal dialogs spin their own event loops. (I don't know how Gecko events get processed in that model.)

Jim Mathies [:jimm]

Comment 15

•

14 years ago

(In reply to comment #13) > I seem to recall there being cases (on Windows?) where we can get as many > native events as we ask for... Maybe during resize event loops? But I might > be misremembering. It's been a while, and I was watching from the sidelines. That was a bit different - in that case Windows internal modal dispatch loops were prioritizing the native msg we post that triggers gecko event processing, starving input events. (Windows and mozilla prioritize differently.) The result was locked up native ui actions like window resize. http://mxr.mozilla.org/mozilla-central/source/widget/src/windows/nsAppShell.cpp#275

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 16

•

14 years ago

(In reply to comment #11) > To prevent starving either user events or web events, don't we want to process > both in small chunks as opposed to big batches? Well, the wip patch makes us process user events in smaller chunks, yet allow gecko events to get processes too.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 17

•

14 years ago

The patch seems to work on tryserver, but it does trigger random oranges quite often. I mean such oranges which have been reported already before. I would assume in many cases the tests are buggy, or some other code has wrong assumptions.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 18

•

14 years ago

Tryserver builds http://ftp.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/opettay@mozilla.com-2358236b71a7/

Jim Mathies [:jimm]

Comment 19

•

14 years ago

For windows builds- a little bit of general testing shows everything working as expected. The test case returns 375->450 both in the try build and the latest nightly.

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

14 years ago

Blocks: 648850

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

14 years ago

No longer blocks: 648850

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 20

•

14 years ago

Attached patch simpler (deleted) — Details — Splinter Review

Don't remove perfmode, since I suspect the removal caused some twin regression. Uploaded to tryserver.

(no longer active)

Comment 21

•

14 years ago

(In reply to comment #17) > The patch seems to work on tryserver, but it does trigger random oranges quite > often. > I mean such oranges which have been reported already before. > I would assume in many cases the tests are buggy, or some other code > has wrong assumptions. I'd be very cautious with this patch then...

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 22

•

14 years ago

Yeah, need to probably fix those tests before landing this kind of change.

(no longer active)

Comment 23

•

14 years ago

(In reply to comment #22) > Yeah, need to probably fix those tests before landing this kind of change. I meant, do we know that it only affects tests in a negative way? Could it also affect web content in a negative way, without us realizing?

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 24

•

14 years ago

My *guess* is that those (already without the patch randomly failing) tests have some bad assumptions about when timers and runnables run.

(no longer active)

Comment 25

•

14 years ago

(In reply to comment #24) > My *guess* is that those (already without the patch randomly failing) tests > have some bad assumptions about when timers and runnables run. That's interesting. Do you have a set of those tests? They may be among our flaky tests which fail intermittently...

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 26

•

14 years ago

I posted the patch to tryserver and this time I got the following failures. No new random failures. Bug 608206 - Intermittent timeout in test_prompt.html | Test timed out. * uses nsITimer Bug 632102 - Intermittent test_bug629838.html | Got MozAfterPaint event 25 time(s). * has bogus assumptions about how often events fire. There is a patch to fix the test. Bug 645267 - Intermittent failure in layout/reftests/svg/as-image/img-and-image-1.html, svg-image-recursive-1a.svg, svg-image-recursive-1b.svg, svg-image-recursive-2a.svg and svg-image-recursive-2b.svg | image comparison (==) * not sure about this. Happens quite often anyway. Bug 637806 - Intermittent failures in test_titlebar.xul * this one uses setTimeout Bug 542928 - [Linux] browser_bug511456.js | application timed out after 330 seconds with no output * common one Bug 626998 - Intermittent failure in browser_bug553455.js | Test timed out * There are patches for this Bug 620721 - Intermittent test_timeupdate_small_files.html | Test timed out. followed by shutdown timeout * relies on setTimeout

Andreas Gal :gal

Reporter

Comment 27

•

14 years ago

Patch looks great. I would say land it. If this patch does alter the behavior of some test, thats most likely due to a brittle test and should be fixed anyway.

(no longer active)

Comment 28

•

14 years ago

(In reply to comment #27) > Patch looks great. I would say land it. If this patch does alter the behavior > of some test, thats most likely due to a brittle test and should be fixed > anyway. I don't agree, we shouldn't just land stuff which can cause tests to fail more frequently without fixing the issues before landng. Let's take a look at these: (In reply to comment #26) > I posted the patch to tryserver and this time I got the following failures. No > new random failures. > Bug 608206 - Intermittent timeout in test_prompt.html | Test timed out. > * uses nsITimer Can you please trigger this test multiple times and see how frequently does it fail? The frequency should be ~0.05 times per test run (which is pretty low) <http://brasstacks.mozilla.com/orangefactor/?display=Bug&endday=2011-04-26&startday=2011-02-25&bugid=608206>. > Bug 632102 - Intermittent test_bug629838.html | Got MozAfterPaint event 25 > time(s). > * has bogus assumptions about how often events fire. There is a patch to fix > the test. Then this bug should depend on bug 632102. Can you try pushing this patch with the patch on bug 632102 to see if that helps the situation? > Bug 645267 - Intermittent failure in > layout/reftests/svg/as-image/img-and-image-1.html, svg-image-recursive-1a.svg, > svg-image-recursive-1b.svg, svg-image-recursive-2a.svg and > svg-image-recursive-2b.svg | image comparison (==) > * not sure about this. Happens quite often anyway. Yeah, this one is really flaky, and I don't think your patch here can affect this in any way. > Bug 637806 - Intermittent failures in test_titlebar.xul > * this one uses setTimeout I think we need to fix this bug before this can land. > Bug 542928 - [Linux] browser_bug511456.js | application timed out after 330 > seconds with no output > * common one Actually it's not that common: <http://brasstacks.mozilla.com/orangefactor/?display=Bug&endday=2011-04-26&startday=2011-02-25&bugid=542928&tree=all> How often does this happen for you on the try server? > Bug 626998 - Intermittent failure in browser_bug553455.js | Test timed out > * There are patches for this It seems to have been fixed. If you update to tip and reapply the patch and push it to try, this shouldn't happen again. > Bug 620721 - Intermittent test_timeupdate_small_files.html | Test timed out. > followed by shutdown timeout > * relies on setTimeout This one also seems to be close to getting fixed.

Depends on: 632102, 637806, 626998, 620721

Andreas Gal :gal

Reporter

Comment 29

•

14 years ago

I don't see how it makes sense to gate patches on **** tests. If a test is clearly flaky and the patch is unrelated and merely causes the alignment of the sun and the moon to change enough for a test to fail more often, we don't win anything from not landing it. Such tests should be disabled until they are fixed.

(no longer active)

Comment 30

•

14 years ago

(In reply to comment #29) > I don't see how it makes sense to gate patches on shitty tests. If a test is > clearly flaky and the patch is unrelated and merely causes the alignment of the > sun and the moon to change enough for a test to fail more often, we don't win > anything from not landing it. Such tests should be disabled until they are > fixed. Landing something which increases the frequency of intermittent failures is going to make the lives of everyone harder. There is a cost to landing such patches. Like I said in comment 28, a lot of these issues have a fix at hand, and for those which don't, we should advocate on the respective bugs for more attention. None of these bugs seem like big issues to me, but if for some reason we can't find anybody to work on some of them, we should disable the test with the blessing of the respective module owner.

(no longer active)

Comment 31

•

14 years ago

Also, Olli, can you please do another try server test run from the tip so that we have more data on the frequency of the failures? Thanks!

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 32

•

13 years ago

Sorry, forgot to push to try. A new build is coming http://tbpl.mozilla.org/?tree=Try&rev=81bb5e879230

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 33

•

13 years ago

This time failures were Bug 662800, Bug 636379, Bug 632117 and REFTEST TEST-UNEXPECTED-FAIL | file:///home/cltbld/talos-slave/test/build/reftest/tests/layout/reftests/bugs/491323-1.xul | image comparison (==)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 34

•

13 years ago

Comment on attachment 525490 [details] [diff] [review] simpler gal, bz, what you think of landing this. If this causes any problem it would be easy just back this out.

Attachment #525490 - Flags: review?(gal)

Attachment #525490 - Flags: review?(bzbarsky)

(no longer active)

Comment 35

•

13 years ago

There were three unrecognized test failures too: http://tinderbox.mozilla.org/showlog.cgi?log=Try/1308006459.1308007548.21785.gz http://tinderbox.mozilla.org/showlog.cgi?log=Try/1308018232.1308020306.7995.gz http://tinderbox.mozilla.org/showlog.cgi?log=Try/1308018243.1308019688.5836.gz I triggered the jobs to see if they were intermittent.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 36

•

13 years ago

Thanks!

Boris Zbarsky [:bzbarsky]

Comment 37

•

13 years ago

So I'm trying to understand this patch. If we get into nsBaseAppShell::OnProcessNextEvent and we recently processed a native event so we skip native event processing per this patch... what will make sure we process those native events eventually?

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 38

•

13 years ago

Comment on attachment 525490 [details] [diff] [review] simpler I uploaded this once again to tryserver. Seems like this is causing still too many random (usually known) failures.

Attachment #525490 - Flags: review?(gal)

Attachment #525490 - Flags: review?(bzbarsky)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 39

•

13 years ago

(In reply to comment #37) w > So I'm trying to understand this patch. > > If we get into nsBaseAppShell::OnProcessNextEvent and we recently processed > a native event so we skip native event processing per this patch... what > will make sure we process those native events eventually? Hmm, good question. But if there is a problem with the patch, it is there also without the patch. If we're in perfmode, we can easily not process native events, and even in non-perfmode we may not process all the native events. /me reads some more code...

Boris Zbarsky [:bzbarsky]

Comment 40

•

13 years ago

Ah, fair. I sort of wonder what determines whether we're in perf mode or not...

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 41

•

13 years ago

Attached patch safer? (deleted) — Details — Splinter Review

Uploaded this, perhaps a tiny bit safer patch to tryserver.

Nobody; OK to take it and work on it

Assignee

Updated

•

6 years ago

Component: Event Handling → User events and focus handling

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

test case 14 years ago Andreas Gal :gal (deleted), text/html		Details
testcase without shark start/stop 14 years ago Olli Pettay [:smaug][bugs@pettay.fi] (deleted), text/html		Details
wip 14 years ago Olli Pettay [:smaug][bugs@pettay.fi] (deleted), patch		Details \| Diff \| Splinter Review
simpler 14 years ago Olli Pettay [:smaug][bugs@pettay.fi] (deleted), patch		Details \| Diff \| Splinter Review
safer? 13 years ago Olli Pettay [:smaug][bugs@pettay.fi] (deleted), patch		Details \| Diff \| Splinter Review