Closed
Bug 1300355
Opened 8 years ago
Closed 6 years ago
intermittent blank reftest test or reference screenshots on win7 PGO
Categories
(Core :: Graphics, defect, P3)
Core
Graphics
Tracking
()
People
(Reporter: dbaron, Assigned: mattwoodrow)
References
(Depends on 2 open bugs, Blocks 18 open bugs)
Details
(Whiteboard: gfx-noted)
Attachments
(3 files, 3 obsolete files)
(deleted),
patch
|
dbaron
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
dbaron
:
review+
|
Details | Diff | Splinter Review |
We've started seeing intermittent reftest failures with blank reftests (either the test or the reference is blank), sometimes with 1 failure in a run and sometimes three.
They all seem to be on Win7 PGO, and the failures are usually preceded by some sort of GFX error message (although the message varies).
Reporter | ||
Updated•8 years ago
|
Reporter | ||
Comment 1•8 years ago
|
||
The message seems to be one of:
INFO - [GFX1]: Failed 2 buffer db=00000000 dw=00000000 for 0, 0, 800, 1000
INFO - [GFX1]: Failed to create DrawTarget, Type: 5 Size: Size(800,1000)
Comment 2•8 years ago
|
||
Not sure whether the assertion stack will help anyone (the crash stack certainly won't), but bug 1300608 is "Failed to create DrawTarget" in a debug build, and thus an assertion.
Comment 3•8 years ago
|
||
We had some problems with skia content on Windows due to OOMing during the tests. We weren't GCing enough and that sometimes caused us to be unable to allocate memory to draw content. Terrence, any other ideas here?
Flags: needinfo?(terrence)
Updated•8 years ago
|
Flags: needinfo?(terrence) → needinfo?(continuation)
Comment 4•8 years ago
|
||
Is this happening for specific tests? Is there any way to say add some pauses of a few seconds here and there for these tests? Or maybe the directory could be split up, assuming that reftests have run-by-directory, which I'm not sure if they do.
Flags: needinfo?(continuation)
Comment 5•8 years ago
|
||
They aren't even chunked, much less run-by-dir, and while there's a little affinity for hitting w3c-css/submitted/variables/ and svg/filters/, it's just a little affinity based on short sample time.
Comment 6•8 years ago
|
||
(In reply to Mason Chang [:mchang] from comment #3)
> We had some problems with skia content on Windows due to OOMing during the
> tests. We weren't GCing enough and that sometimes caused us to be unable to
> allocate memory to draw content. Terrence, any other ideas here?
That seems plausible.
Comment 7•8 years ago
|
||
Any other ideas for being more aggressive with the gc here? Thanks.
Flags: needinfo?(continuation)
Updated•8 years ago
|
Whiteboard: gfx-noted
Updated•8 years ago
|
Comment 8•8 years ago
|
||
(In reply to Mason Chang [:mchang] from comment #7)
> Any other ideas for being more aggressive with the gc here? Thanks.
Not really, sorry. Splitting up debug windows reftests might help. They take like 90 minutes. That must really fragment the address space.
Flags: needinfo?(continuation)
Comment 9•8 years ago
|
||
This particular case, or set of cases, is not 90 minute debug runs, it's 25 minute PGO runs.
I've been resisting our knee-jerk calls to split up the suite rather than fix our inability to run it, because we are desperately short on Windows hardware, routinely having 8-24+ hour waittimes for Windows tests on try, but since we run Windows 7 tests on AWS VMs it would be a little less horrible. Looks like when it has load, this pool can take another job within around 2 minutes after finishing one, and the actual test running time of the 25 minute job is 20 minutes, so we'd only be burning an additional 14 minutes of wasted time per reftest run, if we were willing to accept the awkwardness of running one chunk for WinXP and Win8 but three chunks for Win7.
Comment 10•8 years ago
|
||
So it sounds like we either can do chunking or just GC more often here. Since bug 1299204, we've been doing a full GC sweep every 3000 reftests. Maybe we can bump that up to something like 1000 or 100? Otherwise, are we left only with chunking?
I'm not sure what we can do on the graphics side. If we don't have memory to allocate a texture, then we can't display anything. We already try to clear out caches when we get a low memory notification, but I've been told that it's not really reliable.
Do we have a preference for chunking versus just GC more often?
Flags: needinfo?(philringnalda)
Flags: needinfo?(continuation)
Comment 11•8 years ago
|
||
I'm fine with increasing how frequently we do them or chunking, but I don't think it will help. I assume the basic issue here is that there's a directory that has a ton of graphically intensive tests that all allocate a lot of memory. You need to somehow do a forceGC + forceCC in the middle of this directory.
Flags: needinfo?(continuation)
Comment 12•8 years ago
|
||
Chunking won't help because this directory will still all be in the same chunk, and forcing a GC/CC more frequently will only help if one of those happens to fall during this directory (and could easily change later if more tests are added).
Comment 13•8 years ago
|
||
Can we split up the directory?
Reporter | ||
Comment 14•8 years ago
|
||
Are there actually large chunks of graphics memory that require a GC to be deleted, rather than being deleted when the presentation is destroyed? If that's the case, that sounds bad, and probably a regression.
Updated•8 years ago
|
Flags: needinfo?(philringnalda)
Comment 15•8 years ago
|
||
(In reply to David Baron :dbaron: ⌚️UTC-7 (busy September 14-25) from comment #14)
> Are there actually large chunks of graphics memory that require a GC to be
> deleted, rather than being deleted when the presentation is destroyed? If
> that's the case, that sounds bad, and probably a regression.
Nical, do you know? My understanding is that lots of gfx memory is just RefPtr. When do textures get cleaned up?
Flags: needinfo?(nical.bugzilla)
Comment 16•8 years ago
|
||
Textures are generally reference counted, and the ones that are directly associated with DOM elements are held alive by the things they represent (images, canvases, video elements, etc.). There's also a lot of texture memory that belongs to the layer tree which should go away when the presentation is destroyed (I think), except if they are held alive by a pool (tiling comes to mind, but it is not used on windows). We have a concept of memory pressure event, I don't remember if we use it at all on desktop, if so can we make sure it is used after every few test? I don't think it will make as much of a difference as running GC+CC but who knows...
Flags: needinfo?(nical.bugzilla)
Assignee | ||
Comment 17•8 years ago
|
||
Do we actually know what the problem is here?
Have we run out of available memory (and if so, is it because of gfx)? Or have we just fragmented the heap to the point that the large allocations gfx tries to make fail?
Nick, is it possible to manually trigger an about:memory report (or any memory stats at all) when we fail an allocation during automated testing that goes to somewhere we can access? stdout is the easiest for tbpl log parsing, but I'm sure we could get ask to a file on disk if we asked the right people.
Ideally these would be synchronous so we can know about the state we're in right now, especially since it could change quite quickly under reftest load.
Flags: needinfo?(n.nethercote)
Assignee | ||
Comment 18•8 years ago
|
||
I did a try push with some diagnostics along with my patch for bug 1305326 which seems to make this more frequent.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7b860233b71
When it fails:
[GFX1-]: Failed 2 buffer db=00000000 dw=00000000 for 0, 0, 800, 1000
Largest contiguous memory region: 5218304
Memory stats: Load: 52, Total physical: 3211976704, available: 1511645184, Total vmem: 2147352576, available: 449433600 available-extended: 0
Looks like we're trying to allocate a ~3.2mb buffer and the largest contiguous memory chunk is ~5.2mb. Seems like it should have succeeded, but not really surprising that it didn't. Total available memory is fine.
Given that, I think just splitting the suite in half should work fine. There's no obvious huge peak of memory load, so we don't need to worry about splitting a specific directory.
Assignee | ||
Comment 19•8 years ago
|
||
Oh, and the hang happens when the failing allocation is one that reports an error back to javascript. The reftest harness doesn't catch them, so it just stops trying to do things.
If the failing allocation is during drawing then we just get a blank image.
It might be worth trying to profile the allocation and see if we're unnecessarily churning memory more than we need to.
Does anyone if we have tools for doing that?
Comment 20•8 years ago
|
||
> Nick, is it possible to manually trigger an about:memory report (or any
> memory stats at all) when we fail an allocation during automated testing
> that goes to somewhere we can access? stdout is the easiest for tbpl log
> parsing, but I'm sure we could get ask to a file on disk if we asked the
> right people.
You can use the methods in nsIMemoryInfoDumper/nsMemoryInfoDumper to save a memory report to disk. See nsThread::SaveMemoryReportNearOOM() (and the functions it calls) for an example of its use.
Flags: needinfo?(n.nethercote)
Comment 21•8 years ago
|
||
(In reply to Matt Woodrow (:mattwoodrow) from comment #17)
> stdout is the easiest for tbpl log parsing, but I'm sure we could get ask to a file on disk if we asked the right people.
In automation, the environment variable MOZ_UPLOAD_DIR is set to a directory you can store files to. There will be a link to the files saved to that directory in Treeherder.
Updated•8 years ago
|
Comment 22•8 years ago
|
||
The blank snapshot happened on Linux PGO too. Bug 1312965, bug 1313785, bug 1313279. Is it considered as the same as this?
Comment 23•8 years ago
|
||
(In reply to Hiroyuki Ikezoe (:hiro) from comment #22)
> The blank snapshot happened on Linux PGO too. Bug 1312965, bug 1313785, bug
> 1313279. Is it considered as the same as this?
Maybe not.
I can not find gfx info log in these three Linux PGO test failures(see comment 1):
INFO - [GFX1]: Failed to create DrawTarget, Type: 5 Size: Size(800,1000)
Updated•8 years ago
|
Comment 24•8 years ago
|
||
Thanks, cjku. Filed bug 1314838.
Updated•8 years ago
|
Comment 25•8 years ago
|
||
Can we just set duplicate for those bugs with same cause ? Or are we going to re-visit them one by one in the future ?
Comment 26•8 years ago
|
||
No, we can't, not unless someone fixes bug 1299274 first (which would then just have the effect of making this such a topfail that it would more quickly be identified as a "fix it, or lose right to run the tests."
Comment 27•8 years ago
|
||
Ah, I see. Thanks for your feedback. I'll keep watching intermittent bugs and set up dependency if they are in same cause.
Comment 28•8 years ago
|
||
It seems Win7 debug R1 will have high intermittent failure rate when reftests are added under layout/reftests/w3c-css/submitted. Both Bug 1311244 comment 43 and Bug 1316482 comment 1300355 are backout due to the symptom similar to this bug. We need help to fix or workaround the issue to prevent it from blocking feature developing and bug fixing.
Comment 29•8 years ago
|
||
I wonder if we could export an ability to restart the browser instance while running reftest? Something like a line "need-to-restart-browser" in reftest.list, and the browser could restart at this certain point. In this way, we could force a browser restart before/after running a huge set of tests. Maybe this could be an alternative if doing chunking is not an option at present.
We seem have a '--run-by-dir' option in mochitest (
http://searchfox.org/mozilla-central/rev/225ab0637ed51b8b3f9f4ee2f9c339a37a65b626/testing/mochitest/mochitest_options.py#231), not sure if we could implement this similar function in reftest...
Comment 30•8 years ago
|
||
There is recent Bug 1331371 hit the same fail, the only difference is the backend Type.
03:00:37 INFO - Assertion failure: [GFX1]: Failed to create DrawTarget, Type: 3 Size: Size(800,1000), at c:\builds\moz2_slave\autoland-w32-d-000000000000000\build\src\obj-firefox\dist\include\mozilla/gfx/Logging.h:515
I would try to approach on Bug 1331371 to see if it helps to this one.
Updated•8 years ago
|
Comment 32•8 years ago
|
||
(In reply to Vincent Liu[:vliu] from comment #30)
> There is recent Bug 1331371 hit the same fail, the only difference is the
> backend Type.
>
> 03:00:37 INFO - Assertion failure: [GFX1]: Failed to create DrawTarget,
> Type: 3 Size: Size(800,1000), at
> c:\builds\moz2_slave\autoland-w32-d-000000000000000\build\src\obj-
> firefox\dist\include\mozilla/gfx/Logging.h:515
>
> I would try to approach on Bug 1331371 to see if it helps to this one.
In recent weeks, I can't reproduce this issue even I tried to set up Win7 vm on local or looked into it on try. Some bugs like bug 1316482 and 1311244 disabled many reftests because of this. Based on this, I tried to enabled them and sent a try. Please see [1] for the try result.
[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d28066d93e0370b4b46fe8c3ee9e49812da58a0f
It seems that even I enabled them, the issue still not happens. From the above results, I will file a bug to enable those reftests in bug 1316482 and 1311244 and observe what we will see next.
Comment 33•8 years ago
|
||
(In reply to Vincent Liu[:vliu] from comment #32)
> (In reply to Vincent Liu[:vliu] from comment #30)
> > There is recent Bug 1331371 hit the same fail, the only difference is the
> > backend Type.
> >
> > 03:00:37 INFO - Assertion failure: [GFX1]: Failed to create DrawTarget,
> > Type: 3 Size: Size(800,1000), at
> > c:\builds\moz2_slave\autoland-w32-d-000000000000000\build\src\obj-
> > firefox\dist\include\mozilla/gfx/Logging.h:515
> >
> > I would try to approach on Bug 1331371 to see if it helps to this one.
>
> In recent weeks, I can't reproduce this issue even I tried to set up Win7 vm
> on local or looked into it on try. Some bugs like bug 1316482 and 1311244
> disabled many reftests because of this. Based on this, I tried to enabled
> them and sent a try. Please see [1] for the try result.
>
> [1]:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=d28066d93e0370b4b46fe8c3ee9e49812da58a0f
>
> It seems that even I enabled them, the issue still not happens. From the
> above results, I will file a bug to enable those reftests in bug 1316482 and
> 1311244 and observe what we will see next.
From comment history from 20170127, I found the current status is it is more easier to reproduce it in opt build than in debug build. Since bug 1316482 and bug 1311244 only disable those reftests in debug build, maybe that's the reason why we saw it more easier in opt. Based on this, I will stop filing bug to enable those bugs in debug build until we have clear view.
Comment 34•8 years ago
|
||
Hi Phil,
(In reply to Matt Woodrow (:mattwoodrow) from comment #18)
> I did a try push with some diagnostics along with my patch for bug 1305326
> which seems to make this more frequent.
>
> Given that, I think just splitting the suite in half should work fine.
> There's no obvious huge peak of memory load, so we don't need to worry about
> splitting a specific directory.
From tracking the comment history, I believe increasing the trunk splitting the suite worth to try to see the result. Do you or anyone or mdn link knows how to send patch to try to do this? I would like to do this to see if it can actually solve this issue. Really thanks.
Flags: needinfo?(philringnalda)
Comment 35•8 years ago
|
||
Windows, so jobs run by buildbot? My guess would be that you can't, but I wouldn't know, not something that's ever been anything to do with me. You might ask in #ateam during the US daytime.
Flags: needinfo?(philringnalda)
Comment 36•8 years ago
|
||
(In reply to Vincent Liu[:vliu] from comment #34)
> Hi Phil,
>
> (In reply to Matt Woodrow (:mattwoodrow) from comment #18)
> > I did a try push with some diagnostics along with my patch for bug 1305326
> > which seems to make this more frequent.
> >
>
> > Given that, I think just splitting the suite in half should work fine.
> > There's no obvious huge peak of memory load, so we don't need to worry about
> > splitting a specific directory.
>
> From tracking the comment history, I believe increasing the trunk splitting
> the suite worth to try to see the result. Do you or anyone or mdn link knows
> how to send patch to try to do this? I would like to do this to see if it
> can actually solve this issue. Really thanks.
Hi gbrown,
Is it possible I can do this on my own? Thanks.
Flags: needinfo?(gbrown)
Comment 37•8 years ago
|
||
You can't test buildbot-config changes on Try. I'd be willing to split the reftest jobs up on Ash and see what happens, though.
Comment 39•8 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #37)
> You can't test buildbot-config changes on Try. I'd be willing to split the
> reftest jobs up on Ash and see what happens, though.
I am not quite clear to know about *Ash* you mentioned. Is it a platform on try that the issue can reproduce? Because I never see this term on trychooser, maybe you can have input about this? Thanks
Flags: needinfo?(ryanvm)
Comment 40•8 years ago
|
||
Ash is the project branch we've been using for e10s-related testing. I sync it roughly daily with mozilla-central. You can see the results on Treeherder at https://treeherder.mozilla.org/#/jobs?repo=ash.
Flags: needinfo?(ryanvm)
Comment 41•8 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #37)
> You can't test buildbot-config changes on Try. I'd be willing to split the
> reftest jobs up on Ash and see what happens, though.
From [1], I didn't see the way like increasing trunk to split the reftest jobs up on Ash. Have you started doing that? Or there was something I missed. Really thanks.
[1]: https://treeherder.mozilla.org/#/jobs?repo=ash
Flags: needinfo?(ryanvm)
Comment 42•8 years ago
|
||
The attached buildbot-configs patch, once landed and put into production, will split up the Windows reftests into 2 chunks. I will then do a new push to Ash from mozilla-central so we can see retrigger the runs and see how they do.
Note that Ash has a bit of a different configuration from mozilla-central in that it currently has 4 content processes enabled by default on it. That said, I can see these same failures happening there as well, so I don't think it'll have any meaningful impact on the results.
Jordan, the diff is a bit bigger than I'd hoped because I had to fight with all the logic related to Win7 AWS jobs and ended up moving and refactoring things a bit to make life easier. I've verified that the builder diff is sane, however. It also has the added benefit of turning off a bunch of non-e10s jobs that had been inadvertently enabled on Ash during all of that migration, so that's a nice plus :).
Flags: needinfo?(ryanvm)
Attachment #8835763 -
Flags: review?(jlund)
Comment 43•8 years ago
|
||
Comment on attachment 8835763 [details] [diff] [review]
refactor Ash test scheduling and split Windows reftests into two chunks
Review of attachment 8835763 [details] [diff] [review]:
-----------------------------------------------------------------
overall looks good. I don't know much about the win7_vm_gfx and win7_vm platforms so I'll trust in you that our infra is setup to handle this. I only have one nit with regards to some of the condition statements in-line below. resetting r? for now until new patch is up or you push back :)
::: mozilla-tests/config.py
@@ +3009,5 @@
> + if slave_platform not in BRANCHES['ash']['platforms'][platform]:
> + continue
> +
> + # Linux jobs are scheduled via in-tree Taskcluster configs, so no need for anything here.
> + if slave_platform in BRANCHES['ash']['platforms'][platform] and slave_platform in ('yosemite_r7'):
to help with some of the complexity, could we remove every `if slave_platform not in BRANCHES['ash']['platforms'][platform]` from every condition statement? I believe that condition is checked above at the beginning of the loop.
Also, since every single one of these platform blocks are for individual platform, could we say something like: `if slave_platform == "yosemite_r7":` rather than check for presence in a list?
Attachment #8835763 -
Flags: review?(jlund) → review-
Comment 44•8 years ago
|
||
Yeah, some of that logic was a carryover from another time when I was setting things for multiple platforms in those blocks. This patch makes the changes you requested and still produces a sane builder diff.
And yes, I've made sure to split the win7_ix/win7_vm/win7_vm_gfx jobs up in the same way they're scheduled in production as well.
Attachment #8835763 -
Attachment is obsolete: true
Attachment #8835804 -
Flags: review?(jlund)
Comment 45•8 years ago
|
||
Comment on attachment 8835804 [details] [diff] [review]
refactor Ash test scheduling and split Windows reftests into two chunks
Review of attachment 8835804 [details] [diff] [review]:
-----------------------------------------------------------------
lgtm :)
Attachment #8835804 -
Flags: review?(jlund) → review+
Comment 46•8 years ago
|
||
Comment on attachment 8835804 [details] [diff] [review]
refactor Ash test scheduling and split Windows reftests into two chunks
https://hg.mozilla.org/build/buildbot-configs/rev/1d08e92f2992b9e64afda4537ccff41bf55daf48
Attachment #8835804 -
Flags: checkin+
Comment 47•8 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #42)
> Created attachment 8835763 [details] [diff] [review]
> refactor Ash test scheduling and split Windows reftests into two chunks
>
Thanks for your help to figure it out. One question, As I know currently we have two trucks on windows 7 debug build and one for opt build. does your patch work on debug or opt? Until now we have chance to see this fail on both debug and opt, is it possible we can observe it on both? Really thanks.
Flags: needinfo?(ryanvm)
Comment 48•8 years ago
|
||
From looked into the recent fails, it all heppens in sumitted/variables/*.html. To see the fails doesn't relative to test flaw itself but possibly the test environment, I tried send a try to remove all reftests in layout/ except sumitted/variables/*.html and reftests in bug 1316482/bug 1311244. It turns out that it got all green after so many jobs retriggered. Please see [1] for reference.
[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a009919c5d625d2ab6e5c19ae5674d62d27aa5dc
Comment 49•8 years ago
|
||
(In reply to Vincent Liu[:vliu] from comment #47)
> Thanks for your help to figure it out. One question, As I know currently we
> have two trucks on windows 7 debug build and one for opt build. does your
> patch work on debug or opt? Until now we have chance to see this fail on
> both debug and opt, is it possible we can observe it on both? Really thanks.
The Ash change went into production yesterday. Unfortunately, it doesn't appear to have helped. Take note of the red Ru1 jobs at the link below:
https://treeherder.mozilla.org/#/jobs?repo=ash&revision=c27dd6952505c5cb2733c0bd077f370f676c3536&filter-searchStr=win%20ref
Looks like adding more chunks isn't going to be a silver bullet here.
Flags: needinfo?(ryanvm)
Comment 50•8 years ago
|
||
This is basically permafailing on Aurora/Beta (and soon ESR52) at the moment where all Windows opt builds are PGO. Is this anybody's active priority at the moment or do we need to resort to hiding Windows PGO reftests across all trees until someone has more time to investigate? Leaving them in a nearly-permafailing state isn't doing any useful testing for us.
Updated•8 years ago
|
Flags: needinfo?(bugs)
Reporter | ||
Comment 51•8 years ago
|
||
Do we have a regression range for when this started permafailing? (Speaking of which, do we have a regression range for when it started in the first place?)
Reporter | ||
Comment 52•8 years ago
|
||
Another question is whether this is really a continuation of bug 1167155 and it just started happening in this form after https://hg.mozilla.org/mozilla-central/rev/6f6ae4f2c283 ?
Comment 53•8 years ago
|
||
My recollection is that this was tied to the Skia switchover.
Per the email I'm going to be sending to release-drivers in a bit, I think this needs to block the 52 release.
tracking-firefox52:
--- → ?
Comment 54•8 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #53)
> Per the email I'm going to be sending to release-drivers in a bit, I think
> this needs to block the 52 release.
hmm, looking at Treeherder it's not quite as permafail as I'd thought it was, though it's still pretty far from great. We seem to hit this failure roughly once per push across one of the Win7 reftest suites.
Reporter | ||
Comment 55•8 years ago
|
||
Another possibly-relevant changeset around when this started is https://hg.mozilla.org/mozilla-central/rev/9bbc7d520ccb from bug 1298345.
Reporter | ||
Comment 56•8 years ago
|
||
More maybe-relevant changesets:
https://hg.mozilla.org/mozilla-central/rev/42031a1a09e8
https://hg.mozilla.org/mozilla-central/rev/1065b1b168df
Assignee | ||
Comment 57•8 years ago
|
||
(In reply to David Baron :dbaron: ⌚️UTC-8 from comment #52)
> Another question is whether this is really a continuation of bug 1167155 and
> it just started happening in this form after
> https://hg.mozilla.org/mozilla-central/rev/6f6ae4f2c283 ?
I think it is, but it's hard to tell as the logs from that bug have expired.
Assignee | ||
Comment 58•8 years ago
|
||
I had a play with a few ideas to try reduce the memory usage:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c58b2496ad0683d7b8a275a477f870901a93d17c
Didn't do the right magic to get PGO builds unfortunately.
Ryan, how often do we expect it to fail on non-PGO builds?
Flags: needinfo?(ryanvm)
Comment 59•8 years ago
|
||
You'd have to look at Treeherder. Also, for PGO on Try:
https://wiki.mozilla.org/ReleaseEngineering/TryChooser#What_if_I_want_PGO_for_my_build
Flags: needinfo?(ryanvm)
Comment 60•8 years ago
|
||
There is a try[1] I sent last Thursday also shows OOM happens to case the crash.
1. before crash , Largest contiguous memory region remains about : 5238784 (~5.2M)
2. the log message I added "SkMallocPixelRef::NewUsing size=3200000, addr=00000000" shows that alloc() returns null addr
[1]: https://archive.mozilla.org/pub/firefox/try-builds/vliu@mozilla.com-db7f562916f4bc86770d5ac158c7554720b939d3/try-win32/try_win7_ix_test-reftest-no-accel-bm119-tests1-windows-build1392.txt.gz
Not only this meta, bug 1315037 also has OOM issue. Based on this, is it possible increasing installed memory can improve this?
Flags: needinfo?(ryanvm)
Assignee | ||
Comment 61•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ae62e77ea73229e2564bfbea927de783cf7d7f8f
PGO mozconfig on a mozilla-beta tree. Can't reproduce the failure any more.
Assignee | ||
Comment 62•8 years ago
|
||
(In reply to Vincent Liu[:vliu] from comment #60)
> Not only this meta, bug 1315037 also has OOM issue. Based on this, is it
> possible increasing installed memory can improve this?
It's more likely that we're running out of address space (3GB) rather than physical memory, since I assume we could use swap if it were the latter.
Comment 63•8 years ago
|
||
(In reply to Matt Woodrow (:mattwoodrow) from comment #62)
> (In reply to Vincent Liu[:vliu] from comment #60)
>
> > Not only this meta, bug 1315037 also has OOM issue. Based on this, is it
> > possible increasing installed memory can improve this?
>
> It's more likely that we're running out of address space (3GB) rather than
> physical memory, since I assume we could use swap if it were the latter.
The memory layout on 32bit windows should have 2GB of user space usage. I think that's why I always see addr is close to the upper bound of user space. Since available memory is larger enough, increasing memory may not helpful for it.
23:43:17 INFO - SkMallocPixelRef::NewUsing size=3200000, addr=7F800000
......
23:43:17 INFO - SkMallocPixelRef::NewUsing size=3200000, addr=7EC00000
23:43:17 INFO - REFTEST TEST-LOAD | file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/svg/filters/svg-filter-chains/different-StrokePaint-filter-regions-ref.svg | 11992 / 15159 (79%)
23:43:17 INFO - SkMallocPixelRef::NewUsing size=3200000, addr=00000000
Flags: needinfo?(ryanvm)
Comment 64•8 years ago
|
||
(In reply to Matt Woodrow (:mattwoodrow) from comment #61)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=ae62e77ea73229e2564bfbea927de783cf7d7f8f
>
> PGO mozconfig on a mozilla-beta tree. Can't reproduce the failure any more.
Please tell me you're getting this reviewed somewhere? :)
Flags: needinfo?(matt.woodrow)
Assignee | ||
Comment 65•8 years ago
|
||
Assignee: nobody → matt.woodrow
Flags: needinfo?(matt.woodrow)
Attachment #8840669 -
Flags: review?(dbaron)
Updated•8 years ago
|
Attachment #8835804 -
Attachment is obsolete: true
Assignee | ||
Comment 66•8 years ago
|
||
Attachment #8840670 -
Flags: review?(dbaron)
Assignee | ||
Comment 67•8 years ago
|
||
The canvas cache keeps around copies of reftest results for any urls that are repeated multiple times during the run.
This saves time (don't need to repaint them) at the cost of extra memory usage.
Apparently we really can't afford this extra memory on win7, so this disables it there.
It'll make the reftest run take longer to complete, but shouldn't be as bad as splitting into into multiple runs.
Attachment #8840671 -
Flags: review?(jmaher)
Reporter | ||
Comment 68•8 years ago
|
||
Comment on attachment 8840669 [details] [diff] [review]
Part 1: Increase frequency of GC/CC during reftests
>try: -b do -p win32 -u reftest[Windows 7],crashtest[Windows 7] -t none[Windows 7]
I'd suggest a better commit message is the patch description you used in bugzilla, "Increase frequency of GC/CC during reftests". :-)
Attachment #8840669 -
Flags: review?(dbaron) → review+
Comment 69•8 years ago
|
||
Comment on attachment 8840671 [details] [diff] [review]
Part 3: Disable reftest canvas cache for win 7.
Review of attachment 8840671 [details] [diff] [review]:
-----------------------------------------------------------------
thanks for the patch and explanation. As win7 is the only 32 bit we test on, are there no concerns on win8 x64?
Attachment #8840671 -
Flags: review?(jmaher) → review+
Reporter | ||
Comment 70•8 years ago
|
||
Comment on attachment 8840670 [details] [diff] [review]
Part 2: Fix issues in reftest harness with canvas cache disabled
Could you explain what this is fixing? The AddURIUseCount and UpdateCanvasCache calls seem symmetric both before and after the patch.
Flags: needinfo?(matt.woodrow)
Reporter | ||
Comment 71•8 years ago
|
||
I guess it means that will will call ReleaseCanvas() and thus recycle canvases that were drawn when we had prefs set -- but why not do that by just calling ReleaseCanvas directly as an alternative to UpdateCanvasCache?
(When the canvas cache is off, it seems better to maintain the use counts (which are for the cache) the same as we do when the cache is off, or not to bother with them at all -- this seems to maintain them differently, which seems like unnecessary complexity to me.)
Updated•8 years ago
|
Assignee | ||
Comment 72•8 years ago
|
||
Attachment #8840670 -
Attachment is obsolete: true
Attachment #8840670 -
Flags: review?(dbaron)
Flags: needinfo?(matt.woodrow)
Attachment #8841292 -
Flags: review?(dbaron)
Reporter | ||
Updated•8 years ago
|
Attachment #8841292 -
Flags: review?(dbaron) → review+
Comment 74•8 years ago
|
||
Pushed by mwoodrow@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a519934c735b
Part 1: Increase frequency of GC/CC during reftests. r=dbaron
https://hg.mozilla.org/integration/mozilla-inbound/rev/3bd400108a21
Part 2: Don't track use counts in the reftest harness when the cache is disabled. r=dbaron
https://hg.mozilla.org/integration/mozilla-inbound/rev/c3437d0e66b7
Part 3: Disable reftest canvas cache for win 7. r=jmaher
Comment 75•8 years ago
|
||
Looks like this might have helped, but unfortunately, it doesn't appear to have been enough :(
https://treeherder.mozilla.org/logviewer.html#?job_id=81783793&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=81783190&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=81783846&repo=mozilla-inbound
Keywords: leave-open
Comment 76•8 years ago
|
||
bugherder |
Reporter | ||
Updated•8 years ago
|
Comment 77•8 years ago
|
||
Since the root cause here is memory fragmentation, can we try to fix that? I don't have experience with memory fragmentation profiling tools - njn, can massif or DMD be used to provide this info? In particular I'm wondering if putting certain allocations into an arena or something along those lines would help reduce the fragmentation. That would help both these reftests and Firefox performance in general. I recall recently that Bas did some profiling and discovered that not using arenas in some places was impacting our performance due to CPU cache misses.
Flags: needinfo?(n.nethercote)
Comment hidden (obsolete) |
Comment 79•8 years ago
|
||
I landed these patches on Beta and ESR52 to at least cut down on the 800000 pixel failures, even if it's not a full fix.
https://hg.mozilla.org/releases/mozilla-beta/rev/81deafbeaf6b
https://hg.mozilla.org/releases/mozilla-beta/rev/3b911d0ad6be
https://hg.mozilla.org/releases/mozilla-beta/rev/c16cd2b0733e
https://hg.mozilla.org/releases/mozilla-esr52/rev/2bb591035b83
https://hg.mozilla.org/releases/mozilla-esr52/rev/fb382c5c0cf0
https://hg.mozilla.org/releases/mozilla-esr52/rev/9a153f39053d
status-firefox55:
--- → affected
status-firefox-esr52:
--- → affected
Assignee | ||
Comment 80•8 years ago
|
||
Hopefully it gets rid of the failures entirely on older branches, it seemed to on my try runs. It's believable that memory usage is worse on Nightly than it is there, but it's also believable that I just didn't do enough reruns to see the failures.
I like Kats' suggestions, applying some memory profiling tools is probably the best way to try figure out what's causing the remaining fragmentation and then we can try figure out a solution.
Flags: needinfo?(matt.woodrow)
Comment 81•8 years ago
|
||
I don't know any tools aimed at diagnosing memory fragmentation, sorry. At a minimum you need to know where in memory all the blocks are, which is something memory profilers don't normally show. Beyond that you could take into account lifetimes of all the blocks. It's a hard problem in general.
Flags: needinfo?(n.nethercote)
Comment 82•8 years ago
|
||
After bug 1344991 landed, we'll be able to restart Firefox after an crash and continue with the rest of the test. Maybe we can modify it a bit and restart from the intermittent test? But then we need to be able to identify the cause of that kind of failure reliably, and we need to think about how to report that on treeherder.
Comment 83•8 years ago
|
||
That might make the consequences of this slightly less awful, but ultimately it won't do anything to reduce the overall failure rate of the test suite. Seems to me that something akin to run-by-dir mode for reftests (closing and restarting the browser between directories like we do for mochitests) is more likely to make a meaningful difference here if we're unable to fix this at the platform level.
Comment 84•8 years ago
|
||
++ for a run-by-dir mode, makes chunking more consistent
Comment 85•8 years ago
|
||
I think it's time to hide these across all trees or just turn them off if we can't prioritize coming up with a fix here. See Aurora below for example.
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=win%20vm-gfx%20ref&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable
https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&filter-searchStr=win%20vm-gfx%20ref%20pgo&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable
https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&filter-searchStr=win%20vm-gfx%20ref%20pgo&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable
This has been happening for months now and this is not how Tier 1 test suites are supposed be treated. If everyone's too busy to work on this, let's stop wasting AWS resources and turn them off.
Flags: needinfo?(dbaron)
Comment 86•8 years ago
|
||
The only reason I haven't hidden them is because they run on a precious and limited sort of AWS instance, and I know from experience that hiding them just means we would leave them running hidden and ignored and consuming limited resources for months longer. But I would completely support either shutting them off, or moving them back onto hardware to see whether we've also broken our ability to run them there.
Comment 87•8 years ago
|
||
I don't see any green jobs in the last 4 weeks on m-c, but inbound and autoland have plenty of green and much fewer failures.
We do have limited instances of the gfx machines, are we getting value from these jobs running? Can we find regressions, or do we just assume they always fail and ignore the fine details?
Comment 88•8 years ago
|
||
Looking through recent failures, it looks like this only happens on non-e10s runs. If that's not true, would appreciate seeing a counter-example.
Similarly, I see lots of win7-opt and win7-pgo examples, but no win7-debug. Is that consistently true?
If those observations are correct, do we feel that is consistent with our memory fragmentation theory?
Comment 89•8 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #83)
> Seems to me that something akin to run-by-dir mode for reftests (closing and
> restarting the browser between directories like we do for mochitests) is
> more likely to make a meaningful difference here if we're unable to fix this
> at the platform level.
Examples like
https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-win32/1491151822/mozilla-central_win7_vm_gfx_test-reftest-bm139-tests1-windows-build309.txt.gz
make me wonder if that would be effective. The job was only 12% complete when we hit this failure.
On the other hand, the majority (based on a very small sample size!) of such failures seem to happen in the last 25% of the job...so maybe.
win7 opt/pgo reftests run in a single chunk. Has anyone even tried simply running in multiple chunks?
Comment 90•8 years ago
|
||
I think 8 chunks would be a good place to start :) Unfortunately this is still 100% via buildbot, this makes it harder to experiment with on try server.
Reporter | ||
Comment 91•8 years ago
|
||
So I wasn't aware that things had gotten worse -- that information doesn't appear to have previously been in this bug. We really need a better way to get information about failures that happen across multiple tests into a single bug so that we can understand their severity correctly. This bug doesn't have any comments from OrangeFactor Robot, and neither do any of the other bugs feeding into it.
Given that it has gotten worse, it would be useful to get a regression range on when it got worse. But comment 87 is very disturbing -- why do we see different statistics on mozilla-central vs. the integration repositories that lead into it. We could also use a regression range on when it started. Getting any regression ranges here probably requires bisection on try given that the starring is probably inaccurate. But does comment 87 suggest that maybe we won't get correct data on try?
If we do need to disable the reftest suite on Windows, I think we can't land changes to layout. So if you want to disable the suite, I'd ask that you also add a repository hook on all trees where it's disabled (and integration branches leading to those trees) that disallows pushes with changes in layout/.
Flags: needinfo?(dbaron)
Reporter | ||
Comment 92•8 years ago
|
||
Jet, this probably needs a new assignee.
Comment 93•8 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #89)
> win7 opt/pgo reftests run in a single chunk. Has anyone even tried simply
> running in multiple chunks?
We did in comment 49. Today we also tried increasing the number of chunks to 4 on Ash. It does in fact appear to help quite a bit, though we still ended up with a couple 800000 pixel failures on the w3c-css directory. It would also add roughly 6-7min *per chunk* of extra overhead to each push as well, so the costs from a resource utilization standpoint would be non-trivial.
Comment 94•8 years ago
|
||
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #91)
> If we do need to disable the reftest suite on Windows, I think we can't land
> changes to layout. So if you want to disable the suite, I'd ask that you
> also add a repository hook on all trees where it's disabled (and integration
> branches leading to those trees) that disallows pushes with changes in
> layout/.
Let's please not disable the reftest suite.
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #92)
> Jet, this probably needs a new assignee.
Bas: can you help here? We're getting 32-bit Windows surface allocation failures for the canvas2d used for reftest screenshots (reports show SKIA & DIRECT2D1_1 failures.) We've tried various hacks to split up the tests and force GC between runs but it still happens.
Updated•8 years ago
|
Flags: needinfo?(bugs) → needinfo?(bas)
Comment 95•8 years ago
|
||
(In reply to Jet Villegas (:jet) from comment #94)
> (In reply to David Baron :dbaron: ⌚️UTC-7 from comment #91)
> > If we do need to disable the reftest suite on Windows, I think we can't land
> > changes to layout. So if you want to disable the suite, I'd ask that you
> > also add a repository hook on all trees where it's disabled (and integration
> > branches leading to those trees) that disallows pushes with changes in
> > layout/.
>
> Let's please not disable the reftest suite.
>
> (In reply to David Baron :dbaron: ⌚️UTC-7 from comment #92)
> > Jet, this probably needs a new assignee.
>
> Bas: can you help here? We're getting 32-bit Windows surface allocation
> failures for the canvas2d used for reftest screenshots (reports show SKIA &
> DIRECT2D1_1 failures.) We've tried various hacks to split up the tests and
> force GC between runs but it still happens.
If these failures happen both for Skia and D2D1, and especially if they manifest in 32-bit, it seems likely these are just OOMs? At least when D2D fails we should be reporting the failure codes for the surface creation in the gfx error log.
Flags: needinfo?(bas)
Updated•8 years ago
|
Whiteboard: gfx-noted → gfx-noted, [Memshrink]
Comment 96•8 years ago
|
||
MemShrink review is OK with a periodic browser restart. I'll look into getting that wired up.
Flags: needinfo?(bugs)
Whiteboard: gfx-noted, [Memshrink] → gfx-noted
Comment 97•8 years ago
|
||
running the tests one directory at a time is a logical split up, I filed bug 1353461 for that, might not get done in the next week or so though.
Comment hidden (Intermittent Failures Robot) |
Comment 99•7 years ago
|
||
FWIW, I've been seeing these fail on win7 opt VM without a GFX failure message, but with:
JavaScript error: chrome://reftest/content/reftest.jsm, line 1748: NS_ERROR_FAILURE:
The line number in reftest.jsm seems a bit bogus, though.
For example:
https://treeherder.mozilla.org/logviewer.html#?job_id=102242688&repo=mozilla-central&lineNumber=61669
Comment 100•7 years ago
|
||
If you look at the reftest.jsm file that ends up in the objdir
https://dxr.mozilla.org/mozilla-central/source/obj-x86_64-pc-linux-gnu/_tests/reftest/reftest/chrome/reftest/content/reftest.jsm#1748
that line number points to
var image1 = gCanvas1.toDataURL();
which seems plausible.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Priority: -- → P3
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Flags: needinfo?(bugs)
Comment 112•6 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:mattwoodrow, maybe it's time to close this bug?
Flags: needinfo?(matt.woodrow)
Assignee | ||
Comment 113•6 years ago
|
||
We're not running Win7 PGO tests any more, so I don't think we'll work on this.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(matt.woodrow)
Resolution: --- → WONTFIX
Updated•6 years ago
|
Keywords: leave-open
Comment 114•6 years ago
|
||
we really are, it is win7-shippable
Updated•2 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•