Leading guard pages for normal allocations
Categories
(Core :: Memory Allocator, enhancement, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox96 | --- | fixed |
People
(Reporter: gcp, Assigned: gcp, NeedInfo)
References
(Depends on 1 open bug)
Details
(Keywords: perf-alert, Whiteboard: [no-nag])
Attachments
(3 files)
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
text/x-phabricator-request
|
Details |
Assignee | ||
Comment 1•6 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 2•5 years ago
|
||
Assignee | ||
Comment 3•5 years ago
|
||
Assignee | ||
Comment 4•5 years ago
|
||
Assignee | ||
Comment 5•5 years ago
|
||
Assignee | ||
Comment 6•5 years ago
|
||
Comment 8•5 years ago
|
||
bugherder |
Comment 9•5 years ago
|
||
Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.
Comment 10•5 years ago
|
||
Comment 11•5 years ago
|
||
Backed out changeset b1faa1af2967 (Bug 1546442) for causing a spike in Bug 1475812 and Bug 1471001.
Backout link: https://hg.mozilla.org/mozilla-central/rev/92748f5caf89eead6ae37e57d822359f256a08cf
Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&revision=b1faa1af2967cdf347a2e8a4f860168777a3fecf&selectedJob=258372478
Failure logs:
Assignee | ||
Comment 12•5 years ago
|
||
So changing our memory layout causes 2 tests that are intermittent failures to be slightly more intermittent failures?
I'm not even sure what there's actionable for me here. Certainly nothing in this patch.
Comment 13•5 years ago
|
||
Failure A:
[task 2019-07-25T20:33:15.241Z] 20:33:15 INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_2_conformance2__uniforms__incompatible-texture-type-for-sampler.html
[task 2019-07-25T20:33:15.478Z] 20:33:15 INFO - GECKO(1628) | JavaScript error: http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/js/webgl-test-utils.js, line 2064: SyntaxError: bad trailing UTF-8 byte 0xE5 doesn't match the pattern 0b10xxxxxx
[task 2019-07-25T20:33:15.478Z] 20:33:15 INFO - GECKO(1628) | JavaScript note: http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/js/webgl-test-utils.js, line 2064: the code units comprising this invalid code point were: 0xE5 0xE5
Failure B:
[task 2019-07-25T20:33:05.771Z] 20:33:05 INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_2_conformance2__textures__misc__npot-video-sizing.html
[task 2019-07-25T20:33:05.915Z] 20:33:05 INFO - GECKO(1632) | JavaScript error: http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/js/webgl-test-utils.js, line 3012: SyntaxError: bad trailing UTF-8 byte 0xE5 doesn't match the pattern 0b10xxxxxx
[task 2019-07-25T20:33:05.915Z] 20:33:05 INFO - GECKO(1632) | JavaScript note: http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/js/webgl-test-utils.js, line 3012: the code units comprising this invalid code point were: 0xE5 0xE5
:waldo added this UTF-8 decoding error.
Also, 0xe5e5, you say? "jemalloc freed memory":
https://searchfox.org/mozilla-central/rev/1dfd70469212ef2785d41827c5532c571c784227/js/src/gdb/mozilla/JSString.py#54
Comment 14•5 years ago
|
||
Sooooooooo, what's happening is ScriptLoadHandler
downloads bytes of scripts incrementally. Each increment of bytes, it uses a Decoder
to convert from a Necko-provided memory range, into a destination buffer containing validly-encoded data in ScriptLoadRequest
: UTF-16 if !StaticPrefs::dom_script_loader_external_scripts_utf8_parsing_enabled()
, UTF-8 otherwise. (ScriptLoadRequest::SetTextSource
makes this determination, and then processing each increment of data will do a runtime type-check of what kind of destination buffer was constructed.) The resulting code units are then, ultimately, passed directly to a JSAPI function that interprets them using the appropriate encoding.
Interpretation as UTF-16 can always understand the code units passed in, because JS uses WTF-16 and so even unpaired surrogates get interpreted as code points of identical value. If we got 0xE5E5 memory there -- Decoder
would only splat that if the code point were encoded in the downloaded data -- that'd just be a U+E5E5 PRIVATE USE CODEPOINT that would be permissible JS if it appeared in the right places. (I think it'd only be allowed inside string literals or comments, but there might be a valid context I'm forgetting.)
But JS doesn't know about UTF-8 at all -- it assumes it's been handed WTF-16 -- and so the way we do it, interpretation as UTF-8 that encounters invalid UTF-8 is just an immediate error. No attempt to imply replacement-character semantics or anything. Just, error. And 0xE5 0xE5 in succession is not valid UTF-8, so we error.
So Gecko for some reason is passing in what it thinks is valid UTF-8, but somehow it contains 0xE5 0xE5, and SpiderMonkey's tokenizer carefully recognizes this as invalid UTF-8 and so just reports a syntax error. This is good! It's safe! But presumably the 0xE5 comes from poison patterns, yeah, which means we have some GIGO happening.
Where? Good question. If this could be caught in rr
we would be absolutely hunky-dory. Perhaps I could do a browser build and give it a whirl or something, but it's gonna take some time (and also it's Friday night and so arguably this not how I should use my time...)...
Comment 15•5 years ago
|
||
Some of them are a slightly-garbled out-of-memory, but the "MEMORY STAT" lines don't paint a picture of a memory-starved system.
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=258372496&repo=autoland&lineNumber=3529
[task 2019-07-25T20:33:14.380Z] 20:33:14 INFO - GECKO(1647) | MEMORY STAT | vsize 7667MB | residentFast 266MB | heapAllocated 67MB
[task 2019-07-25T20:33:14.380Z] 20:33:14 INFO - TEST-OK | dom/canvas/test/webgl-conf/generated/test_2_conformance__context__context-no-alpha-fbo-with-alpha.html | took 221ms
[task 2019-07-25T20:33:14.401Z] 20:33:14 INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_2_conformance__context__context-release-upon-reload.html
[task 2019-07-25T20:33:14.764Z] 20:33:14 INFO - GECKO(1647) | JavaScript error: , line 0: uncaught exception: out of memory
[task 2019-07-25T20:33:14.806Z] 20:33:14 INFO - GECKO(1647) | JavaScript error: http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/context/resources/context-release-upon-reload-child.html, line 54: ReferenceError: WebGLTestUtils is not defined
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=258372482&repo=autoland&lineNumber=5496
[task 2019-07-25T20:34:42.913Z] 20:34:42 INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_2_conformance__uniforms__no-over-optimization-on-uniform-array-15.html
[task 2019-07-25T20:34:42.966Z] 20:34:42 INFO - GECKO(1649) | MEMORY STAT | vsize 7659MB | residentFast 300MB | heapAllocated 48MB
[task 2019-07-25T20:34:43.014Z] 20:34:43 INFO - TEST-OK | dom/canvas/test/webgl-conf/generated/test_2_conformance__uniforms__no-over-optimization-on-uniform-array-15.html | took 142ms
[task 2019-07-25T20:34:43.030Z] 20:34:43 INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_2_conformance__uniforms__no-over-optimization-on-uniform-array-16.html
[task 2019-07-25T20:34:43.127Z] 20:34:43 INFO - GECKO(1649) | MEMORY STAT | vsize 7659MB | residentFast 300MB | heapAllocated 47MB
[task 2019-07-25T20:34:43.170Z] 20:34:43 INFO - TEST-OK | dom/canvas/test/webgl-conf/generated/test_2_conformance__uniforms__no-over-optimization-on-uniform-array-16.html | took 141ms
[task 2019-07-25T20:34:43.171Z] 20:34:43 INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_2_conformance__uniforms__no-over-optimization-on-uniform-array-17.html
[task 2019-07-25T20:34:43.328Z] 20:34:43 INFO - GECKO(1649) | MEMORY STAT | vsize 7654MB | residentFast 300MB | heapAllocated 41MB
[task 2019-07-25T20:34:43.328Z] 20:34:43 INFO - TEST-OK | dom/canvas/test/webgl-conf/generated/test_2_conformance__uniforms__no-over-optimization-on-uniform-array-17.html | took 144ms
[task 2019-07-25T20:34:43.328Z] 20:34:43 INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_2_conformance__uniforms__null-uniform-location.html
[task 2019-07-25T20:35:20.196Z] 20:35:20 INFO - GECKO(1649) | JavaScript error: , line 0: uncaught exception: out of memory
[task 2019-07-25T20:35:21.213Z] 20:35:21 INFO - GECKO(1649) | JavaScript error: http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/uniforms/null-uniform-location.html?webglVersion=2, line 44: ReferenceError: WebGLTestUtils is not defined
[task 2019-07-25T20:35:21.213Z] 20:35:21 INFO - TEST-INFO | started process screencapture
Comment 16•5 years ago
|
||
So my local Linux build of b1faa1af2967 (per comment 11) does not reproduce any UTF-8 encoding errors that manifest as test failures. Tried running it once standalone, a couple times, no dice. Tried with --run-until-failure --keep-open=false
and nothing failed in the default-30 runs that ensued. And jgilbert sez this was a Mac failure anyway, so maybe this is not too surprising an outcome. So much for rr
.
We know where the UTF-8 error happens, and it's in an out-of-the-way place. It would be easy to just MOZ_CRASH("boo-urns!");
there. That would give a stack trace...but only the mundane sort of stack trace you'd get for any <script src="...">
element being executed, so probably that isn't helpful.
We're inside JS so it's hard to get to Gecko-specific things, but it would be trivial to add temporary JSAPI to set some sort of callback thingy to let us inspect/do whatever we want to do, temporarily. What would be useful to inspect there? Printing out the memory of the script being executed is easy enough, but it seems likely to be <valid script prefix>E5E5E5E5...
basically which is not going to tell us much. Or maybe we shouldn't assume too much, and we should just do that.
I dunno what we want to examine once the crime's been committed, but it ought be very easy to examine whatever we want when we're in that state. If I supplied a patch that would just crash when we got into this error state, maybe someone on a Mac could run the test, wait for a failure that seems to be very-frequent, then poke around and see what can be discovered using a full-fledged debugger.
Volunteers, perhaps?
Comment 17•5 years ago
|
||
FWIW, this isn't making an existing intermittent that I know of worse, but rather seems to introduce a new intermittent on the mac+opt+webgl2-core job, which is not one test, but about 500 tests:
https://searchfox.org/mozilla-central/source/dom/canvas/test/webgl-conf/generated-mochitest.ini
Comment 18•5 years ago
|
||
Comment 19•5 years ago
|
||
I have a Mac I can try this on, on Monday.
Comment 20•5 years ago
|
||
I'm running out of time that I can volunteer for this.
I downloaded the affected+crashing build ( https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=258865090&revision=3f2e1e46f07cb2e6a1789c5258e8a9e8052d697e ) and tried running normal webgl2 conformance tests, but it wouldn't crash.
I tried downloading the .dmp file but WinDBG (which seems to be the go-to consumer, even across platforms?) wouldn't consume it, and I'm not about to wager another hour on the dozen step approach to trying to decant it into gdb based on Chromiums docs and hope it works on Gecko.
I waited 45 minutes for a local opt build, then another 50 for an opt build with tests enabled, and then ran into bug 1569383.
My final attempt tomorrow will be s/MOZ_ASSERT/MOZ_RELEASE_ASSERT/ and adding MOZ_CRASHes to try to get Try to spill the beans.
Assignee | ||
Comment 21•5 years ago
|
||
I wonder if a macOS ASAN build would have more luck finding the source of the (what looks like) UAF. We (used to?) support ASAN on macOS but don't run it in CI.
Comment 22•5 years ago
|
||
There are some r+ patches which didn't land and no activity in this bug for 2 weeks.
:gcp, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 23•5 years ago
|
||
Ohhh bots...
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 24•5 years ago
|
||
Assignee | ||
Comment 25•5 years ago
|
||
Looks like aggressively enabling PHC makes the problem go away. Sigh!
Comment 26•5 years ago
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #25)
Looks like aggressively enabling PHC makes the problem go away. Sigh!
It looks like it's still getting the problematic issue:
[task 2019-09-02T08:54:19.903Z] 08:54:19 INFO - GECKO(1650) | JavaScript error: http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/canvas/to-data-url-test.html?webglVersion=2, line 42: ReferenceError: WebGLTestUtils is not defined
Assignee | ||
Comment 27•5 years ago
|
||
It looks like it's still getting the problematic issue:
Yes, intermittently.
That said, I finally managed to make some progress. I can make it hit the assertion, confirming that the bad units are 0xE5 (which isn't elsewhere visible in the log). So I think I can now start working my way up the callstack to see where the bad data is coming from.
[task 2019-09-06T15:20:23.511Z] 15:20:23 INFO - GECKO(1647) | Hit MOZ_CRASH(UTF-8 bad units: 0xE5 0xE5) at /builds/worker/workspace/build/src/js/src/frontend/TokenStream.cpp:1069
[task 2019-09-06T15:20:23.511Z] 15:20:23 INFO - GECKO(1647) | Hit MOZ_CRASH(UTF-8 encoding error unexpectedly detected -- have a breakpoint) at /builds/worker/workspace/build/src/js/src/frontend/TokenStream.cpp:1074
Assignee | ||
Comment 28•5 years ago
|
||
So I tracked this down a bit more:
[task 2019-09-12T09:55:41.337Z] 09:55:41 INFO - GECKO(1645) | [Child 1648, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/workspace/build/src/dom/script/ScriptLoader.cpp, line 1555
[task 2019-09-12T09:55:41.337Z] 09:55:41 INFO - GECKO(1645) | [Child 1648, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/workspace/build/src/dom/script/ScriptLoader.cpp, line 1555
[task 2019-09-12T09:55:41.351Z] 09:55:41 INFO - GECKO(1645) | Hit MOZ_CRASH(UTF-8 bad units: e5e5e5e5) at /builds/worker/workspace/build/src/js/src/frontend/BytecodeCompiler.cpp:926
The latter is CompileLazyFunctionImpl. I added some more booby traps in JSScript, but these were apparently not hit in this callpath. The check failing in ScriptLoader is also interesting. if the bad data is already there, that would cut out a large part of the search. But it seems the test already generates that warning by itself.
Assignee | ||
Comment 29•5 years ago
|
||
I'm a bit stuck here. I can make this bomb out about 50% of the time on try with guards detecting e5s5s5 memory, but never locally, which precludes using rr or so to find the offensive users and allocation: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6391d354d8a0db202ef80a86da8d4ef41cdd64f3
I digged through the code but at some point we hit the code that is lazily compiling script functions that weren't compiled earlier, and also deals with thigs like compressing then source (and uncompressing on-demand) to save memory. That does look like a potential rat nest from a lifetime perspective. I tried to bomb out on the source that is fed to the JS engine to see if it's in the network stack, but no luck there. I'm also not very sure where to try to bomb specifically.
Waldo, given the above try run, that hits added assertions checking for UAF markers, can you investigate, or give concrete advice how to narrow this down? It's pretty deep in the JS engine at this point.
Assignee | ||
Comment 30•4 years ago
|
||
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 31•3 years ago
|
||
Looks like the UAF in the JS engine is gone: https://treeherder.mozilla.org/jobs?repo=try&revision=5ff766ccd36460d2191d8c2b13d6302c92000510
I'll check perf but we may be able to land this now.
Assignee | ||
Comment 32•3 years ago
|
||
Assignee | ||
Comment 33•3 years ago
|
||
Comment 34•3 years ago
|
||
Hrm, some performance really did move around:
But why?! If you show the unconfident runs on the browsertime set there's some slowdowns, but I'd like to see more runs.
Assignee | ||
Comment 35•3 years ago
|
||
I'd like to see more runs.
The sherrifs killed the original set because of the resource usage :-/ Pretty hard to do performance measurement like this.
Assignee | ||
Comment 36•3 years ago
|
||
Hrm, some performance really did move around:
Most of those are speedups, which of course makes no sense.
Assignee | ||
Comment 37•3 years ago
|
||
With re-triggering more tests, I think the results are good enough, no real regressions.
Assignee | ||
Comment 38•3 years ago
|
||
Depends on D28516
Updated•3 years ago
|
Comment 39•3 years ago
|
||
Comment 40•3 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/a5a3d9d8a54e
https://hg.mozilla.org/mozilla-central/rev/b7ce86e48a0c
Comment 41•3 years ago
|
||
(In reply to Pulsebot from comment #39)
Pushed by gpascutto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a5a3d9d8a54e
Leading guard pages for normal allocations. r=glandium
https://hg.mozilla.org/integration/autoland/rev/b7ce86e48a0c
Add test for leading guard pages for normal allocations. r=glandium,pbone
== Change summary for alert #32566 (as of Mon, 29 Nov 2021 14:32:41 GMT) ==
Improvements:
Ratio | Test | Platform | Options | Absolute values (old vs new) |
---|---|---|---|---|
55% | Base Content Resident Unique Memory | macosx1015-64-shippable-qr | 18,036,650.67 -> 8,126,464.00 | |
55% | Base Content Resident Unique Memory | macosx1015-64-shippable-qr | 17,904,213.33 -> 8,134,997.33 | |
21% | Resident Memory | macosx1015-64-shippable-qr | tp6 | 943,611,969.84 -> 746,658,068.58 |
For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=32566
Comment 42•3 years ago
|
||
What even are these memory "improvements"? Not complaining I guess, but, what??
Comment 43•3 years ago
|
||
I know right?
All I know is that measuring resident memory on MacOS is weird. It's already different with a few different madvise
options. Maybe this is related, but maybe it's also something else entirely because I don't get it.
Assignee | ||
Comment 44•3 years ago
|
||
Looking at the origin of those values, it's this code:
https://searchfox.org/mozilla-central/rev/e9cd2997be1071b9bb76fc14df0f01a2bd721c30/xpcom/base/nsMemoryReporterManager.cpp#459
Given that it's not a simple OS query, and the values look nonsensical, I'm going to file a bug to further investigate.
Assignee | ||
Updated•3 years ago
|
Comment 45•3 years ago
|
||
(In reply to Pulsebot from comment #39)
Pushed by gpascutto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a5a3d9d8a54e
Leading guard pages for normal allocations. r=glandium
https://hg.mozilla.org/integration/autoland/rev/b7ce86e48a0c
Add test for leading guard pages for normal allocations. r=glandium,pbone
== Change summary for alert #32548 (as of Mon, 29 Nov 2021 07:23:48 GMT) ==
Improvements:
Ratio | Test | Platform | Options | Absolute values (old vs new) |
---|---|---|---|---|
18% | perf_reftest_singletons link-style-cache-1.html | macosx1015-64-shippable-qr | e10s stylo webrender | 354.05 -> 291.05 |
18% | perf_reftest_singletons link-style-cache-1.html | macosx1015-64-shippable-qr | e10s fission stylo webrender | 354.72 -> 292.01 |
16% | perf_reftest_singletons link-style-cache-1.html | macosx1015-64-shippable-qr | e10s stylo webrender | 355.88 -> 299.11 |
12% | perf_reftest_singletons inline-style-cache-1.html | macosx1015-64-shippable-qr | e10s fission stylo webrender | 1,083.12 -> 953.22 |
12% | perf_reftest_singletons inline-style-cache-1.html | macosx1015-64-shippable-qr | e10s stylo webrender | 1,058.36 -> 934.55 |
... | ... | ... | ... | ... |
4% | tsvgx | macosx1015-64-shippable-qr | e10s stylo webrender-sw | 140.17 -> 135.00 |
For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=32548
Comment 46•3 years ago
|
||
== Change summary for alert #32569 (as of Mon, 29 Nov 2021 15:18:23 GMT) ==
Improvements:
Ratio | Test | Platform | Options | Absolute values (old vs new) |
---|---|---|---|---|
3% | twitch loadtime | macosx1015-64-shippable-qr | warm webrender | 522.50 -> 506.12 |
3% | twitch dcf | macosx1015-64-shippable-qr | warm webrender | 447.00 -> 435.04 |
For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=32569
Updated•3 years ago
|
Description
•