Closed Bug 1546442 Opened 6 years ago Closed 3 years ago

Leading guard pages for normal allocations

Categories

(Core :: Memory Allocator, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
96 Branch
Tracking Status
firefox96 --- fixed

People

(Reporter: gcp, Assigned: gcp, NeedInfo)

References

(Depends on 1 open bug)

Details

(Keywords: perf-alert, Whiteboard: [no-nag])

Attachments

(3 files)

No description provided.
Depends on: 1446040
Assignee: nobody → gpascutto
Pushed by gpascutto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b1faa1af2967 Leading guard pages for normal allocations. r=glandium
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70

Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.

Backout by malexandru@mozilla.com: https://hg.mozilla.org/mozilla-central/rev/92748f5caf89 Backed out changeset b1faa1af2967 for causing a spike in bug 1475812 and bug 1471001 a=backout

So changing our memory layout causes 2 tests that are intermittent failures to be slightly more intermittent failures?

I'm not even sure what there's actionable for me here. Certainly nothing in this patch.

Flags: needinfo?(gpascutto)

Failure A:

[task 2019-07-25T20:33:15.241Z] 20:33:15 INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_2_conformance2__uniforms__incompatible-texture-type-for-sampler.html
[task 2019-07-25T20:33:15.478Z] 20:33:15 INFO - GECKO(1628) | JavaScript error: http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/js/webgl-test-utils.js, line 2064: SyntaxError: bad trailing UTF-8 byte 0xE5 doesn't match the pattern 0b10xxxxxx
[task 2019-07-25T20:33:15.478Z] 20:33:15 INFO - GECKO(1628) | JavaScript note: http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/js/webgl-test-utils.js, line 2064: the code units comprising this invalid code point were: 0xE5 0xE5

Failure B:

[task 2019-07-25T20:33:05.771Z] 20:33:05 INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_2_conformance2__textures__misc__npot-video-sizing.html
[task 2019-07-25T20:33:05.915Z] 20:33:05 INFO - GECKO(1632) | JavaScript error: http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/js/webgl-test-utils.js, line 3012: SyntaxError: bad trailing UTF-8 byte 0xE5 doesn't match the pattern 0b10xxxxxx
[task 2019-07-25T20:33:05.915Z] 20:33:05 INFO - GECKO(1632) | JavaScript note: http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/js/webgl-test-utils.js, line 3012: the code units comprising this invalid code point were: 0xE5 0xE5

:waldo added this UTF-8 decoding error.

Also, 0xe5e5, you say? "jemalloc freed memory":
https://searchfox.org/mozilla-central/rev/1dfd70469212ef2785d41827c5532c571c784227/js/src/gdb/mozilla/JSString.py#54

Flags: needinfo?(jwalden)

Sooooooooo, what's happening is ScriptLoadHandler downloads bytes of scripts incrementally. Each increment of bytes, it uses a Decoder to convert from a Necko-provided memory range, into a destination buffer containing validly-encoded data in ScriptLoadRequest: UTF-16 if !StaticPrefs::dom_script_loader_external_scripts_utf8_parsing_enabled(), UTF-8 otherwise. (ScriptLoadRequest::SetTextSource makes this determination, and then processing each increment of data will do a runtime type-check of what kind of destination buffer was constructed.) The resulting code units are then, ultimately, passed directly to a JSAPI function that interprets them using the appropriate encoding.

Interpretation as UTF-16 can always understand the code units passed in, because JS uses WTF-16 and so even unpaired surrogates get interpreted as code points of identical value. If we got 0xE5E5 memory there -- Decoder would only splat that if the code point were encoded in the downloaded data -- that'd just be a U+E5E5 PRIVATE USE CODEPOINT that would be permissible JS if it appeared in the right places. (I think it'd only be allowed inside string literals or comments, but there might be a valid context I'm forgetting.)

But JS doesn't know about UTF-8 at all -- it assumes it's been handed WTF-16 -- and so the way we do it, interpretation as UTF-8 that encounters invalid UTF-8 is just an immediate error. No attempt to imply replacement-character semantics or anything. Just, error. And 0xE5 0xE5 in succession is not valid UTF-8, so we error.

So Gecko for some reason is passing in what it thinks is valid UTF-8, but somehow it contains 0xE5 0xE5, and SpiderMonkey's tokenizer carefully recognizes this as invalid UTF-8 and so just reports a syntax error. This is good! It's safe! But presumably the 0xE5 comes from poison patterns, yeah, which means we have some GIGO happening.

Where? Good question. If this could be caught in rr we would be absolutely hunky-dory. Perhaps I could do a browser build and give it a whirl or something, but it's gonna take some time (and also it's Friday night and so arguably this not how I should use my time...)...

Some of them are a slightly-garbled out-of-memory, but the "MEMORY STAT" lines don't paint a picture of a memory-starved system.

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=258372496&repo=autoland&lineNumber=3529

[task 2019-07-25T20:33:14.380Z] 20:33:14 INFO - GECKO(1647) | MEMORY STAT | vsize 7667MB | residentFast 266MB | heapAllocated 67MB
[task 2019-07-25T20:33:14.380Z] 20:33:14 INFO - TEST-OK | dom/canvas/test/webgl-conf/generated/test_2_conformance__context__context-no-alpha-fbo-with-alpha.html | took 221ms
[task 2019-07-25T20:33:14.401Z] 20:33:14 INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_2_conformance__context__context-release-upon-reload.html
[task 2019-07-25T20:33:14.764Z] 20:33:14 INFO - GECKO(1647) | JavaScript error: , line 0: uncaught exception: out of memory
[task 2019-07-25T20:33:14.806Z] 20:33:14 INFO - GECKO(1647) | JavaScript error: http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/context/resources/context-release-upon-reload-child.html, line 54: ReferenceError: WebGLTestUtils is not defined

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=258372482&repo=autoland&lineNumber=5496

[task 2019-07-25T20:34:42.913Z] 20:34:42 INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_2_conformance__uniforms__no-over-optimization-on-uniform-array-15.html
[task 2019-07-25T20:34:42.966Z] 20:34:42 INFO - GECKO(1649) | MEMORY STAT | vsize 7659MB | residentFast 300MB | heapAllocated 48MB
[task 2019-07-25T20:34:43.014Z] 20:34:43 INFO - TEST-OK | dom/canvas/test/webgl-conf/generated/test_2_conformance__uniforms__no-over-optimization-on-uniform-array-15.html | took 142ms
[task 2019-07-25T20:34:43.030Z] 20:34:43 INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_2_conformance__uniforms__no-over-optimization-on-uniform-array-16.html
[task 2019-07-25T20:34:43.127Z] 20:34:43 INFO - GECKO(1649) | MEMORY STAT | vsize 7659MB | residentFast 300MB | heapAllocated 47MB
[task 2019-07-25T20:34:43.170Z] 20:34:43 INFO - TEST-OK | dom/canvas/test/webgl-conf/generated/test_2_conformance__uniforms__no-over-optimization-on-uniform-array-16.html | took 141ms
[task 2019-07-25T20:34:43.171Z] 20:34:43 INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_2_conformance__uniforms__no-over-optimization-on-uniform-array-17.html
[task 2019-07-25T20:34:43.328Z] 20:34:43 INFO - GECKO(1649) | MEMORY STAT | vsize 7654MB | residentFast 300MB | heapAllocated 41MB
[task 2019-07-25T20:34:43.328Z] 20:34:43 INFO - TEST-OK | dom/canvas/test/webgl-conf/generated/test_2_conformance__uniforms__no-over-optimization-on-uniform-array-17.html | took 144ms
[task 2019-07-25T20:34:43.328Z] 20:34:43 INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_2_conformance__uniforms__null-uniform-location.html
[task 2019-07-25T20:35:20.196Z] 20:35:20 INFO - GECKO(1649) | JavaScript error: , line 0: uncaught exception: out of memory
[task 2019-07-25T20:35:21.213Z] 20:35:21 INFO - GECKO(1649) | JavaScript error: http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/uniforms/null-uniform-location.html?webglVersion=2, line 44: ReferenceError: WebGLTestUtils is not defined
[task 2019-07-25T20:35:21.213Z] 20:35:21 INFO - TEST-INFO | started process screencapture

Regressions: 1568807

So my local Linux build of b1faa1af2967 (per comment 11) does not reproduce any UTF-8 encoding errors that manifest as test failures. Tried running it once standalone, a couple times, no dice. Tried with --run-until-failure --keep-open=false and nothing failed in the default-30 runs that ensued. And jgilbert sez this was a Mac failure anyway, so maybe this is not too surprising an outcome. So much for rr.

We know where the UTF-8 error happens, and it's in an out-of-the-way place. It would be easy to just MOZ_CRASH("boo-urns!"); there. That would give a stack trace...but only the mundane sort of stack trace you'd get for any <script src="..."> element being executed, so probably that isn't helpful.

We're inside JS so it's hard to get to Gecko-specific things, but it would be trivial to add temporary JSAPI to set some sort of callback thingy to let us inspect/do whatever we want to do, temporarily. What would be useful to inspect there? Printing out the memory of the script being executed is easy enough, but it seems likely to be <valid script prefix>E5E5E5E5... basically which is not going to tell us much. Or maybe we shouldn't assume too much, and we should just do that.

I dunno what we want to examine once the crime's been committed, but it ought be very easy to examine whatever we want when we're in that state. If I supplied a patch that would just crash when we got into this error state, maybe someone on a Mac could run the test, wait for a failure that seems to be very-frequent, then poke around and see what can be discovered using a full-fledged debugger.

Volunteers, perhaps?

Flags: needinfo?(jwalden)

FWIW, this isn't making an existing intermittent that I know of worse, but rather seems to introduce a new intermittent on the mac+opt+webgl2-core job, which is not one test, but about 500 tests:
https://searchfox.org/mozilla-central/source/dom/canvas/test/webgl-conf/generated-mochitest.ini

I have a Mac I can try this on, on Monday.

I'm running out of time that I can volunteer for this.

I downloaded the affected+crashing build ( https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=258865090&revision=3f2e1e46f07cb2e6a1789c5258e8a9e8052d697e ) and tried running normal webgl2 conformance tests, but it wouldn't crash.

I tried downloading the .dmp file but WinDBG (which seems to be the go-to consumer, even across platforms?) wouldn't consume it, and I'm not about to wager another hour on the dozen step approach to trying to decant it into gdb based on Chromiums docs and hope it works on Gecko.

I waited 45 minutes for a local opt build, then another 50 for an opt build with tests enabled, and then ran into bug 1569383.

My final attempt tomorrow will be s/MOZ_ASSERT/MOZ_RELEASE_ASSERT/ and adding MOZ_CRASHes to try to get Try to spill the beans.

I wonder if a macOS ASAN build would have more luck finding the source of the (what looks like) UAF. We (used to?) support ASAN on macOS but don't run it in CI.

There are some r+ patches which didn't land and no activity in this bug for 2 weeks.
:gcp, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(gpascutto)

Ohhh bots...

Flags: needinfo?(gpascutto)
Whiteboard: [no-nag]

(In reply to Gian-Carlo Pascutto [:gcp] from comment #25)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7d49ef2aea484f7177bc1e097194e7728d3d4386&selectedJob=264321401

Looks like aggressively enabling PHC makes the problem go away. Sigh!

It looks like it's still getting the problematic issue:

[task 2019-09-02T08:54:19.903Z] 08:54:19     INFO - GECKO(1650) | JavaScript error: http://mochi.test:8888/tests/dom/canvas/test/webgl-conf/checkout/conformance/canvas/to-data-url-test.html?webglVersion=2, line 42: ReferenceError: WebGLTestUtils is not defined

It looks like it's still getting the problematic issue:

Yes, intermittently.

That said, I finally managed to make some progress. I can make it hit the assertion, confirming that the bad units are 0xE5 (which isn't elsewhere visible in the log). So I think I can now start working my way up the callstack to see where the bad data is coming from.

[task 2019-09-06T15:20:23.511Z] 15:20:23 INFO - GECKO(1647) | Hit MOZ_CRASH(UTF-8 bad units: 0xE5 0xE5) at /builds/worker/workspace/build/src/js/src/frontend/TokenStream.cpp:1069
[task 2019-09-06T15:20:23.511Z] 15:20:23 INFO - GECKO(1647) | Hit MOZ_CRASH(UTF-8 encoding error unexpectedly detected -- have a breakpoint) at /builds/worker/workspace/build/src/js/src/frontend/TokenStream.cpp:1074

So I tracked this down a bit more:

[task 2019-09-12T09:55:41.337Z] 09:55:41 INFO - GECKO(1645) | [Child 1648, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/workspace/build/src/dom/script/ScriptLoader.cpp, line 1555
[task 2019-09-12T09:55:41.337Z] 09:55:41 INFO - GECKO(1645) | [Child 1648, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/workspace/build/src/dom/script/ScriptLoader.cpp, line 1555
[task 2019-09-12T09:55:41.351Z] 09:55:41 INFO - GECKO(1645) | Hit MOZ_CRASH(UTF-8 bad units: e5e5e5e5) at /builds/worker/workspace/build/src/js/src/frontend/BytecodeCompiler.cpp:926

The latter is CompileLazyFunctionImpl. I added some more booby traps in JSScript, but these were apparently not hit in this callpath. The check failing in ScriptLoader is also interesting. if the bad data is already there, that would cut out a large part of the search. But it seems the test already generates that warning by itself.

I'm a bit stuck here. I can make this bomb out about 50% of the time on try with guards detecting e5s5s5 memory, but never locally, which precludes using rr or so to find the offensive users and allocation: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6391d354d8a0db202ef80a86da8d4ef41cdd64f3

I digged through the code but at some point we hit the code that is lazily compiling script functions that weren't compiled earlier, and also deals with thigs like compressing then source (and uncompressing on-demand) to save memory. That does look like a potential rat nest from a lifetime perspective. I tried to bomb out on the source that is fed to the JS engine to see if it's in the network stack, but no luck there. I'm also not very sure where to try to bomb specifically.

Waldo, given the above try run, that hits added assertions checking for UAF markers, can you investigate, or give concrete advice how to narrow this down? It's pretty deep in the JS engine at this point.

Flags: needinfo?(jwalden)
Severity: normal → S4
Priority: -- → P1

Looks like the UAF in the JS engine is gone: https://treeherder.mozilla.org/jobs?repo=try&revision=5ff766ccd36460d2191d8c2b13d6302c92000510

I'll check perf but we may be able to land this now.

I'd like to see more runs.

The sherrifs killed the original set because of the resource usage :-/ Pretty hard to do performance measurement like this.

Hrm, some performance really did move around:

Most of those are speedups, which of course makes no sense.

With re-triggering more tests, I think the results are good enough, no real regressions.

Attachment #9252280 - Attachment description: Bug 1546442 - Add test for leading guard pages for normal allocations. r?glandium → Bug 1546442 - Add test for leading guard pages for normal allocations. r?glandium,pbone
Pushed by gpascutto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a5a3d9d8a54e Leading guard pages for normal allocations. r=glandium https://hg.mozilla.org/integration/autoland/rev/b7ce86e48a0c Add test for leading guard pages for normal allocations. r=glandium,pbone
Status: REOPENED → RESOLVED
Closed: 5 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 96 Branch

(In reply to Pulsebot from comment #39)

Pushed by gpascutto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a5a3d9d8a54e
Leading guard pages for normal allocations. r=glandium
https://hg.mozilla.org/integration/autoland/rev/b7ce86e48a0c
Add test for leading guard pages for normal allocations. r=glandium,pbone

== Change summary for alert #32566 (as of Mon, 29 Nov 2021 14:32:41 GMT) ==

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
55% Base Content Resident Unique Memory macosx1015-64-shippable-qr 18,036,650.67 -> 8,126,464.00
55% Base Content Resident Unique Memory macosx1015-64-shippable-qr 17,904,213.33 -> 8,134,997.33
21% Resident Memory macosx1015-64-shippable-qr tp6 943,611,969.84 -> 746,658,068.58

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=32566

What even are these memory "improvements"? Not complaining I guess, but, what??

I know right?

All I know is that measuring resident memory on MacOS is weird. It's already different with a few different madvise options. Maybe this is related, but maybe it's also something else entirely because I don't get it.

Looking at the origin of those values, it's this code:
https://searchfox.org/mozilla-central/rev/e9cd2997be1071b9bb76fc14df0f01a2bd721c30/xpcom/base/nsMemoryReporterManager.cpp#459

Given that it's not a simple OS query, and the values look nonsensical, I'm going to file a bug to further investigate.

(In reply to Pulsebot from comment #39)

Pushed by gpascutto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a5a3d9d8a54e
Leading guard pages for normal allocations. r=glandium
https://hg.mozilla.org/integration/autoland/rev/b7ce86e48a0c
Add test for leading guard pages for normal allocations. r=glandium,pbone

== Change summary for alert #32548 (as of Mon, 29 Nov 2021 07:23:48 GMT) ==

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
18% perf_reftest_singletons link-style-cache-1.html macosx1015-64-shippable-qr e10s stylo webrender 354.05 -> 291.05
18% perf_reftest_singletons link-style-cache-1.html macosx1015-64-shippable-qr e10s fission stylo webrender 354.72 -> 292.01
16% perf_reftest_singletons link-style-cache-1.html macosx1015-64-shippable-qr e10s stylo webrender 355.88 -> 299.11
12% perf_reftest_singletons inline-style-cache-1.html macosx1015-64-shippable-qr e10s fission stylo webrender 1,083.12 -> 953.22
12% perf_reftest_singletons inline-style-cache-1.html macosx1015-64-shippable-qr e10s stylo webrender 1,058.36 -> 934.55
... ... ... ... ...
4% tsvgx macosx1015-64-shippable-qr e10s stylo webrender-sw 140.17 -> 135.00

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=32548

== Change summary for alert #32569 (as of Mon, 29 Nov 2021 15:18:23 GMT) ==

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
3% twitch loadtime macosx1015-64-shippable-qr warm webrender 522.50 -> 506.12
3% twitch dcf macosx1015-64-shippable-qr warm webrender 447.00 -> 435.04

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=32569

Regressions: 1743315
No longer regressions: 1743315
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: