Closed Bug 1276669 Opened 8 years ago Closed 8 years ago

Intermittent Main app process exited normally | application crashed [@ NS_ShutdownAtomTable()] after Assertion failure: gUnusedAtomCount == 0, at xpcom/ds/nsAtomTable.cpp:546 (sometimes reported during crashtest 1113005.html)

Categories

(Core :: General, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox52 --- fixed
firefox-esr52 --- fixed
firefox53 --- fixed
firefox54 --- fixed

People

(Reporter: cbook, Assigned: froydnj)

References

(Depends on 1 open bug, )

Details

(Keywords: assertion, crash, intermittent-failure, Whiteboard: [stockwell fixed])

Attachments

(11 files, 1 obsolete file)

(deleted), patch
hsivonen
: review+
Details | Diff | Splinter Review
(deleted), patch
bzbarsky
: review+
Details | Diff | Splinter Review
(deleted), patch
hsivonen
: review+
Details | Diff | Splinter Review
(deleted), patch
benjamin
: review+
Details | Diff | Splinter Review
(deleted), patch
erahm
: review+
Details | Diff | Splinter Review
(deleted), patch
bzbarsky
: review+
Details | Diff | Splinter Review
(deleted), patch
erahm
: review+
Details | Diff | Splinter Review
(deleted), patch
erahm
: review+
Details | Diff | Splinter Review
(deleted), patch
erahm
: review+
Details | Diff | Splinter Review
(deleted), patch
erahm
: review+
Details | Diff | Splinter Review
(deleted), patch
erahm
: review+
Details | Diff | Splinter Review
https://treeherder.mozilla.org/logviewer.html#?job_id=29017674&repo=mozilla-inbound#L84278 04:15:27 ERROR - REFTEST PROCESS-CRASH | Main app process exited normally | application crashed [@ XUL + 0x96620] Crash dump filename: /var/folders/t0/q6j3ybpx29q21vrxg9z4zx0800000w/T/tmpv_8bdS.mozrunner/minidumps/5319246F-CA7C-40D5-8C22-ADF9C3EFB716.dmp 04:15:27 INFO - Operating system: Mac OS X 04:15:27 INFO - 10.10.5 14F27 04:15:27 INFO - CPU: amd64 04:15:27 INFO - family 6 model 69 stepping 1 04:15:27 INFO - 4 CPUs 04:15:27 INFO - Crash reason: EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 04:15:27 INFO - Crash address: 0x0 04:15:27 INFO - Process uptime: 440 seconds 04:15:27 INFO - Thread 0 (crashed) 04:15:27 INFO - 0 XUL + 0x96620 04:15:27 INFO - rax = 0x0000000000000000 rdx = 0x00007fff737f91f8 04:15:27 INFO - rcx = 0x0000000000000000 rbx = 0x00007fff737f9c50 04:15:27 INFO - rsi = 0x01936c0001936c00 rdi = 0x0000000105967734 04:15:27 INFO - rbp = 0x00007fff5f23c300 rsp = 0x00007fff5f23c2f0 04:15:27 INFO - r8 = 0x00007fff5f23c2a0 r9 = 0x00007fff72734300 04:15:27 INFO - r10 = 0x00000001009c2ce0 r11 = 0x00000001009c2ca0 04:15:27 INFO - r12 = 0x000000010767a040 r13 = 0x00007fff5f23c328 04:15:27 INFO - r14 = 0x0000000107193d88 r15 = 0x00007fff5f23c318 04:15:27 INFO - rip = 0x0000000100e3b620 04:15:27 INFO - Found by: given as instruction pointer in context 04:15:27 INFO - 1 XUL + 0x121258 04:15:27 INFO - rbp = 0x00007fff5f23c370 rsp = 0x00007fff5f23c310 04:15:27 INFO - rip = 0x0000000100ec6258 04:15:27 INFO - Found by: previous frame's frame pointer 04:15:27 INFO - 2 XUL + 0x38b6b41 04:15:27 INFO - rbp = 0x00007fff5f23c390 rsp = 0x00007fff5f23c380 04:15:27 INFO - rip = 0x000000010465bb41 04:15:27 INFO - Found by: previous frame's frame pointer 04:15:27 INFO - 3 XUL + 0x66e4b3 04:15:27 INFO - rbp = 0x00007fff5f23c3b0 rsp = 0x00007fff5f23c3a0 04:15:27 INFO - rip = 0x00000001014134b3 04:15:27 INFO - Found by: previous frame's frame pointer 04:15:27 INFO - 4 XUL + 0x38b7686 04:15:27 INFO - rbp = 0x00007fff5f23d070 rsp = 0x00007fff5f23c3c0 04:15:27 INFO - rip = 0x000000010465c686 04:15:27 INFO - Found by: previous frame's frame pointer 04:15:27 INFO - 5 plugin-container + 0x12cc 04:15:27 INFO - rbp = 0x00007fff5f23d0b0 rsp = 0x00007fff5f23d080 04:15:27 INFO - rip = 0x00000001009c32cc 04:15:27 INFO - Found by: previous frame's frame pointer 04:15:27 INFO - 6 plugin-container + 0xcd4 04:15:27 INFO - rbp = 0x00007fff5f23d0c8 rsp = 0x00007fff5f23d0c0 04:15:27 INFO - rip = 0x00000001009c2cd4 04:15:27 INFO - Found by: previous frame's frame pointer 04:15:27 INFO - Thread 1 04:15:27 INFO - 0 libsystem_kernel.dylib!kevent64 + 0xa 04:15:27 INFO - rax = 0x0000000002000171 rdx = 0x0000000000000001 04:15:27 INFO - rcx = 0x00000001074985b8 rbx = 0x0000000000000000 04:15:27 INFO - rsi = 0x00007fff73f2c730 rdi = 0x0000000000000004 04:15:27 INFO - rbp = 0x0000000107498640 rsp = 0x00000001074985b8 04:15:27 INFO - r8 = 0x0000000000000001 r9 = 0x0000000000000000 04:15:27 INFO - r10 = 0x00000001074985d0 r11 = 0x0000000000000246 04:15:27 INFO - r12 = 0x00007fff8f0d7258 r13 = 0x00007fff73f2bec0 04:15:27 INFO - r14 = 0x00000001074985d0 r15 = 0x0000000107498600 04:15:27 INFO - rip = 0x00007fff913e0232 04:15:27 INFO - Found by: given as instruction pointer in context 04:15:27 INFO - 1 libdispatch.dylib!_dispatch_mgr_thread + 0x34 04:15:27 INFO - rbp = 0x0000000107498e50 rsp = 0x0000000107498650 04:15:27 INFO - rip = 0x00007fff8f0b7a6a 04:15:27 INFO - Found by: previous frame's frame pointer 04:15:27 INFO - Thread 2 04:15:27 INFO - 0 libsystem_kernel.dylib!mach_msg_trap + 0xa 04:15:27 INFO - rax = 0x000000000100001f rdx = 0x0000000000000000 04:15:27 INFO - rcx = 0x0000000109958df8 rbx = 0x0000000000000000 04:15:27 INFO - rsi = 0x0000000000000002 rdi = 0x0000000107618800 04:15:27 INFO - rbp = 0x0000000109958e40 rsp = 0x0000000109958df8 04:15:27 INFO - r8 = 0x0000000000002403 r9 = 0x0000000000000000 04:15:27 INFO - r10 = 0x000000000000041c r11 = 0x0000000000000202 04:15:27 INFO - r12 = 0x000000000000041c r13 = 0x0000000000000000 04:15:27 INFO - r14 = 0x0000000107618800 r15 = 0x0000000000002403 04:15:27 INFO - rip = 0x00007fff913da4de 04:15:27 INFO - Found by: given as instruction pointer in context 04:15:27 INFO - 1 XUL + 0x642f0d 04:15:27 INFO - rbp = 0x000000010
could this be a regression from bug 1275755 ?
Flags: needinfo?(nfroyd)
(In reply to Carsten Book [:Tomcat] from comment #1) > could this be a regression from bug 1275755 ? Looks like. How frequent is this?
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #2) > (In reply to Carsten Book [:Tomcat] from comment #1) > > could this be a regression from bug 1275755 ? > > Looks like. How frequent is this? not super frequent but happens i would say
This looks really frequent on OSX debug AFAICT.
Flags: needinfo?(nfroyd)
Summary: Intermittent Main app process exited normally | application crashed [@ XUL + 0x96620] | 04:15:26 INFO - Assertion failure: gUnusedAtomCount == 0, at /builds/slave/m-in-m64-d-0000000000000000000/build/src/xpcom/ds/nsAtomTable.cpp:546 → Intermittent Main app process exited normally | application crashed [@ XUL + 0x96620] after Assertion failure: gUnusedAtomCount == 0, at /builds/slave/m-in-m64-d-0000000000000000000/build/src/xpcom/ds/nsAtomTable.cpp:546
In one particular testsuite, even. I would take a patch that would provide more insight into what's going on with some diagnostic messages and such. Want to learn to wrote more C++? =D
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #12) > In one particular testsuite, even. > > I would take a patch that would provide more insight into what's going on > with some diagnostic messages and such. Want to learn to wrote more C++? =D Still happening, so I guess you've left me with no other choice.
Flags: needinfo?(nfroyd)
Let's try one that's actually passed local compilation: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0a8da4d60aad
Even better: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9da2f23ef8f8 This log: https://treeherder.mozilla.org/logviewer.html#?job_id=26204790&repo=try#L84757 indicates that our counter tracking the number of unused atoms is inconsistent with the number of atoms actually in the table: 07:03:11 INFO - [Child 1735] ###!!! ASSERTION: removed 10 unused 11 07:03:11 INFO - : 'false', file /builds/slave/try-m64-d-00000000000000000000/build/src/xpcom/ds/nsAtomTable.cpp, line 472 07:03:11 INFO - #01: XRE_InitChildProcess (in XUL) + 2328 07:03:11 INFO - #02: 0x0009c560 (in XUL) + 112 07:03:11 INFO - #03: 0x0012651c (in XUL) + 1564 07:03:11 INFO - #04: XRE_TermEmbedding (in XUL) + 81 07:03:11 INFO - #05: 0x006699d3 (in XUL) + 19 07:03:11 INFO - #06: 0x0000000100001f0c (in plugin-container) + 236 But, at the same time, we removed all the known dynamic atoms from the table, and we (I think) haven't had any races with other threads releasing atoms while we're iterating over the table for the final GC. So I'm unsure what the problem actually is...
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
Summary: Intermittent Main app process exited normally | application crashed [@ XUL + 0x96620] after Assertion failure: gUnusedAtomCount == 0, at /builds/slave/m-in-m64-d-0000000000000000000/build/src/xpcom/ds/nsAtomTable.cpp:546 → Intermittent Main app process exited normally | application crashed [@ NS_ShutdownAtomTable()] after Assertion failure: gUnusedAtomCount == 0, at /builds/slave/m-in-m64-d-0000000000000000000/build/src/xpcom/ds/nsAtomTable.cpp:546
This has increased in frequency recently- I would like to come back and revisit this bug. I see both Bobby Holley and :froydnj on the bug- those are the common names on nsAtomTable.cpp. While this happens on a variety of operating systems, this is particular to crashtests, e10s, debug, and primarily on osx. Looking over history on autoland, this is about 6% of the time. We went from ~15 instances/week to >25/week around December 4th. I would imagine we could push to try and bisect this down with 50 retriggers of crashtest- maybe break crashtests into a few chunks. Unfortunately osx isn't on taskcluster, so to do this we would need to hack the manifests each time.
here is me bisecting the ~3100 crashtests into 10 chunks and running them 50 times each: https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=7183f38b4fc65b92faa953572f3a9f74f029cc92&tochange=46c1edf84a5a77b19ec0a65e9dc80653c211f64d ideally we will have data to point us at a chunk of ~310 tests, worse case either a couple chunks or no chunks.
I found the 3rd chunk to be the problem (10%+ of the time) :) now to bisect the 312 tests down in 10 more chunks: https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=46b96c8fd1430d02227c5fd8836e211186ed2b18&tochange=b04ea377414ae0b6a31026c7a9fa72b90854223e just like last time, this should give us a smaller chunk and we can bisect the ~31 tests down to a smaller range.
hmm, all smaller chunks of 31 passed for me, and I have verified I ran the same set of tests in the smaller chunks as per the original chunk of 311. I am not sure if this is timing related or not running long enough, the directories that we run tests from are: dom/svg/crashtests dom/xbl/crashtests dom/xml/crashtests dom/xslt/crashtests dom/xul/crashtests dom/xul/templates/crashtests editor/composer/crashtests editor/libeditor/crashtests editor/txmgr/tests/crashtests gfx/tests/crashtests
trying a few more try pushes to split the 311 tests into 2 sets of 155 and see if this helps.
ok, I give up, splitting the 311 tests in 2 or 3 chunks do not reproduce, but trying the same 311 tests again yield failures- so as it stands, I think there is a set of tests in the directories from comment 45 which depend upon each other (or some timing issue). I am not sure how to make this actionable other than running the tests in 12 chunks :) I think I will see if :froydnj has thoughts on this given the fact that we have narrowed the test case down 90%.
Flags: needinfo?(nfroyd)
Summary: Intermittent Main app process exited normally | application crashed [@ NS_ShutdownAtomTable()] after Assertion failure: gUnusedAtomCount == 0, at /builds/slave/m-in-m64-d-0000000000000000000/build/src/xpcom/ds/nsAtomTable.cpp:546 → Intermittent 1113005.html, Main app process exited normally | application crashed [@ NS_ShutdownAtomTable()] after Assertion failure: gUnusedAtomCount == 0, at /builds/slave/m-in-m64-d-0000000000000000000/build/src/xpcom/ds/nsAtomTable.cpp:546
(In reply to Joel Maher ( :jmaher) from comment #47) > I am not sure how to make this actionable other than running the tests in 12 > chunks :) I think I will see if :froydnj has thoughts on this given the > fact that we have narrowed the test case down 90%. On the supposition that it's timing related, can you simply eliminate some tests and/or directories from the try runs and narrow it down that way? The peculiar thing is that comment 22 says that we're racing between threads, I think, and comment 45 suggests that virtually all of the tests are dealing with on-the-main-thread things.
Flags: needinfo?(nfroyd)
OK, so I've been thinking about this intermittent and doing some try runs, and I have a plausible theory as to how we end up with this assertion. Why it happens much more frequently on Mac, I don't know. Try runs indicate that we're failing like this: 12:29:24 INFO - [Child 2266] ###!!! ASSERTION: removed 15 unused 16 12:29:24 INFO - : 'false', file /builds/slave/try-m64-d-00000000000000000000/build/src/xpcom/ds/nsAtomTable.cpp, line 487 12:29:24 INFO - #01: NS_ShutdownAtomTable() [xpcom/glue/Mutex.h:169] 12:29:24 INFO - #02: mozilla::ShutdownXPCOM(nsIServiceManager*) [xpcom/build/XPCOMInit.cpp:1083] 12:29:24 INFO - #03: XRE_TermEmbedding [toolkit/xre/nsEmbedFunctions.cpp:217] 12:29:24 INFO - #04: mozilla::ipc::ScopedXREEmbed::Stop() [ipc/glue/ScopedXREEmbed.cpp:118] 12:29:24 INFO - #05: XRE_InitChildProcess [toolkit/xre/nsEmbedFunctions.cpp:761] 12:29:24 INFO - #06: content_process_main(int, char**) [ipc/contentproc/plugin-container.cpp:115] That is, we've removed 15 atoms from the atom table at shutdown, but we think we have 16 unused atoms in the table, and so we fail. The numbers are not consistent between runs, but the difference is always more unused atoms than we have removed. A plausible scenario follows below, at some arbitrary time during the execution of the tests, we have: T1: Starts to GC the atom table, for whatever reason: http://searchfox.org/mozilla-central/source/xpcom/ds/nsAtomTable.cpp#403 T2: Release()'s a Dynamic atom: http://searchfox.org/mozilla-central/source/xpcom/ds/nsAtomTable.cpp#453 After the reference count of the atom on T2 has been decremented (and so it is now zero and can be GC'd), but before it can execute any of the instructions involved in `++gUnusedAtomCount >= kAtomGCThreshold`, it is interrupted. T1: Completes the GC of the atom table, GC'ing the atom from T2, and decrements gUnusedAtomCount. T2: increments gUnusedAtomCount, finds that it's less than kAtomGCThreshold, and continues. We are now in a situation where gUnusedAtomCount is *not* an accurate count of the unused atoms still in the atom table. We're also doing very dodgy things, since we're executing the destructor of an object that has just been delete'd...it's a very good thing we don't touch any more object state after this check! The only possibilities I see are deleting the assertion or making atom addref/release slower (i.e. Mutex acquisition to ensure we don't race with the GC). But I haven't thought about it very hard. bholley, WDYT?
Flags: needinfo?(bobbyholley)
(In reply to Nathan Froyd [:froydnj] from comment #50) > T1: Starts to GC the atom table, for whatever reason: > > http://searchfox.org/mozilla-central/source/xpcom/ds/nsAtomTable.cpp#403 > > T2: Release()'s a Dynamic atom: > > http://searchfox.org/mozilla-central/source/xpcom/ds/nsAtomTable.cpp#453 > > After the reference count of the atom on T2 has been decremented (and so it > is now zero and can be GC'd), but before it can execute any of the > instructions involved in `++gUnusedAtomCount >= kAtomGCThreshold`, it is > interrupted. > > T1: Completes the GC of the atom table, GC'ing the atom from T2, and > decrements gUnusedAtomCount. > T2: increments gUnusedAtomCount, finds that it's less than kAtomGCThreshold, > and continues. But T1 decrements gUnusedAtomCount by the number of atoms that were actually removed, right? So in the race you're describing, T1 is sort of "spending on credit" when it decrements gUnusedAtomCount, but T2 will deposit the balance shortly and equalize things out. This could lead to integer underflow, but as long as things wrap appropriately (do they), it should all balance out, right? Or am I missing something?
Flags: needinfo?(bobbyholley) → needinfo?(nfroyd)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #51) > But T1 decrements gUnusedAtomCount by the number of atoms that were actually > removed, right? So in the race you're describing, T1 is sort of "spending on > credit" when it decrements gUnusedAtomCount, but T2 will deposit the balance > shortly and equalize things out. This could lead to integer underflow, but > as long as things wrap appropriately (do they), it should all balance out, > right? Or am I missing something? OK, that makes sense. But we're still ending up with an extra increment that's not reflected in the number of unused atoms in the atom table! Back to thinking hard, I guess...
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #50) > That is, we've removed 15 atoms from the atom table at shutdown, but we > think we have 16 unused atoms in the table, and so we fail. The numbers are > not consistent between runs, but the difference is always more unused atoms > than we have removed. Hm. How many atoms are actually in the table? Does it match the unused count when we GC? It seems like the final GC should assert that the unused count is equal to the entries in the table. But I guess that runs up against the issue described in [1]. I think it might make sense to track down and fix those leaks, since it would allow us to have stronger invariants here, and make it easier to reason about this bug. http://searchfox.org/mozilla-central/rev/8144fcc62054e278fe5db9666815cc13188c96fa/xpcom/ds/nsAtomTable.cpp#549
--> Adjusting bug summary a bit to clarify that this doesn't *actually* happen during crashtest 1113005.html, though Linux64 Debug test-runners report it as if it did. See bug 1329781 for more details (there's a logging bug here).
Summary: Intermittent 1113005.html, Main app process exited normally | application crashed [@ NS_ShutdownAtomTable()] after Assertion failure: gUnusedAtomCount == 0, at /builds/slave/m-in-m64-d-0000000000000000000/build/src/xpcom/ds/nsAtomTable.cpp:546 → Intermittent Main app process exited normally | application crashed [@ NS_ShutdownAtomTable()] after Assertion failure: gUnusedAtomCount == 0, at xpcom/ds/nsAtomTable.cpp:546 (sometimes reported during crashtest 1113005.html)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #54) > (In reply to Nathan Froyd [:froydnj] from comment #50) > > That is, we've removed 15 atoms from the atom table at shutdown, but we > > think we have 16 unused atoms in the table, and so we fail. The numbers are > > not consistent between runs, but the difference is always more unused atoms > > than we have removed. > > Hm. How many atoms are actually in the table? Does it match the unused count > when we GC? It seems like the final GC should assert that the unused count > is equal to the entries in the table. That is what the current assertion (gUnusedAtomCount == 0) is essentially trying to do. I have a more complete set of diagnostic patches: https://hg.mozilla.org/try/rev/61f17a9fa0314f3a6516c8193cf8a4bd469c08da https://hg.mozilla.org/try/rev/3ba431dc5a0b54b1ed1895eb5ac47d7b5cd61321 that separates the GC into normal-usage GC and shutdown-GC. And we get the same flavor of assertion: at the final shutdown GC, the number of removed atoms from the table is not equal to the count of unused atoms that we had. And no, we don't appear to have missed any, or added things to the atom table during GC, or anything like that.
(In reply to Nathan Froyd [:froydnj] from comment #56) > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #54) > > (In reply to Nathan Froyd [:froydnj] from comment #50) > > > That is, we've removed 15 atoms from the atom table at shutdown, but we > > > think we have 16 unused atoms in the table, and so we fail. The numbers are > > > not consistent between runs, but the difference is always more unused atoms > > > than we have removed. > > > > Hm. How many atoms are actually in the table? Does it match the unused count > > when we GC? It seems like the final GC should assert that the unused count > > is equal to the entries in the table. > > That is what the current assertion (gUnusedAtomCount == 0) is essentially > trying to do. Not quite. IIUC the current assertion is trying to check whether we GCed all the unused atoms. That's different than asserting that all the atoms are unused (i.e. no atoms are leaked), which per the link in comment 54 isn't yet assertable. I'm suggesting that we see whether we can make that assertable without too much work, because it would give us another data point and help us figure out whether gUnusedAtomCount is wrong or one of the mRefCnt members is wrong.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #57) > (In reply to Nathan Froyd [:froydnj] from comment #56) > > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #54) > > > (In reply to Nathan Froyd [:froydnj] from comment #50) > > > > That is, we've removed 15 atoms from the atom table at shutdown, but we > > > > think we have 16 unused atoms in the table, and so we fail. The numbers are > > > > not consistent between runs, but the difference is always more unused atoms > > > > than we have removed. > > > > > > Hm. How many atoms are actually in the table? Does it match the unused count > > > when we GC? It seems like the final GC should assert that the unused count > > > is equal to the entries in the table. > > > > That is what the current assertion (gUnusedAtomCount == 0) is essentially > > trying to do. > > Not quite. IIUC the current assertion is trying to check whether we GCed all > the unused atoms. That's different than asserting that all the atoms are > unused (i.e. no atoms are leaked), which per the link in comment 54 isn't > yet assertable. I'm suggesting that we see whether we can make that > assertable without too much work, because it would give us another data > point and help us figure out whether gUnusedAtomCount is wrong or one of the > mRefCnt members is wrong. The diagnostic patches explicitly check that the number of removed atoms is equal to gUnusedAtomCount, which is not the case. Those patches also check that there are no dynamic atoms in the atom table after we've finished the shutdown GC. So we're not leaking any atoms. Ah, idea: dynamic atoms can get transmuted to static atoms. Maybe what's happening is that we're creating a dynamic atom *very* early on and then releasing it so that it's effectively unused. That would increment gUnusedAtomCount, but probably not trigger a GC. Then we register all the static atoms, and effectively change that (unused) dynamic atom into a static atom...and that atom will never get released. That would account for the extra increment to gUnusedAtomCount.
(In reply to Nathan Froyd [:froydnj] from comment #58) > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #57) > > (In reply to Nathan Froyd [:froydnj] from comment #56) > > > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #54) > > > > (In reply to Nathan Froyd [:froydnj] from comment #50) > > > > > That is, we've removed 15 atoms from the atom table at shutdown, but we > > > > > think we have 16 unused atoms in the table, and so we fail. The numbers are > > > > > not consistent between runs, but the difference is always more unused atoms > > > > > than we have removed. > > > > > > > > Hm. How many atoms are actually in the table? Does it match the unused count > > > > when we GC? It seems like the final GC should assert that the unused count > > > > is equal to the entries in the table. > > > > > > That is what the current assertion (gUnusedAtomCount == 0) is essentially > > > trying to do. > > > > Not quite. IIUC the current assertion is trying to check whether we GCed all > > the unused atoms. That's different than asserting that all the atoms are > > unused (i.e. no atoms are leaked), which per the link in comment 54 isn't > > yet assertable. I'm suggesting that we see whether we can make that > > assertable without too much work, because it would give us another data > > point and help us figure out whether gUnusedAtomCount is wrong or one of the > > mRefCnt members is wrong. > > The diagnostic patches explicitly check that the number of removed atoms is > equal to gUnusedAtomCount, which is not the case. Those patches also check > that there are no dynamic atoms in the atom table after we've finished the > shutdown GC. So we're not leaking any atoms. Oh interesting! If that's actually the case, then we can strengthen the assertion referred to at [1] right? > Ah, idea: dynamic atoms can get transmuted to static atoms. Maybe what's > happening is that we're creating a dynamic atom *very* early on and then > releasing it so that it's effectively unused. That would increment > gUnusedAtomCount, but probably not trigger a GC. Then we register all the > static atoms, and effectively change that (unused) dynamic atom into a > static atom...and that atom will never get released. That would account for > the extra increment to gUnusedAtomCount. Interesting! But why on earth do we allow this sketchy transmutation? Is it a holdover from the binary component days? It seems like we could just reorder things so that the static atom table is initialized, prepopulated, and sealed very early in startup. [1] http://searchfox.org/mozilla-central/rev/8144fcc62054e278fe5db9666815cc13188c96fa/xpcom/ds/nsAtomTable.cpp#549
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #59) > (In reply to Nathan Froyd [:froydnj] from comment #58) > > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #57) > > > (In reply to Nathan Froyd [:froydnj] from comment #56) > > > > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #54) > > > > > (In reply to Nathan Froyd [:froydnj] from comment #50) > > > > > > That is, we've removed 15 atoms from the atom table at shutdown, but we > > > > > > think we have 16 unused atoms in the table, and so we fail. The numbers are > > > > > > not consistent between runs, but the difference is always more unused atoms > > > > > > than we have removed. > > > > > > > > > > Hm. How many atoms are actually in the table? Does it match the unused count > > > > > when we GC? It seems like the final GC should assert that the unused count > > > > > is equal to the entries in the table. > > > > > > > > That is what the current assertion (gUnusedAtomCount == 0) is essentially > > > > trying to do. > > > > > > Not quite. IIUC the current assertion is trying to check whether we GCed all > > > the unused atoms. That's different than asserting that all the atoms are > > > unused (i.e. no atoms are leaked), which per the link in comment 54 isn't > > > yet assertable. I'm suggesting that we see whether we can make that > > > assertable without too much work, because it would give us another data > > > point and help us figure out whether gUnusedAtomCount is wrong or one of the > > > mRefCnt members is wrong. > > > > The diagnostic patches explicitly check that the number of removed atoms is > > equal to gUnusedAtomCount, which is not the case. Those patches also check > > that there are no dynamic atoms in the atom table after we've finished the > > shutdown GC. So we're not leaking any atoms. > > Oh interesting! If that's actually the case, then we can strengthen the > assertion referred to at [1] right? Well, once the previous, weaker assertion stops intermittently failing, yes. :) > > Ah, idea: dynamic atoms can get transmuted to static atoms. Maybe what's > > happening is that we're creating a dynamic atom *very* early on and then > > releasing it so that it's effectively unused. That would increment > > gUnusedAtomCount, but probably not trigger a GC. Then we register all the > > static atoms, and effectively change that (unused) dynamic atom into a > > static atom...and that atom will never get released. That would account for > > the extra increment to gUnusedAtomCount. > > Interesting! But why on earth do we allow this sketchy transmutation? Is it > a holdover from the binary component days? It seems like we could just > reorder things so that the static atom table is initialized, prepopulated, > and sealed very early in startup. I don't know why we allow this; the code doing this dates from hg@1: https://hg.mozilla.org/integration/mozilla-inbound/annotate/5903b0f685fc/xpcom/ds/nsAtomTable.cpp with just minimal refactoring changes over the years. I guess we could stick a MOZ_ASSERT there and then try to figure out who's doing that. I would dearly love to overhaul how we do atoms... The good news is that handling this transmutation case apparently makes the intermittent go away: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e27bf84176b405879fceab7526cc58a2d35b26bb&selectedJob=67698606
(In reply to Nathan Froyd [:froydnj] from comment #60) > with just minimal refactoring changes over the years. I guess we could > stick a MOZ_ASSERT there and then try to figure out who's doing that. I > would dearly love to overhaul how we do atoms... I think it would actually be even easier than that. There are 9 calls to NS_RegisterStaticAtoms [1]. The 4 main ones live at the top of nsLayoutStatics::Initialize [2]. But because of the 5 other miscellaneous ones, we delay sealing the atom table until the _bottom_ of nsLayoutStatics::Initialize [3], after all sorts of (presumably dynamic-atom-consuming) code has been run. So my bet is we could do the following simple steps: (1) Invoke nsTextServicesDocument::RegisterAtoms, nsHtml5Atoms::AddRefAtoms, nsHTMLTags::AddRefTable, NS_RegisterStaticAtoms(rdf_atoms), and NS_RegisterStaticAtoms(directory_atoms) at the top of nsLayoutStatics::Initialize. (2) Move the NS_SealStaticAtomTable() call to immediately after that block. (3) Initialize the dynamic atom table at the bottom of NS_SealStaticAtomTable(), and remove the lazy initialization. (4) Remove the transmutation gunk. (5) Assert that we're not leaking any atoms at shutdown. WDYT? [1] http://searchfox.org/mozilla-central/search?q=symbol:_Z22NS_RegisterStaticAtomsRAT__K12nsStaticAtom&redirect=false [2] http://searchfox.org/mozilla-central/rev/225ab0637ed51b8b3f9f4ee2f9c339a37a65b626/layout/build/nsLayoutStatics.cpp#155 [3] http://searchfox.org/mozilla-central/rev/225ab0637ed51b8b3f9f4ee2f9c339a37a65b626/layout/build/nsLayoutStatics.cpp#270
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #60) > The good news is that handling this transmutation case apparently makes the > intermittent go away: > > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=e27bf84176b405879fceab7526cc58a2d35b26bb&selectedJob=6 > 7698606 Also: \o/ \o/ \o/
That is a much easier idea than my "figure out what needs to be moved through repeated try runs". :) I'll look into doing that.
Flags: needinfo?(nfroyd)
Moving the html5 atoms out into their own initialization phase makes the initialization of atoms more explicit and avoids problems with trying to move modules around so their atoms get initialized in the correct place. As an aesthetic bonus, this change produces pleasing symmetry in nsHtml5Module::{Initialize,Release}Statics. Reviewed-by: Nathan Froyd <froydnj@gmail.com>
Attachment #8830339 - Flags: review?(hsivonen)
Moving this call closer to the other atom initializations will enable us to seal the static atom table sooner.
Attachment #8830340 - Flags: review?(bzbarsky)
We do this for much the same reason that we moved the nsHtml5Atoms initialization. Otherwise, the nsHTMLTags atoms are lazily initialized long after we've sealed the static atom table in nsLayoutStatics.
Attachment #8830341 - Flags: review?(hsivonen)
This is not the cleanest code ever, but we need to move all static atom initialization prior to NS_SealStaticAtomTable to avoid the possibility of dynamic atoms being transmuted into static atoms. We therefore need to open up an avenue for somebody else to initialize the atoms that RDF needs.
Attachment #8830343 - Flags: review?(l10n)
There are several XPCOM tests that purport to call NS_RegisterStaticAtoms. The tests located in the xpcom/tests/ directory are unused, so we might as well just remove them. The gtests do get run, but there's going to be no way to test NS_RegisterStaticAtoms once sealing the atom table actually means forbidding new additions. So we might as well remove the gtest too.
Attachment #8830344 - Flags: review?(erahm)
Attached patch part 6 - seal the static atom table sooner (deleted) — — Splinter Review
We do this to ensure that everybody has registered all the static atoms we'll care about, and to pave the way for asserting that nobody is trying to create any static atoms past this point in the next patch.
Attachment #8830345 - Flags: review?(bzbarsky)
This change seems like an obvious thing we should have been doing, but we weren't.
Attachment #8830346 - Flags: review?(erahm)
This comment makes things slightly more greppable.
Attachment #8830347 - Flags: review?(erahm)
Attached patch part 9 - forbid transmutation of dynamic atoms (obsolete) (deleted) — — Splinter Review
We can do this now that we've shuffled static atom initialization around appropriately.
Attachment #8830348 - Flags: review?(erahm)
Good riddance to some sketchy code.
Attachment #8830349 - Flags: review?(erahm)
This could have been done more simply, but the small amount of refactoring that takes place in this comment enables better error messages in the case where something does go wrong.
Attachment #8830350 - Flags: review?(erahm)
Comment on attachment 8830343 [details] [diff] [review] part 4 - initialize RDF atoms in nsLayoutStatics Review of attachment 8830343 [details] [diff] [review]: ----------------------------------------------------------------- Flinging this over from Axel to Benjamin, this is out of my comfort zone.
Attachment #8830343 - Flags: review?(l10n) → review?(benjamin)
Comment on attachment 8830346 [details] [diff] [review] part 7 - don't register static atoms after the table has been sealed Review of attachment 8830346 [details] [diff] [review]: ----------------------------------------------------------------- ::: xpcom/ds/nsAtomTable.cpp @@ +601,1 @@ > if (!gStaticAtomTable && !gStaticAtomTableSealed) { There doesn't seem to be a whole lot of value in the gStaticAtomTableSealed here now.
Comment on attachment 8830348 [details] [diff] [review] part 9 - forbid transmutation of dynamic atoms Review of attachment 8830348 [details] [diff] [review]: ----------------------------------------------------------------- ::: xpcom/ds/nsAtomTable.cpp @@ +619,5 @@ > if (atom) { > + // Disallow creating a dynamic atom, and then later, while the > + // dynamic atom is still alive, registering that name atom as a > + // static atom. It causes subtle bugs, and we're programming in > + // C++ here, not Smalltalk. Not sure there's any point in jabbing at smalltalk here.
Comment on attachment 8830343 [details] [diff] [review] part 4 - initialize RDF atoms in nsLayoutStatics This makes me sad, both because RDF will be going away soon and because the layout module is a weird side-effecty way of doing this. But I can't think of a cleaner way, so r+ and I'll hopefully make this better hacking the startup sequence.
Attachment #8830343 - Flags: review?(benjamin) → review+
Comment on attachment 8830340 [details] [diff] [review] part 2 - move nsTextServicesDocument::RegisterAtoms call r=me
Attachment #8830340 - Flags: review?(bzbarsky) → review+
Comment on attachment 8830345 [details] [diff] [review] part 6 - seal the static atom table sooner r=me
Attachment #8830345 - Flags: review?(bzbarsky) → review+
Comment on attachment 8830344 [details] [diff] [review] part 5 - don't try to test NS_RegisterStaticAtoms Review of attachment 8830344 [details] [diff] [review]: ----------------------------------------------------------------- It would be nice if we had a way to test the atom table directly rather than being limited to the global one, but I guess this is what we've got. It might make sense to file a follow up to make the atom table unit testable. ::: xpcom/tests/gtest/TestAtoms.cpp @@ -160,5 @@ > > EXPECT_TRUE(thirdDynamic); > EXPECT_EQ(NS_GetNumberOfAtoms(), count + 1); > - > - NS_RegisterStaticAtoms(sAtoms_info); Can you add a note here that we can't test adding atoms because the table has already been sealed by the time xpcom initialization has completed?
Attachment #8830344 - Flags: review?(erahm) → review+
Comment on attachment 8830346 [details] [diff] [review] part 7 - don't register static atoms after the table has been sealed Review of attachment 8830346 [details] [diff] [review]: ----------------------------------------------------------------- ::: xpcom/ds/nsAtomTable.cpp @@ +595,5 @@ > RegisterStaticAtoms(const nsStaticAtom* aAtoms, uint32_t aAtomCount) > { > MutexAutoLock lock(*gAtomTableLock); > + > + MOZ_RELEASE_ASSERT(!gStaticAtomTableSealed, "too late!"); Perhaps: "Atom table has already been sealed!"
Attachment #8830346 - Flags: review?(erahm) → review+
Attachment #8830347 - Flags: review?(erahm) → review+
Comment on attachment 8830348 [details] [diff] [review] part 9 - forbid transmutation of dynamic atoms Review of attachment 8830348 [details] [diff] [review]: ----------------------------------------------------------------- ::: xpcom/ds/nsAtomTable.cpp @@ +617,5 @@ > > nsIAtom* atom = he->mAtom; > if (atom) { > + // Disallow creating a dynamic atom, and then later, while the > + // dynamic atom is still alive, registering that name atom as a nit: "registering that same atom" @@ +624,2 @@ > if (!atom->IsStaticAtom()) { > + MOZ_CRASH("static atom registration should be pushed back!"); Is there some way we can log the atom string here? Crashing without that isn't super helpful. Maybe w/ a crash annotation.
Attachment #8830348 - Flags: review?(erahm) → feedback+
Comment on attachment 8830349 [details] [diff] [review] part 10 - remove dynamic->static atom transmutation code Review of attachment 8830349 [details] [diff] [review]: ----------------------------------------------------------------- \o/
Attachment #8830349 - Flags: review?(erahm) → review+
Attachment #8830350 - Flags: review?(erahm) → review+
Lovely, esp. part 10.
Comment on attachment 8830339 [details] [diff] [review] part 1 - make nsHtml5Atoms initialization explicit in nsLayoutStatics Review of attachment 8830339 [details] [diff] [review]: ----------------------------------------------------------------- r+, but I guess I should really put the HTML5 atoms into the normal atom bucket. I was advised to make them separate way back in 2009, IIRC, but the separation seems to just complicate things.
Attachment #8830339 - Flags: review?(hsivonen) → review+
(In reply to Henri Sivonen (:hsivonen) from comment #88) > r+, but I guess I should really put the HTML5 atoms into the normal atom > bucket. I was advised to make them separate way back in 2009, IIRC, but the > separation seems to just complicate things. It complicates things and it would save a reasonable amount of space (tens of kilobytes) if the HTML parser would just use nsGkAtoms:: atoms instead. Is that feasible to do? I looked into it, but got a bit lost in having to modify the Java translator to make things go.
Flags: needinfo?(hsivonen)
Attached patch part 9 - forbid transmutation of dynamic atoms (deleted) — — Splinter Review
Now with more crash annotation.
Attachment #8830805 - Flags: review?(erahm)
Attachment #8830348 - Attachment is obsolete: true
Comment on attachment 8830805 [details] [diff] [review] part 9 - forbid transmutation of dynamic atoms Review of attachment 8830805 [details] [diff] [review]: ----------------------------------------------------------------- ::: xpcom/ds/nsAtomTable.cpp @@ +628,5 @@ > + nsPrintfCString msg("static atom registration for %s should be pushed back", > + name.get()); > + static char sCrashReason[1024]; > + memcpy(sCrashReason, msg.get(), XPCOM_MIN(sizeof(sCrashReason), > + size_t(msg.Length() + 1))); We could probably just |snprintf| the buffer directly, but this way works fine too.
Attachment #8830805 - Flags: review?(erahm) → review+
Comment on attachment 8830805 [details] [diff] [review] part 9 - forbid transmutation of dynamic atoms Review of attachment 8830805 [details] [diff] [review]: ----------------------------------------------------------------- ::: xpcom/ds/nsAtomTable.cpp @@ +630,5 @@ > + static char sCrashReason[1024]; > + memcpy(sCrashReason, msg.get(), XPCOM_MIN(sizeof(sCrashReason), > + size_t(msg.Length() + 1))); > + MOZ_CRASH_ANNOTATE(sCrashReason); > + MOZ_REALLY_CRASH(); Sorry one more note, you can just MOZ_CRASH(sCrashReason) I believe.
Pushed by nfroyd@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/f0488821fbbe part 1 - make nsHtml5Atoms initialization explicit in nsLayoutStatics; r=hsivonen https://hg.mozilla.org/integration/mozilla-inbound/rev/d8652bf04c7b part 2 - move nsTextServicesDocument::RegisterAtoms call; r=bz https://hg.mozilla.org/integration/mozilla-inbound/rev/b7c1ee47ca9a part 3 - split out nsHTMLTags atom initialization; r=hsivonen https://hg.mozilla.org/integration/mozilla-inbound/rev/e6272736e0e6 part 4 - initialize RDF atoms in nsLayoutStatics; r=Pike https://hg.mozilla.org/integration/mozilla-inbound/rev/ead7c6aa2342 part 5 - don't try to test NS_RegisterStaticAtoms; r=erahm https://hg.mozilla.org/integration/mozilla-inbound/rev/c7456484433f part 6 - seal the static atom table sooner; r=bz https://hg.mozilla.org/integration/mozilla-inbound/rev/5a6fc9e02f9d part 7 - don't register static atoms after the table has been sealed; r=erahm https://hg.mozilla.org/integration/mozilla-inbound/rev/0fe884bfae3a part 8 - add a comment in NS_InitXPCOM2 to make static atom initialization less cryptic; r=erahm https://hg.mozilla.org/integration/mozilla-inbound/rev/2534a6de4a7a part 9 - forbid transmutation of dynamic atoms; r=erahm https://hg.mozilla.org/integration/mozilla-inbound/rev/5d5695501775 part 10 - remove dynamic->static atom transmutation code; r=erahm https://hg.mozilla.org/integration/mozilla-inbound/rev/4ad4f181474c part 11 - strengthen assertions for atom table shutdown GC; r=erahm
My hero <3
Assignee: nobody → nfroyd
Depends on: 1334558
Depends on: 1334771
Depends on: 1303637
Depends on: 1334834
Depends on: 1334874
(In reply to Nathan Froyd [:froydnj] from comment #89) > (In reply to Henri Sivonen (:hsivonen) from comment #88) > > r+, but I guess I should really put the HTML5 atoms into the normal atom > > bucket. I was advised to make them separate way back in 2009, IIRC, but the > > separation seems to just complicate things. > > It complicates things and it would save a reasonable amount of space (tens > of kilobytes) if the HTML parser would just use nsGkAtoms:: atoms instead. > Is that feasible to do Totally. > I looked into it, but got a bit lost in having to > modify the Java translator to make things go. Let's see if I can get it done this week...
Flags: needinfo?(hsivonen)
(In reply to Henri Sivonen (:hsivonen) from comment #97) > (In reply to Nathan Froyd [:froydnj] from comment #89) > > I looked into it, but got a bit lost in having to > > modify the Java translator to make things go. > > Let's see if I can get it done this week... That would be great, thank you!
RyanVM would dearly love to uplift these patches. I don't think there's much of a problem with uplifting them to 53, but given the TB fallout from these patches, I think uplifting them to 52 would be a bridge too far. For 52, I'd like to do something more like: https://hg.mozilla.org/try/rev/e27bf84176b405879fceab7526cc58a2d35b26bb and handle the problematic case directly, rather than architecturally redoing bits of how we do atoms.
I do think that if we uplift without the new assertion to 52 Thunderbird would be OK by just moving to dynamic atoms and backporting that to 52. Whether we can have other callers of NS_RegisterStaticAtoms on 52 is less clear to me. On 53 we can't of course, since we killed XPCOM exports.
Shall we go ahead and proceed with the Aurora approvals on this?
Flags: needinfo?(nfroyd)
Sure.
Flags: needinfo?(nfroyd)
Comment on attachment 8830339 [details] [diff] [review] part 1 - make nsHtml5Atoms initialization explicit in nsLayoutStatics This uplift request is for all the patches in this bug. Approval Request Comment [Feature/Bug causing the regression]: bug 1275755 [User impact if declined]: None. Sheriffs will love not having to star intermittent failures for this on Aurora, though. [Is this code covered by automated tests?]: Yes. [Has the fix been verified in Nightly?]: Insofar as there haven't been any intermittent failures from this bug since the patches landed, yes. [Needs manual test from QE? If yes, steps to reproduce]: No. [List of other uplifts needed for the feature/fix]: None. [Is the change risky?]: No. [Why is the change risky/not risky?]: We're making the startup sequence of the browser more predictable. If there were problems, we'd have seen them via crashes at startup. [String changes made/needed]: None.
Attachment #8830339 - Flags: approval-mozilla-aurora?
Blocks: 1335854
Comment on attachment 8830339 [details] [diff] [review] part 1 - make nsHtml5Atoms initialization explicit in nsLayoutStatics Fixes an intermittent, let's uplift to Aurora53
Attachment #8830339 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8830339 [details] [diff] [review] part 1 - make nsHtml5Atoms initialization explicit in nsLayoutStatics See comment 103.
Attachment #8830339 - Flags: approval-mozilla-beta?
Is this ok for TB in 52 after all?
Thanks for the consideration. We're still suffering from the fallout of this. We landed bug 1334558, bug 1334771, bug 1334874 to fix our bustage, but bug 1337865 is still open. This affects debug builds only. I guess we would survive is this was uplifted to M-C, perhaps next week since this week we're planning to build our TB 52 beta2 without this being uplifted.
Comment on attachment 8830339 [details] [diff] [review] part 1 - make nsHtml5Atoms initialization explicit in nsLayoutStatics let's get these into beta52 now.
Attachment #8830339 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
seems this need rebasing for beta since hitting as example grafting 387536:c922a29ad146 "Bug 1276669 - Part 3: Split out nsHTMLTags atom initialization. r=hsivonen, a=ritu" merging layout/build/nsLayoutStatics.cpp merging parser/htmlparser/nsHTMLTags.cpp merging parser/htmlparser/nsHTMLTags.h warning: conflicts while merging parser/htmlparser/nsHTMLTags.cpp! (edit, then use 'hg resolve --mark') abort: unresolved conflicts, can't continue (use 'hg resolve' and 'hg graft --continue')
Flags: needinfo?(nfroyd)
(In reply to Carsten Book [:Tomcat] from comment #111) > seems this need rebasing for beta since hitting as example > > grafting 387536:c922a29ad146 "Bug 1276669 - Part 3: Split out nsHTMLTags > atom initialization. r=hsivonen, a=ritu" > merging layout/build/nsLayoutStatics.cpp > merging parser/htmlparser/nsHTMLTags.cpp > merging parser/htmlparser/nsHTMLTags.h > warning: conflicts while merging parser/htmlparser/nsHTMLTags.cpp! (edit, > then use 'hg resolve --mark') > abort: unresolved conflicts, can't continue > (use 'hg resolve' and 'hg graft --continue') I think RyanVM was flexing his C++ knowledge and has rebased the patches onto Beta?
Flags: needinfo?(nfroyd) → needinfo?(ryanvm)
Green on Try no less! \m/
Flags: needinfo?(ryanvm)
Whiteboard: [stockwell fixed]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: