Closed Bug 604756 Opened 14 years ago Closed 14 years ago

crash [@ JSString::flatten() ] [@ JSString::flatten ]

Categories

(Core :: JavaScript Engine, defect)

x86
All
defect
Not set
critical

Tracking

()

RESOLVED DUPLICATE of bug 608142
Tracking Status
blocking2.0 --- beta7+

People

(Reporter: scoobidiver, Assigned: dmandelin)

References

Details

(Keywords: crash, regression, reproducible)

Crash Data

Attachments

(3 files, 1 obsolete file)

Build: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b8pre) Gecko/20101015 Firefox/4.0b8pre It is a residual crash signature with a low daily rate of about 5 crashes/day. From b8pre/20101015 build, there is a spike in crashes. It becomes #6 top crasher. The regression range for the spike is : http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=ad0a0be8be74&tochange=19cb42fa4554 Signature JSString::flatten() UUID 84334e51-5271-49ec-a2c4-8972b2101015 Time 2010-10-15 12:57:58.4697 Uptime 129 Last Crash 136 seconds before submission Install Age 12337 seconds (3.4 hours) since version was first installed. Product Firefox Version 4.0b8pre Build ID 20101015042126 Branch 2.0 OS Windows NT OS Version 6.1.7600 CPU x86 CPU Info GenuineIntel family 6 model 15 stepping 11 Crash Reason EXCEPTION_ACCESS_VIOLATION_READ Crash Address 0x0 App Notes AdapterVendorID: 10de, AdapterDeviceID: 0e22 Frame Module Signature [Expand] Source 0 mozjs.dll JSString::flatten js/src/jsstr.cpp:118 1 mozjs.dll js_ConcatStrings js/src/jsstr.cpp:404 2 mozjs.dll js::Interpret js/src/jsinterp.cpp:3588 3 mozjs.dll js::RunScript js/src/jsinterp.cpp:638 4 mozjs.dll js::Invoke js/src/jsinterp.cpp:747 5 mozjs.dll js_fun_apply js/src/jsfun.cpp:2369 6 mozjs.dll js::Interpret js/src/jsinterp.cpp:4714 7 mozjs.dll js::RunScript js/src/jsinterp.cpp:638 8 mozjs.dll js::Invoke js/src/jsinterp.cpp:747 9 mozjs.dll js::ExternalInvoke js/src/jsinterp.cpp:871 10 mozjs.dll JS_CallFunctionValue js/src/jsapi.cpp:4961 11 xul.dll nsXPCWrappedJSClass::CallMethod js/src/xpconnect/src/xpcwrappedjsclass.cpp:1692 12 xul.dll nsXPCWrappedJS::CallMethod js/src/xpconnect/src/xpcwrappedjs.cpp:571 13 xul.dll PrepareAndDispatch xpcom/reflect/xptcall/src/md/win32/xptcstubs.cpp:114 14 xul.dll SharedStub xpcom/reflect/xptcall/src/md/win32/xptcstubs.cpp:141 15 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:547 16 xul.dll nsThread::ThreadFunc xpcom/threads/nsThread.cpp:263 17 nspr4.dll _PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c:426 18 nspr4.dll pr_root nsprpub/pr/src/md/windows/w95thred.c:122 19 mozcrt19.dll _callthreadstartex obj-firefox/memory/jemalloc/crtsrc/threadex.c:348 20 mozcrt19.dll _threadstartex obj-firefox/memory/jemalloc/crtsrc/threadex.c:326 21 kernel32.dll BaseThreadInitThunk 22 ntdll.dll __RtlUserThreadStart 23 ntdll.dll _RtlUserThreadStart More reports at: http://crash-stats.mozilla.com/report/list?product=Firefox&query_search=signature&query_type=exact&query=JSString%3A%3Aflatten%28%29&range_value=4&range_unit=weeks&hang_type=any&process_type=any&plugin_field=&plugin_query_type=&plugin_query=&do_query=1&admin=&signature=JSString%3A%3Aflatten%28%29
blocking2.0: --- → ?
It happens also on Mac OS X.
OS: Windows 7 → All
Summary: crash [@ JSString::flatten() ] → crash [@ JSString::flatten() ] [@ JSString::flatten ]
#10 and #57 ranked crash on the trunk yesterday. might have been around on the trunk for awhile but something seems to have caused a bit of a spike in the last few days. date tl crashes at, count build, count build, ... JSString::flatten 20100927 20100928 20100929 2 4.0b62010091407 2 , 20100930 1 4.0b62010091407 1 , 20101001 20101002 20101003 2 1 4.0b7pre2010100303, 1 4.0b7pre2010100203, 20101004 1 4.0b7pre2010100403 1 , 20101005 3 2 4.0b7pre2010100403, 1 4.0b7pre2010100503, 20101006 20101007 20101008 20101009 20101010 20101011 1 4.0b8pre2010101003 1 , 20101012 20101013 20101014 2 4.0b8pre2010101403 2 , 20101015 6 4.0b8pre2010101503 6 , 20101016 9 4 4.0b8pre2010101603, 4 4.0b8pre2010101503, 1 4.0b8pre2010101602, there was some similar higher activity on this signature back around the 1st of sept. date tl crashes at, count build, count build, ... JSString::flatten 20100901 4 4.0b4pre2010080703 4 , 20100902 3 2 4.0b42010081812, 1 4.0b4pre2010080703, 20100903 5 4 4.0b4pre2010080703, 1 4.0b5pre2010082406, 20100904 20100905 2 4.0b42010081812 2 , 20100906 2 4.0b42010081812 2 , 20100907 1 4.0b42010081812 1 , 20100908 3 2 4.0b42010081812, 1 4.0b52010083107, 20100909 20100910 then things tailed off.
er, that's just the mac track record. the windows spike is much higher with it going from around 1-10 crashes per day to near 50 maybe starting with 4.0b8pre build 2010101504 20101012 2 4.0b62010091408 2 , 20101013 8 7 4.0b62010091408, 1 4.0b8pre2010101304, 20101014 9 4 4.0b8pre2010101322, 2 4.0b8pre2010101404, 2 4.0b62010091408, 1 4.0b62010091323, 20101015 49 38 4.0b8pre2010101504, 8 4.0b62010091408, 2 4.0b8pre2010101404, 1 4.0b42010081813, 20101016 53 27 4.0b8pre2010101604, 24 4.0b8pre2010101504, 1 4.0b62010091408, 1 4.0b42010081813,
Just hit that crash (bp-ceb8464f-13e2-43e7-8f9b-5b0c92101018 - stack identical to the one quoted by Scoobidiver) when testing a new Adblock Plus feature which isn't checked in yet. The crash happened when I hit a button that closes a XUL window and reloads a page in browser (http://www.heise.de/). However, doing exactly the same thing again no longer triggers the crash, so the issue is probably intermittent. Or it is related to one of the ads on that page which change on each reload.
Hit this after restoring my Mac from sleep and opening a URL from another application (in this case, IRC in a terminal).
blocking2.0: ? → beta7+
Here is the crash: void JSString::flatten() { JSString *topNode; jschar *chars; size_t capacity; JS_ASSERT(isRope()); /* * This can be called from any string in the rope, so first traverse to the * top node. */ topNode = this; while (topNode->isInteriorNode()) topNode = topNode->interiorNodeParent(); *** CRASH HERE In the common case, it is an NPE on topNode. The assembly shows that it occurs after |interiorNodeParent| has been called at least once. This makes sense, because we would have crashed on |topNode->isInteriorNode| otherwise. So, it appears at the point of the crash we have a string that is tagged as an INTERIOR_NODE, but has a NULL parent, which is presumably invalid. But it's hard to say whether the tag is wrong or the parent is wrong. If the tag is wrong, then the memory that stores the parent for interior nodes is also unioned with a 4-byte inline storage, a mutable flat string capacity, and a JSRopeBufferInfo for top nodes. Alan, do you know which of those, if any, are allowed to be NULL? An easy first step is to add an assertion in this loop, and see if any of our existing test cases trip it.
(In reply to comment #6) > An easy first step is to add an assertion in this loop, and see if any of our > existing test cases trip it. Of course, that makes no sense, because any such test case would already be hitting this crash.
In terms of history, there are 3 notable events with this crash: 0. There was bug 585309, which got fixed around 8/8/2010. 1. The crash started again on 9/23. The earliest builds with the 'new' crash are from 9/21 (20100921041551) and occurred on 9/24. The last merge from TM before then was on 9/20: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=e4fdb23efa33&tochange=e6cfaf0840cb This happens to be the same time period as bug 601725, but I don't know that there is any relationship. Bug 601725 does relate to requests and/or threads, and if something of that kind went wrong in there, I could see that maybe it is corrupting string headers by sharing them when it shouldn't. Another thing that happened in that merge is that we started JM-compiling some new ops that we hadn't before, so it's possible we exposed a bug somewhere by doing that. That doesn't seem likely, though. 2. The crash got much more frequent starting with 10/15 builds. Compartments landed just before that, but the first m-c build with compartments was, I believe, on 10/14, so I think the big compartments landing was *not* the cause here. But there were 4 compartments-related landings on m-c on 10/14, that would have first run in 10/15 builds: (Don't try to waive Xray wrapper for primitives) http://hg.mozilla.org/mozilla-central/rev/4666d67cba75 (Don't create Xray wrappers for chrome objects in sandboxes) http://hg.mozilla.org/mozilla-central/rev/ed24c21e6497 (Some test changes) http://hg.mozilla.org/mozilla-central/rev/ca9083cab8ef http://hg.mozilla.org/mozilla-central/rev/83a06cce6b14 Of these, the first seems the most likely to me, since it involves primitives. --- Given the clear spike in 10/15 builds, if code inspection/thinking don't provide a ready answer, it seems like backing out these patches one by one (modulo not introducing orange) and seeing if the numbers go back down is a reasonable next step.
(In reply to comment #6) > So, it appears at the point of the crash we have a string that is tagged as an > INTERIOR_NODE, but has a NULL parent, which is presumably invalid. But it's > hard to say whether the tag is wrong or the parent is wrong. If the tag is > wrong, then the memory that stores the parent for interior nodes is also > unioned with a 4-byte inline storage, a mutable flat string capacity, and a > JSRopeBufferInfo for top nodes. Alan, do you know which of those, if any, are > allowed to be NULL? -Inline storage is only valid in static strings and short strings, and is rarely NULL, if ever. -Capacity is usually 0, and that field is always NULL in dependent strings, so it's possible that we have a flat or dependent string mistagged as an interior node. -mBufferWithInfo is almost never NULL. However, it's NULL in js_ConcatStrings just before a top node is converted to an interior node, after we free the buffer that it points to. A few other thoughts: I noticed that the caller of flatten is often js_ConcatStrings on line 404, which is the call to FinishConcat. A bunch of things get inlined, but I don't see where flatten() would be called. In js_ConcatStrings, my code assumes that NewFinalizableGCThing will not flatten any existing strings (e.g. by calling chars() on them). If one of the arguments is flattened while calling NewFinalizableGCThing, then I think we would crash in flatten when we first because a top node has a NULL buffer (which doesn't agree with any of the crash line numbers, I think). There seem to be a number of broken invariants with strings here. For example http://crash-stats.mozilla.com/report/index/db1d4cc0-3766-430e-9a51-1e96b2101017 shows flatten calling itself, which should never happen. What's happening there is there's a node that has a top node as its left child, which should never happen. The change http://hg.mozilla.org/mozilla-central/rev/4666d67cba75 seems like it could have an effect on this kind of thing, since JS_WrapValue flattens its argument if it's a string.
This is happening upon restarting from applying the latest nightly update, on Mac. 20101019.
Keywords: reproducible
Alan, can you explain a bit why the JS_WrapValue code flattening is a problem here?
From conversation with Andreas: the JS_WrapValue patch vastly increased the number of flatten operations that we do, so it makes sense that that patch would spike the frequency of this crash. At this point we have: - This started around the time as bug 601725. - Invariants are being violated. (Interestingly, that is also happening in bug 601725.) Adding more frequent assertions to check the invariants might help here.
Update on the history--I think comment 8 is invalid. For that comment, I was looking only at nightlies. For some reason we don't see this crash in nightlies before 9/24, but it was occurring in betas. In fact, it seems we had this crash in all betas that had ropes. So, there's a good chance this was a bug in the original code. At this point, I think we should figure out again what invariant violations are occurring, then audit the code for them, and add assertions to catch this as early as possible. We probably want them to be assertions in opt builds as well so we can get data from Socorro.
looking at a summary of the top ten frames of all the stacks in the top 200 signatures these signatures all seem to run though a lot of the same code as the stack in comment 0 #9 JSString::flatten #12 js_NewGCStringJSContext #21 JSString::flatten #22 js_ConcatStrings #87 js_ConcatStringsJSContext,JSString,JSString #135 js_NewGCShortStringJSContext search for ConcatStrings in http://people.mozilla.com/crash_stacks/stack-summary-4.0b8pre.txt bugs like bug 605852 are appearing on these other signatures. should we dup against this one?
(In reply to comment #14) > bugs like bug 605852 are appearing on these other signatures. should we dup > against this one? Seems OK to me--we can easily reopen them if they prove to be different bugs.
Bug 605852 is definitely a dup of Bug 605836 - crash [@ js_NewGCShortString(JSContext*) but comments in 605836 there suggest it may be related to Bug 601725 - Firefox 4.0b7pre Crash Report [@ StartRequest ] Maybe it adds do much complication to mix that chain of bugs in until we understand that one better.
Luke gave me the idea of looking more closely at the crash stacks. It turns out they cluster into 2 main kinds: - type 1: In these, the next stack signature after |flatten| is |ConcatStrings:404|. This line of code calls FinishConcat. I'm not even sure how that calls flatten, so I need to look at minidumps more closely. - type 2: In these, we crash on the lnes of ConcatStrings that call flatten directly: if (left->isInteriorNode()) left->flatten(); if (right->isInteriorNode()) right->flatten(); The crash appears to be that in walking up to find the top node, we run into a null pointer first. This was the cause I noted earlier. In crashes from the last few days, 2/3 (of a 30-crash sample I classified were of type 1, and 1/3 were of type 2. Before the 10/15 spike, 1/6 (of a 12-crash sample) were of type 1, and 5/6 were of type 2. So there's a good chance there are 2 separate bugs here, where presumably the type 1 crashes are the reason this blocks.
Assignee: general → dmandelin
It turns out the stack traces are bogus in the type 1 crashes. The minidumps in MSVC show that we are really calling flatten from the same lines of code as in type 2 crashes.
Build identifier: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b8pre) Gecko/20101021 Firefox/4.0b8pre Crashed as well. Flash had crashed, I dismissed the dialog. After browsing some more, the app ended up crashing as well. http://crash-stats.mozilla.com/report/index/bp-f75000de-30bc-47a5-87d5-9dcba2101021
Attached patch Diagnostic patch (deleted) — Splinter Review
I want to land this temporarily on m-c to gather more data.
Attachment #485202 - Flags: review?(dvander)
Attachment #485202 - Flags: review?(dvander) → review?(lw)
Comment on attachment 485202 [details] [diff] [review] Diagnostic patch Excellent! >+ // About to crash: record some black-box info. >+ char blackbox[4 + sizeof(JSString) + 4]; >+ memcpy(blackbox, "AAAA", 4); >+ memcpy(blackbox + 4 + sizeof(JSString), "AAAA", 4); >+ memcpy(blackbox + 4, (char *) topNode, sizeof(JSString)); Since memcpy is known to the compiler, this whole sequence may get killed. 'volatile char blackbox' should do the trick although to be really really safe, you might additionally assign &blackbox[0] to a volatile global.
Attachment #485202 - Flags: review?(lw) → review+
Comment on attachment 485202 [details] [diff] [review] Diagnostic patch Requesting approval for this diagnostic patch that will help us fix the b7 blocker.
Attachment #485202 - Flags: approval2.0?
Attachment #485202 - Flags: approval2.0? → approval2.0+
Crash reports with the diagnostic patch are starting to come in. So far, 3 out 4 crash at the same point with a crash address of 0x2. This suggests the bug is related to the traversal loop in flatten: - Before the loop starts, the top node is converted to an interior node with a parent of 0x2 (with the diagnostic patch--it is NULL in the original code). - The loop is supposed to walk over the tree postorder, left child first, converting each interior node to a flat dependent string. - The loop finishes when it reaches a node with a parent of 0x2: this is supposed to be the original top node. - After the loop finishes, the top node is converted to a flat mutable string. So, how do we end up with interior nodes with a parent of 0x2 at the crash point? It seems like it can't happen. flatten() sets topNode's parent to 0x2. At the end, it sets topNode to a flat string, at a point that postdominates the point that sets 0x2. - One theoretical possibility is that two threads call flatten on the same string simultaneously. Say T1 calls flatten, sets the 0x2 parent, then we switch to thread T2. T2 calls flatten on the same string and reaches the original top node while it still has a 0x2 parent, then we crash? Can this happen? I don't know, but it seems like the best bet at this point. We could try to detect that by keeping a global variable that tells us if we are in flatten, and seeing if we have re-entered flatten when we crash. I'll put recommendations in the next comment to set them off from these investigative notes.
Is it true that (almost) all crashes happen in chrome code? I saw only one stack with JaegerShot, and JM is disabled for chrome...
Recommendations: 1. Disable the ropes code for the b7 branch. It looks like a threading bug, so schedule risk is high. We can re-enable on trunk and continue investigating. 2. Next steps in investigation: 2.1. Try to show that we re-enter |flatten| when we crash. As noted above, we can keep a global "in-flatten" flag and assert if we are about to crash while re-entering. We could try to get more fine-grained by keeping track of which strings we are trying to flatten and asserting if it's the same one, but that is harder because in this context S1 and S2 are "the same" if they are members of the same rope. 2.2. Try preventing re-entrance of |flatten| with some locking. If that makes the crash go away, then we know the cause. This seems to give less information than 2.1, so it is probably worth trying only if it is easier or more reliable. 2.3. Add a |cx| parameter to |flatten|, then assert that we don't have a compartment mismatch in flatten. This seems to be a pretty direct way of detecting violations of our basic threading invariants. And if it does assert, then we know the exact point that is making a bad cross-compartment call. The disadvantage is that if we do this and don't get asserts, we still haven't ruled out all threading issues: we'd still have to do 2.1 or 2.2 for that. 2.4. The spike on 10/15 seems kind of interesting again. If we look at those changes, in particular the one that calls |flatten| more often when wrapping strings, we might be able to spot the issue. I think 2.3 is the most promising right now. We know there are compartment mismatch bugs, and this could just be another one. Keep in mind that this crash predates compartments, so compartments probably didn't cause the bug; the issue would be more that compartments didn't stop it.
(In reply to comment #24) > Is it true that (almost) all crashes happen in chrome code? I saw only one > stack with JaegerShot, and JM is disabled for chrome... Hey, great observation! I think that supports the idea that these crashes are threading-related, since content isn't supposed to use multithreading for the most part. It also means we could disable the ropes optimization only for contexts that are chrome (or that have methodjit off), which would hopefully nearly eliminate the crash while also not costing us any perf in content.
Thats a very useful observation Jan. Awesome. Dave, we should look at extensions. Browser chrome doesn't multithread either.
Is JM enabled for workers?
Looks to me like JM is enabled for workers: http://mxr.mozilla.org/mozilla-central/source/dom/base/nsJSEnvironment.cpp#1241 and workers use nsIScriptContext's only mplementation, nsJSContext. Bent can confirm. The string code grew up immutable, so safe cross-thread. I think we need to restore that even if it costs us on the thread-crossing boundaries (should have zero cost on same-thread-only code and data). /be
Compartment transitions flatten and copy strings. We are thread-safe by design. This is a bug.
(In reply to comment #29) > Looks to me like JM is enabled for workers: No. Workers have their own contexts and the relevant options are here: http://mxr.mozilla.org/mozilla-central/source/dom/src/threads/nsDOMThreadService.cpp#1031 JM is not enabled for workers yet, though I'd love to try it out.
Alright, I am pretty sure this is a workers-related bug then. Lets whip up a stress test that sends strings back and forth maybe.
The structured cloning stuff decomposes the string into a stream, so that seems fine. Bent, any other way a worker can touch a JS string?
Well, we do in a few places here: http://mxr.mozilla.org/mozilla-central/search?string=nsdependentjsstring&find=dom%2Fsrc%2Fthreads&findi=&filter=^[^\0]*%24&hitlimit=&tree=mozilla-central ... but it doesn't look like we hang on to them or anything, we just convert to xpcom strings. The only other place where we use JSStrings is in setTimeout calls where a string expression is passed, here: http://mxr.mozilla.org/mozilla-central/source/dom/src/threads/nsDOMWorkerTimeout.cpp#155 The strings are then evaluated here: http://mxr.mozilla.org/mozilla-central/source/dom/src/threads/nsDOMWorkerTimeout.cpp#196
Oh, and those setTimeout strings may not be saved and evaluated on the same thread, but they're guaranteed to not run on two threads simultaneously.
We are flattening a string on 2 threads simultaneously, so this must be some sort of user data that leaks across. Ropes tend to be generated by user code, not within gecko.
ProxyAutoConfig was on a thread at one time...
(In reply to comment #25) > Recommendations: > > 1. Disable the ropes code for the b7 branch. It looks like a threading bug, so > schedule risk is high. We can re-enable on trunk and continue investigating. If you do this, you shouldn't just revert to the pre-ropes string concat code, since that isn't compatible with our current design of strings (in particular, it reallocs buffers of flat strings, which isn't allowed anymore because dependent strings now keep a direct pointer to their chars). The right way to disable ropes, I think, is to just call flatten at the end of js_ConcatStrings (in particular, at the end of FinishConcat). That way, even though we make and flatten ropes during js_ConcatStrings, ropes are always trivial, and the outside world never sees them. The mutable/flatCapacity optimization in js_ConcatStrings makes it so we don't get quadratic append time when concatenating to the end of a string. On my machine, making this change causes about a 5% slowdown on sunspider. > > 2.3. Add a |cx| parameter to |flatten|, then assert that we don't have a > compartment mismatch in flatten. This seems to be a pretty direct way of > detecting violations of our basic threading invariants. And if it does assert, > then we know the exact point that is making a bad cross-compartment call. The > disadvantage is that if we do this and don't get asserts, we still haven't > ruled out all threading issues: we'd still have to do 2.1 or 2.2 for that. > Adding a cx parameter is probably hard, since JSString::chars doesn't take a cx, and chars is used all over the place (this is roughly the same reason that we eagerly allocate rope buffers instead of allocating them during flatten). Maybe a more reasonable approach would be to pass a cx from the most common callers and pass NULL everywhere else (and don't do the assert if cx is NULL).
(In reply to comment #38) > ProxyAutoConfig was on a thread at one time... At least in my case the crash happened without proxy autoconfig.
I disagree with disabling ropes. The bug here is that non-immutable strings are shared between threads. We happen to see ropes fail over harder than other forms of mutable strings, but disabling ropes is the wrong fix. We have to find the cross thread string leak. We have a test case that shows a crash with workers while they are concat-ing strings. Lets chase that one down with replay on Monday.
Andreas meant bug 606705 by "We have a test case ...." in comment 41. /be
Dave, I only just noticed that the crash-if-not-on-the-string's-compartment's-thread check is only in js_ConcatStrings. Perhaps you could push it down into flatten()?
Blocks: 602157
Depends on: 606705
Over the weekend, almost all the crashes had an address of 0x2, so the analysis above has a good foundation. We should back out the diagnostic patch now. Do I need to get approval on that? Is that going to need 'a=' in the checkin comment?
(In reply to comment #41) > I disagree with disabling ropes. The bug here is that non-immutable strings are > shared between threads. We happen to see ropes fail over harder than other > forms of mutable strings, but disabling ropes is the wrong fix. We have to find > the cross thread string leak. We have a test case that shows a crash with > workers while they are concat-ing strings. Lets chase that one down with replay > on Monday. Andreas, do you think: (a) We should fix this bug eventually, but it is OK to switch off ropes for now to get this blocker out of the way, or (b) The underlying thread string leak is severe enough that fixing that bug should block b7?
This patch should stop most of the immediate crashes. But Andreas wants to find the cross-thread string leakage first, so not requesting review for now.
Just following along, but I agree with Andreas: cross-thread string leakage is bad even without ropes, due to the mutable string optimization (which allows a mutable single-threaded string to become a dependent string pointing into a larger string concat result string). /be
Regarding recent discussion of next steps: - If we want to fix this bug fast (i.e., in order to get beta 7 out the door), and we want to fix the underlying bug in threads/compartments, then I am the wrong assignee. I know little about compartments, so I'd have to spend a bunch of time just learning how things work. - If we don't care to fix this fast, then let's just make it block beta 8 instead. - If we don't care to fix the underlying bug immediately, then I think we can make the crash largely go away by restricting ropes to the main content thread. Guidance from release drivers would be greatly appreciated.
Attachment #485846 - Flags: review?(lw)
Attachment #485846 - Flags: approval2.0?
Attachment #485846 - Flags: review?(lw) → review+
B7 blockers doesn't need approval.
Attachment #485846 - Flags: approval2.0?
Restrict patch landed: http://hg.mozilla.org/mozilla-central/rev/8dd996c91f40 - We'll have to check the crash stats for the next day or two to see if this did it. - If this works, once b7 goes out, if we haven't figured out the underlying cause yet, we can enable ropes for content or DOM workers selectively to see if it is crashing in one of those areas or both.
So far, it doesn't look like the test patch worked. It's hard to say why. I'm pretty sure this method of 'disabling' ropes doesn't actually entirely prevent rope-related multithreading bugs, because two threads could concatenate strings S1 and S2 simultaneously, and then they could try to flatten their trees simultaneously, both of which contain both S1 and S2. Not sure what to do next. We could try to step on races harder, by really making it not access two mutable string data structures simultaneously. Or we could try to track down the leakage. Neither path necessarily gets us to a fix very quickly.
(In reply to comment #51) > We could try to step on races harder, by really > making it not access two mutable string data structures simultaneously. What about adding asserts to rope manipulation code that the current thread matches the compartment thread? One do not need cx in string methods for that. It would be sufficient to add JSThread into the compartment and then check in ropes or any other string mutation code that str->compartment()->thread->is == GetCurrentThreadId.
(In reply to comment #52) > (In reply to comment #51) > > We could try to step on races harder, by really > > making it not access two mutable string data structures simultaneously. > > What about adding asserts to rope manipulation code that the current thread > matches the compartment thread? One do not need cx in string methods for that. > It would be sufficient to add JSThread into the compartment and then check in > ropes or any other string mutation code that str->compartment()->thread->is == > GetCurrentThreadId. That sounds like a good idea. But how does one go from a compartment to its thread? I don't see the name 'thread' anywhere in jscompartment.h. Also, what exactly would the thread of a compartment mean? I thought different threads could enter the same compartment, just not at the same time.
There is no notion of a compartment thread at the moment since threads enter compartments in an unbalanced manner due to cx->globalObject and SetFrameRegs.
I looked at the first 13 crashes that came in from the latest build. 5 of them are still crashing walking up the tree, mostly on address 0, but one on 0x1f. 5 more crash in the traversal loop, where str becomes 0x1ff or something bad like that. There's no way to tell which branch of the switch did it, so it could be just another case where we see a bad parent pointer, or maybe not.
We are in a tough spot here. We have pretty good evidence that the crash is caused by two threads calling |flatten| simultaneously, but that's all we know so far. Here's what we think is happening. |str->flatten()| can be abstracted to this: 1. enter the function 2. walk to the root of str's concat tree 3. break invariants of the concat tree 4. do stuff 5. restore invariants of the concat tree If we crash in thread T1, it could happen because of a schedule like this: T1 T2 enter walk break enter walk XXX walk crashes because invariants are broken Note that the diagnostic patch checked compartments of the arguments to js_ConcatStrings. That is, it would have crashed with a specific address if the compartment of either string argument did not match the compartment of the cx. No such crashes were observed, which tells us that js_ConcatStrings, at least, is usually being used correctly wrt compartments. It seems that some other thread, one that doesn't run scripts, is making the mismatched-compartment call. We might like to learn more about this by setting things up so that we crash if a second thread enters |flatten| for a given concat tree while a thread is already in |flatten|. But this seems like it probably wouldn't tell us much that we don't already know. The reason is that the second thread is already crashing, so we wouldn't get any new stacks. And it's really the first thread that we probably want to know about, because our crash stacks have a lot of js_ConcatStrings, which appears to be using compartments correctly, as explained in the previous paragraph. So all it would really do is help confirm that we are crashing because of races. But we have other ways of doing that that may be easier.
(In reply to comment #56) > Note that the diagnostic patch checked compartments of the arguments to > js_ConcatStrings. That is, it would have crashed with a specific address if the > compartment of either string argument did not match the compartment of the cx. > No such crashes were observed, which tells us that js_ConcatStrings, at least, > is usually being used correctly wrt compartments. I don't think this is true. From the recent nightly crashes in js_ConcatStrings: http://crash-stats.mozilla.com/report/list?range_value=2&range_unit=weeks&date=2010-10-26%2014%3A00%3A00&signature=js_ConcatStrings&version=Firefox%3A4.0b8pre it looks like when the diagnostic patch was there, there were a number of crashes at addresses 0xd0 and 0xd4, which were the two addresses used for the compartment checks.
(In reply to comment #57) > (In reply to comment #56) > > Note that the diagnostic patch checked compartments of the arguments to > > js_ConcatStrings. That is, it would have crashed with a specific address if the > > compartment of either string argument did not match the compartment of the cx. > > No such crashes were observed, which tells us that js_ConcatStrings, at least, > > is usually being used correctly wrt compartments. > > I don't think this is true. From the recent nightly crashes in > js_ConcatStrings: > > http://crash-stats.mozilla.com/report/list?range_value=2&range_unit=weeks&date=2010-10-26%2014%3A00%3A00&signature=js_ConcatStrings&version=Firefox%3A4.0b8pre > > it looks like when the diagnostic patch was there, there were a number of > crashes at addresses 0xd0 and 0xd4, which were the two addresses used for the > compartment checks. That helps a lot! I forgot that we need to look with the other signature to find those. I looked at a bunch of them and they are pretty much just normal calls to stubs::Add, which makes me think there's just random crap that leaks strings across compartments, which is about what we expected. (We did also wonder about other threads just calling stuff, like DOM workers or minor browser threads.) Here's the correlation report for js_ConcatStrings: 78% (25/32) vs. 27% (103/379) firebug@software.joehewitt.com (Firebug, https://addons.mozilla.org/addon/1843) 44% (14/32) vs. 6% (24/379) jsonview@brh.numbera.com (JSONView, https://addons.mozilla.org/addon/10869) 34% (11/32) vs. 5% (18/379) firediff@johnjbarton.com (Firediff, https://addons.mozilla.org/addon/13179) 34% (11/32) vs. 9% (34/379) {c45c406e-ab73-11d8-be73-000a95be3b12} (Web Developer, https://addons.mozilla.org/addon/60) 44% (14/32) vs. 19% (71/379) {3d7eb24f-2740-49df-8937-200b1cc08f8a} (Flashblock, https://addons.mozilla.org/addon/433) 31% (10/32) vs. 9% (33/379) {e4a8a97b-f2ed-450b-b12d-ee082ba24781} (Greasemonkey, https://addons.mozilla.org/addon/748) 25% (8/32) vs. 3% (13/379) netexport@getfirebug.com 25% (8/32) vs. 3% (13/379) eventbug@getfirebug.com 25% (8/32) vs. 3% (13/379) selectbug@getfirebug.com 25% (8/32) vs. 3% (13/379) firestarter@getfirebug.com 25% (8/32) vs. 4% (14/379) facebookBlocker@webgraph.com 25% (8/32) vs. 4% (15/379) rapportive@rapportive.com 25% (8/32) vs. 4% (16/379) {6AC85730-7D0F-4de0-B3FA-21142DD85326} (ColorZilla, https://addons.mozilla.org/addon/271) 25% (8/32) vs. 4% (16/379) foxmarks@kei.com (Xmarks (formerly Foxmarks), https://addons.mozilla.org/addon/2410) 25% (8/32) vs. 4% (17/379) {c151d79e-e61b-4a90-a887-5a46d38fba99} (Pearl Crescent Page Saver Basic, https://addons.mozilla.org/addon/10367) 25% (8/32) vs. 4% (17/379) firecookie@janodvarko.cz (Firecookie, https://addons.mozilla.org/addon/6683) 50% (16/32) vs. 30% (114/379) compatibility@addons.mozilla.org 22% (7/32) vs. 3% (12/379) forcetls@sid.stamm (Force-TLS, https://addons.mozilla.org/addon/12714) 25% (8/32) vs. 7% (28/379) {46551EC9-40F0-4e47-8E18-8E5CF550CFB8} (Stylish, https://addons.mozilla.org/addon/2108) 25% (8/32) vs. 9% (34/379) {DDC359D1-844A-42a7-9AA1-88A850A938A8} (DownThemAll!, https://addons.mozilla.org/addon/201) 16% (5/32) vs. 3% (10/379) chromebug@johnjbarton.com 13% (4/32) vs. 3% (11/379) grafxbot@mozilla.org 9% (3/32) vs. 1% (5/379) speak.words@agadak.net 9% (3/32) vs. 2% (7/379) es-es@dictionaries.addons.mozilla.org (Spanish (Spain) Dictionary, https://addons.mozilla.org/addon/3554) 9% (3/32) vs. 2% (9/379) sendtophone@martinezdelizarrondo.com 50% (16/32) vs. 44% (166/379) {d10d0bf8-f5b5-c8b4-a8b2-2b9879e08c5d} (Adblock Plus, https://addons.mozilla.org/addon/1865) 9% (3/32) vs. 3% (13/379) yslow@yahoo-inc.com (YSlow, https://addons.mozilla.org/addon/5369) 9% (3/32) vs. 4% (14/379) en-US@dictionaries.addons.mozilla.org (United States English Dictionary, https://addons.mozilla.org/addon/3497) It looks very possible that there are a variety of extensions, principally developer extensions like Firebug, that are leaking strings across compartments. I do see a few of these that have no extensions installed, but most of them have Firebug, and several others are common, so there's a good chance extensions are driving the volume up here.
So, what can we try next? I see a few possibilities: 1. Try to confirm that we are crashing because of races. One way would be to put a mutex around flatten, so 2 threads can never enter it simultaneously. This actually isn't quite right, because you are allowed to call flatten only on a rope node, and flatten makes the nodes not be rope nodes. So the mutex would have to be acquired before deciding to call flatten, and released at the end. Another way would be to put a count of how many threads are currently flattening a given rope top node at a time in the string header. Then, if we are about to crash, we can encode that value into the address. If it is usually greater than 1, then we can conclude races cause the crash. What would this buy us? If it says we are crashing due to races, not much, just confirmation of our guess. If is says we are not crashing due to races, then we need to start over. 2. Try to find the first thread that calls flatten. This is thread we don't get to see now. Bill and I had an idea how to do this. Strings would carry an 'in-flatten' flag, and a 'crash-when-done-with-flatten' flag. If a call enters flatten and the in-flatten flag is set, it would set the crash flag and then wait for 10 seconds. When exiting flatten, crash if the crash flag is set. This would give us a chance to see the call stack that presumably is not supposed to be able to call flatten. It's also possible that that first thread is showing in dumps we have already received. I checked a few and it wasn't there. I'm not sure if we should expect to see it--I don't know what happens with thread scheduling between a crash and crashreport generation. What would this buy us? It might point us directly at the answer, if the call stack was obviously calling from the wrong place, or doing some JSAPI from the wrong thread. Conversely, it might tell us very little, if both call stacks look fine, and the real problem is that a string was leaked to the wrong thread/compartment earlier in time. But maybe we could add a compartment check near the bad call into flatten, and then try to work back from there. 3. Try to find the cross-compartment call. We could add a cx argument to flatten, push that back through the APIs, and then do a compartment check in flatten. It might touch many lines of code, because chars() also has no cx and calls flatten, and is called in many places. Also, there might be public APIs that transitively call flatten and don't take a cx. For those, we'd need to create a new version that does take a cx, and then use that in the browser. We'd have to provide some other check for extensions, maybe just check whether they are racing using one of the other techniques. What would this buy us? Potentially, a lot. There are a lot of compartment checks in the code now, and they seem pretty productive for bugfinding (of course, in those cases there are STR so it's easier). Still, it seems like it could take us a lot closer to the problem if they hit. If we added these checks and they didn't hit, we'd have to rethink. 4. Try to find the "wrong thread" call. It seems that the problem is that thread T2 calls flatten when it really shouldn't. So, can we check in flatten that we are on the right thread? It seems hard--it is perfectly OK for different threads to call into the same apartment at different times. I'm not sure there is any notion of "correct thread" that we can use here. -- It seems best to start with #1 today, to find out if threads actually is the right idea at all. After that, #2 seems better, because although #3 seems more direct, #2 seems much easier to implement. A crazy option would be to just make flatten thread-safe. We could use some kind of thin lock on the root node of the rope, and then make any second thread that enters wait, and then return once the first thread has returned. I'm not sure what the perf effects would be.
See also comments 11-14 in bug 604818, which may be related. For at least one user, the problem went away with Firebug, consistent with my comment 58 here on Firebug and such possibly causing a lot of this crash. Andreas says Firebug doesn't create threads, so it shouldn't cause races. But we know it does cause compartment mismatches, at least, and it seems to be associated with crashes in flatten, js_ConcatStrings, and Decompile, which are all string-related crashes. I'm not sure what to make of that. I think I'll still try the mutex, just to see if it can tell us anything. But I'm starting to think that a lot of this is JSD-related and that grinding away much harder on this before getting the known JSD problems taken care of may be a waste of time.
Most of the jsd-related compartment mismatches should be fixed as of yesterday. Maybe we should resample over night with your patch?
(In reply to comment #61) > Most of the jsd-related compartment mismatches should be fixed as of yesterday. > Maybe we should resample over night with your patch? We could do that, but what information do you expect to get from it? Fixing the JSD compartment mismatches should bring down the total crash count, which seems like it may have been happening. Is there something else we can learn from that?
Attached patch Diagnostic patch to serialize flatten (obsolete) (deleted) — Splinter Review
This will test the hypothesis that concurrent calls to flatten cause the crash. It doesn't seem to affect SunSpider all that much either--maybe 20ms more than the last nightly I tested, but at least some of that is from PGO, so it seems pretty low impact.
Attachment #486253 - Flags: review?(gal)
Attachment #486253 - Flags: review?(gal) → review?(dvander)
Comment on attachment 486253 [details] [diff] [review] Diagnostic patch to serialize flatten r=me with UNLOCK_RUNTIME before the return
Attachment #486253 - Flags: review?(dvander) → review+
Attached patch v2 serializing diagnostic (deleted) — Splinter Review
Attachment #486253 - Attachment is obsolete: true
Attachment #486254 - Flags: review?(dvander)
Attachment #486254 - Flags: review?(dvander) → review+
Comment on attachment 486254 [details] [diff] [review] v2 serializing diagnostic Looks good.
Attachment #486254 - Flags: review?
Attachment #486254 - Flags: review? → review+
I hit this crash just now, and I don't have Firebug installed, FWIW: http://crash-stats.mozilla.com/report/index/bp-09af5fd4-398a-42df-a3bb-d4ac62101027 about the only interesting extension I have installed is Flashblock.
(In reply to comment #67) > I hit this crash just now, and I don't have Firebug installed, FWIW: > http://crash-stats.mozilla.com/report/index/bp-09af5fd4-398a-42df-a3bb-d4ac62101027 > > about the only interesting extension I have installed is Flashblock. Hmm, I'll run with Flashblock in a debug build. I bet I'll hit a compartments mismatch. Reading http://www.mozdev.org/source/browse/flashblock/source/content/flashblock/flashblock.xml?annotate=1.8.2.47#14 I see this: function nativeMethod(untrustedObject, methodName) and your stack has a call to apply on it.
Oh ho, I had forgotten all about that nonsense. Since the Flashblock XBL binding loads in web content, it uses that silliness to ensure that webpages can't kill DOM methods that it needs to function.
I just tried Flashblock with a TM tip debug build and didn't have any problems. But it looks like Ted's build is from 10-25.
(In reply to comment #54) > There is no notion of a compartment thread at the moment since threads enter > compartments in an unbalanced manner due to cx->globalObject and SetFrameRegs. For diagnostic purposes this should not be a problem. So my suggestion is to record the last thread that entered the compartment and then assert on that thread in string manipulation code. The false positives are not bad in this case as they may also give some insight into the problem.
Well, it looks like it's stopped crashing: there have been reports of this crash coming in all morning, but zero from the 10/27 build. Unfortunately, there seems to have been no Windows build this morning, so we don't have data there yet, but it looks like it has stopped on Mac/Linux. I tried SunSpider on an m-c tinderbox build, and my score was 232.5, which seems to show that there was no perf impact from the latest patch. (My last run, with a TM build of a few days ago, got 241. Chrome 8 was 218, so we are within 7% now on trunk even on my machine.) So, we could probably keep this as a mitigation patch if we wanted. Currently contemplating next steps.
I hit a slightly different manifestation of this today: http://crash-stats.mozilla.com/report/index/15cfddcb-c9d9-46a6-8eba-5b4152101028 [@ RtlpWaitOnCriticalSection | RtlpDeCommitFreeBlock | PR_Lock ] The source line in frame 4 looks wrong, but it looks like we're calling PR_Lock with a NULL PRLock.
Thanks Ted. This is pretty interesting. We inserted a lock into trunk so exclude mutual access. You crashing with a NULL lock there looks really weird. Memory corruption? Bad string pointer? (we find the lock by finding the compartment for the string).
philor's all-seeing-eye points out a crash @ JSString::flatten from yesterday: http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288241702.1288242307.16454.gz#err2 which occurred in the middle of a Sync test. I should note that I do use Sync, so it's plausible that these crashes are occurring during a background sync.
The profile that crashed for me has Sync set up as well.
The volume is way down with the mutex in place, but we still get a few. The most common one of what's left crashes at jsstr.cpp:136 [1] (reported as line 144 in both Socorro and VS, but it's really 136) getting the capacity of a rope top node's buffer. It is an NPE, so somehow we can get a rope top node with no buffer. That should be filed as a separate bug, though, if we want to study it, but I don't recommend doing that until the main one is solved. I'm wondering if it could be caused by a race in some other string-related function, because the mutex only stops races in flatten. [1] https://crash-stats.mozilla.com/report/index/a2bf9fa1-ab1f-4fa3-ba8f-c8a262101027
(In reply to comment #76) > The profile that crashed for me has Sync set up as well. I think Sync is a red herring. We're seeing this in pretty high numbers, higher than I believe is representative of sync users within our nightly audience.
Thanks beltzner. I also can't find any multithreading in sync.
philiKON: When did off-thread sync crypto land? The crypto stuff has plenty of string passing through ctypes.
This sends a function made on one thread to another thread, which is completely unsafe :-/ (and got even unsafer when jorendorff turned off multithreaded objects). 93 this.cbThread.dispatch(new Runner(this.callback, this.thisObj, 94 returnval, error), 95 Ci.nsIThread.DISPATCH_NORMAL);
By "red herring" I meant "almost assuredly the actual culprit", obviously. Those idioms are right next to each other. /me hides in shame
Blocks: 608109
No longer blocks: 608109
Depends on: 608109
(In reply to comment #82) > This sends a function made on one thread to another thread, which is completely > unsafe :-/ (and got even unsafer when jorendorff turned off multithreaded > objects). > > 93 this.cbThread.dispatch(new Runner(this.callback, this.thisObj, > 94 returnval, error), > 95 Ci.nsIThread.DISPATCH_NORMAL); Not only did we seem to support this, it looks just like "good JS". If it were in a single-threaded-by-design framework, e.g. nodejs, it would be fine. But it is not fine. Advising people not to do this risks whispering into the wind. We need stronger checks and migration help, or MT wrappers (bug 566951). /be
(In reply to comment #83) > By "red herring" I meant "almost assuredly the actual culprit", obviously. > Those idioms are right next to each other. > > /me hides in shame No worries, it is not easy to call these by sparse sampling of crashes and add-on correlations. /be
I filed a bug to fix the crypto stuff. I will file a second bug to stop this at the API level. That will hopefully reveal what we have to fix and stop the crashes.
(In reply to comment #86) > I filed a bug to fix the crypto stuff. I will file a second bug to stop this at > the API level. That will hopefully reveal what we have to fix and stop the > crashes. It will only reveal what we can dynamically cover with testing. That's not enough for our addons and others' addons. It is definitely not enough for XULRunner apps (all the dark matter; Le Monde's daily production system? dunno, don't want to have to worry). So, bug 566951. /be
I didn't say its enough. Its a step up from "crash randomly in far away places in the JS engine in rope code".
Blocks: 608142
No longer blocks: 608142
Depends on: 608142
We've backed out off-thread crypto so hopefully we should see the number of these crashes go down starting with tomorrow's nightlies.
Thanks Philipp.
Backed out the serialization patch so we can see more clearly the effect of backing out background crypto: http://hg.mozilla.org/mozilla-central/rev/16888138cd07
Looking good so far. We got several crash reports this morning from builds before the post-thread crypto backout, but none from after.
This is awesome. I will dup this bug against 608142, which disallows sending closures across threads.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → DUPLICATE
Crash Signature: [@ JSString::flatten() ] [@ JSString::flatten ]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: