Closed
Bug 1296630
Opened 8 years ago
Closed 4 years ago
Crash in arena_dalloc_small | je_free | PLDHashTable::~PLDHashTable | mozilla::ContainerState::~ContainerState
Categories
(Core :: General, defect)
Tracking
()
People
(Reporter: marcia, Assigned: dbaron, NeedInfo)
References
Details
(Keywords: crash, regression, Whiteboard: [platform-rel-Intel])
Crash Data
Attachments
(1 file, 1 obsolete file)
This bug was filed from the Socorro interface and is
report bp-a7026ba5-3925-447f-8d58-6e3542160819.
=============================================================
Seen while looking at B4 crash stats a new signature to B4 - currently sits at #15: http://bit.ly/2bHyzcs
Not sure where to bucket it. There may be a few other similar crashes that are related, such as the crash following it in arena_dalloc_small | je_free | nsTArray_base<T>::ShrinkCapacity | nsTArray_Impl<T>::ReplaceElementsAt<T> | mozilla::PaintedLayerData::Accumulate
Updated•8 years ago
|
status-firefox50:
--- → affected
Comment 1•8 years ago
|
||
[Tracking Requested - why for this release]:
the scope of this issue might be bigger. this signature seems to be part of an intel cpu specific crash pattern (mentioned in the 2016-08-18 channel meeting) that started in 49.0b4 and unfortunately seems to continue in beta5 judging from early crash data there.
starting in 49.0b4 there was a whole range of new signatures showing up beginning with "arena..." that were coming from "GenuineIntel family 6 model 61 stepping 4 | 4" & "GenuineIntel family 6 model 61 stepping 4 | 2" devices: http://bit.ly/2b5z1mp
all in all they make up ~8% of all crashes in 49.0b4 and seem to happen on windows 7 and above but predominantly (60%) on win8.1.
these were the changes landing in 49.0b4: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_49_0b3_RELEASE&tochange=FIREFOX_49_0b4_RELEASE
Crash Signature: [@ arena_dalloc_small | je_free | PLDHashTable::~PLDHashTable | mozilla::ContainerState::~ContainerState] → [@ arena_dalloc_small | je_free | PLDHashTable::~PLDHashTable | mozilla::ContainerState::~ContainerState]
[@ arena_dalloc_small | je_free | nsTArray_base<T>::ShrinkCapacity | nsTArray_Impl<T>::ReplaceElementsAt<T> | mozilla::PaintedLayerData::Accumulate]…
tracking-firefox49:
--- → ?
Keywords: regression
Reporter | ||
Updated•8 years ago
|
Crash Signature: arena_dalloc_small | je_free | sftkdb_FindObjectsInit] → arena_dalloc_small | je_free | sftkdb_FindObjectsInit]
[@ arena_dalloc_small | je_free | _moz_pixman_region32_fini]
Comment 2•8 years ago
|
||
Not sure who could dig into this one...
Flags: needinfo?(milan)
Flags: needinfo?(continuation)
Flags: needinfo?(bugs)
Hard to tell if this is a single problem, or a bunch of problems, and what grouping is the correct one, but given that it started in beta, it probably is the same cause.
Jet would look at FrameLayerBuilder::BuildContainerLayerFor ones, I'll find somebody to look at LayerManagerComposite::PostProcessLayers, there are a few other display item/display list related.
Looking at the list of changes in 49b4 (comment 1), there are a few media related changes, so it's probably worth somebody looking at those. The only other one that stands out is bug 1291016, which is probably fine, but to a casual observer a new, uninitialized variable got introduced, so... (:heycam instead of :jfkthame who's not around right now).
Flags: needinfo?(milan)
Flags: needinfo?(cam)
Flags: needinfo?(ajones)
High volume group of possibly related crashes, new on beta 4, let's call this a blocker for 49.
tracking-firefox50:
--- → ?
Comment 5•8 years ago
|
||
I don't know anything about layout code, sorry.
Flags: needinfo?(continuation)
If this is an OOM condition then it is probably related to bug 1296453.
Comment 7•8 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #3)
> The only other one that stands out is bug 1291016, which is probably fine, but to a
> casual observer a new, uninitialized variable got introduced, so... (:heycam
> instead of :jfkthame who's not around right now).
Following up over there.
Flags: needinfo?(cam)
Comment 9•8 years ago
|
||
the crash spike issue has disappeared again in 49.0b6.
in beta 5 crashes coming from "GenuineIntel family 6 model 61 stepping 4" devices generally made up 8.2% of the whole crashing volume, in beta 6 those are back to a "normal" level of 1.2% of all crashes.
we should probably wait and see how beta 7 is performing before considering this solved/untrack it though...
Comment 10•8 years ago
|
||
(In reply to David Bolter [:davidb] from comment #8)
> Daniel, and thoughts on this one?
Looks like the backtrace is in layers code, which I'm not super-familiar with. kats or mattwoodrow would perhaps be able to offer more useful opinions/thoughts than I can. (Comment 9 is encouraging, though; maybe this is fixed? I guess we'll see.)
(Side note, following up on comment 3 / comment 7: jfkthame says over in bug 1291016 that he doesn't think it's connected to this bug.)
Updated•8 years ago
|
Flags: needinfo?(dholbert)
Comment 11•8 years ago
|
||
This is a hashtable being torn down inside of ~ContainerState(). ContainerState owns two hash tables, and this could be either one:
> nsTHashtable<nsRefPtrHashKey<PaintedLayer>> mPaintedLayersAvailableForRecycling;
...and:
> nsDataHashtable<nsGenericHashKey<MaskLayerKey>, RefPtr<ImageLayer>>
> mRecycledMaskImageLayers;
https://dxr.mozilla.org/mozilla-central/rev/01748a2b1a463f24efd9cd8abad9ccfd76b037b8/layout/base/FrameLayerBuilder.cpp#1396-1423
We might be putting something bogus in one of those hashtables, and then crashing when the hashtable gets destroyed, or something... mstange & dvander have "hg blame" for each of those hashtables declarations, so one of them might be a good person to take a look at this, too, if we discover that it's not fixed as hoped in comment 9. [CC'ing them]
I don't see any playback commits in the regression range in c1.
Flags: needinfo?(ajones)
Comment 13•8 years ago
|
||
the crash level of this cpu family still looks normal in 49.0b7, so i think we can close this bug.
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox49:
affected → ---
status-firefox50:
affected → ---
tracking-firefox50:
? → ---
Resolution: --- → WORKSFORME
Comment 14•8 years ago
|
||
the issue is back again in 49.0b8. i don't understand what's going on :-(
Comment 15•8 years ago
|
||
Bug 1294193 is another bug where there's a strong correlation to "GenuineIntel family 6 model 61 stepping 4 | 4".
Comment 16•8 years ago
|
||
For "arena_dalloc_small | je_free | PLDHashTable::~PLDHashTable | mozilla::ContainerState::~ContainerState":
(91.30% in signature vs 03.81% overall) address = 0xffffffffffffffff
(91.30% in signature vs 05.34% overall) adapter_device_id = 0x1616
(91.30% in signature vs 05.50% overall) cpu_info = GenuineIntel family 6 model 61 stepping 4 | 4
(47.83% in signature vs 02.74% overall) build_id = 20160814184416
(43.48% in signature vs 08.58% overall) platform_pretty_version = Windows 8.1
(43.48% in signature vs 08.58% overall) platform_version = 6.3.9600
(26.09% in signature vs 03.73% overall) bios_manufacturer = Insyde
(21.74% in signature vs 02.76% overall) Addon "Video DownloadHelper" = true
We seem to be talking about a heap corruption scenario. I imagine this is some underlying problem that the changes to beta are tickling into higher frequency, rather than actually caused by the changes between beta 3 and beta 4, as well as between beta 7 and beta 8.
The first set of patches is: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_49_0b3_RELEASE&tochange=FIREFOX_49_0b4_RELEASE
The second set of patches is: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_49_0b7_RELEASE&tochange=FIREFOX_49_0b8_RELEASE
Grasping for straws, there is audio related things in both, but that's weak.
The release that got better (beta 6, see comment 9) contains a fix to bug 1293985, with the "...PLDHashTable::Iterator can't handle modifications while iterating...", so that's starting to look interesting.
Mats, thoughts on this bug? Since your patch in bug 1293985 correlates with things getting better (then they got worse afterwards), and we're crashing in the PLDHashTable destructor, though you may have some insight.
Flags: needinfo?(mats)
Comment 20•8 years ago
|
||
Comment 21•8 years ago
|
||
The changes in bug 1292856 may also affect this one if Layer rendering is affected. Jamie: can you chime in on what that patch can change related to memory allocation?
Flags: needinfo?(bugs) → needinfo?(jnicol)
Comment 22•8 years ago
|
||
Any correlation with the changes in bug 1293985 seems coincidental to me.
The PLDHashTable object there is different from this one.
The correlations in comment 16 seems unusually strong, so it might be
worth finding hardware that match:
>(91.30% in signature vs 05.34% overall) adapter_device_id = 0x1616
>(91.30% in signature vs 05.50% overall) cpu_info = GenuineIntel family 6 model 61 stepping 4 | 4
and install:
>(21.74% in signature vs 02.76% overall) Addon "Video DownloadHelper" = true
and test with some of the URLs from crash-stats...
If it's at least semi-reproducible it might be possible to fix.
Flags: needinfo?(mats)
Comment 23•8 years ago
|
||
(In reply to Jet Villegas (:jet) from comment #21)
> The changes in bug 1292856 may also affect this one if Layer rendering is
> affected. Jamie: can you chime in on what that patch can change related to
> memory allocation?
That change will in some cases make us *avoid* making a large allocation.
Flags: needinfo?(jnicol)
Updated•8 years ago
|
Crash Signature: arena_dalloc_small | je_free | sftkdb_FindObjectsInit]
[@ arena_dalloc_small | je_free | _moz_pixman_region32_fini] → arena_dalloc_small | je_free | sftkdb_FindObjectsInit]
[@ arena_dalloc_small | je_free | _moz_pixman_region32_fini]
[@ arena_run_tree_insert | arena_dalloc_small | je_free | mozilla::UniquePtr<T>::reset ]
[@ arena_dalloc_small | je_free | nsTArray_bas…
Comment 24•8 years ago
|
||
After looking at the "cpu info" correlation for the signatures here, I'm leaning
towards a hardware related bug, as Marco suggested in bug 1294193 comment 6.
Many of signatures in this bug have a 100% correlation to
"GenuineIntel family 6 model 61 stepping 4".
dbaron, what do you think about that theory? (given your experience with
the AMD bug)
Flags: needinfo?(dbaron)
Let's keep an eye on this next week and see what Intel says.
Assignee | ||
Comment 26•8 years ago
|
||
(In reply to Mats Palmgren (:mats) from comment #24)
> After looking at the "cpu info" correlation for the signatures here, I'm
> leaning
> towards a hardware related bug, as Marco suggested in bug 1294193 comment 6.
> Many of signatures in this bug have a 100% correlation to
> "GenuineIntel family 6 model 61 stepping 4".
They all do.
But they also seem to have graphics hardware in common. (e.g., "adapter device id" is mostly 0x1616 with a bit of 0x1606 and a few stragglers; "adapter vendor id" nearly always 0x8086, which is apparently Intel(R) HD Graphics 5500)
I think if you want to claim it's a hardware bug you need to make a much stronger case.
And even if it is, we should still be working to figure out how to fix it. There was obviously something that made it start happening, so we should figure out what that was and undo it if possible.
Flags: needinfo?(dbaron)
Assignee | ||
Comment 27•8 years ago
|
||
Possibly of interest:
https://crash-stats.mozilla.com/search/?cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping%204&product=Firefox&version=49.0b
which is the topcrash list for 49.0 betas for this CPU model only.
Comment 28•8 years ago
|
||
> But they also seem to have graphics hardware in common.
I assumed this was because this chip comes with a builtin GPU.
FWIW, here are a few matches I get from Google on the cpu info string:
Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) i7-5650U CPU @ 2.20GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) M-5Y71 CPU @ 1.20GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) M-5Y51 CPU @ 1.10GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) M-5Y31 CPU @ 0.90GHz [x86 Family 6 Model 61 Stepping 4]
Which are all Broadwell, with varying GPUs (source: http://www.cpu-world.com/ )
Assignee | ||
Comment 29•8 years ago
|
||
I think the [@ sse2_blt] crashes looked interesting at first because they have similar CPU pattern and predominance on betas of 49, except they spiked in 49.0b7 and 49.0b8 rather than spiking in 49.0b4, b5, and b8 like many (didn't check all) of the others.
Assignee | ||
Updated•8 years ago
|
Assignee | ||
Comment 30•8 years ago
|
||
A few other observations:
* a pretty big portion of the urls reported in crash-stats are either the facebook homepage or youtube videos. This makes it seem like there's a decent chance that these crashes usually or always occur during video playback
* a decent portion of the crash reports (half of the ones I sampled?) have the cubeb audio thread being one of the threads contending for the malloc lock at the time the main thread crashes in the allocator (at one of 2 different stacks). e.g., bp-a68d7849-7be0-4058-9697-842d82160903 thread 65, or bp-4d11b27f-9f69-4cd5-8256-1819d2160903 thread 119). I suspect this wouldn't be the case if audio weren't playing at the time of the crash. (I *suspect* the contention is because those threads keep running a little bit after the crash happens, and then crash reporting starts, on the main thread, and thus essentially keep running until they hit a lock that they need to acquire. I'm not really sure about this, though, i.e., about how long other threads would keep running when one thread crashes.)
Assignee | ||
Comment 31•8 years ago
|
||
Though one other thought. I looked at the minidump for bp-0d55d0d0-99a5-4899-8aa4-f48da2160903, and allegedly we crash on the instruction:
71B248FB 3B C1 cmp eax,ecx
I just don't see how that instruction can yield EXCEPTION_ACCESS_VIOLATION_READ with a crash address of 0xffffffff, especially when eax and ecx are both 0x01500228.
Updated•8 years ago
|
platform-rel: --- → ?
Whiteboard: [platform-rel-Intel]
Comment 32•8 years ago
|
||
49 beta 10 looks affected still, judging on early crash data there.
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #30)
> ... I suspect this
> wouldn't be the case if audio weren't playing at the time of the crash.
We have audio related patches landing in releases where we could observe a difference in the crash rates (good and bad), so it could be the changes in timing getting us in trouble.
Marking this as a blocker for 49 as it seems very high volume for 49.
Comment 35•8 years ago
|
||
The audio playback aspect of this makes me wonder if it's related to bug 1255737. I believe that the speculation there was related to drivers causing audio shutdown badness. https://hg.mozilla.org/releases/mozilla-beta/rev/ab7b68014a1e would have shipped in 49b4.
Comment 36•8 years ago
|
||
just to sum up again the impact from this bug that we have seen so far this beta cycle:
beta 1: unaffected
beta 2: unaffected
beta 3: unaffected
beta 4: affected
beta 5: affected
beta 6: unaffected
beta 7: unaffected
beta 8: affected
(beta 9 not released)
beta 10: affected
Comment 37•8 years ago
|
||
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #31)
> Though one other thought. I looked at the minidump for
> bp-0d55d0d0-99a5-4899-8aa4-f48da2160903, and allegedly we crash on the
> instruction:
> 71B248FB 3B C1 cmp eax,ecx
>
> I just don't see how that instruction can yield
> EXCEPTION_ACCESS_VIOLATION_READ with a crash address of 0xffffffff,
> especially when eax and ecx are both 0x01500228.
This reminds me of bug 1034706 comment 44, although that was AMD specific.
By the way, we had a similar situation with AMD for 48 Beta (see bug 1290419).
Some builds were affected, some were not. It might depend on the compiler.
Comment 38•8 years ago
|
||
(In reply to [:philipp] from comment #36)
> just to sum up again the impact from this bug that we have seen so far this
> beta cycle:
> beta 1: unaffected
> beta 2: unaffected
> beta 3: unaffected
> beta 4: affected
> beta 5: affected
> beta 6: unaffected
> beta 7: unaffected
> beta 8: affected
> (beta 9 not released)
> beta 10: affected
Well, that doesn't seem to match the bug 1255737 in/out pattern. We had AsyncShutdownBlocked (causing shutdown hangs in the field) in:
beta 1, 2, 3, 7, 8, 9, 10
We weren't blocking asyncshutdown in:
beta 4, 5, 6
Comment 39•8 years ago
|
||
Pasting a response from Adam Moloniewicz, Intel (with his permission):
"As this looks like a heap corruption – have you tried to run the app with Application Verifier engaged ? Especially with Memory and Heap options enabled. You could also use Intel inspector(Intel studio) to analyze memory/threading anomalies. Other ideas that come to my mind could be to force internal SW modules to use isolated heaps instead of using only the global one(which is usually the common case). I’m not aware of the internal architecture so it’s hard to come up with particular ideas but perhaps some custom C++ memory allocator would do the job. This way we could narrow down the root cause.
So far I don’t see any strong evidence that would indicate the graphics UMD modules are a culprit, though indeed, they utilize the process global heap. So not sure how to assist you. Is there any easy way to disable HW rendering acceleration so that the rendering would fall back to WARP renderer instead of HW?"
Useful query from Marco that shows the RC1 does not look affected at the same volume. We can use this to check the RC2 as well once we release it.
https://crash-stats.mozilla.com/search/?signature=%5Earena_dalloc_small%20%7C%20je_free%20%7C&product=Firefox&_sort=-date&_facets=signature&_facets=version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version
Comment 41•8 years ago
|
||
Links for the tools I mentioned in comment 39:
https://msdn.microsoft.com/en-us/library/windows/hardware/ff538115(v=vs.85).aspx
https://software.intel.com/en-us/intel-system-studio
Isolated heaps info:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366599(v=vs.85).aspx
Comment 42•8 years ago
|
||
Assigning to dbaron who agreed to own this exceptionally complex issue...
Assignee: nobody → dbaron
Updated•8 years ago
|
platform-rel: ? → +
Assignee | ||
Comment 43•8 years ago
|
||
One other point of interest is that crashes on nightly first showed up in build 2016-06-18, although they weren't frequent enough to happen in every build following that:
https://crash-stats.mozilla.com/search/?date=%3E2016-05-01&cpu_info=GenuineIntel%20family%206%20model%2061%20stepping%204&release_channel=nightly&signature=free_impl&signature=je_free&product=Firefox&_sort=build_id&_sort=-date&_facets=signature&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=platform_pretty_version&_columns=install_time&_columns=url#crash-reports
Assignee | ||
Comment 44•8 years ago
|
||
Assignee | ||
Comment 45•8 years ago
|
||
I looked through a bunch of crashes with cpearce. One interesting point he noticed is that at least some of the ones on youtube were using VP9, which wouldn't go through DXVA, which makes DXVA less likely. (Another thing making DXVA less likely is that it should be writing to graphics memory.)
On the other hand, he's suspicious of cubeb's writing to audio buffers.
Assignee | ||
Comment 46•8 years ago
|
||
I think the right CPUs include:
http://ark.intel.com/products/85213/Intel-Core-i5-5300U-Processor-3M-Cache-up-to-2_90-GHz
and maybe also:
http://ark.intel.com/products/85212/Intel-Core-i5-5200U-Processor-3M-Cache-up-to-2_70-GHz
based on the graphics device ID being 0x1616.
Assignee | ||
Comment 47•8 years ago
|
||
And one other point is that the machines in question do seem to come from multiple manufacturers, given:
Rank Bios manufacturer Count %
1 Dell Inc. 4480 33.43 %
2 Insyde 2504 18.69 %
3 American Megatrends Inc. 2336 17.43 %
4 Hewlett-Packard 1792 13.37 %
5 LENOVO 1244 9.28 %
6 Insyde Corp. 721 5.38 %
7 INSYDE Corp. 129 0.96 %
8 Lenovo 91 0.68 %
9 TOSHIBA 37 0.28 %
10 Intel Corporation 30 0.22 %
from the query https://crash-stats.mozilla.com/search/?cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping%204&signature=~je_free&signature=~sse2_blt&product=Firefox&version=49.0b&_sort=-date&_facets=signature&_facets=bios_manufacturer&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform_pretty_version&_columns=uptime&_columns=app_notes&_columns=graphics_critical_error#facet-bios_manufacturer
Comment 48•8 years ago
|
||
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #31)
> Though one other thought. I looked at the minidump for
> bp-0d55d0d0-99a5-4899-8aa4-f48da2160903, and allegedly we crash on the
> instruction:
> 71B248FB 3B C1 cmp eax,ecx
>
> I just don't see how that instruction can yield
> EXCEPTION_ACCESS_VIOLATION_READ with a crash address of 0xffffffff,
> especially when eax and ecx are both 0x01500228.
I see this in all the arena_dalloc_small crashes that I looked at.
The sse2_blt crashes are equally weird, crashing on a 'mov edi,ecx' instruction with a write access violation.
Neither of these instruction access memory, so an access violation sounds impossible.
Either the EIP value in the minidump is incorrect (but only for this CPU?), or the we're hitting a CPU bug. I can't think of any other ways this could be possible.
I've had a look at the errata for the 5th gen Intel CPUs, nothing stands out as being this, but there are are a lot so I could easily have missed it.
Assignee | ||
Comment 49•8 years ago
|
||
More bizarre data from the minidump. I took a look at 9 minidumps (5 for one signature all from beta 10, 4 for a different signature all from beta 10, and also 1 from the first signature but for beta 8):
In all cases:
* The upper short of EAX, EBX, ECX, and EDI was always the same for a given minidump, but varied between minidumps (0080, 0050, 0080, 0060, 00e0, 0060, 00b0, 0100, 0120, 0080)
* The lower short of those 4 registers was always: AX: 0228, BX: 0220, CX: 0228, DI: 0040
* EBP was always 00000040
* EFLAGS was always either 00210246 or 00010246 (which is consistent with having executed a CMP between two equal values, which is allegedly the instruction we crash on)
* EDX and ESI looked like pointers to the same area of memory, though differing by a decent amount. Presumably heap pointers.
* ESP and EIP looked like pointers to other (different from each other and from EDX/ESI) areas of memory
* EIP always has the low short the same for a given build, although differing slightly between beta 8 and beta 10
ted, does something like this ring any bells?
Flags: needinfo?(ted)
Comment 50•8 years ago
|
||
(In reply to David Bolter [:davidb] from comment #41)
> Links for the tools I mentioned in comment 39:
> https://msdn.microsoft.com/en-us/library/windows/hardware/ff538115(v=vs.85).
> aspx
Likely you all have this installed already as part of the SDK
> https://software.intel.com/en-us/intel-system-studio
Available on all platforms, but starts at $699
> Isolated heaps info:
> https://msdn.microsoft.com/en-us/library/windows/desktop/aa366599(v=vs.85).
> aspx
Comment 51•8 years ago
|
||
We should feed dbaron's analysis from comment 49 back to the intel guy as well (and comment 46 and comment 48, or a link to this). (Perhaps once ted weighs in)
(In reply to Randell Jesup [:jesup] from comment #51)
> We should feed dbaron's analysis from comment 49 back to the intel guy as
> well (and comment 46 and comment 48, or a link to this). (Perhaps once ted
> weighs in)
Joe from Intel is on this bug - however I don't believe that Adam is a user. I'll see if we can get him signed up via the ML.
Joe - any others that we should be adding here?
Flags: needinfo?(joseph.k.olivas)
I'm testing on one of these - any insight if there is a usage pattern? Right now, I'm doing videos and general browsing.
Comment 54•8 years ago
|
||
(In reply to Desigan Chinniah [:cyberdees] [:dees] [London - GMT] from comment #52)
> Joe - any others that we should be adding here?
I am following this bug closely and feeding back to some people internally here. I can be the main point of contact.
Flags: needinfo?(joseph.k.olivas)
Assignee | ||
Comment 55•8 years ago
|
||
Inspired by bug 1300233 and with some help from Aryx and froydnj on IRC, I should point out that thanks to the combination of bug 1259782 and bug 1270664, Firefox 49 is the first release we're shipping with Visual Studio 2015 rather than 2013. This upgrade also involved using SSE instructions in generated code, since that can't be turned off in 2015.
So that seems like it could be related to the problems we're seeing here.
I'd still be interested to know more details about what sorts of problems were present in this CPU revision that may have been fixed in microcode updates.
Comment 57•8 years ago
|
||
All versions of Firefox 49 and newer are currently building on Visual Studio 2015 Update 2. VS2015u3 is out but not deployed (bug 1283203 tracks). Someone may want to comb the release notes for VS2015u3 to see if it fixes anything that could be related to this crash. Also, if getting central (and possibly earlier releases) on VS2015u3 is a good idea, let me know and I can land that.
Comment 58•8 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #57)
> Someone may want to comb the release notes for VS2015u3 to see if it fixes
> anything that could be related to this crash.
The first item in VC++ fixes looks interesting:
https://www.visualstudio.com/news/releasenotes/vs2015-update3-vs#visualcpp
"We now check the access of a deleted trivial copy/move ctor. Without the check, we may incorrectly call the defaulted copy ctor (in which the implementation can be ill-formed) and cause potential runtime bad code generation."
Comment 59•8 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #57)
> if getting central (and possibly earlier releases) on VS2015u3 is a good idea, let me know and I can land that.
The bug fix in comment 58 sounds like we should deploy this upgrade on m-c.
Comment 60•8 years ago
|
||
If you upgrade to VS2015u3 you might also want to apply KB3165756. Fixes at least one possible compiler bug:
https://msdn.microsoft.com/en-us/library/mt752379.aspx
>> Issue 3
>> Potential miscompilation of code-calling functions that resemble std::min/std::max on
>> floating point values.
Assignee | ||
Comment 61•8 years ago
|
||
One other piece of data from the query in comment 44:
Rank E10s cohort Count %
1 disqualified 7587 72.98 %
2 control 5509 52.99 %
3 test 4853 46.68 %
4 addons 255 2.45 %
5 set2a 255 2.45 %
6 optedout 26 0.25 %
So it seems to happen both with and without e10s.
Assignee | ||
Comment 62•8 years ago
|
||
Current status is that while I have a laptop with one of the CPU models in question, as does Milan, neither of us have been able to reproduce the crash.
It's possible that the crash is specific to microcode version. We're hoping to get that data added to crash-stats soon so we can tell. The one user who we've been able to contact has version 0x18, while Milan has 0x1D and I have 0x21.
I don't know how to boot with an older version of the microcode than the one that's used by default (which I believe comes from the BIOS). There are older (and newer) versions made available for use by Linux distros (0x18 is available as part of https://downloadcenter.intel.com/download/24661/Linux-Processor-Microcode-Data-File ), but I don't *think* those are usable with Windows in any way, unless I could somehow use part of the Linux boot process (e.g., grub) to load it and then boot into Windows. But I think the part of the Linux boot process that loads it happens later, via the kernel, based on reading the manual for iucode-tool(8).
And I'm not even sure if switching to a different microcode version would help, or if playing lots of youtube and other videos are really the right steps to reproduce the crash.
Assignee | ||
Comment 63•8 years ago
|
||
(In reply to David Baron :dbaron: ⌚️UTC-7 (busy September 14-25) from comment #62)
> And I'm not even sure if switching to a different microcode version would
> help
Oh, but one reason to think it would is that the crashes don't occur on Windows 10, and I *believe* Windows 10 loads a more recent version of the microcode.
David, do you see the extra stuff in the app notes if you build with this patch?
Comment 65•8 years ago
|
||
Confirmed that attachment 8790490 [details] [diff] [review] adds CpuRevisionStatus in App Notes.
https://crash-stats.mozilla.com/report/index/eea965fd-bad2-4b7b-b888-24a372160913
Milan asked me to do a try build for Windows, so that we can pass the build to others to test.
With patch on m-c
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c3bdc76b4200
With patch on beta
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6c6a66c5db3b
Comment 66•8 years ago
|
||
I am not sure if the following is related to this bug.
In gecko, hundreds of HTMLMediaElement and MediaDecoders could be piled up even when JS side uses one media element at a time. It could be confirmed with the following url in Bug 1155000. It uses mp4 videos for testing.
http://people.mozilla.org/~kbrosnan/tmp/1155000/video-memory-test.html
I think our RC3 build has avoided this crash. Marking 49 as no longer blocked.
Comment 69•8 years ago
|
||
(In reply to David Bolter [:davidb] from comment #68)
> How is RC4 looking?
good so far in regards to this bug. there's no higher crash volume from systems with a "GenuineIntel family 6 model 61 stepping 4" cpu than normally.
Looking good. You can compare to beta 10 here, https://mozilla.github.io/stab-crashes/compare-betas.html?beta1=49.0b&beta2=49.0b99
Flags: needinfo?(lhenry)
Comment 71•8 years ago
|
||
(76.66% in signatures vs 02.28% overall) address = 0xffffffffffffffff
(70.46% in signatures vs 02.52% overall) adapter_device_id = 0x1616
(70.31% in signatures vs 02.52% overall) cpu_info = GenuineIntel family 6 model 61 stepping 4 | 4
(97.85% in signatures vs 37.76% overall) reason = EXCEPTION_ACCESS_VIOLATION_READ
(49.93% in signatures vs 08.79% overall) platform_version = 6.3.9600
(49.93% in signatures vs 08.79% overall) platform_pretty_version = Windows 8.1
(46.60% in signatures vs 05.83% overall) build_id = 20160829102229
(36.41% in signatures vs 05.47% overall) has dual GPUs = true
(33.01% in signatures vs 03.01% overall) Addon "Kaspersky Protection" = true
(95.49% in signatures vs 68.52% overall) adapter_vendor_id = 0x8086
(33.97% in signatures vs 10.23% overall) bios_manufacturer = Dell Inc.
(30.13% in signatures vs 08.52% overall) GFX_ERROR "Failed 2 buffer db="
(20.61% in signatures vs 03.20% overall) bios_manufacturer = Insyde
Where 'signatures' is every signature starting with 'arena_dalloc_small | je_free'.
Perhaps installing the "Kaspersky Protection" (light_plugin_ACF0E80077C511E59DED005056C00008@kaspersky.com)
addon might help in reproducing the issue?
Updated•8 years ago
|
Updated•8 years ago
|
Crash Signature: mozilla::layers::ContainerLayerProperties::ComputeChangeInternal ]
[@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::DestructRange | mozilla::PaintedLayerData::~PaintedLayerData ] → mozilla::layers::ContainerLayerProperties::ComputeChangeInternal ]
[@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::DestructRange | mozilla::PaintedLayerData::~PaintedLayerData ]
[@ arena_dalloc_small | je_free | nsT…
Updated•8 years ago
|
Crash Signature: mozilla::detail::RunnableMethodImpl<T>::`scalar deleting destructor'' ]
[@ arena_dalloc_small | je_free | nsCSSValue::DoReset ]
[@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::RemoveElementsAt | mozilla::DisplayList… → nsCSSValue::DoReset ]
[@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::RemoveElementsAt | mozilla::DisplayListClipState::GetCurrentCombinedClip ]
Comment on attachment 8790490 [details] [diff] [review]
A bit of a hack to get the update revision/signature into app notes of the crash report
Maybe we land this, even in the wrong place, as it should be easy to uplift and sounds like we have more instances of the problem showing up.
Attachment #8790490 -
Flags: review?(dvander)
Comment 74•8 years ago
|
||
We could make it an annotation (bug 1305120), so it would be easier to use with Socorro and SuperSearch.
Comment on attachment 8790490 [details] [diff] [review]
A bit of a hack to get the update revision/signature into app notes of the crash report
Review of attachment 8790490 [details] [diff] [review]:
-----------------------------------------------------------------
Is there any reason this can't be in gfxWindowsPlatform?
::: gfx/thebes/gfxPlatform.cpp
@@ +682,5 @@
> + }
> +
> + if (cpuUpdateRevision > 0) {
> + nsAutoCString revAndStatus;
> + revAndStatus.AppendPrintf("CpuRevisionStatus(0x%x:0x%x) ",
nit: can use nsPrintfCString here
Attachment #8790490 -
Flags: review?(dvander) → review+
Comment on attachment 8790490 [details] [diff] [review]
A bit of a hack to get the update revision/signature into app notes of the crash report
The patch in bug 1305120 (same code, different place) is probably more appropriate - separate annotation field and in a better file.
Attachment #8790490 -
Attachment is obsolete: true
Comment 77•8 years ago
|
||
Looks like in 50.0b3 we have another signature (jemalloc_crash) strongly correlated with cpu_info = GenuineIntel family 6 model 61 stepping 4.
Crash Signature: nsCSSValue::DoReset ]
[@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::RemoveElementsAt | mozilla::DisplayListClipState::GetCurrentCombinedClip ] → nsCSSValue::DoReset ]
[@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::RemoveElementsAt | mozilla::DisplayListClipState::GetCurrentCombinedClip ]
[@ jemalloc_crash ]
Updated•8 years ago
|
Comment 78•8 years ago
|
||
Interestingly, so far the 'jemalloc_crash' (strongly correlated with Intel CPUs) is gone in 50.0b4 at the same time as js::NativeObject::setSlotWithType (bug 1307285), which is strongly correlated to AMD CPUs.
Updated•8 years ago
|
Assignee | ||
Comment 79•8 years ago
|
||
This query might be useful for finding crashes with microcode info:
https://crash-stats.mozilla.com/search/?cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping%204&signature=~je_free&signature=~sse2_blt&build_id=%3E%3D20161007000000&product=Firefox&date=%3E%3D2016-10-06T20%3A29%3A00.000Z&date=%3C2016-10-13T20%3A29%3A00.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform_pretty_version&_columns=cpu_microcode_version#crash-reports
but so far none of the crashes in that query are crashes that are clearly associated with this bug.
Updated•8 years ago
|
Rank: 1
Comment 80•8 years ago
|
||
50.0b7 is affected by this bug again, these are the microcode facets:
https://crash-stats.mozilla.com/search/?signature=^arena&cpu_info=^GenuineIntel family 6 model 61 stepping 4&version=50.0b7&product=Firefox&process_type=browser&date=>2016-10-14&_sort=-date&_facets=signature&_facets=platform_pretty_version&_facets=cpu_info&_facets=cpu_microcode_version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=platform_pretty_version&_columns=cpu_microcode_version#facet-cpu_microcode_version
Comment 81•8 years ago
|
||
linkified: http://bit.ly/2e92Qz4
Assignee | ||
Comment 82•8 years ago
|
||
So based on comparing this list:
https://crash-stats.mozilla.com/search/?cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping%204&signature=~je_free&signature=~sse2_blt&version=50.0b7&product=Firefox&date=%3E%3D2016-10-10T20%3A04%3A00.000Z&date=%3C2016-10-17T20%3A04%3A00.000Z&_sort=-date&_facets=signature&_facets=cpu_microcode_version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform_pretty_version&_columns=app_notes&_columns=graphics_critical_error#facet-cpu_microcode_version
which is the microcode versions of the 50.0b7 crashes that are *mostly* this bug, with this list:
https://crash-stats.mozilla.com/search/?cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping%204&signature=OOM%20%7C%20small&product=Firefox&version=50.0b&date=%3E%3D2016-10-10T20%3A04%3A00.000Z&date=%3C2016-10-17T20%3A04%3A00.000Z&_sort=-date&_facets=signature&_facets=cpu_microcode_version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform_pretty_version&_columns=app_notes&_columns=graphics_critical_error#facet-cpu_microcode_version
which is the microcode versions of the OOM | small crashes on the affected CPU family, it seems reasonably clear that:
These microcode versions are affected:
0xe (rare), 0x11, 0x12, 0x13, 0x16, 0x18, 0x19
These microcode versions are not affected:
0x1d, 0x1f, 0x21, 0x22
Comment 83•8 years ago
|
||
This would *seem* to imply that revs before 0x1d (and 0xe or above, roughly) are the ones affected.
Joe, any thoughts?
Flags: needinfo?(joseph.k.olivas)
Comment 84•8 years ago
|
||
Just so I understand what's going on:
OOM is used just to basically get what versions are out there (everyone hits OOM), while the other shows which ones are hitting this crash.
Is this correct? I do see 0x1d, 0x1f and 0x21 in the bad crash, but very low numbers.
I'll leave ni open for now.
Assignee | ||
Comment 85•8 years ago
|
||
Yes, I was using small out-of-memory crashes to try to establish a baseline distribution of the microcode versions among our users. (The OOM crashes are smaller numbers than this crash, but I think big enough to get a usable baseline.)
Assignee | ||
Comment 86•8 years ago
|
||
(In reply to Joe Olivas from comment #84)
> Is this correct? I do see 0x1d, 0x1f and 0x21 in the bad crash, but very low
> numbers.
Yes -- I think the problem there is that the query isn't perfect. It's a query on signature substring, which catches some other crashes. Those looked like they had very different patterns of signatures than the ones this bug covers, so I didn't actually go through and check the minidumps to verify that they were not showing the patterns in comment 31 and comment 49.
Comment 87•8 years ago
|
||
until we have a proper fix in place for this problem could we do something similar like with the whole websense saga and have those installations with the known affected intel cpu+microcode versions identify themselves in the update ping?
so in case of an emergency (like a dot release on the firefox release channel that is affected by this bug, because we cannot first test it with a wider beta audience like we do with rc builds) we would at least be able to retroactively disable automatic updates just for those crashing configurations...
Assignee | ||
Updated•8 years ago
|
Assignee | ||
Comment 88•8 years ago
|
||
I filed bug 1311515 on comment 87.
Updated•8 years ago
|
(In reply to David Baron :dbaron: ⌚️UTC+8 from comment #82)
>...
>
> These microcode versions are affected:
> 0xe (rare), 0x11, 0x12, 0x13, 0x16, 0x18, 0x19
These are still the only ones showing.
>
> These microcode versions are not affected:
> 0x1d, 0x1f, 0x21, 0x22
These still haven't shown.
Too late to fix in 50.1.0 release
Comment 91•8 years ago
|
||
Are these signatures still showing up newer releases?
Reporter | ||
Comment 92•8 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #91)
> Are these signatures still showing up newer releases?
A cursory manual look did not show anything later than 50, but I will need info on Marco to answer for sure.
Flags: needinfo?(mozillamarcia.knous) → needinfo?(mcastelluccio)
Comment 93•8 years ago
|
||
50, 51.0b up to 12, 52.0a2 and 53.0a1 are currently unaffected, but the signatures are build-dependent so they might reappear in the future.
Flags: needinfo?(mcastelluccio)
Updated•8 years ago
|
platform-rel: + → -
Comment 94•4 years ago
|
||
Closing because no crashes reported for 12 weeks.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 4 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•