Closed
Bug 1281759
(amdbug)
Opened 8 years ago
Closed 4 years ago
Work around mysterious AMD JIT crashes
Categories
(Core :: JavaScript Engine: JIT, defect, P3)
Core
JavaScript Engine: JIT
Tracking
()
People
(Reporter: jandem, Assigned: jandem)
References
(Depends on 1 open bug, Blocks 3 open bugs)
Details
(Keywords: crash, Whiteboard: [platform-rel-AMD])
Attachments
(2 files)
(deleted),
patch
|
sunfish
:
review+
Sylvestre
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jandem
:
review+
ritu
:
approval-mozilla-aurora+
ritu
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
See the AMD crashes in bug 1034706 comment 44.
This still happens on 47. I want to add some nops to the Baseline stub where we're crashing (we can probably use cpuid to restrict it to these AMD CPUs) to see if that makes it go away. Info from AMD suggests it might help.
Assignee | ||
Comment 1•8 years ago
|
||
This adds some 4-byte NOPs to this IC stub on x86 if CPU family is 20 and model is 0-2.
According to AMD engineers, limiting the number of branches per cache line might help, so I'm hopeful this will work. If this patch doesn't help, we can try more or larger NOPs...
Attachment #8764587 -
Flags: review?(sunfish)
Comment 2•8 years ago
|
||
Comment on attachment 8764587 [details] [diff] [review]
Patch
Review of attachment 8764587 [details] [diff] [review]:
-----------------------------------------------------------------
Yikes. Patch looks good though.
Attachment #8764587 -
Flags: review?(sunfish) → review+
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ae024662143f
Try to work around mysterious AMD crashes. r=sunfish
Comment 4•8 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
status-firefox50:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Assignee | ||
Updated•8 years ago
|
Flags: needinfo?(jdemooij)
Comment 5•8 years ago
|
||
> This adds some 4-byte NOPs to this IC stub on x86 if CPU family is 20 and
> model is 0-2.
Do we have confirmation from AMD that these models are the only ones affected? My analysis in bug 1034706 comment 63 suggests that these families account for only a fraction of suspicious JIT crashes on AMD CPUs.
Comment 6•8 years ago
|
||
Also, do you have an analysis plan to determine if this change helps in practice?
Assignee | ||
Comment 7•8 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #5)
> Do we have confirmation from AMD that these models are the only ones
> affected? My analysis in bug 1034706 comment 63 suggests that these families
> account for only a fraction of suspicious JIT crashes on AMD CPUs.
No confirmation, but I haven't seen *these* crashes with other (AMD) CPUs. I can take another look at some of the other crashes though.
(In reply to Nicholas Nethercote [:njn] from comment #6)
> Also, do you have an analysis plan to determine if this change helps in
> practice?
I'll check if there's any difference on Nightly later today. Unfortunately not that many users are on Nightly and most probably don't use these CPUs, so we'll likely have to backport to see if it helps...
Updated•8 years ago
|
Assignee | ||
Comment 9•8 years ago
|
||
(In reply to Marco Castelluccio [:marco] (PTO until August 9) from comment #8)
> This might fix bug 1290388.
I don't think so; I moved it to the See Also list :)
No longer blocks: 1290388
Flags: needinfo?(jdemooij)
Assignee | ||
Comment 10•8 years ago
|
||
Comment on attachment 8764587 [details] [diff] [review]
Patch
This patch attempts to work around JIT code crashes on certain AMD CPUs. It inserts some NOP instructions on this particular CPU. It should be safe to backport.
Approval Request Comment
[Feature/regressing bug #]: -
[User impact if declined]: Potential crashes with certain AMD CPUs
[Describe test coverage new/current, TreeHerder]: On m-c and aurora for a while. I haven't seen these crashes on builds with this patch applied, but it's hard to say much because the crashes don't seem to affect all builds.
[Risks and why]: Low risk.
[String/UUID change made/needed]: None.
Attachment #8764587 -
Flags: approval-mozilla-beta?
Comment 11•8 years ago
|
||
Comment on attachment 8764587 [details] [diff] [review]
Patch
Let's take it to try to fix the famous issue.
Should be in 49 beta 3
Attachment #8764587 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 12•8 years ago
|
||
bugherder uplift |
status-firefox49:
--- → fixed
Looks like we have some new crashes on beta 3 described in bug 1294963. If you think it's from this, can you back out your changes or submit a fix? Thanks!
Flags: needinfo?(jdemooij)
On 2nd look, this does seem to have caused the #1 topcrash on beta 3. Wes, can you back it out before the beta 5 build?
Assignee | ||
Comment 15•8 years ago
|
||
I'll take a closer look, but it's *extremely unlikely* this caused these crashes.
What makes us think this is related to bug 1294963? The AMD bug shows up in some builds but not others. Just because this patch fixes one instance of this bug by inserting NOP instructions, doesn't mean it caused a random bug in entirely unrelated code...
Assignee | ||
Comment 16•8 years ago
|
||
This workaround probably didn't work. I still see similar crashes like this one:
https://crash-stats.mozilla.com/report/index/6a120ee3-cf18-48cc-8dc7-e96e22160815
We did correctly identify the CPU and inserted NOP instructions, but apparently that was not enough to stop the crashes. I'll check the errata and conversations with AMD tomorrow to see if there's something I missed... I'll also check if there's any difference in frequency.
006a4530 83 f9 8c cmp ecx,0xffffff8c <- crash here
006a4533 0f 85 16 00 00 00 jne 0x006a454f
006a4539 0f 1f 40 00 nop DWORD PTR [eax+0x0]
006a453d 39 57 10 cmp DWORD PTR [edi+0x10],edx
006a4540 0f 85 09 00 00 00 jne 0x006a454f
006a4546 0f 1f 40 00 nop DWORD PTR [eax+0x0]
006a454a c3 ret
006a454b 0f 1f 40 00 nop DWORD PTR [eax+0x0]
006a454f 8b 7f 04 mov edi,DWORD PTR [edi+0x4]
006a4552 ff 27 jmp DWORD PTR [edi]
Flags: needinfo?(jdemooij)
Assignee | ||
Updated•8 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•8 years ago
|
Priority: -- → P3
Comment 17•8 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #16)
> This workaround probably didn't work. I still see similar crashes like this
> one:
>
> https://crash-stats.mozilla.com/report/index/6a120ee3-cf18-48cc-8dc7-
> e96e22160815
>
> We did correctly identify the CPU and inserted NOP instructions, but
> apparently that was not enough to stop the crashes. I'll check the errata
> and conversations with AMD tomorrow to see if there's something I missed...
> I'll also check if there's any difference in frequency.
>
> 006a4530 83 f9 8c cmp ecx,0xffffff8c <- crash here
> 006a4533 0f 85 16 00 00 00 jne 0x006a454f
> 006a4539 0f 1f 40 00 nop DWORD PTR [eax+0x0]
> 006a453d 39 57 10 cmp DWORD PTR [edi+0x10],edx
> 006a4540 0f 85 09 00 00 00 jne 0x006a454f
> 006a4546 0f 1f 40 00 nop DWORD PTR [eax+0x0]
> 006a454a c3 ret
> 006a454b 0f 1f 40 00 nop DWORD PTR [eax+0x0]
> 006a454f 8b 7f 04 mov edi,DWORD PTR [edi+0x4]
> 006a4552 ff 27 jmp DWORD PTR [edi]
Bug 772330 comment 55 mentions that the two "next instructions" could be affected. Should we try two nops instead of one long nop?
Updated•8 years ago
|
platform-rel: --- → ?
Whiteboard: [platform-rel-AMD]
Updated•8 years ago
|
platform-rel: ? → -
Comment 18•8 years ago
|
||
Talking to AMD, they suggested to maybe increase to 32 bytes of NOPs. Next try to squash those bugs.
Attachment #8805497 -
Flags: review?(jdemooij)
Assignee | ||
Comment 19•8 years ago
|
||
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size
Review of attachment 8805497 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks, let's see if this works :)
Attachment #8805497 -
Flags: review?(jdemooij) → review+
Comment 20•8 years ago
|
||
Pushed by hv1989@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d77c9965e9b2
Try to work around mysterious AMD crashes (take 2). r=jandem
Comment 21•8 years ago
|
||
bugherder |
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
status-firefox52:
--- → fixed
Resolution: --- → FIXED
Updated•8 years ago
|
Updated•8 years ago
|
Comment 22•8 years ago
|
||
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size
Approval Request Comment
[Feature/regressing bug #]:
/
[User impact if declined]:
This contains a possible fix for AMD Bobcat processors. We are not sure this is a definite fix, but we only see numbers high enough on beta to come to a conclusion if this fix works or doesn't work.
[Describe test coverage new/current, TreeHerder]:
Has been on nightly since last week. I also enabled this workaround on my own computer and this caused no problems.
[Risks and why]:
We are improving a work-around for AMD bobcat processors only. We already had a work-around uplifted, but that didn't seem to fix the crashes. Hopefully this new patch does that. We increased the NOP size from 4-bytes to 32-bytes on specific positions. We don't expect any regressions or issues even if the work-around doesn't work.
[String/UUID change made/needed]:
/
Attachment #8805497 -
Flags: approval-mozilla-beta?
Attachment #8805497 -
Flags: approval-mozilla-aurora?
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size
Let's stabilize this on Aurora51 before deciding whether to uplift to Beta50.
Attachment #8805497 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
status-firefox51:
--- → affected
Comment 24•8 years ago
|
||
bugherder uplift |
Hello Jan, Hannes, I don't think this is a release blocker. We are in code freeze mode and only taking fixes for critical new regressions, crashes, security issues. I would like to wontfix this for 50 unless you can provide me with data that this is really valuable based on Nightly52/Aurora51 data. Thanks!
Flags: needinfo?(jdemooij)
Flags: needinfo?(hv1989)
Attachment #8805497 -
Flags: approval-mozilla-beta? → approval-mozilla-beta-
Hello DBaron, this is a late uplift in 50 RC. Do you think this is something worth taking a risk on and uplifting to 50 RC2? Or are we better off letting this stabilize on Beta51 (after merge day) for a few weeks and uplifting to planned 50.1.0 release? Appreciate your help.
Flags: needinfo?(dbaron)
Comment 27•8 years ago
|
||
The most important thing is to have this on Beta as soon as possible. Since we can only decide if this workaround fixed the issue or not when it hits Beta. We don't have enough people with this combination of hardware to meaningful trigger this on aurora/nightly.
As soon as this is on Beta we have to wait 2-3 weeks before we have insightful information about it.
Now the most important part now is getting information to drive the next decision.
1) This fixes the crash. In that case we should try to get this in a stable X.1 release.
2) This doesn't fix the crash. We will go back to drawing board to find another workaround.
Currently it doesn't make sense for us to force an uplift for a 50.1.0 release without having crash information on Beta.
As a result we should OR takes this on Beta50 OR wait 2 weeks and automatically have this on Beta51.
Flags: needinfo?(jdemooij)
Flags: needinfo?(hv1989)
Comment 28•8 years ago
|
||
As a side-note:
The workaround only triggers for "AMD Ontario" processors and is very benign. As a result even if the work-around doesn't work, we will not back this patch out. But let the removal of the work-around just ride the train.
Comment 29•8 years ago
|
||
I don't have a strong opinion here. I'm generally pretty conservative on uplifts to RC, but if the patch is believed to be very safe I think it's worth considering, given that this was a significant number of crashes.
Flags: needinfo?(dbaron)
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size
I have given it some thought. I will take a leap of faith and take this uplift in 50 RC2. If things look worse, we back it out and build RC3. If things look same/better, we ship with it. If things look worse on release, we do a dot release with a better fix or back out this one. Let's hope this one sticks. Beta50+
Attachment #8805497 -
Flags: approval-mozilla-beta- → approval-mozilla-beta+
Updated•8 years ago
|
Comment 32•8 years ago
|
||
hi, rc2 still shows a particular crash pattern with some js related signatures (bug 1316022) that are correlated to the amd cpus: http://bit.ly/2fduzlh
Flags: needinfo?(jdemooij)
Comment 33•8 years ago
|
||
(In reply to [:philipp] from comment #32)
> hi, rc2 still shows a particular crash pattern with some js related
> signatures (bug 1316022) that are correlated to the amd cpus:
> http://bit.ly/2fduzlh
This bug is about the code generated by SpiderMonkey JIT compilers (as written in the bug title).
Bug 1316022 is dealing with stack trace of code generated by MSVC. If you got a report with no stack s trace then this might be related to this bug.
Flags: needinfo?(jdemooij)
Comment 34•8 years ago
|
||
(In reply to Nicolas B. Pierron [:nbp] from comment #33)
> (In reply to [:philipp] from comment #32)
> > hi, rc2 still shows a particular crash pattern with some js related
> > signatures (bug 1316022) that are correlated to the amd cpus:
> > http://bit.ly/2fduzlh
>
> This bug is about the code generated by SpiderMonkey JIT compilers (as
> written in the bug title).
>
> Bug 1316022 is dealing with stack trace of code generated by MSVC. If you
> got a report with no stack s trace then this might be related to this bug.
What do you mean by 'report with no stack trace'? Do you mean stack traces that contain code generated by the JIT?
If that's the case, we might have a few reports: https://crash-stats.mozilla.com/search/?build_id=20161104212021&signature=%3Djs%3A%3ANativeObject%3A%3AaddPropertyInternal&signature=%3Djs%3A%3ANativeObject%3A%3AputProperty&signature=%3DAddOrChangeProperty&signature=%3DnsCOMPtr_base%3A%3A~nsCOMPtr_base%20%7C%20mozilla%3A%3Adom%3A%3AWriteOp%3A%3AInit&signature=%3D%400x0%20%7C%20js%3A%3ANativeObject%3A%3AaddPropertyInternal&signature=%3D%400x0%20%7C%20js%3A%3ANativeObject%3A%3AputProperty&signature=%3Djs%3A%3Ajit%3A%3ACallInfo%3A%3AsetImplicitlyUsedUnchecked&signature=~EnterIon&proto_signature=~EnterBaseline&proto_signature=~EnterBaselineMethod&product=Firefox&process_type=browser&date=%3E%3D2016-11-05T00%3A00%3A00.000Z&date=%3C2016-11-09T13%3A45%3A00.000Z&_sort=-date&_facets=signature&_facets=cpu_info&_facets=proto_signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-proto_signature
I've limited the search to stack traces that contain EnterBaseline, EnterBaselineMethod or EnterIon.
Could you check if one of those fits the conditions?
Comment 35•8 years ago
|
||
(In reply to Marco Castelluccio [:marco] from comment #34)
> (In reply to Nicolas B. Pierron [:nbp] from comment #33)
> > (In reply to [:philipp] from comment #32)
> > > hi, rc2 still shows a particular crash pattern with some js related
> > > signatures (bug 1316022) that are correlated to the amd cpus:
> > > http://bit.ly/2fduzlh
> >
> > This bug is about the code generated by SpiderMonkey JIT compilers (as
> > written in the bug title).
> >
> > Bug 1316022 is dealing with stack trace of code generated by MSVC. If you
> > got a report with no stack s trace then this might be related to this bug.
>
> What do you mean by 'report with no stack trace'? Do you mean stack traces
> that contain code generated by the JIT?
Yes, this means that the stack frames do not have much symbols, and the next valid frame should be something such as EnterIon / EnterBaseline.
> If that's the case, we might have a few reports: […]
>
> I've limited the search to stack traces that contain EnterBaseline,
> EnterBaselineMethod or EnterIon.
>
> Could you check if one of those fits the conditions?
I will try to explain the reasoning I have on one example:
@0x0 , This seems to appear when some valid stacks exists
js::NativeObject::addPropertyInternal | below them. Thus these might be called by the JIT but
xpc::JSXrayTraits::getPrototype ' not caused by it.
js::Shape::search<T>
js::AddTypePropertyId
AddOrChangeProperty
js::SetPropertyByDefining
mozglue.dll@0x44af ,
@0xffffff8b | These stack frames are in complete random-ish
js::jit::SetPropertyIC::update | order, and this might not correspond
@0x2e70d7 | to any valid call path which can be produced
@0xffffff84 | by either our code-base, or by code generated
intrinsic_OwnPropertyKeys | by the JIT compilers.
@0x75d212f |
@0x2516585f | The most likely source of this non-sense are
@0x168db13 | the remains of previous stack frames pushed
@0x24b676ff | on the stack.
@0x7310942 '
EnterBaseline ,
js::jit::EnterBaselineMethod |
Interpret |
js::RunScript | Valid call stack for entering JIT code.
js::InternalCallOrConstruct |
js::jit::InvokeFunction |
js::jit::TryAttachCallStub '
Thus, from my opinion, this precise stack frame does not seems to be a bug in our JIT compiler. If we were looking for a bug in our JIT compiler I would expect the stack frames to look something like:
@0x75d212f '
js::jit::SetPropertyIC::update | Random mess from the stack, even if some of the mess
@0x168db13 | can be attributed a valid name.
@0x24b676ff |
@0x7310942 '
EnterBaseline ,
js::jit::EnterBaselineMethod |
Interpret |
js::RunScript | Valid call stack for entering JIT code.
js::InternalCallOrConstruct |
js::jit::InvokeFunction |
js::jit::TryAttachCallStub '
Comment 36•8 years ago
|
||
Some of the reports with signature "jit | CORRUPT_CODE" on 50.1.0 seem to be failing in JIT compiled code, e.g. bp-a884f584-e3df-4167-84e9-4adf82161220:
@0x71abc33
@0x75087de
@0x406ecdf7
@0x3ce20942
EnterBaseline
js::jit::EnterBaselineMethod(JSContext*, js::RunState&)
js::RunScript(JSContext*, js::RunState&)
js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct)
InternalCall
@0xffffff8b
(We have a spike of crashes related to the usual AMD CPUs in 50.1.0, some of them aren't happening in JIT code, but some of them might be. The bug is bug 1287087).
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(hv1989)
Updated•7 years ago
|
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(hv1989)
Comment 37•7 years ago
|
||
I think this bug is responsible for these crashes in 57beta: https://crash-stats.mozilla.org/signature/?product=Firefox&version=57.0b3&address=%3D0xffffffffffffffff&signature=EnterBaseline&date=%3E%3D2017-09-20T22%3A19%3A05.000Z&date=%3C2017-09-27T22%3A19%3A05.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports
I see nonsense memory violations at addresses without memory access. They happen in stubs with the work-around added by this patch, but happen at beginning of stub before the NOPs.
Comment 38•7 years ago
|
||
https://crash-stats.mozilla.org/search/?cpu_arch=amd64&product=Firefox&version=57.0&date=%3E%3D2017-11-14T18%3A41%3A04.000Z&date=%3C2017-11-21T18%3A41%3A04.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
This crash is number #1 release crash on 64-bit.
Updated•7 years ago
|
Alias: amdbug
Updated•7 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•7 years ago
|
Blocks: win64-migration
status-firefox57:
--- → affected
status-firefox58:
--- → affected
status-firefox59:
--- → affected
Keywords: crash
Comment 39•7 years ago
|
||
Well, well, well. Look what we have here.
AMD's family 20, 14h, meaning Bobcat cpus.
50€ this is about what I extensively reported in bug 1335925, comment 1.
You are going to have a pretty bad time trying to 1. workaround the issue and 2. detect when to apply workaround in the first place.
Link again to the (likely) problem: http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf#page=63
For starters, I'm not sure how you'd manage to avoid a vague "set of internal timing conditions", if not by blindly pushing patches over and over again, waiting for wheel of luck to spin.
Secondly, don't expect microcode to mean anything at all.
This is about something else getting updated in the bios. AGESA maybe. Anyway, something that to the best of my knowledge is hard to figure out even after disassembling bios.
Proper way to do control, would be to check MSRs, but that's really not doable (bug 772330, comment 25).
A manual blacklist of 'crashy' unpatched bios revisions, if any, is the only thing that I see could *remotely* do something.
(even though that'd still be suboptimal for cases where you have a CPU with bug fixed in silicon, but also bug-unpatched bios)
Also worth mentioning that linux got a fix backed in the OS.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bfc1168de949cd3e9ca18c3480b5085deff1ea7c
If you use anything newer than 4.14-rc7, you should be 100% safe.
On windows it seems like AMD itself used to take care of implementing similar solutions (AMDeFix in their chipset drivers), which I think could even be pushed by microsoft universally.. But for the moment, no luck at all.
Regards.
Comment 40•7 years ago
|
||
Thanks for the info! I've got my hands on some machines with the chip now. I wonder if we should consider disabling JITs on this processor so at least the browser doesn't crash all day =\
Comment 41•7 years ago
|
||
Afaik (bug 1298333) :marco should also have one.
Or better, the apu family should be the right one. But god knows if it's one of those fixed batches or not.
It shouldn't be all that hard to manually check MSRs on linux though.
Or run my test case in bug 1335925 under windows.
Comment 42•7 years ago
|
||
(In reply to Ted Campbell [:tcampbell] from comment #40)
> Thanks for the info! I've got my hands on some machines with the chip now. I
> wonder if we should consider disabling JITs on this processor so at least
> the browser doesn't crash all day =\
Ted: How common are these buggy AMD processors? Do these crashes only affect 64-bit Firefox? IIUC, EnterBaseline is a pretty generic crash signature so I don't know how to filter crash reports for this specific AMD issue.
jandem: Should we disable JITs for these AMD users? How much will that hurt performance?
Flags: needinfo?(jdemooij)
Comment 43•7 years ago
|
||
Afaik they are so common that you ended up blacklisting the *whole* AMD drivers, just in an attempt to workaround this bug.
Almost a ten of times during the years. Sometimes even nailing down the issue (can't give enough credits to dmajor), but basically always forgetting or passing over further specific actions.
If you check bug 1335925, you can see my test case was actually done with FF19, so it's by no means 64-bit only.
Just, as I was saying, the conditions to trigger the bug are pretty much a lottery.
Isn't there some amd engineer around the bugzilla that may have further background?
Comment 44•7 years ago
|
||
I'm not familiar with graphics drivers or the blacklist process, so I'll leave that discussion to the other bug.
Scope of this bug is reducing the EnterBaseline crash signature. We are currently seeing this on Firefox 57 64-bit quite a bit on these CPUs. We've applied mitigations in the past on the advice of AMD engineers, but it seems the combination of codegen tweaks and our 64-bit build in 57 has made this appear in larger than previously see.
Comment 45•7 years ago
|
||
Installed Windows 10 64-bit on a Lenovo X130e and made it 10 seconds on Firefox 57 before it crashed =\
Assignee | ||
Comment 46•7 years ago
|
||
Forwarding to Ted since he's working on this.
Flags: needinfo?(jdemooij) → needinfo?(tcampbell)
Comment 47•7 years ago
|
||
(In reply to Ted Campbell [:tcampbell] from comment #45)
> Installed Windows 10 64-bit on a Lenovo X130e and made it 10 seconds on
> Firefox 57 before it crashed =\
Could not reproduce on W7 on my K53U with unpatched bios.
Comment 48•7 years ago
|
||
It seems it was just luck it crashed so fast. The two machines were able to stay up for the hour I observed them. Will enable full crash dumps today and somehow script them to stop / stop firefox and load a page. I'd like to get a dump of what path is entering the TypeMonitor stub and determine if adding the NOPs to that end improves stability enough.
Also, as suggested by :jandem on IRC, we shouldn't have to go so far as disabling full JIT, but rather disable key IC stubs on affected machines.
Flags: needinfo?(tcampbell)
Comment 49•7 years ago
|
||
Looking at crash stats, these crashes are happening on up-to-date Windows 10 / 8.1 systems with catalyst video driver from windows update and cpu microcode updates applied.
Comment 50•7 years ago
|
||
No Windows 7?
Also, putting aside that at least the provided W10 driver was quite of a letdown last year, that makes sense.
If my digging through coreboot is right, the fix for erratum 688 pass through an AGESA update, which in turn requires a totally new bios release.
Or at least I see this correlation in the newer fixed one for my laptop.
I wonder if said AGESA release number couldn't be read from userspace?
Comment 51•7 years ago
|
||
There are Windows 7 crashes too.
I'm having trouble reproducing still. Based on mirh's observations that old microcode was really unstable, I've disabled Windows microcode updates on one of these and am still not hitting it. The AGESA update thing is interesting. I'll see if I can find any info on what these laptops have.
Current plan is still to try and capture a full process crash dump and investigate the JIT code paths preceding the crash. Mitigation will likely be adding more of the no-ops or disabling the unlucky jit stubs in order to reduce rate of crashes to something more acceptable.
Comment 52•7 years ago
|
||
(In reply to Ted Campbell [:tcampbell] from comment #51)
> There are Windows 7 crashes too.
Oh, ok, I was almost panicking to think the contrary.
Said this, again, for the fourth time: microcode means *nothing*.
0x05000029 (the most updated one) yes, does something, but it really fixes all but the errata we are interested in.
https://anonscm.debian.org/cgit/users/hmh/amd64-microcode.git/tree/microcode_amd.bin.README#n26
AGESA version can usually be easily checked by opening bios image in hex editor and looking for "OntaroPI" string.
Or I think even AIDA64 should be able to detect it, though the magic involved misses me.
At least in my laptop, broken bios have v1.1.0.0, while fixed ones are 1.2.0.0, so that was the educate guess I made.
The last word to know whether you are affected by bug though (according to docs) would be running under linux
sudo setpci -s 18.4 0x164.l
sudo modprobe msr
sudo rdmsr --all 0xc0011021
And check if the lasts of these commands returns respectively 00000003 and 10008000.
Sad reproducing it on 57 is so difficult.
Comment 53•7 years ago
|
||
(In reply to mirh from comment #52)
> Said this, again, for the fourth time: microcode means *nothing*.
> 0x05000029 (the most updated one) yes, does something, but it really fixes
> all but the errata we are interested in.
> https://anonscm.debian.org/cgit/users/hmh/amd64-microcode.git/tree/
> microcode_amd.bin.README#n26
Ah, nice link. I now follow what you were saying.
> AGESA version can usually be easily checked by opening bios image in hex
> editor and looking for "OntaroPI" string.
> Or I think even AIDA64 should be able to detect it, though the magic
> involved misses me.
AGESA 1.1.0.0 is what both factory and "updated" firmware images show for these X130e machines.
> sudo setpci -s 18.4 0x164.l
>
> sudo modprobe msr
> sudo rdmsr --all 0xc0011021
>
> And check if the lasts of these commands returns respectively 00000003 and
> 10008000.
I get 00000007 and 10008000 on this one laptop, indicating it should already be fixed. I'll check the other laptop on Monday. (I'm don't have my notes here on which one triggered the bug).
Comment 54•7 years ago
|
||
(In reply to Ted Campbell [:tcampbell] from comment #53)
> I get 00000007 and 10008000 on this one laptop, indicating it should already
> be fixed. I'll check the other laptop on Monday. (I'm don't have my notes
> here on which one triggered the bug).
Loool. No **** then you wasn't experiencing the issue!!
I must correct though.
In addition to 00000003, even 00000001 signals a bug.
http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf#page=9
Comment 55•7 years ago
|
||
[Tracking Requested - why for this release]:
This 64-bit EnterBaseline JIT crash is 57.0 content process top crash #2.
tracking-firefox57:
--- → ?
Comment 56•7 years ago
|
||
Ted, Release Management plans to ship a 57.0.1 dot release this week. Do you think we should temporarily disable the JIT for users on these AMD CPUs in 57.0.1? You mentioned on Slack that some people are hitting this crash 10+ times a day.
Flags: needinfo?(tcampbell)
Comment 57•7 years ago
|
||
Ted says we shouldn't rush a fix for 57.0.1. This is sensitive code and, while we are receiving many crash reports, the actual number of affected users is relatively small. On Beta 58, for example, we have about 5000 reports from just 150 installs.
tracking-firefox57:
? → ---
Flags: needinfo?(tcampbell)
Comment 58•7 years ago
|
||
It is looking like only a fraction users of this processor are hitting the problem badly. We don't have a reliable test for identifying which are affected, so it is unwise to risk hurting perf of anyone using this processor. People who are affected should use 32-bit builds.
The second sample laptop I have does report (via privileged mode tests that mirh described above) as having bad silicon. I'm continuing to try and reproduce the issue more than once.
We should continue to track this issue, but fixes should be attempted in beta first.
Comment 60•7 years ago
|
||
Actually, rates in beta are much lower than I thought. 4000/4700 crashes in last week are a single user hitting a crash report issue that caused it to submit a single crash over and over again. Once correcting for that, the AMD bug is still the cause of 100 crashes a day on beta 58. These AMD crashes are still almost all 64-bit. We should still monitor this bug, but it isn't really #2 crash on beta.
Comment 61•7 years ago
|
||
Not sure we should rush anything across trains here.
Assignee | ||
Updated•7 years ago
|
Assignee: jdemooij → tcampbell
Comment 62•4 years ago
|
||
The TypeMonitor IC stubs have been a large source of where these crashes appear. The ubiquitous nature of them and the specific control flow pattern seems to really exacerbate the the CPU defect.
With Bug 1613592, we have disabled the use of these ICs, and with Bug 1673553 we have removed the code altogether. As a result, we see a significant (10x) reduction in these JIT crashes on AMD Bobcat machines starting with FF83. These crashes were about 1/3rd of JIT crashes across all desktop CPUs so, this will be apparent in those stats as well.
Assignee | ||
Comment 63•4 years ago
|
||
Closing given comment 62.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 4 years ago
Resolution: --- → FIXED
Updated•4 years ago
|
Assignee: nobody → jdemooij
Updated•4 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•