1281759 - (amdbug) Work around mysterious AMD JIT crashes

Assignee

Description

•

8 years ago

See the AMD crashes in bug 1034706 comment 44. This still happens on 47. I want to add some nops to the Baseline stub where we're crashing (we can probably use cpuid to restrict it to these AMD CPUs) to see if that makes it go away. Info from AMD suggests it might help.

Jan de Mooij [:jandem]

Assignee

Updated

•

8 years ago

No longer blocks: 1279590

Jan de Mooij [:jandem]

Assignee

Updated

•

8 years ago

Blocks: 1279590

Jan de Mooij [:jandem]

Assignee

Comment 1

•

8 years ago

Attached patch Patch (deleted) — Details — Splinter Review

This adds some 4-byte NOPs to this IC stub on x86 if CPU family is 20 and model is 0-2. According to AMD engineers, limiting the number of branches per cache line might help, so I'm hopeful this will work. If this patch doesn't help, we can try more or larger NOPs...

Attachment #8764587 - Flags: review?(sunfish)

Dan Gohman [:sunfish]

Comment 2

•

8 years ago

Comment on attachment 8764587 [details] [diff] [review] Patch Review of attachment 8764587 [details] [diff] [review]: ----------------------------------------------------------------- Yikes. Patch looks good though.

Attachment #8764587 - Flags: review?(sunfish) → review+

Pulsebot

Comment 3

•

8 years ago

Pushed by jandemooij@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/ae024662143f Try to work around mysterious AMD crashes. r=sunfish

Phil Ringnalda (:philor)

Comment 4

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ae024662143f

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox50: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla50

Jan de Mooij [:jandem]

Assignee

Updated

•

8 years ago

Flags: needinfo?(jdemooij)

Nicholas Nethercote [inactive]

Comment 5

•

8 years ago

> This adds some 4-byte NOPs to this IC stub on x86 if CPU family is 20 and > model is 0-2. Do we have confirmation from AMD that these models are the only ones affected? My analysis in bug 1034706 comment 63 suggests that these families account for only a fraction of suspicious JIT crashes on AMD CPUs.

Nicholas Nethercote [inactive]

Comment 6

•

8 years ago

Also, do you have an analysis plan to determine if this change helps in practice?

Jan de Mooij [:jandem]

Assignee

Comment 7

•

8 years ago

(In reply to Nicholas Nethercote [:njn] from comment #5) > Do we have confirmation from AMD that these models are the only ones > affected? My analysis in bug 1034706 comment 63 suggests that these families > account for only a fraction of suspicious JIT crashes on AMD CPUs. No confirmation, but I haven't seen *these* crashes with other (AMD) CPUs. I can take another look at some of the other crashes though. (In reply to Nicholas Nethercote [:njn] from comment #6) > Also, do you have an analysis plan to determine if this change helps in > practice? I'll check if there's any difference on Nightly later today. Unfortunately not that many users are on Nightly and most probably don't use these CPUs, so we'll likely have to backport to see if it helps...

Nicholas Nethercote [inactive]

Updated

•

8 years ago

Blocks: 1289666

Marco Castelluccio [:marco]

Comment 8

•

8 years ago

This might fix bug 1290388.

Blocks: 1290388

Marco Castelluccio [:marco]

Updated

•

8 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1290419

Jan de Mooij [:jandem]

Assignee

Comment 9

•

8 years ago

(In reply to Marco Castelluccio [:marco] (PTO until August 9) from comment #8) > This might fix bug 1290388. I don't think so; I moved it to the See Also list :)

No longer blocks: 1290388

Flags: needinfo?(jdemooij)

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1290388

Jan de Mooij [:jandem]

Assignee

Comment 10

•

8 years ago

Comment on attachment 8764587 [details] [diff] [review] Patch This patch attempts to work around JIT code crashes on certain AMD CPUs. It inserts some NOP instructions on this particular CPU. It should be safe to backport. Approval Request Comment [Feature/regressing bug #]: - [User impact if declined]: Potential crashes with certain AMD CPUs [Describe test coverage new/current, TreeHerder]: On m-c and aurora for a while. I haven't seen these crashes on builds with this patch applied, but it's hard to say much because the crashes don't seem to affect all builds. [Risks and why]: Low risk. [String/UUID change made/needed]: None.

Attachment #8764587 - Flags: approval-mozilla-beta?

Sylvestre Ledru [:Sylvestre]

Comment 11

•

8 years ago

Comment on attachment 8764587 [details] [diff] [review] Patch Let's take it to try to fix the famous issue. Should be in 49 beta 3

Attachment #8764587 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Carsten Book [:Tomcat]

Comment 12

•

8 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/7af8faeaf115

status-firefox49: --- → fixed

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Comment 13

•

8 years ago

Looks like we have some new crashes on beta 3 described in bug 1294963. If you think it's from this, can you back out your changes or submit a fix? Thanks!

Flags: needinfo?(jdemooij)

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Comment 14

•

8 years ago

On 2nd look, this does seem to have caused the #1 topcrash on beta 3. Wes, can you back it out before the beta 5 build?

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Depends on: 1294963

Jan de Mooij [:jandem]

Assignee

Comment 15

•

8 years ago

I'll take a closer look, but it's *extremely unlikely* this caused these crashes. What makes us think this is related to bug 1294963? The AMD bug shows up in some builds but not others. Just because this patch fixes one instance of this bug by inserting NOP instructions, doesn't mean it caused a random bug in entirely unrelated code...

Jan de Mooij [:jandem]

Assignee

Comment 16

•

8 years ago

This workaround probably didn't work. I still see similar crashes like this one: https://crash-stats.mozilla.com/report/index/6a120ee3-cf18-48cc-8dc7-e96e22160815 We did correctly identify the CPU and inserted NOP instructions, but apparently that was not enough to stop the crashes. I'll check the errata and conversations with AMD tomorrow to see if there's something I missed... I'll also check if there's any difference in frequency. 006a4530 83 f9 8c cmp ecx,0xffffff8c <- crash here 006a4533 0f 85 16 00 00 00 jne 0x006a454f 006a4539 0f 1f 40 00 nop DWORD PTR [eax+0x0] 006a453d 39 57 10 cmp DWORD PTR [edi+0x10],edx 006a4540 0f 85 09 00 00 00 jne 0x006a454f 006a4546 0f 1f 40 00 nop DWORD PTR [eax+0x0] 006a454a c3 ret 006a454b 0f 1f 40 00 nop DWORD PTR [eax+0x0] 006a454f 8b 7f 04 mov edi,DWORD PTR [edi+0x4] 006a4552 ff 27 jmp DWORD PTR [edi]

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Assignee

Updated

•

8 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Hannes Verschore [:h4writer]

Updated

•

8 years ago

Priority: -- → P3

Hannes Verschore [:h4writer]

Comment 17

•

8 years ago

(In reply to Jan de Mooij [:jandem] from comment #16) > This workaround probably didn't work. I still see similar crashes like this > one: > > https://crash-stats.mozilla.com/report/index/6a120ee3-cf18-48cc-8dc7- > e96e22160815 > > We did correctly identify the CPU and inserted NOP instructions, but > apparently that was not enough to stop the crashes. I'll check the errata > and conversations with AMD tomorrow to see if there's something I missed... > I'll also check if there's any difference in frequency. > > 006a4530 83 f9 8c cmp ecx,0xffffff8c <- crash here > 006a4533 0f 85 16 00 00 00 jne 0x006a454f > 006a4539 0f 1f 40 00 nop DWORD PTR [eax+0x0] > 006a453d 39 57 10 cmp DWORD PTR [edi+0x10],edx > 006a4540 0f 85 09 00 00 00 jne 0x006a454f > 006a4546 0f 1f 40 00 nop DWORD PTR [eax+0x0] > 006a454a c3 ret > 006a454b 0f 1f 40 00 nop DWORD PTR [eax+0x0] > 006a454f 8b 7f 04 mov edi,DWORD PTR [edi+0x4] > 006a4552 ff 27 jmp DWORD PTR [edi] Bug 772330 comment 55 mentions that the two "next instructions" could be affected. Should we try two nops instead of one long nop?

Desigan Chinniah [:cyberdees] [:dees] [London - GMT]

Updated

•

8 years ago

platform-rel: --- → ?

Whiteboard: [platform-rel-AMD]

Jenn Chaulk (:jchaulk)

Updated

•

8 years ago

platform-rel: ? → -

Hannes Verschore [:h4writer]

Comment 18

•

8 years ago

Attached patch Increasing NOP size (deleted) — Details — Splinter Review

Talking to AMD, they suggested to maybe increase to 32 bytes of NOPs. Next try to squash those bugs.

Attachment #8805497 - Flags: review?(jdemooij)

Jan de Mooij [:jandem]

Assignee

Comment 19

•

8 years ago

Comment on attachment 8805497 [details] [diff] [review] Increasing NOP size Review of attachment 8805497 [details] [diff] [review]: ----------------------------------------------------------------- Thanks, let's see if this works :)

Attachment #8805497 - Flags: review?(jdemooij) → review+

Pulsebot

Comment 20

•

8 years ago

Pushed by hv1989@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/d77c9965e9b2 Try to work around mysterious AMD crashes (take 2). r=jandem

Ryan VanderMeulen [:RyanVM]

Comment 21

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/d77c9965e9b2

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

status-firefox52: --- → fixed

Resolution: --- → FIXED

Marco Castelluccio [:marco]

Updated

•

8 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1312270

Marco Castelluccio [:marco]

Updated

•

8 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1176268

Hannes Verschore [:h4writer]

Comment 22

•

8 years ago

Comment on attachment 8805497 [details] [diff] [review] Increasing NOP size Approval Request Comment [Feature/regressing bug #]: / [User impact if declined]: This contains a possible fix for AMD Bobcat processors. We are not sure this is a definite fix, but we only see numbers high enough on beta to come to a conclusion if this fix works or doesn't work. [Describe test coverage new/current, TreeHerder]: Has been on nightly since last week. I also enabled this workaround on my own computer and this caused no problems. [Risks and why]: We are improving a work-around for AMD bobcat processors only. We already had a work-around uplifted, but that didn't seem to fix the crashes. Hopefully this new patch does that. We increased the NOP size from 4-bytes to 32-bytes on specific positions. We don't expect any regressions or issues even if the work-around doesn't work. [String/UUID change made/needed]: /

Attachment #8805497 - Flags: approval-mozilla-beta?

Attachment #8805497 - Flags: approval-mozilla-aurora?

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 23

•

8 years ago

Comment on attachment 8805497 [details] [diff] [review] Increasing NOP size Let's stabilize this on Aurora51 before deciding whether to uplift to Beta50.

Attachment #8805497 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

8 years ago

status-firefox51: --- → affected

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 24

•

8 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-aurora/rev/30be0417e259

status-firefox51: affected → fixed

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 25

•

8 years ago

Hello Jan, Hannes, I don't think this is a release blocker. We are in code freeze mode and only taking fixes for critical new regressions, crashes, security issues. I would like to wontfix this for 50 unless you can provide me with data that this is really valuable based on Nightly52/Aurora51 data. Thanks!

Flags: needinfo?(jdemooij)

Flags: needinfo?(hv1989)

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

8 years ago

status-firefox50: fixed → wontfix

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

8 years ago

Attachment #8805497 - Flags: approval-mozilla-beta? → approval-mozilla-beta-

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 26

•

8 years ago

Hello DBaron, this is a late uplift in 50 RC. Do you think this is something worth taking a risk on and uplifting to 50 RC2? Or are we better off letting this stabilize on Beta51 (after merge day) for a few weeks and uplifting to planned 50.1.0 release? Appreciate your help.

Flags: needinfo?(dbaron)

Hannes Verschore [:h4writer]

Comment 27

•

8 years ago

The most important thing is to have this on Beta as soon as possible. Since we can only decide if this workaround fixed the issue or not when it hits Beta. We don't have enough people with this combination of hardware to meaningful trigger this on aurora/nightly. As soon as this is on Beta we have to wait 2-3 weeks before we have insightful information about it. Now the most important part now is getting information to drive the next decision. 1) This fixes the crash. In that case we should try to get this in a stable X.1 release. 2) This doesn't fix the crash. We will go back to drawing board to find another workaround. Currently it doesn't make sense for us to force an uplift for a 50.1.0 release without having crash information on Beta. As a result we should OR takes this on Beta50 OR wait 2 weeks and automatically have this on Beta51.

Flags: needinfo?(jdemooij)

Flags: needinfo?(hv1989)

Hannes Verschore [:h4writer]

Comment 28

•

8 years ago

As a side-note: The workaround only triggers for "AMD Ontario" processors and is very benign. As a result even if the work-around doesn't work, we will not back this patch out. But let the removal of the work-around just ride the train.

David Baron :dbaron:

Comment 29

•

8 years ago

I don't have a strong opinion here. I'm generally pretty conservative on uplifts to RC, but if the patch is believed to be very safe I think it's worth considering, given that this was a significant number of crashes.

Flags: needinfo?(dbaron)

Hannes Verschore [:h4writer]

Updated

•

8 years ago

Depends on: 1315196

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 30

•

8 years ago

Comment on attachment 8805497 [details] [diff] [review] Increasing NOP size I have given it some thought. I will take a leap of faith and take this uplift in 50 RC2. If things look worse, we back it out and build RC3. If things look same/better, we ship with it. If things look worse on release, we do a dot release with a better fix or back out this one. Let's hope this one sticks. Beta50+

Attachment #8805497 - Flags: approval-mozilla-beta- → approval-mozilla-beta+

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

8 years ago

status-firefox50: wontfix → affected

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 31

•

8 years ago

https://hg.mozilla.org/releases/mozilla-release/rev/ec795eb6be0099003f06c17c5afb6e08c84d392f

status-firefox50: affected → fixed

[:philipp]

Updated

•

8 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1316022

[:philipp]

Comment 32

•

8 years ago

hi, rc2 still shows a particular crash pattern with some js related signatures (bug 1316022) that are correlated to the amd cpus: http://bit.ly/2fduzlh

Flags: needinfo?(jdemooij)

Nicolas B. Pierron [:nbp]

Comment 33

•

8 years ago

(In reply to [:philipp] from comment #32) > hi, rc2 still shows a particular crash pattern with some js related > signatures (bug 1316022) that are correlated to the amd cpus: > http://bit.ly/2fduzlh This bug is about the code generated by SpiderMonkey JIT compilers (as written in the bug title). Bug 1316022 is dealing with stack trace of code generated by MSVC. If you got a report with no stack s trace then this might be related to this bug.

Flags: needinfo?(jdemooij)

Marco Castelluccio [:marco]

Comment 34

•

8 years ago

(In reply to Nicolas B. Pierron [:nbp] from comment #33) > (In reply to [:philipp] from comment #32) > > hi, rc2 still shows a particular crash pattern with some js related > > signatures (bug 1316022) that are correlated to the amd cpus: > > http://bit.ly/2fduzlh > > This bug is about the code generated by SpiderMonkey JIT compilers (as > written in the bug title). > > Bug 1316022 is dealing with stack trace of code generated by MSVC. If you > got a report with no stack s trace then this might be related to this bug. What do you mean by 'report with no stack trace'? Do you mean stack traces that contain code generated by the JIT? If that's the case, we might have a few reports: https://crash-stats.mozilla.com/search/?build_id=20161104212021&signature=%3Djs%3A%3ANativeObject%3A%3AaddPropertyInternal&signature=%3Djs%3A%3ANativeObject%3A%3AputProperty&signature=%3DAddOrChangeProperty&signature=%3DnsCOMPtr_base%3A%3A~nsCOMPtr_base%20%7C%20mozilla%3A%3Adom%3A%3AWriteOp%3A%3AInit&signature=%3D%400x0%20%7C%20js%3A%3ANativeObject%3A%3AaddPropertyInternal&signature=%3D%400x0%20%7C%20js%3A%3ANativeObject%3A%3AputProperty&signature=%3Djs%3A%3Ajit%3A%3ACallInfo%3A%3AsetImplicitlyUsedUnchecked&signature=~EnterIon&proto_signature=~EnterBaseline&proto_signature=~EnterBaselineMethod&product=Firefox&process_type=browser&date=%3E%3D2016-11-05T00%3A00%3A00.000Z&date=%3C2016-11-09T13%3A45%3A00.000Z&_sort=-date&_facets=signature&_facets=cpu_info&_facets=proto_signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-proto_signature I've limited the search to stack traces that contain EnterBaseline, EnterBaselineMethod or EnterIon. Could you check if one of those fits the conditions?

Nicolas B. Pierron [:nbp]

Comment 35

•

8 years ago

(In reply to Marco Castelluccio [:marco] from comment #34) > (In reply to Nicolas B. Pierron [:nbp] from comment #33) > > (In reply to [:philipp] from comment #32) > > > hi, rc2 still shows a particular crash pattern with some js related > > > signatures (bug 1316022) that are correlated to the amd cpus: > > > http://bit.ly/2fduzlh > > > > This bug is about the code generated by SpiderMonkey JIT compilers (as > > written in the bug title). > > > > Bug 1316022 is dealing with stack trace of code generated by MSVC. If you > > got a report with no stack s trace then this might be related to this bug. > > What do you mean by 'report with no stack trace'? Do you mean stack traces > that contain code generated by the JIT? Yes, this means that the stack frames do not have much symbols, and the next valid frame should be something such as EnterIon / EnterBaseline. > If that's the case, we might have a few reports: […] > > I've limited the search to stack traces that contain EnterBaseline, > EnterBaselineMethod or EnterIon. > > Could you check if one of those fits the conditions? I will try to explain the reasoning I have on one example: @0x0 , This seems to appear when some valid stacks exists js::NativeObject::addPropertyInternal | below them. Thus these might be called by the JIT but xpc::JSXrayTraits::getPrototype ' not caused by it. js::Shape::search<T> js::AddTypePropertyId AddOrChangeProperty js::SetPropertyByDefining mozglue.dll@0x44af , @0xffffff8b | These stack frames are in complete random-ish js::jit::SetPropertyIC::update | order, and this might not correspond @0x2e70d7 | to any valid call path which can be produced @0xffffff84 | by either our code-base, or by code generated intrinsic_OwnPropertyKeys | by the JIT compilers. @0x75d212f | @0x2516585f | The most likely source of this non-sense are @0x168db13 | the remains of previous stack frames pushed @0x24b676ff | on the stack. @0x7310942 ' EnterBaseline , js::jit::EnterBaselineMethod | Interpret | js::RunScript | Valid call stack for entering JIT code. js::InternalCallOrConstruct | js::jit::InvokeFunction | js::jit::TryAttachCallStub ' Thus, from my opinion, this precise stack frame does not seems to be a bug in our JIT compiler. If we were looking for a bug in our JIT compiler I would expect the stack frames to look something like: @0x75d212f ' js::jit::SetPropertyIC::update | Random mess from the stack, even if some of the mess @0x168db13 | can be attributed a valid name. @0x24b676ff | @0x7310942 ' EnterBaseline , js::jit::EnterBaselineMethod | Interpret | js::RunScript | Valid call stack for entering JIT code. js::InternalCallOrConstruct | js::jit::InvokeFunction | js::jit::TryAttachCallStub '

Marco Castelluccio [:marco]

Comment 36

•

8 years ago

Some of the reports with signature "jit | CORRUPT_CODE" on 50.1.0 seem to be failing in JIT compiled code, e.g. bp-a884f584-e3df-4167-84e9-4adf82161220: @0x71abc33 @0x75087de @0x406ecdf7 @0x3ce20942 EnterBaseline js::jit::EnterBaselineMethod(JSContext*, js::RunState&) js::RunScript(JSContext*, js::RunState&) js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) InternalCall @0xffffff8b (We have a spike of crashes related to the usual AMD CPUs in 50.1.0, some of them aren't happening in JIT code, but some of them might be. The bug is bug 1287087).

Flags: needinfo?(nicolas.b.pierron)

Flags: needinfo?(hv1989)

Marco Castelluccio [:marco]

Updated

•

7 years ago

Flags: needinfo?(nicolas.b.pierron)

Flags: needinfo?(hv1989)

Ted Campbell [:tcampbell]

Comment 37

•

7 years ago

I think this bug is responsible for these crashes in 57beta: https://crash-stats.mozilla.org/signature/?product=Firefox&version=57.0b3&address=%3D0xffffffffffffffff&signature=EnterBaseline&date=%3E%3D2017-09-20T22%3A19%3A05.000Z&date=%3C2017-09-27T22%3A19%3A05.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports I see nonsense memory violations at addresses without memory access. They happen in stubs with the work-around added by this patch, but happen at beginning of stub before the NOPs.

Ted Campbell [:tcampbell]

Comment 38

•

7 years ago

https://crash-stats.mozilla.org/search/?cpu_arch=amd64&product=Firefox&version=57.0&date=%3E%3D2017-11-14T18%3A41%3A04.000Z&date=%3C2017-11-21T18%3A41%3A04.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature This crash is number #1 release crash on 64-bit.

Ted Campbell [:tcampbell]

Updated

•

7 years ago

Alias: amdbug

Ted Campbell [:tcampbell]

Updated

•

7 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Chris Peterson [:cpeterson]

Updated

•

7 years ago

Blocks: win64-migration

status-firefox57: --- → affected

status-firefox58: --- → affected

status-firefox59: --- → affected

Keywords: crash

mirh

Comment 39

•

7 years ago

Well, well, well. Look what we have here. AMD's family 20, 14h, meaning Bobcat cpus. 50€ this is about what I extensively reported in bug 1335925, comment 1. You are going to have a pretty bad time trying to 1. workaround the issue and 2. detect when to apply workaround in the first place. Link again to the (likely) problem: http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf#page=63 For starters, I'm not sure how you'd manage to avoid a vague "set of internal timing conditions", if not by blindly pushing patches over and over again, waiting for wheel of luck to spin. Secondly, don't expect microcode to mean anything at all. This is about something else getting updated in the bios. AGESA maybe. Anyway, something that to the best of my knowledge is hard to figure out even after disassembling bios. Proper way to do control, would be to check MSRs, but that's really not doable (bug 772330, comment 25). A manual blacklist of 'crashy' unpatched bios revisions, if any, is the only thing that I see could *remotely* do something. (even though that'd still be suboptimal for cases where you have a CPU with bug fixed in silicon, but also bug-unpatched bios) Also worth mentioning that linux got a fix backed in the OS. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bfc1168de949cd3e9ca18c3480b5085deff1ea7c If you use anything newer than 4.14-rc7, you should be 100% safe. On windows it seems like AMD itself used to take care of implementing similar solutions (AMDeFix in their chipset drivers), which I think could even be pushed by microsoft universally.. But for the moment, no luck at all. Regards.

mirh

Updated

•

7 years ago

Blocks: 1335925

Ted Campbell [:tcampbell]

Comment 40

•

7 years ago

Thanks for the info! I've got my hands on some machines with the chip now. I wonder if we should consider disabling JITs on this processor so at least the browser doesn't crash all day =\

mirh

Comment 41

•

7 years ago

Afaik (bug 1298333) :marco should also have one. Or better, the apu family should be the right one. But god knows if it's one of those fixed batches or not. It shouldn't be all that hard to manually check MSRs on linux though. Or run my test case in bug 1335925 under windows.

Chris Peterson [:cpeterson]

Comment 42

•

7 years ago

(In reply to Ted Campbell [:tcampbell] from comment #40) > Thanks for the info! I've got my hands on some machines with the chip now. I > wonder if we should consider disabling JITs on this processor so at least > the browser doesn't crash all day =\ Ted: How common are these buggy AMD processors? Do these crashes only affect 64-bit Firefox? IIUC, EnterBaseline is a pretty generic crash signature so I don't know how to filter crash reports for this specific AMD issue. jandem: Should we disable JITs for these AMD users? How much will that hurt performance?

Flags: needinfo?(jdemooij)

mirh

Comment 43

•

7 years ago

Afaik they are so common that you ended up blacklisting the *whole* AMD drivers, just in an attempt to workaround this bug. Almost a ten of times during the years. Sometimes even nailing down the issue (can't give enough credits to dmajor), but basically always forgetting or passing over further specific actions. If you check bug 1335925, you can see my test case was actually done with FF19, so it's by no means 64-bit only. Just, as I was saying, the conditions to trigger the bug are pretty much a lottery. Isn't there some amd engineer around the bugzilla that may have further background?

Ted Campbell [:tcampbell]

Comment 44

•

7 years ago

I'm not familiar with graphics drivers or the blacklist process, so I'll leave that discussion to the other bug. Scope of this bug is reducing the EnterBaseline crash signature. We are currently seeing this on Firefox 57 64-bit quite a bit on these CPUs. We've applied mitigations in the past on the advice of AMD engineers, but it seems the combination of codegen tweaks and our 64-bit build in 57 has made this appear in larger than previously see.

Ted Campbell [:tcampbell]

Comment 45

•

7 years ago

Installed Windows 10 64-bit on a Lenovo X130e and made it 10 seconds on Firefox 57 before it crashed =\

Jan de Mooij [:jandem]

Assignee

Comment 46

•

7 years ago

Forwarding to Ted since he's working on this.

Flags: needinfo?(jdemooij) → needinfo?(tcampbell)

mirh

Comment 47

•

7 years ago

(In reply to Ted Campbell [:tcampbell] from comment #45) > Installed Windows 10 64-bit on a Lenovo X130e and made it 10 seconds on > Firefox 57 before it crashed =\ Could not reproduce on W7 on my K53U with unpatched bios.

Ted Campbell [:tcampbell]

Comment 48

•

7 years ago

It seems it was just luck it crashed so fast. The two machines were able to stay up for the hour I observed them. Will enable full crash dumps today and somehow script them to stop / stop firefox and load a page. I'd like to get a dump of what path is entering the TypeMonitor stub and determine if adding the NOPs to that end improves stability enough. Also, as suggested by :jandem on IRC, we shouldn't have to go so far as disabling full JIT, but rather disable key IC stubs on affected machines.

Flags: needinfo?(tcampbell)

Ted Campbell [:tcampbell]

Comment 49

•

7 years ago

Looking at crash stats, these crashes are happening on up-to-date Windows 10 / 8.1 systems with catalyst video driver from windows update and cpu microcode updates applied.

mirh

Comment 50

•

7 years ago

No Windows 7? Also, putting aside that at least the provided W10 driver was quite of a letdown last year, that makes sense. If my digging through coreboot is right, the fix for erratum 688 pass through an AGESA update, which in turn requires a totally new bios release. Or at least I see this correlation in the newer fixed one for my laptop. I wonder if said AGESA release number couldn't be read from userspace?

Ted Campbell [:tcampbell]

Comment 51

•

7 years ago

There are Windows 7 crashes too. I'm having trouble reproducing still. Based on mirh's observations that old microcode was really unstable, I've disabled Windows microcode updates on one of these and am still not hitting it. The AGESA update thing is interesting. I'll see if I can find any info on what these laptops have. Current plan is still to try and capture a full process crash dump and investigate the JIT code paths preceding the crash. Mitigation will likely be adding more of the no-ops or disabling the unlucky jit stubs in order to reduce rate of crashes to something more acceptable.

mirh

Comment 52

•

7 years ago

(In reply to Ted Campbell [:tcampbell] from comment #51) > There are Windows 7 crashes too. Oh, ok, I was almost panicking to think the contrary. Said this, again, for the fourth time: microcode means *nothing*. 0x05000029 (the most updated one) yes, does something, but it really fixes all but the errata we are interested in. https://anonscm.debian.org/cgit/users/hmh/amd64-microcode.git/tree/microcode_amd.bin.README#n26 AGESA version can usually be easily checked by opening bios image in hex editor and looking for "OntaroPI" string. Or I think even AIDA64 should be able to detect it, though the magic involved misses me. At least in my laptop, broken bios have v1.1.0.0, while fixed ones are 1.2.0.0, so that was the educate guess I made. The last word to know whether you are affected by bug though (according to docs) would be running under linux sudo setpci -s 18.4 0x164.l sudo modprobe msr sudo rdmsr --all 0xc0011021 And check if the lasts of these commands returns respectively 00000003 and 10008000. Sad reproducing it on 57 is so difficult.

Ted Campbell [:tcampbell]

Comment 53

•

7 years ago

(In reply to mirh from comment #52) > Said this, again, for the fourth time: microcode means *nothing*. > 0x05000029 (the most updated one) yes, does something, but it really fixes > all but the errata we are interested in. > https://anonscm.debian.org/cgit/users/hmh/amd64-microcode.git/tree/ > microcode_amd.bin.README#n26 Ah, nice link. I now follow what you were saying. > AGESA version can usually be easily checked by opening bios image in hex > editor and looking for "OntaroPI" string. > Or I think even AIDA64 should be able to detect it, though the magic > involved misses me. AGESA 1.1.0.0 is what both factory and "updated" firmware images show for these X130e machines. > sudo setpci -s 18.4 0x164.l > > sudo modprobe msr > sudo rdmsr --all 0xc0011021 > > And check if the lasts of these commands returns respectively 00000003 and > 10008000. I get 00000007 and 10008000 on this one laptop, indicating it should already be fixed. I'll check the other laptop on Monday. (I'm don't have my notes here on which one triggered the bug).

mirh

Comment 54

•

7 years ago

(In reply to Ted Campbell [:tcampbell] from comment #53) > I get 00000007 and 10008000 on this one laptop, indicating it should already > be fixed. I'll check the other laptop on Monday. (I'm don't have my notes > here on which one triggered the bug). Loool. No **** then you wasn't experiencing the issue!! I must correct though. In addition to 00000003, even 00000001 signals a bug. http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf#page=9

Chris Peterson [:cpeterson]

Comment 55

•

7 years ago

[Tracking Requested - why for this release]: This 64-bit EnterBaseline JIT crash is 57.0 content process top crash #2.

tracking-firefox57: --- → ?

Chris Peterson [:cpeterson]

Comment 56

•

7 years ago

Ted, Release Management plans to ship a 57.0.1 dot release this week. Do you think we should temporarily disable the JIT for users on these AMD CPUs in 57.0.1? You mentioned on Slack that some people are hitting this crash 10+ times a day.

Flags: needinfo?(tcampbell)

Chris Peterson [:cpeterson]

Comment 57

•

7 years ago

Ted says we shouldn't rush a fix for 57.0.1. This is sensitive code and, while we are receiving many crash reports, the actual number of affected users is relatively small. On Beta 58, for example, we have about 5000 reports from just 150 installs.

tracking-firefox57: ? → ---

Flags: needinfo?(tcampbell)

Ted Campbell [:tcampbell]

Comment 58

•

7 years ago

It is looking like only a fraction users of this processor are hitting the problem badly. We don't have a reliable test for identifying which are affected, so it is unwise to risk hurting perf of anyone using this processor. People who are affected should use 32-bit builds. The second sample laptop I have does report (via privileged mode tests that mirh described above) as having bad silicon. I'm continuing to try and reproduce the issue more than once. We should continue to track this issue, but fixes should be attempted in beta first.

Chris Peterson [:cpeterson]

Comment 59

•

7 years ago

[Tracking Requested - why for this release]:

tracking-firefox58: --- → ?

Ted Campbell [:tcampbell]

Comment 60

•

7 years ago

Actually, rates in beta are much lower than I thought. 4000/4700 crashes in last week are a single user hitting a crash report issue that caused it to submit a single crash over and over again. Once correcting for that, the AMD bug is still the cause of 100 crashes a day on beta 58. These AMD crashes are still almost all 64-bit. We should still monitor this bug, but it isn't really #2 crash on beta.

Ted Campbell [:tcampbell]

Updated

•

7 years ago

Depends on: 1421445

Julien Cristau [:jcristau]

Comment 61

•

7 years ago

Not sure we should rush anything across trains here.

status-firefox57: affected → wontfix

status-firefox58: affected → fix-optional

tracking-firefox58: ? → ---

Jan de Mooij [:jandem]

Assignee

Updated

•

7 years ago

Assignee: jdemooij → tcampbell

Ted Campbell [:tcampbell]

Updated

•

5 years ago

Assignee: tcampbell → nobody

Blocks: SadJit

Ted Campbell [:tcampbell]

Comment 62

•

4 years ago

The TypeMonitor IC stubs have been a large source of where these crashes appear. The ubiquitous nature of them and the specific control flow pattern seems to really exacerbate the the CPU defect.

With Bug 1613592, we have disabled the use of these ICs, and with Bug 1673553 we have removed the code altogether. As a result, we see a significant (10x) reduction in these JIT crashes on AMD Bobcat machines starting with FF83. These crashes were about 1/3rd of JIT crashes across all desktop CPUs so, this will be apparent in those stats as well.

https://crash-stats.mozilla.org/signature/?release_channel=release&product=Firefox&cpu_info=~family%2020&signature=js%3A%3Ajit%3A%3AMaybeEnterJit&date=%3E%3D2020-09-07T13%3A49%3A00.000Z&date=%3C2020-12-07T13%3A49%3A00.000Z#graphs

Jan de Mooij [:jandem]

Assignee

Comment 63

•

4 years ago

Closing given comment 62.

Status: REOPENED → RESOLVED

Closed: 8 years ago → 4 years ago

Resolution: --- → FIXED

BugBot [:suhaib / :marco/ :calixte]

Updated

•

4 years ago

Assignee: nobody → jdemooij

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

status-firefox58: fix-optional → wontfix

status-firefox59: affected → wontfix

Patch 8 years ago Jan de Mooij [:jandem] (deleted), patch	sunfish : review+ Sylvestre : approval-mozilla-beta+	Details \| Diff \| Splinter Review
Increasing NOP size 8 years ago Hannes Verschore [:h4writer] (deleted), patch	jandem : review+ ritu : approval-mozilla-aurora+ ritu : approval-mozilla-beta+	Details \| Diff \| Splinter Review