772330 - layout crashes with AuthenticAMD Family 20 (0x14), Models 1 and 2 CPUs (also shows as AMD Radeon HD 6xxx series), spiking at various times

Reporter

Description

•

12 years ago

We've had lots of layout crashes associated with the AMD Radeon HD 6xxx series graphics drivers. I think they've all (or mostly) briefly spiked and then gone away. Odds are they have the same underlying cause. This is a meta-bug to track the problem. (One question of interest is whether they ever go away, or just keep moving signature constantly and stay around all the time.)

David Baron :dbaron:

Reporter

Updated

•

12 years ago

Keywords: crash

Scoobidiver (away)

Comment 1

•

12 years ago

(In reply to David Baron [:dbaron] from comment #0) > (One question of interest is whether they ever go away, or just keep moving > signature constantly and stay around all the time.) They go away, then come back later, sometimes two builds after, other times hundreds builds after. They don't usually stay more than one build.

David Baron :dbaron:

Reporter

Comment 2

•

12 years ago

How do you know that? Maybe most of the time they're spread between a large number of low-frequency signatures, and occasionally they concentrate on a single signature. Is there a way to verify that that's not happening? (Back when we generated CSV files, I could have, but I don't see those anymore.)

David Baron :dbaron:

Reporter

Comment 3

•

12 years ago

So another theory is that this driver is doing some sort of binary patching or hooking in that was designed for a particular version of Firefox, but the check that they're doing to make sure they have the right version relies on a very small amount of variable data, such that it has a significant false positive rate. So they do some sort of binary patching or hooking in on certain Firefox versions with an appearance of randomness. If this is the case it's only a matter of time before the pattern matches on a release build.

Scoobidiver (away)

Comment 4

•

12 years ago

(In reply to David Baron [:dbaron] from comment #2) > How do you know that? Maybe most of the time they're spread between a large > number of low-frequency signatures It impacts the crash ratio. > and occasionally they concentrate on a single signature. When this issue happens, there are about a half dozens of crash signatures. (In reply to David Baron [:dbaron] from comment #3) > If this is the case it's only a matter of time before the pattern matches on a > release build. It has already happened in Fx 11.0 (see bug 700288 comment 24).

Benjamin Smedberg

Comment 5

•

12 years ago

Bug 768383 was an instance of this that showed up in FF14b9 and went away in FF14b10; I examined the minidump and it was an almost impossible crash (null deref after null-check with the intervening code being fairly well defined). I chalked it up to a weird PGO fluke, but it's also possible that the driver is overwriting a stack location or register. But if that were the case I'd expect to see the crashes spread out more. And there weren't any graphics calls nested in this stack frame, at least that I could see. I'm a bit stumped by this one. It's the sort of thing that I'd love to catch in record and replay but we probably can't have those graphics drivers in a VM anyway.

David Baron :dbaron:

Reporter

Comment 6

•

12 years ago

(In reply to Scoobidiver from comment #4) > (In reply to David Baron [:dbaron] from comment #2) > > How do you know that? Maybe most of the time they're spread between a large > > number of low-frequency signatures > It impacts the crash ratio. Ah, ok, so I don't need to gather data from https://crash-analysis.mozilla.com/crash_analysis/

Scoobidiver (away)

Comment 7

•

12 years ago

(In reply to David Baron [:dbaron] from comment #3) > If this is the case it's only a matter of time before the pattern matches on a > release build. 10.0.6 ESR is affected!

Scoobidiver (away)

Updated

•

12 years ago

Blocks: 837371

Scoobidiver (away)

Updated

•

12 years ago

Blocks: 839270

David Baron :dbaron:

Reporter

Comment 8

•

12 years ago

It might be useful to try to figure out what's similar about the builds that are affected that isn't a characteristic of the unaffected builds. Does anybody happen to have a list of the affected builds?

David Baron :dbaron:

Reporter

Comment 9

•

12 years ago

So if we have any contacts at AMD, it might be worth asking them what regression might have been introduced on their end between (probably, though we don't have 100% confidence in these ranges): version 8.17.10.1047 and 8.17.10.1052 of aticfx32.dll version 8.17.10.310 and 8.17.10.318 of atidxx32.dll version 8.14.1.6150 and 8.14.1.6160 of atiuxpag.dll (I got these ranges from the correlations for bug 839270; the third is consistent with bug 714320 comment 26 from over a year ago.)

David Baron :dbaron:

Reporter

Comment 10

•

12 years ago

roc did investigation of another minidump in bug 839270 comment 22.

David Baron :dbaron:

Reporter

Comment 11

•

12 years ago

(In reply to David Baron [:dbaron] from comment #9) > So if we have any contacts at AMD, it might be worth asking them what oh, and bug 700288 comment 35 suggests Joe does have contacts at AMD.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 12

•

12 years ago

Summary of bug 839270 comment #22: We seem to do an unexpected jump forward by a short distance when we reach a specific point in our code, jumping into the middle of an instruction in another function. This doesn't always happen or the browser couldn't even start, but when it does happen it always happens in the same place in libxul for all the crash reports in that bug (even though those are different addresses since libxul is moved by ASLR). In bug 839270 the jump originates from a small leaf function which has clearly been compiled correctly and cannot be causing the jump itself. Whatever's causing this must be very subtle and is almost certainly unrelated to the Gecko code implicated by the crash stacks. I have some contacts at AMD too. I'll try them.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 13

•

12 years ago

I got minidumps for some of the other crash bugs. Bug 700288 is similar to bug 839270 --- we're in a small leaf function (UnionRectEdges), and inexplicably jump to the middle of an instruction (in this case in the same function though). However, the address within libxul is different (and nowhere near) the address for the crash in bug 839270. Bug 714320 affects AddChild, like bug 839280, but I'm not sure what's going on there. See https://bugzilla.mozilla.org/show_bug.cgi?id=714320#c79. Bug 722024 is like bug 700288. It looks like we're crashing in UnionRectEdges with an inexplicable jump forward past the end of the function, into int3 padding in that case. In summary, the code address where we go wrong seems to vary between libxul builds (but is at the same location in libxul for all regardless of ASLR). I bet the varying impact of these crashes depends on exactly which function (if any) gets cursed.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 14

•

12 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12) > I have some contacts at AMD too. I'll try them. Email sent.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 15

•

12 years ago

One question that might be helpful to answer: do we ever see these crashes in more than one function for a given libxul build?

Robert Kaiser

Comment 16

•

12 years ago

I *think* that we're seeing it in only one function per build, but one would probably need to look through all the dependent bugs and compare the builds where those happen.

Robert Kaiser

Comment 17

•

12 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #16) > I *think* that we're seeing it in only one function per build, but one would > probably need to look through all the dependent bugs and compare the builds > where those happen. Actually, scratch that. We have at least three different signatures for bug 839270 in 19.0b5 alone.

Joe Drew (not getting mail)

Comment 18

•

12 years ago

(In reply to David Baron [:dbaron] from comment #11) > (In reply to David Baron [:dbaron] from comment #9) > > So if we have any contacts at AMD, it might be worth asking them what > > oh, and bug 700288 comment 35 suggests Joe does have contacts at AMD. The people I know are the same people Robert emailed. Unfortunately I don't think we've heard back yet.

Benjamin Smedberg

Updated

•

12 years ago

Blocks: 830531

Benjamin Smedberg

Updated

•

12 years ago

Depends on: 845970

Scoobidiver (away)

Updated

•

12 years ago

Blocks: 806071

Scoobidiver (away)

Updated

•

12 years ago

Blocks: 854820

Scoobidiver (away)

Updated

•

12 years ago

Blocks: 863714

Scoobidiver (away)

Updated

•

12 years ago

Blocks: 865701

Robert Kaiser

Comment 19

•

12 years ago

Given the crashes tracked here are something highly visible when they explode and are a continuous subject of tracking by stability and release management, I'll invoke the "bugs that spearhead investigation or fixes across a large collection of crashes" clause of https://wiki.mozilla.org/CrashKill/Topcrash on this meta tracker bug and add the topcrash keyword here. We should not use it on individual signatures, though, as we know that's per-build fluctuations anyhow.

Keywords: topcrash

Avi Halachmi (:avih)

Comment 20

•

11 years ago

I have a system with a Radeon HD 6310 (it's an iGPU of AMD E-350) which is used daily as an HTPC. Bug 840161 blacklists window-acceleration and d2d-acceleration on this GPU due to this bug. FWIW, I didn't have any crash with layers.acceleration.force-enabled=true neither with FX22 (main browser), nor in nightly builds which I update regularly. I tried also gfx.direct2d.force-enabled=true without crashes, but typically it's not on since it degrades performance sometimes. If I can help tests in any way, please use me. My gfx about:support info is available at bug 840161 comment 15.

Scoobidiver (away)

Updated

•

11 years ago

Blocks: 902349

Benjamin Smedberg

Updated

•

11 years ago

Assignee: nobody → dmajor

(Away)

Assignee

Comment 21

•

11 years ago

TL;DR - We have a lot of observations but are far from a solution. Here's the story so far. On 21.0b4, the bug manifests as a crash usually near xul!mozilla::dom::DocumentBinding::CreateInterfaceObjects. The specific instruction offset and the nature of the crash (access violation, invalid instruction, privileged instruction, etc.) can vary. I can not-very-reliably repro this on the netbook named "MOZILLA-RD6310" by opening up some youtube videos in one window, then opening another window with nbcnews.com and mousing around and reloading until it crashes. It can take anywhere from a minute to an hour or more. After the crash, everything seems as if xul!nsStyleContext::AddChild+0x12 (xul+0x7d760) had been corrupted to contain an instruction reading "call CreateInterfaceObjects+0x20 (xul+0xa9b01)". There are several reasons for believing this. First, the top of the stack contains AddChild+0x17, as if a return address had been pushed during a call instruction (five bytes). Second, AddChild+0x12 is a valid instruction reachable in the original binary, but AddChild+0x17 is in the middle of an instruction and could never be a return address without corruption. Third, CreateInterfaceObjects+0x20 is also in the middle of an instruction, so it could not be a valid branch target in an unmodified binary. The affected locations are always offsets from xul.dll, so the absolute values change based on xul's base. Here's where it gets suspicious: by the time we notice the crash, the memory at AddChild+0x12 appears to have its original values. So we can't definitively prove whether the bug is indeed the corruption described above, or some other badness that happens to have the same symptoms. It's possible that the driver is modifying the xul.dll memory (perhaps as a write-test) and quickly modifying it back to the original value. There are other possibilities like a hardware issue in the instruction fetch, but that seems less likely. Assuming that the driver is modifying memory, it would have to touch five bytes, more than it could typically do with regular 32bit operations: 89 08 c3 83 c0 are the bytes at xul+0x7d760 originally. e8 9c c3 02 00 are the bytes that would cause our theorized call. Memory access breakpoints on the affected addresses don't trigger. Presumably that's because the driver accesses that physical memory via a different virtual-to-physical mapping (hardware breakpoints are based on virtual address). I tried dumping the driver's address mappings to see what other address it might be using, but there were so many mappings for that region that it's not practical to go chasing them all down. Another complication is that the memory at CreateInterfaceObjects+0x20 changes each time you load Firefox. That memory just so happens to contain an absolute address of a global variable (sPrefCachesInited). The Windows loader patches up the address based on where xul.dll gets based each time. What this means is, if we execute AddChild+0x20, occasionally it look like an innocuous instruction, so we continue on to 0x21 and so on. Depending on the interpretation of that memory, we crash in different ways and at different offsets. Usually it's plus-twenty-something, but in a few cases I've seen it continue on for dozens of instructions and jmp far away to mozjs. Also, sometimes those instructions contain a "pop" so that AddChild+0x17 is no longer on our stack. I've tried detouring AddChild in several places, adding instructions that verify AddChild+0x12 before executing them. If the verification were to fail then we'd have solid proof of memory corruption. Unfortunately, I haven't been able to hit the crash after doing this. Either my reading of those values interferes with the execution of the scenario, or I just haven't waited long enough on the unreliable repro, can't really say. [Note: This detouring is not a fix that we can apply to source code; I can only do it in the debugger with after-the-fact knowledge of what function fails on this build] All of the above applies to 21.0b4 only. The crash is not machine-specific (same functions affected on our netbook and various user crash dumps) but it is build-specific, since function layout changes with each compilation. I need to do more digging in the other bugs to see whether the victim is always xul+0x7d760, or at least some predictable location. If so, maybe we could play some tricks with the linker to avoid putting anything critical there.

(Away)

Assignee

Comment 22

•

11 years ago

I think this is a CPU bug. I don't say that lightly, because generally hardware is the last thing you should blame, but that's where the evidence is pointing. https://bugzilla.mozilla.org/show_bug.cgi?id=830531#c72 100% of 71760 crashes in bug 865701 occured on the two CPU models affected by that microcode update (AuthenticAMD Family 20 (0x14), Models 1 and 2). Those models have combined CPU+GPU on the same chip, which would explain why this appeared to correlate with ATI drivers. http://support.amd.com/us/Processor_TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf Erratum 688 is the only major bug that applies to both Models 1 and 2, and it just might be the issue that we're hitting. Our case of AddChild in bug 865701 meets the requirement of "after a not-taken branch that ends on the last byte of an aligned quad-word" and the "internal timing conditions" might explain the variability that we've seen. There is a workaround listed, but it requires BIOS authors to modify undocumented bits in the processor's instruction cache settings. Our netbook is Family 20 Model 1, and I confirmed that PCI configuration register D18F4x164[2] = 0, indicating that this rev of the silicon does not have the fix for 688. I also confirmed that MSRC001_1021[14] = 0 and MSRC001_1021[3] = 0, indicating that my BIOS has not applied AMD's workaround. Unfortunately, installing KB2818604 from Windows Update didn't stop the crashes. I don't have a good explanation. Maybe that patch was for something else on the errata sheet. But after using a kernel debugger to mimic AMD's BIOS workaround (don't try this at home), I don't crash anymore. Or at least I haven't crashed yet -- the repro is unreliable to begin with, so I want to give it a few more attempts.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 23

•

11 years ago

Wow. Your analysis is very impressive.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 24

•

11 years ago

I don't suppose we can read those configuration registers and get them into crash dumps?

(Away)

Assignee

Comment 25

•

11 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #24) > I don't suppose we can read those configuration registers and get them into > crash dumps? MSRs and PCI config need kernel privilege. We would have to write a driver to read them.

David Baron :dbaron:

Reporter

Comment 26

•

11 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #23) > Wow. Your analysis is very impressive. Agreed. An interesting followup question: is there a way we could examine a binary to determine whether it would trigger this bug? (If we could, then we could reject builds that would trigger it, perhaps even during the build process.)

David Baron :dbaron:

Reporter

Updated

•

11 years ago

Summary: layout crashes with AMD Radeon HD 6xxx series, spiking at various times → layout crashes with AuthenticAMD Family 20 (0x14), Models 1 and 2 CPUs (also shows as AMD Radeon HD 6xxx series), spiking at various times

(Away)

Assignee

Comment 27

•

11 years ago

(In reply to David Baron [:dbaron] (needinfo? me; away Aug 28 - Sep 3) from comment #26) > An interesting followup question: is there a way we could examine a binary > to determine whether it would trigger this bug? (If we could, then we could > reject builds that would trigger it, perhaps even during the build process.) I imagine that the bug depends at least as much on the runtime call patterns and control flow as on the static contents of the binary.

Benjamin Smedberg

Comment 28

•

11 years ago

> 100% of 71760 crashes in bug 865701 occured on the two CPU models affected > by that microcode update (AuthenticAMD Family 20 (0x14), Models 1 and 2). How did you collect this data? I know you asked me about this I wasn't able to run that query yesterday, but I did run a query today which shows different results: For the date period 2013-04-25 through 2013-05-04 with the 21.0b4 builds, I selected all crashes with the following signatures associated with bug 865701: 'mozilla::dom::DocumentBinding::CreateInterfaceObjects(JSContext*, JSObject*, JSObject**)', 'JSCompartment::getNewType(JSContext*, js::Class*, js::TaggedProto, JSFunction*)', 'JS_GetCompartmentPrincipals(JSCompartment*)', 'nsStyleSet::ReparentStyleContext(nsStyleContext*, nsStyleContext*, mozilla::dom::Element*)', 'nsFrameManager::ReResolveStyleContext(nsPresContext*, nsIFrame*, nsIContent*, nsStyleChangeList*, nsChangeHint, nsChangeHint, nsRestyleHint, mozilla::css::RestyleTracker&, nsFrameManager::DesiredA11yNotifications, nsTArray<nsIContent*>&, TreeMatchConte...', The AuthenticAMD processors you mention are certainly the most common, but there are other Intel and AMD processor models which experience the same crash signatures. I'll attach the data by CPU and by signature/CPU. I'll also run this for the Firefox 19.0 crash (bug 830531) because IIRC the distribution was different.

Benjamin Smedberg

Comment 29

•

11 years ago

Attached file amd-cpus-21.0b4.grouped.csv (by CPU only) (deleted) — Details

Benjamin Smedberg

Comment 30

•

11 years ago

Attached file amd-cpus-21.0b4 (by signature and CPU, unsorted) (deleted) — Details

(Away)

Assignee

Comment 31

•

11 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #28) > How did you collect this data? I know you asked me about this I wasn't able > to run that query yesterday, but I did run a query today which shows > different results: My search only included DocumentBinding::CreateInterfaceObjects at the top of the stack. I've spot-checked a few dozen reports from the other signatures you listed. In getNewType and JS_GetCompartmentPrincipals, reports from AMD family 20 all went through AddChild or CreateInterfaceObjects, and other CPUs didn't. There might be other crashes getting mixed in to those signatures. For ReparentStyleContext and ReResolveStyleContext, the stacks are quite scattered on both Intel and AMD processors. There may be several root causes there. Maybe CreateInterfaceObjects was just by luck a good filter, in that no other crashes managed to sneak in. I'd be curious to see whether we can say the same about the 19.0 crash.

(Away)

Assignee

Comment 32

•

11 years ago

(In reply to David Major [:dmajor] from comment #31) > I'd be curious to see whether we can say the same about the 19.0 crash. From April 25 to April 30 (I admit they're not good dates for 19.0, but that's what I had handy), I see 562 hits for TlsGetValue in 19.0. 560 of those are AMD family 20, and my spot-checks all showed XPC_WN_Helper_NewResolve on the stack. The remaining two reports from other processors had different stacks.

Avi Halachmi (:avih)

Comment 33

•

11 years ago

Impressive analysis. Looking forward this issue being handled when possible.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 34

•

11 years ago

(In reply to David Major [:dmajor] from comment #22) > Unfortunately, installing KB2818604 from Windows Update didn't stop the > crashes. I don't have a good explanation. Maybe that patch was for something > else on the errata sheet. But after using a kernel debugger to mimic AMD's > BIOS workaround (don't try this at home), I don't crash anymore. Or at least > I haven't crashed yet -- the repro is unreliable to begin with, so I want to > give it a few more attempts. Did you do those additional attempts? Maybe we can supply a kernel module that applies this change? Extreme perhaps, but what else can we do? The maintenance service runs with administrator privileges so I assume we can do this.

Benjamin Smedberg

Comment 35

•

11 years ago

Attached file amd-cpus-19 by signature and CPU, unsorted (deleted) — Details

Benjamin Smedberg

Comment 36

•

11 years ago

Attached file amd-cpus-19.grouped.csv by CPU only (deleted) — Details

(Away)

Assignee

Comment 37

•

11 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #34) > Did you do those additional attempts? Yes. I gave it several attempts on Friday, and I let the news site self-refresh over the weekend. It hasn't hit the crash so far. > Maybe we can supply a kernel module that applies this change? Extreme > perhaps, but what else can we do? The maintenance service runs with > administrator privileges so I assume we can do this. The trouble with my debugger hack is that half of the time it hangs the machine. I'm not surprised -- it's probably pretty dangerous to mess with cache settings when the system is already running. I'm guessing that's why the document says it should be done during BIOS.

Robert Kaiser

Comment 38

•

11 years ago

Comment on attachment 798856 [details] amd-cpus-19.grouped.csv by CPU only >AuthenticAMD family 20 model 2 stepping 0 | 2,294856 >AuthenticAMD family 20 model 1 stepping 0 | 2,4791 >AuthenticAMD family 20 model 1 stepping 0 | 1,163 >GenuineIntel family 6 model 23 stepping 10 | 2,55 >GenuineIntel family 6 model 15 stepping 13 | 2,36 >AuthenticAMD family 20 model 2 stepping 0 | 1,26 >GenuineIntel family 6 model 28 stepping 2 | 2,24 >GenuineIntel family 6 model 42 stepping 7 | 4,22 Given the fast drop-off after the "AuthenticAMD family 20" CPUs, the others might be crashes that just happen to be in the same function/signature but are unrelated to this specific issue. BTW, any idea what those numbers after the pipe actually are?

Benjamin Smedberg

Comment 39

•

11 years ago

I believe those are the number of cores.

Benjamin Smedberg

Comment 40

•

11 years ago

Attached file amd-gfx-21.0b4 by signature/graphics card, unsorted (deleted) — Details

Benjamin Smedberg

Comment 41

•

11 years ago

Attached file amd-gfx-21.0b4.sorted by graphics card, sorted (deleted) — Details

Here's the equivalent data for 21.0b4 by graphics vendor instead of by CPU: 0x1002 (AMD),108430 0x0000 (unknown/bad data),306 0x10de (nvidia),195 0x8086 (intel),175 0x1039 (SIS),5 0x5333 (S3),5 0x1106 (VIA),5 0x300b (?),1

(Away)

Assignee

Updated

•

11 years ago

Depends on: 921569

(Away)

Assignee

Updated

•

11 years ago

Depends on: 921609

Alex Keybl [:akeybl]

Updated

•

11 years ago

Keywords: topcrash → topcrash-win

(Away)

Assignee

Updated

•

11 years ago

Depends on: 945439

(Away)

Assignee

Updated

•

11 years ago

Blocks: 1011075

(Away)

Assignee

Updated

•

10 years ago

Blocks: 1131831

Sylvestre Ledru [:Sylvestre]

Comment 43

•

10 years ago

36 rc1 has this defect. We built a second rc before going live.

Robert Kaiser

Comment 44

•

10 years ago

https://crash-stats.mozilla.com/report/list?product=Firefox&range_value=7&range_unit=days&date=2015-03-03&signature=nsIFrame%3A%3AStylePosition%28%29&version=Firefox%3A38.0a2 is another instance of this on the 2015-03-01 Dev Edition build.

Robert Kaiser

Comment 45

•

10 years ago

https://crash-stats.mozilla.com/report/list?signature=nsStyleContext%3A%3ADoGetStylePosition%28bool%29 seems to be a 38.0b2 instance of this crash.

(Away)

Assignee

Updated

•

10 years ago

Depends on: 1155836

Sylvestre Ledru [:Sylvestre]

Comment 46

•

10 years ago

38b2 & 38b5 were affected too.

Robert Kaiser

Comment 47

•

10 years ago

In 38.0b8, we also have that crash with this signature: https://crash-stats.mozilla.com/report/list?signature=nsDisplayItem%3A%3AZIndex%28%29

Mats Palmgren (inactive)

Updated

•

10 years ago

Blocks: 1160317

Sylvestre Ledru [:Sylvestre]

Comment 48

•

10 years ago

38.0b9 was also impacted.

Jet Villegas (inactive)

Comment 49

•

10 years ago

Adding a dependency on bug 1156135. We may need to detect this CPU/BIOS combination and alert the user at runtime.

Blocks: 945439

Depends on: 1156135
No longer depends on: 945439

David Baron :dbaron:

Reporter

Comment 50

•

9 years ago

Bug 1155836 attempted to fix one of the major places where this happens.

Robert Kaiser

Comment 52

•

9 years ago

(In reply to David Baron [:dbaron] ⏰UTC-7 from comment #50) > Bug 1155836 attempted to fix one of the major places where this happens. And FWIW, I think we have not seen it since then. Doesn't mean we can declare victory but at least it looks like the frequency of those issues has decreased over what we saw in the 38.0 beta cycle.

(Away)

Assignee

Comment 53

•

9 years ago

I'm going to call this fixed by bug 1155836.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Sylvestre Ledru [:Sylvestre]

Comment 54

•

9 years ago

David, you just rock! I am really impressed by your work!

David Baron :dbaron:

Reporter

Comment 55

•

9 years ago

(In reply to dmajor (away) from comment #22) > http://support.amd.com/us/Processor_TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf The URL for this is now: http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf The full text of Erratum 688 is: 688 Processor May Cause Unpredictable Program Behavior Under Highly Specific Branch Conditions Description Under a highly specific and detailed set of internal timing conditions, the processor may incorrectly update the branch status when a taken branch occurs where the first or second instruction after the branch is an indirect call or jump. This may cause the processor to update the rIP (the instruction pointer register) after a not-taken branch that ends on the last byte of an aligned quad-word such that it appears the processor skips, and does not execute, one or more instructions. The new updated rIP due to this erratum may not be at an instruction boundary. Potential Effect on System Unpredictable program behavior, possibly leading to a program error or system error. It is also possible that the processor may hang or recognize an exception (for example, a #GP or #UD exception), however AMD has not observed this effect. Suggested Workaround BIOS should set MSRC001_1021[14] = 1b and MSRC001_1021[3] = 1b. This workaround is required only when bit 2 of Fixed Errata Status Register (D18F4x164[2]) = 0b. Fix Planned Yes

David Baron :dbaron:

Reporter

Updated

•

9 years ago

Blocks: 1266626

David Baron :dbaron:

Reporter

Comment 56

•

9 years ago

So after debugging bug 1266626 which appears to be a form of this crash in build 3 for 46.0 (which we're not using; it's in crash stats as 46.0b99 from its use on the beta channel, though), I thought I'd look to see if we'd shipped other forms of this bug in release recently. I did this by doing crash-stats queries with: cpu_info=%5EAuthenticAMD+family+20+model+1&cpu_info=%5EAuthenticAMD+family+20+model+2 tacked on to see if anything interesting popped out. So far the only interesting thing that I've found is that it appears we shipped a form of this bug that crashes in nsFrame::DisplayBorderBackgroundOutline in 43.0.2 and 43.0.3 (and also, older, 37.0.2).

David Baron :dbaron:

Reporter

Updated

•

9 years ago

Depends on: 1269028

David Baron :dbaron:

Reporter

Updated

•

9 years ago

Blocks: 964351

David Baron :dbaron:

Reporter

Updated

•

9 years ago

No longer blocks: 964351

David Baron :dbaron:

Reporter

Updated

•

9 years ago

Blocks: 1262282

David Baron :dbaron:

Reporter

Updated

•

9 years ago

Blocks: 1270226

Andrew McCreight [:mccr8]

Updated

•

9 years ago

Blocks: 1034706

David Baron :dbaron:

Reporter

Updated

•

9 years ago

Blocks: 1269028

No longer depends on: 1269028

David Baron :dbaron:

Reporter

Comment 57

•

8 years ago

In 47.0b8 this showed up again as crashes in mozilla::FramePropertyTable::GetInternal.

David Baron :dbaron:

Reporter

Comment 58

•

8 years ago

In 47.0b3 we had crashes in ValueToNameOrSymbolId and js::ValueToId<T>.

David Baron :dbaron:

Reporter

Updated

•

8 years ago

Blocks: 1277450

Andrew McCreight [:mccr8]

Comment 59

•

8 years ago

nsCSSOffsetState::InitOffsets seems like another variant of this signature, based on the 7-6 Nightly.

[:philipp]

Updated

•

8 years ago

Blocks: 1312270

[:philipp]

Updated

•

8 years ago

Blocks: 1331253

[:philipp]

Updated

•

8 years ago

Blocks: 1316022

mirh

Comment 60

•

8 years ago

(In reply to David Major [:dmajor] from comment #22) > Unfortunately, installing KB2818604 from Windows Update didn't stop the > crashes. I don't have a good explanation. Maybe that patch was for something > else on the errata sheet. But after using a kernel debugger to mimic AMD's > BIOS workaround (don't try this at home), I don't crash anymore. Or at least > I haven't crashed yet -- the repro is unreliable to begin with, so I want to > give it a few more attempts. Which I guess makes sense considering recommended workaround is setting the required disabling bits in IC_CFG MSR *only* after having checked PCI configuration space for Errata register. Something I'm not sure you can ask the CPU alone to do. Definitively not in just a few lines of assembly. KB2818604 is just a dll with microcodes of all AMD cpus updated to Q1 2013 then (when latest 0x5000029 and 0x5000119 revisions were released respectively for Bobcat ON-B0 and ON-C0 steppings). According to this, they both only bring a fix for erratum 784 https://anonscm.debian.org/cgit/users/hmh/amd64-microcode.git/commit/microcode_amd.bin.README?id=9b4f1804855407f5ba2ce58ef428dfba226f3652 Your kernel debugging trickery was also easily reproduced with msr-tools in this interesting thread https://patchwork.kernel.org/patch/9390769/ And they didn't seem to have any kind of instability.

mirh

Updated

•

7 years ago

Depends on: 1335925

Jeff Muizelaar [:jrmuizel]

Comment 61

•

4 years ago

Mirh, your patchwork link is dead. Did the content just move or is it gone?

Flags: needinfo?(mirh)

mirh

Comment 62

•

4 years ago

It's this https://lore.kernel.org/lkml/b4ad7273efbb0c60a6c93ae68f82a44e@openmailbox.org/t/

And the other broken link is https://salsa.debian.org/hmh/amd64-microcode/-/commit/9b4f1804855407f5ba2ce58ef428dfba226f3652#a0f3027e734957eafcd0affdd79602d4be0bd68d
But as I eventually pointed out in bug 1281759, comment 39, microcode has nothing to do with the problem.

Flags: needinfo?(mirh)

Daniel Veditz [:dveditz]

Updated

•

3 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1746733

amd-cpus-21.0b4.grouped.csv (by CPU only) 11 years ago Benjamin Smedberg (deleted), text/plain		Details
amd-cpus-21.0b4 (by signature and CPU, unsorted) 11 years ago Benjamin Smedberg (deleted), text/plain		Details
amd-cpus-19 by signature and CPU, unsorted 11 years ago Benjamin Smedberg (deleted), text/plain		Details
amd-cpus-19.grouped.csv by CPU only 11 years ago Benjamin Smedberg (deleted), text/plain		Details
amd-gfx-21.0b4 by signature/graphics card, unsorted 11 years ago Benjamin Smedberg (deleted), text/plain		Details
amd-gfx-21.0b4.sorted by graphics card, sorted 11 years ago Benjamin Smedberg (deleted), text/plain		Details