Closed
Bug 772330
Opened 12 years ago
Closed 9 years ago
layout crashes with AuthenticAMD Family 20 (0x14), Models 1 and 2 CPUs (also shows as AMD Radeon HD 6xxx series), spiking at various times
Categories
(Core :: Layout, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: dbaron, Assigned: away)
References
(Depends on 1 open bug, Blocks 5 open bugs)
Details
(Keywords: crash, topcrash-win)
Attachments
(6 files)
We've had lots of layout crashes associated with the AMD Radeon HD 6xxx series graphics drivers. I think they've all (or mostly) briefly spiked and then gone away. Odds are they have the same underlying cause.
This is a meta-bug to track the problem.
(One question of interest is whether they ever go away, or just keep moving signature constantly and stay around all the time.)
Comment 1•12 years ago
|
||
(In reply to David Baron [:dbaron] from comment #0)
> (One question of interest is whether they ever go away, or just keep moving
> signature constantly and stay around all the time.)
They go away, then come back later, sometimes two builds after, other times hundreds builds after. They don't usually stay more than one build.
Reporter | ||
Comment 2•12 years ago
|
||
How do you know that? Maybe most of the time they're spread between a large number of low-frequency signatures, and occasionally they concentrate on a single signature. Is there a way to verify that that's not happening? (Back when we generated CSV files, I could have, but I don't see those anymore.)
Reporter | ||
Comment 3•12 years ago
|
||
So another theory is that this driver is doing some sort of binary patching or hooking in that was designed for a particular version of Firefox, but the check that they're doing to make sure they have the right version relies on a very small amount of variable data, such that it has a significant false positive rate. So they do some sort of binary patching or hooking in on certain Firefox versions with an appearance of randomness. If this is the case it's only a matter of time before the pattern matches on a release build.
Comment 4•12 years ago
|
||
(In reply to David Baron [:dbaron] from comment #2)
> How do you know that? Maybe most of the time they're spread between a large
> number of low-frequency signatures
It impacts the crash ratio.
> and occasionally they concentrate on a single signature.
When this issue happens, there are about a half dozens of crash signatures.
(In reply to David Baron [:dbaron] from comment #3)
> If this is the case it's only a matter of time before the pattern matches on a
> release build.
It has already happened in Fx 11.0 (see bug 700288 comment 24).
Comment 5•12 years ago
|
||
Bug 768383 was an instance of this that showed up in FF14b9 and went away in FF14b10; I examined the minidump and it was an almost impossible crash (null deref after null-check with the intervening code being fairly well defined). I chalked it up to a weird PGO fluke, but it's also possible that the driver is overwriting a stack location or register. But if that were the case I'd expect to see the crashes spread out more. And there weren't any graphics calls nested in this stack frame, at least that I could see.
I'm a bit stumped by this one. It's the sort of thing that I'd love to catch in record and replay but we probably can't have those graphics drivers in a VM anyway.
Reporter | ||
Comment 6•12 years ago
|
||
(In reply to Scoobidiver from comment #4)
> (In reply to David Baron [:dbaron] from comment #2)
> > How do you know that? Maybe most of the time they're spread between a large
> > number of low-frequency signatures
> It impacts the crash ratio.
Ah, ok, so I don't need to gather data from https://crash-analysis.mozilla.com/crash_analysis/
Comment 7•12 years ago
|
||
(In reply to David Baron [:dbaron] from comment #3)
> If this is the case it's only a matter of time before the pattern matches on a
> release build.
10.0.6 ESR is affected!
Reporter | ||
Comment 8•12 years ago
|
||
It might be useful to try to figure out what's similar about the builds that are affected that isn't a characteristic of the unaffected builds. Does anybody happen to have a list of the affected builds?
Reporter | ||
Comment 9•12 years ago
|
||
So if we have any contacts at AMD, it might be worth asking them what regression might have been introduced on their end between (probably, though we don't have 100% confidence in these ranges):
version 8.17.10.1047 and 8.17.10.1052 of aticfx32.dll
version 8.17.10.310 and 8.17.10.318 of atidxx32.dll
version 8.14.1.6150 and 8.14.1.6160 of atiuxpag.dll
(I got these ranges from the correlations for bug 839270; the third is consistent with bug 714320 comment 26 from over a year ago.)
Reporter | ||
Comment 10•12 years ago
|
||
roc did investigation of another minidump in bug 839270 comment 22.
Reporter | ||
Comment 11•12 years ago
|
||
(In reply to David Baron [:dbaron] from comment #9)
> So if we have any contacts at AMD, it might be worth asking them what
oh, and bug 700288 comment 35 suggests Joe does have contacts at AMD.
Summary of bug 839270 comment #22: We seem to do an unexpected jump forward by a short distance when we reach a specific point in our code, jumping into the middle of an instruction in another function. This doesn't always happen or the browser couldn't even start, but when it does happen it always happens in the same place in libxul for all the crash reports in that bug (even though those are different addresses since libxul is moved by ASLR). In bug 839270 the jump originates from a small leaf function which has clearly been compiled correctly and cannot be causing the jump itself.
Whatever's causing this must be very subtle and is almost certainly unrelated to the Gecko code implicated by the crash stacks.
I have some contacts at AMD too. I'll try them.
I got minidumps for some of the other crash bugs.
Bug 700288 is similar to bug 839270 --- we're in a small leaf function (UnionRectEdges), and inexplicably jump to the middle of an instruction (in this case in the same function though). However, the address within libxul is different (and nowhere near) the address for the crash in bug 839270.
Bug 714320 affects AddChild, like bug 839280, but I'm not sure what's going on there. See https://bugzilla.mozilla.org/show_bug.cgi?id=714320#c79.
Bug 722024 is like bug 700288. It looks like we're crashing in UnionRectEdges with an inexplicable jump forward past the end of the function, into int3 padding in that case.
In summary, the code address where we go wrong seems to vary between libxul builds (but is at the same location in libxul for all regardless of ASLR). I bet the varying impact of these crashes depends on exactly which function (if any) gets cursed.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12)
> I have some contacts at AMD too. I'll try them.
Email sent.
One question that might be helpful to answer: do we ever see these crashes in more than one function for a given libxul build?
Comment 16•12 years ago
|
||
I *think* that we're seeing it in only one function per build, but one would probably need to look through all the dependent bugs and compare the builds where those happen.
Comment 17•12 years ago
|
||
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #16)
> I *think* that we're seeing it in only one function per build, but one would
> probably need to look through all the dependent bugs and compare the builds
> where those happen.
Actually, scratch that. We have at least three different signatures for bug 839270 in 19.0b5 alone.
Comment 18•12 years ago
|
||
(In reply to David Baron [:dbaron] from comment #11)
> (In reply to David Baron [:dbaron] from comment #9)
> > So if we have any contacts at AMD, it might be worth asking them what
>
> oh, and bug 700288 comment 35 suggests Joe does have contacts at AMD.
The people I know are the same people Robert emailed. Unfortunately I don't think we've heard back yet.
Comment 19•12 years ago
|
||
Given the crashes tracked here are something highly visible when they explode and are a continuous subject of tracking by stability and release management, I'll invoke the "bugs that spearhead investigation or fixes across a large collection of crashes" clause of https://wiki.mozilla.org/CrashKill/Topcrash on this meta tracker bug and add the topcrash keyword here. We should not use it on individual signatures, though, as we know that's per-build fluctuations anyhow.
Keywords: topcrash
Comment 20•11 years ago
|
||
I have a system with a Radeon HD 6310 (it's an iGPU of AMD E-350) which is used daily as an HTPC.
Bug 840161 blacklists window-acceleration and d2d-acceleration on this GPU due to this bug.
FWIW, I didn't have any crash with layers.acceleration.force-enabled=true neither with FX22 (main browser), nor in nightly builds which I update regularly. I tried also gfx.direct2d.force-enabled=true without crashes, but typically it's not on since it degrades performance sometimes.
If I can help tests in any way, please use me. My gfx about:support info is available at bug 840161 comment 15.
Updated•11 years ago
|
Assignee: nobody → dmajor
Assignee | ||
Comment 21•11 years ago
|
||
TL;DR - We have a lot of observations but are far from a solution. Here's the story so far.
On 21.0b4, the bug manifests as a crash usually near xul!mozilla::dom::DocumentBinding::CreateInterfaceObjects. The specific instruction offset and the nature of the crash (access violation, invalid instruction, privileged instruction, etc.) can vary.
I can not-very-reliably repro this on the netbook named "MOZILLA-RD6310" by opening up some youtube videos in one window, then opening another window with nbcnews.com and mousing around and reloading until it crashes. It can take anywhere from a minute to an hour or more.
After the crash, everything seems as if xul!nsStyleContext::AddChild+0x12 (xul+0x7d760) had been corrupted to contain an instruction reading "call CreateInterfaceObjects+0x20 (xul+0xa9b01)". There are several reasons for believing this. First, the top of the stack contains AddChild+0x17, as if a return address had been pushed during a call instruction (five bytes). Second, AddChild+0x12 is a valid instruction reachable in the original binary, but AddChild+0x17 is in the middle of an instruction and could never be a return address without corruption. Third, CreateInterfaceObjects+0x20 is also in the middle of an instruction, so it could not be a valid branch target in an unmodified binary. The affected locations are always offsets from xul.dll, so the absolute values change based on xul's base.
Here's where it gets suspicious: by the time we notice the crash, the memory at AddChild+0x12 appears to have its original values. So we can't definitively prove whether the bug is indeed the corruption described above, or some other badness that happens to have the same symptoms. It's possible that the driver is modifying the xul.dll memory (perhaps as a write-test) and quickly modifying it back to the original value. There are other possibilities like a hardware issue in the instruction fetch, but that seems less likely.
Assuming that the driver is modifying memory, it would have to touch five bytes, more than it could typically do with regular 32bit operations:
89 08 c3 83 c0 are the bytes at xul+0x7d760 originally.
e8 9c c3 02 00 are the bytes that would cause our theorized call.
Memory access breakpoints on the affected addresses don't trigger. Presumably that's because the driver accesses that physical memory via a different virtual-to-physical mapping (hardware breakpoints are based on virtual address). I tried dumping the driver's address mappings to see what other address it might be using, but there were so many mappings for that region that it's not practical to go chasing them all down.
Another complication is that the memory at CreateInterfaceObjects+0x20 changes each time you load Firefox. That memory just so happens to contain an absolute address of a global variable (sPrefCachesInited). The Windows loader patches up the address based on where xul.dll gets based each time. What this means is, if we execute AddChild+0x20, occasionally it look like an innocuous instruction, so we continue on to 0x21 and so on. Depending on the interpretation of that memory, we crash in different ways and at different offsets. Usually it's plus-twenty-something, but in a few cases I've seen it continue on for dozens of instructions and jmp far away to mozjs. Also, sometimes those instructions contain a "pop" so that AddChild+0x17 is no longer on our stack.
I've tried detouring AddChild in several places, adding instructions that verify AddChild+0x12 before executing them. If the verification were to fail then we'd have solid proof of memory corruption. Unfortunately, I haven't been able to hit the crash after doing this. Either my reading of those values interferes with the execution of the scenario, or I just haven't waited long enough on the unreliable repro, can't really say. [Note: This detouring is not a fix that we can apply to source code; I can only do it in the debugger with after-the-fact knowledge of what function fails on this build]
All of the above applies to 21.0b4 only. The crash is not machine-specific (same functions affected on our netbook and various user crash dumps) but it is build-specific, since function layout changes with each compilation. I need to do more digging in the other bugs to see whether the victim is always xul+0x7d760, or at least some predictable location. If so, maybe we could play some tricks with the linker to avoid putting anything critical there.
Assignee | ||
Comment 22•11 years ago
|
||
I think this is a CPU bug. I don't say that lightly, because generally hardware is the last thing you should blame, but that's where the evidence is pointing.
https://bugzilla.mozilla.org/show_bug.cgi?id=830531#c72
100% of 71760 crashes in bug 865701 occured on the two CPU models affected by that microcode update (AuthenticAMD Family 20 (0x14), Models 1 and 2). Those models have combined CPU+GPU on the same chip, which would explain why this appeared to correlate with ATI drivers.
http://support.amd.com/us/Processor_TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf
Erratum 688 is the only major bug that applies to both Models 1 and 2, and it just might be the issue that we're hitting. Our case of AddChild in bug 865701 meets the requirement of "after a not-taken branch that ends on the last byte of an aligned quad-word" and the "internal timing conditions" might explain the variability that we've seen. There is a workaround listed, but it requires BIOS authors to modify undocumented bits in the processor's instruction cache settings.
Our netbook is Family 20 Model 1, and I confirmed that PCI configuration register D18F4x164[2] = 0, indicating that this rev of the silicon does not have the fix for 688. I also confirmed that MSRC001_1021[14] = 0 and MSRC001_1021[3] = 0, indicating that my BIOS has not applied AMD's workaround.
Unfortunately, installing KB2818604 from Windows Update didn't stop the crashes. I don't have a good explanation. Maybe that patch was for something else on the errata sheet. But after using a kernel debugger to mimic AMD's BIOS workaround (don't try this at home), I don't crash anymore. Or at least I haven't crashed yet -- the repro is unreliable to begin with, so I want to give it a few more attempts.
Wow. Your analysis is very impressive.
I don't suppose we can read those configuration registers and get them into crash dumps?
Assignee | ||
Comment 25•11 years ago
|
||
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #24)
> I don't suppose we can read those configuration registers and get them into
> crash dumps?
MSRs and PCI config need kernel privilege. We would have to write a driver to read them.
Reporter | ||
Comment 26•11 years ago
|
||
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #23)
> Wow. Your analysis is very impressive.
Agreed.
An interesting followup question: is there a way we could examine a binary to determine whether it would trigger this bug? (If we could, then we could reject builds that would trigger it, perhaps even during the build process.)
Reporter | ||
Updated•11 years ago
|
Summary: layout crashes with AMD Radeon HD 6xxx series, spiking at various times → layout crashes with AuthenticAMD Family 20 (0x14), Models 1 and 2 CPUs (also shows as AMD Radeon HD 6xxx series), spiking at various times
Assignee | ||
Comment 27•11 years ago
|
||
(In reply to David Baron [:dbaron] (needinfo? me; away Aug 28 - Sep 3) from comment #26)
> An interesting followup question: is there a way we could examine a binary
> to determine whether it would trigger this bug? (If we could, then we could
> reject builds that would trigger it, perhaps even during the build process.)
I imagine that the bug depends at least as much on the runtime call patterns and control flow as on the static contents of the binary.
Comment 28•11 years ago
|
||
> 100% of 71760 crashes in bug 865701 occured on the two CPU models affected
> by that microcode update (AuthenticAMD Family 20 (0x14), Models 1 and 2).
How did you collect this data? I know you asked me about this I wasn't able to run that query yesterday, but I did run a query today which shows different results:
For the date period 2013-04-25 through 2013-05-04 with the 21.0b4 builds, I selected all crashes with the following signatures associated with bug 865701:
'mozilla::dom::DocumentBinding::CreateInterfaceObjects(JSContext*, JSObject*, JSObject**)',
'JSCompartment::getNewType(JSContext*, js::Class*, js::TaggedProto, JSFunction*)',
'JS_GetCompartmentPrincipals(JSCompartment*)',
'nsStyleSet::ReparentStyleContext(nsStyleContext*, nsStyleContext*, mozilla::dom::Element*)',
'nsFrameManager::ReResolveStyleContext(nsPresContext*, nsIFrame*, nsIContent*, nsStyleChangeList*, nsChangeHint, nsChangeHint, nsRestyleHint, mozilla::css::RestyleTracker&, nsFrameManager::DesiredA11yNotifications, nsTArray<nsIContent*>&, TreeMatchConte...',
The AuthenticAMD processors you mention are certainly the most common, but there are other Intel and AMD processor models which experience the same crash signatures. I'll attach the data by CPU and by signature/CPU. I'll also run this for the Firefox 19.0 crash (bug 830531) because IIRC the distribution was different.
Comment 29•11 years ago
|
||
Comment 30•11 years ago
|
||
Assignee | ||
Comment 31•11 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #28)
> How did you collect this data? I know you asked me about this I wasn't able
> to run that query yesterday, but I did run a query today which shows
> different results:
My search only included DocumentBinding::CreateInterfaceObjects at the top of the stack. I've spot-checked a few dozen reports from the other signatures you listed. In getNewType and JS_GetCompartmentPrincipals, reports from AMD family 20 all went through AddChild or CreateInterfaceObjects, and other CPUs didn't. There might be other crashes getting mixed in to those signatures. For ReparentStyleContext and ReResolveStyleContext, the stacks are quite scattered on both Intel and AMD processors. There may be several root causes there. Maybe CreateInterfaceObjects was just by luck a good filter, in that no other crashes managed to sneak in.
I'd be curious to see whether we can say the same about the 19.0 crash.
Assignee | ||
Comment 32•11 years ago
|
||
(In reply to David Major [:dmajor] from comment #31)
> I'd be curious to see whether we can say the same about the 19.0 crash.
From April 25 to April 30 (I admit they're not good dates for 19.0, but that's what I had handy), I see 562 hits for TlsGetValue in 19.0. 560 of those are AMD family 20, and my spot-checks all showed XPC_WN_Helper_NewResolve on the stack. The remaining two reports from other processors had different stacks.
Comment 33•11 years ago
|
||
Impressive analysis. Looking forward this issue being handled when possible.
(In reply to David Major [:dmajor] from comment #22)
> Unfortunately, installing KB2818604 from Windows Update didn't stop the
> crashes. I don't have a good explanation. Maybe that patch was for something
> else on the errata sheet. But after using a kernel debugger to mimic AMD's
> BIOS workaround (don't try this at home), I don't crash anymore. Or at least
> I haven't crashed yet -- the repro is unreliable to begin with, so I want to
> give it a few more attempts.
Did you do those additional attempts?
Maybe we can supply a kernel module that applies this change? Extreme perhaps, but what else can we do? The maintenance service runs with administrator privileges so I assume we can do this.
Comment 35•11 years ago
|
||
Comment 36•11 years ago
|
||
Assignee | ||
Comment 37•11 years ago
|
||
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #34)
> Did you do those additional attempts?
Yes. I gave it several attempts on Friday, and I let the news site self-refresh over the weekend. It hasn't hit the crash so far.
> Maybe we can supply a kernel module that applies this change? Extreme
> perhaps, but what else can we do? The maintenance service runs with
> administrator privileges so I assume we can do this.
The trouble with my debugger hack is that half of the time it hangs the machine. I'm not surprised -- it's probably pretty dangerous to mess with cache settings when the system is already running. I'm guessing that's why the document says it should be done during BIOS.
Comment 38•11 years ago
|
||
Comment on attachment 798856 [details]
amd-cpus-19.grouped.csv by CPU only
>AuthenticAMD family 20 model 2 stepping 0 | 2,294856
>AuthenticAMD family 20 model 1 stepping 0 | 2,4791
>AuthenticAMD family 20 model 1 stepping 0 | 1,163
>GenuineIntel family 6 model 23 stepping 10 | 2,55
>GenuineIntel family 6 model 15 stepping 13 | 2,36
>AuthenticAMD family 20 model 2 stepping 0 | 1,26
>GenuineIntel family 6 model 28 stepping 2 | 2,24
>GenuineIntel family 6 model 42 stepping 7 | 4,22
Given the fast drop-off after the "AuthenticAMD family 20" CPUs, the others might be crashes that just happen to be in the same function/signature but are unrelated to this specific issue.
BTW, any idea what those numbers after the pipe actually are?
Comment 39•11 years ago
|
||
I believe those are the number of cores.
Comment 40•11 years ago
|
||
Comment 41•11 years ago
|
||
Here's the equivalent data for 21.0b4 by graphics vendor instead of by CPU:
0x1002 (AMD),108430
0x0000 (unknown/bad data),306
0x10de (nvidia),195
0x8086 (intel),175
0x1039 (SIS),5
0x5333 (S3),5
0x1106 (VIA),5
0x300b (?),1
Updated•11 years ago
|
Keywords: topcrash → topcrash-win
Comment 43•10 years ago
|
||
36 rc1 has this defect. We built a second rc before going live.
Comment 44•10 years ago
|
||
https://crash-stats.mozilla.com/report/list?product=Firefox&range_value=7&range_unit=days&date=2015-03-03&signature=nsIFrame%3A%3AStylePosition%28%29&version=Firefox%3A38.0a2 is another instance of this on the 2015-03-01 Dev Edition build.
Comment 45•10 years ago
|
||
https://crash-stats.mozilla.com/report/list?signature=nsStyleContext%3A%3ADoGetStylePosition%28bool%29 seems to be a 38.0b2 instance of this crash.
Comment 46•10 years ago
|
||
38b2 & 38b5 were affected too.
Comment 47•10 years ago
|
||
In 38.0b8, we also have that crash with this signature: https://crash-stats.mozilla.com/report/list?signature=nsDisplayItem%3A%3AZIndex%28%29
Comment 48•10 years ago
|
||
38.0b9 was also impacted.
Comment 49•10 years ago
|
||
Adding a dependency on bug 1156135. We may need to detect this CPU/BIOS combination and alert the user at runtime.
Reporter | ||
Comment 50•9 years ago
|
||
Bug 1155836 attempted to fix one of the major places where this happens.
Comment 52•9 years ago
|
||
(In reply to David Baron [:dbaron] ⏰UTC-7 from comment #50)
> Bug 1155836 attempted to fix one of the major places where this happens.
And FWIW, I think we have not seen it since then. Doesn't mean we can declare victory but at least it looks like the frequency of those issues has decreased over what we saw in the 38.0 beta cycle.
Assignee | ||
Comment 53•9 years ago
|
||
I'm going to call this fixed by bug 1155836.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Comment 54•9 years ago
|
||
David, you just rock! I am really impressed by your work!
Reporter | ||
Comment 55•9 years ago
|
||
(In reply to dmajor (away) from comment #22)
> http://support.amd.com/us/Processor_TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf
The URL for this is now:
http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf
The full text of Erratum 688 is:
688 Processor May Cause Unpredictable Program Behavior Under
Highly Specific Branch Conditions
Description
Under a highly specific and detailed set of internal timing conditions, the processor may incorrectly
update the branch status when a taken branch occurs where the first or second instruction after the
branch is an indirect call or jump. This may cause the processor to update the rIP (the instruction
pointer register) after a not-taken branch that ends on the last byte of an aligned quad-word such that
it appears the processor skips, and does not execute, one or more instructions. The new updated rIP
due to this erratum may not be at an instruction boundary.
Potential Effect on System
Unpredictable program behavior, possibly leading to a program error or system error. It is also
possible that the processor may hang or recognize an exception (for example, a #GP or #UD
exception), however AMD has not observed this effect.
Suggested Workaround
BIOS should set MSRC001_1021[14] = 1b and MSRC001_1021[3] = 1b. This workaround is
required only when bit 2 of Fixed Errata Status Register (D18F4x164[2]) = 0b.
Fix Planned
Yes
Reporter | ||
Comment 56•9 years ago
|
||
So after debugging bug 1266626 which appears to be a form of this crash in build 3 for 46.0 (which we're not using; it's in crash stats as 46.0b99 from its use on the beta channel, though), I thought I'd look to see if we'd shipped other forms of this bug in release recently.
I did this by doing crash-stats queries with:
cpu_info=%5EAuthenticAMD+family+20+model+1&cpu_info=%5EAuthenticAMD+family+20+model+2
tacked on to see if anything interesting popped out.
So far the only interesting thing that I've found is that it appears we shipped a form of this bug that crashes in nsFrame::DisplayBorderBackgroundOutline in 43.0.2 and 43.0.3 (and also, older, 37.0.2).
Reporter | ||
Updated•9 years ago
|
Reporter | ||
Comment 57•8 years ago
|
||
In 47.0b8 this showed up again as crashes in mozilla::FramePropertyTable::GetInternal.
Reporter | ||
Comment 58•8 years ago
|
||
In 47.0b3 we had crashes in ValueToNameOrSymbolId and js::ValueToId<T>.
Comment 59•8 years ago
|
||
nsCSSOffsetState::InitOffsets seems like another variant of this signature, based on the 7-6 Nightly.
Comment 60•8 years ago
|
||
(In reply to David Major [:dmajor] from comment #22)
> Unfortunately, installing KB2818604 from Windows Update didn't stop the
> crashes. I don't have a good explanation. Maybe that patch was for something
> else on the errata sheet. But after using a kernel debugger to mimic AMD's
> BIOS workaround (don't try this at home), I don't crash anymore. Or at least
> I haven't crashed yet -- the repro is unreliable to begin with, so I want to
> give it a few more attempts.
Which I guess makes sense considering recommended workaround is setting the required disabling bits in IC_CFG MSR *only* after having checked PCI configuration space for Errata register.
Something I'm not sure you can ask the CPU alone to do. Definitively not in just a few lines of assembly.
KB2818604 is just a dll with microcodes of all AMD cpus updated to Q1 2013 then (when latest 0x5000029 and 0x5000119 revisions were released respectively for Bobcat ON-B0 and ON-C0 steppings).
According to this, they both only bring a fix for erratum 784
https://anonscm.debian.org/cgit/users/hmh/amd64-microcode.git/commit/microcode_amd.bin.README?id=9b4f1804855407f5ba2ce58ef428dfba226f3652
Your kernel debugging trickery was also easily reproduced with msr-tools in this interesting thread https://patchwork.kernel.org/patch/9390769/
And they didn't seem to have any kind of instability.
Comment 61•4 years ago
|
||
Mirh, your patchwork link is dead. Did the content just move or is it gone?
Flags: needinfo?(mirh)
Comment 62•4 years ago
|
||
It's this https://lore.kernel.org/lkml/b4ad7273efbb0c60a6c93ae68f82a44e@openmailbox.org/t/
And the other broken link is https://salsa.debian.org/hmh/amd64-microcode/-/commit/9b4f1804855407f5ba2ce58ef428dfba226f3652#a0f3027e734957eafcd0affdd79602d4be0bd68d
But as I eventually pointed out in bug 1281759, comment 39, microcode has nothing to do with the problem.
Flags: needinfo?(mirh)
Updated•3 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•