Closed
Bug 784244
Opened 12 years ago
Closed 12 years ago
Intermittent "snow" artifacts in content
Categories
(Core :: Graphics: Layers, defect)
Tracking
()
People
(Reporter: cjones, Assigned: cjones)
References
Details
(Whiteboard: [WebAPI:P0][LOE:S])
Attachments
(2 files, 3 obsolete files)
(deleted),
image/png
|
Details | |
(deleted),
patch
|
roc
:
review+
|
Details | Diff | Splinter Review |
This happens intermittently on different content. I suspect it's related to intermittent errors I see about locking gralloc buffers.
Assignee | ||
Comment 1•12 years ago
|
||
While working on bug 778261, I happened to notice that we still have genlock disabled even though our kernels should support it now. That's likely the cause of these artifacts. Testing now.
Assignee | ||
Comment 2•12 years ago
|
||
I still see these snow artifacts after reverting the genlock disable patch. (And maybe a new type of color artifact.) I'm not sure I backed out everything I need to, but will resume tomorrow.
Comment 3•12 years ago
|
||
(In reply to Chris Jones [:cjones] [:warhammer] from comment #2)
> I still see these snow artifacts after reverting the genlock disable patch.
> (And maybe a new type of color artifact.) I'm not sure I backed out
> everything I need to, but will resume tomorrow.
Shouldn't matter - our copy of genlock comes from the vendor.
Assignee | ||
Comment 4•12 years ago
|
||
The patch (which I haven't read) is to the qcom gralloc HAL code, so I assume it disables attempting to talk to the driver.
Comment 5•12 years ago
|
||
This is reproducible on the lock screen. Dragging the handle and release it would always produce a ghost image in the opposite direction.
blocking-basecamp: --- → ?
Updated•12 years ago
|
Whiteboard: [WebAPI:P0]
Updated•12 years ago
|
Assignee: nobody → jones.chris.g
Assignee | ||
Updated•12 years ago
|
Whiteboard: [WebAPI:P0] → [WebAPI:P0][LOE:S]
Assignee | ||
Comment 7•12 years ago
|
||
This seems to have gotten worse lately. Not sure why. May need to bump priority.
This is especially obvious in the map view of the maps app. It also seems especially bad in the "Scanning ..." UI in the gallery.
Assignee | ||
Comment 8•12 years ago
|
||
Hm, bug 780330 (which was an innocuous and correct optimization) may have magnified the underlying problem here.
Assignee | ||
Comment 9•12 years ago
|
||
Tried a new Nexus S build, and I do see snow there. The snow isn't correlated with any error messages in logcat.
So the finger is pointed at gecko, but based on the kind of artifacts we see, it sure looks like a cache coherency problem somewhere ...
Assignee | ||
Comment 10•12 years ago
|
||
Adding code to gecko that does an (apparently non-standard; following example in gralloc.so) cacheflush() on gralloc buffers before unlocking them doesn't affect the snow.
AutoOpenSurface::~AutoOpenSurface()
{
if (mSurface) {
gfxImageSurface* img = static_cast<gfxImageSurface*>(mSurface.get());
cacheflush(intptr_t(img->Data()),
intptr_t(img->Data()) + img->GetDataSize(),
0/*?*/);
So the problem doesn't seem to be CPU-cache<-->RAM coherency, unless I'm doing something wrong here. That's the most likely hypothesis so this result is a bit worrisome.
I see there's a cache-flushing pmem IOCTL too, which is going to be quite a pain to test. But I assume that would eliminate the hypothesis of RAM<-->GPU-cache coherency problems.
Comment 11•12 years ago
|
||
(In reply to Chris Jones [:cjones] [:warhammer] from comment #7)
> This is especially obvious in the map view of the maps app. It also seems
> especially bad in the "Scanning ..." UI in the gallery.
And the calculator app
Comment 12•12 years ago
|
||
IIRC, we should make sure we know exactly which caches a given function is flushing. On the ARM, L2 cache is normally virtually tagged and L1 is normally physically tagged. So often times, the cache flushing code only does L2 flushes. This is sufficient to ensure coherency between processes, but not sufficient to ensure coherency between HW.
Comment 13•12 years ago
|
||
Alright - shot me. I got L1 and L2 exactly reversed on my previous comment (I'm going to blame it on trying to write something technical before my first coffee)
Assignee | ||
Comment 14•12 years ago
|
||
Good point, need to check that --- cacheflush() is a redheaded step child in the kernel API, and bionic makes it worse.
Comment 15•12 years ago
|
||
If I did not get completely lost in the kernel sources, the handling of the cacheflush syscall starts at [1] and the respective assembly code for flushing the caches on ARMv7 is located at [2]. I haven't yet decoded what the assembly does.
[1] http://lxr.free-electrons.com/source/arch/arm/kernel/traps.c?v=3.0#L501
[2] http://lxr.free-electrons.com/source/arch/arm/mm/cache-v7.S#L166
Comment 16•12 years ago
|
||
So yeah, those functions only flush the L1 cache.
The outer_xxx functions (http://lxr.linux.no/#linux+v3.0.8/arch/arm/include/asm/outercache.h) are the ones which deal with the L2 cache. They're setup here:
http://lxr.linux.no/#linux+v3.0.8/arch/arm/mm/cache-l2x0.c#L341
I see some references over here:
http://lxr.linux.no/#linux+v3.0.8/drivers/staging/msm/memory.c#L96
but I'm not sure if any of that code is in our kernel.
Assignee | ||
Comment 17•12 years ago
|
||
OK, good to know. I'll try an experiment with a manual cache flusher. (That'll affect timing, but at least would be evidence.)
Comment 18•12 years ago
|
||
Marked as P1 given it's a blocker.
Whiteboard: [WebAPI:P0][LOE:S] → [WebAPI:P1][LOE:S]
Comment 19•12 years ago
|
||
Chris,
This was WebAPI:P0 because it's not a WebAPI bug. The WebAPI team prioritized only its own bugs, because we didn't want to impose on anyone else's priorities.
(FWIW, most to all of the bugs that we prioritized are blockers, so the fact that a WebAPI bug is a blocker doesn't make it an automatic WebAPI:P1.)
Like I'd said in another bug, if you want to have a separate, parallel set of priorities, you're welcome to create your own whiteboard tag. Or alternatively, I doubt anyone would mind if you used Bugzilla's built-in priorities for your own prioritization scheme.
Whiteboard: [WebAPI:P1][LOE:S] → [WebAPI:P0][LOE:S]
Assignee | ||
Comment 20•12 years ago
|
||
The "unit" of the snow is a 16-pixel strip. That means they represent either 32 or 64 bytes being read by some processor.
The arm v7a cache line size is 64 bytes, so this is fairly suspicious.
Assignee | ||
Comment 21•12 years ago
|
||
It's also possible that the GPU is finishing a composition with a buffer that's (supposed to be) read-locked, the renderer write-locks it for repaint and starts scribbling on it, but somehow the synchronization isn't working as it's supposed to.
Assignee | ||
Comment 22•12 years ago
|
||
If you slowly pan left/right in the ocean in gmaps, there are also patterns that seem correlated to the gmaps tile size. Might be a clue.
Assignee | ||
Comment 23•12 years ago
|
||
I'm about 90% sure this is a L2 coherency issue now.
The pipeline here basically looks like
content compositor
--------- ------------
begin txn
draw
(A)
forward txn
recv txn
schedule composite
...
begin txn composite
(B)
draw
[etc.] [etc.]
I tried 4 experiments
1. put a poor man's cacheflush (scribble 4MB RAM) at (A)
-> reliably did away with snow
2. poor man's cacheflush at (B)
-> didn't fix the snow; ameliorated but didn't remove in gmaps
3. usleep(50000) at (A)
-> not much change
4. usleep(50000) at (B)
-> similar result as (2); less amelioration
If synchronization were the problem, I would have expected (2) or (4) to mostly make this go away. Didn't happen.
Looks like the pixels aren't making it to RAM before the GPU starts reading them.
Comment 24•12 years ago
|
||
(In reply to Chris Jones [:cjones] [:warhammer] from comment #20)
> Created attachment 661085 [details]
> Screen capture
>
> The "unit" of the snow is a 16-pixel strip. That means they represent
> either 32 or 64 bytes being read by some processor.
>
> The arm v7a cache line size is 64 bytes, so this is fairly suspicious.
So, it actually seems to depend on the processor. The Cortex-A9 has 32-byte cache lines. It looks like the Cortex-A8 has 64-byte cache lines, and I seem to recall that the otoro is an A8, so that would make your statement correct for L1 cache.
The PL310 only supports 32-byte cache lines.
http://lxr.linux.no/#linux+v3.5.3/arch/arm/mm/cache-l2x0.c#L29
(and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0246h/Beifajbj.html 0b00 is the only value listed for cache line length)
Assignee | ||
Comment 25•12 years ago
|
||
Looking through the vendor source, we should be ending up here [1]
ioctl(fd, PMEM_CLEAN_INV_CACHES, &pmem_addr)
after a lock(READ|WRITE). Now to see what that's supposed to do ...
[1] https://www.codeaurora.org/gitweb/quic/la/?p=platform/hardware/qcom/display.git;a=blob;f=libgralloc/pmemalloc.cpp;h=01628ecfa2d5ce062a284b07d4a3c2adfb98873c;hb=refs/heads/ics_chocolate#l100
Assignee | ||
Comment 26•12 years ago
|
||
So that bounces into either arch/arm/mach-msm/memory.c or drivers/staging/memory.c. I assume the former. As far as I can tell, it's doing the right thing. (I double checked and our kernel is configured with CONFIG_OUTER_CACHE=y.)
This one's getting scary. Will check the assumption of comment 25.
Assignee | ||
Comment 27•12 years ago
|
||
I instrumented gralloc.so and there are no errors from genlocking/genunlocking or the pmem flush. So userspace (on the renderer side) is doing all the right things.
Afraid we might need to ask for help on this one.
Assignee | ||
Comment 28•12 years ago
|
||
I doubt it matters much, but I noticed while testing that the artifacts in settings are 1/2 the length of those in gmaps. So that means gmaps is using an r5g6b5 surface and the strip is 32 bytes.
Comment 29•12 years ago
|
||
So then the question becomes which virtual address is being used? A virtual address is only valid within exactly one process.
Is the virtual address being allocated, used, cache-flushed all from the same process?
Comment 30•12 years ago
|
||
Our /proc/config.gz says CONFIG_ARCH_MSM_CORTEX_A5
ARM says that the Cortex A5 has 32-byte cache lines:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0434b/BABCFDAH.html
Comment 31•12 years ago
|
||
It also bothers me that dmesg shows:
<6>[01-01 00:00:00.000] [0: swapper]L310 cache controller enabled
<6>[01-01 00:00:00.000] [0: swapper]L310 cache controller enabled
<6>[01-01 00:00:00.000] [0: swapper]l2x0: 8 ways, CACHE_ID 0x410054c7, AUX_CTRL 0x00540000, Cache size: 262144 B
So why does L310 cache controller enabled show up twice? and the next line only shows up once?
Assignee | ||
Comment 32•12 years ago
|
||
Are folks seeing this in GB builds too?
Assignee | ||
Updated•12 years ago
|
Whiteboard: [WebAPI:P0][LOE:S] → [WebAPI:P0][LOE:S][caf:help]
Comment 33•12 years ago
|
||
Is this really LOE:S?
Whiteboard: [WebAPI:P0][LOE:S][caf:help] → [WebAPI:P0][LOE:?][caf:help]
Comment 34•12 years ago
|
||
chris: I sent my GB otoro back to MV, so I can't check for you.
Assignee | ||
Comment 35•12 years ago
|
||
Yes.
Whiteboard: [WebAPI:P0][LOE:?][caf:help] → [WebAPI:P0][LOE:S][caf:help]
Assignee | ||
Comment 36•12 years ago
|
||
There's one more hypothesis we should eliminate on the gecko side. I assumed it wasn't possible that we were failing to unlock the buffers, because we previously got errors on attempted reflexive write locks. But I realized there's a crazy way we might be skipping unlock ...
Whiteboard: [WebAPI:P0][LOE:S][caf:help] → [WebAPI:P0][LOE:S]
Assignee | ||
Comment 37•12 years ago
|
||
I directly tried patching the crazy thing, and then wrote a small patch and python script to balance lock/unlock. No effect from former, and all good on the latter. Gecko is indeed doing what it's supposed to be doing.
Whiteboard: [WebAPI:P0][LOE:S] → [WebAPI:P0][LOE:S][caf:help]
Comment 38•12 years ago
|
||
After constructing a trivial app that demonstrates the issue I found that while I see "snow" when it runs OOP, there is no snow when blacklisted. A hint? Next will look into the call flow differences between these two cases now from the POV of the bits beneath gecko
Comment 39•12 years ago
|
||
Although globally disabling OOP in settings is not a panacea so maybe c38 will get us nowhere...
Assignee | ||
Comment 40•12 years ago
|
||
I've been seeing snow in in-process content (like settings before this week), load indicator browser, popup windows ...
Assignee | ||
Comment 41•12 years ago
|
||
Let me try again, with English enabled.
It's quite easy to reproduce snow in in-process content. For example,
- hold down power button
- tap-down on "Power Off" in popup menu, hold finger down
- slide finger off "Power Off"
Comment 42•12 years ago
|
||
Not that we didn't know this already but it's safe to say that this is 100% a ARM-->RAM caching issue, as snow is completely gone if |pmem[id].cached = 0| is forced in the kernel pmem driver.
Ref: https://www.codeaurora.org/gitweb/quic/la/?p=kernel/msm.git;a=blob;f=drivers/misc/pmem.c;h=2f48d716d6a77286168793c378a5e635a76284a8;hb=ics_strawberry#l2644
Assignee | ||
Comment 43•12 years ago
|
||
Ball's back in our court --- it's indeed a cache coherency issue, but it's caused by us writing to the buffer after unlocking for write.
Which is ... disturbing.
Whiteboard: [WebAPI:P0][LOE:S][caf:help] → [WebAPI:P0][LOE:S]
Assignee | ||
Comment 44•12 years ago
|
||
Would love to see this upstreamed in some form.
Attachment #661102 -
Attachment is obsolete: true
Assignee | ||
Comment 45•12 years ago
|
||
HAH! You guys are going to love this one.
Comment 46•12 years ago
|
||
So are you just going to leave us hanging...
Assignee | ||
Comment 47•12 years ago
|
||
I'm a little embarrassed because I thought it was something cute, but my first guess was wrong. It's actually just a run-of-the-mill subtle logic bug.
Assignee | ||
Comment 48•12 years ago
|
||
I think that debugging patch has a bug; it was still kicking in after this fix.
Anyway, it's a hot day in hell, snow. Not my finest hour but a lot of good came out of this.
Attachment #663544 -
Attachment is obsolete: true
Assignee | ||
Comment 49•12 years ago
|
||
Comment on attachment 663586 [details] [diff] [review]
We really, seriously, can only let the BufferTracker take care of our required AutoOpenSurfaces
There's a bug here that I fixed locally.
The goal now is figure out why bug 792663 bounced on inbound.
Attachment #663586 -
Attachment is obsolete: true
Assignee | ||
Comment 50•12 years ago
|
||
To resummarize the problem here, we always must use the buffer tracker to unmap buffers used by BasicThebesLayer, because the buffers can be mapped from several different locations. What was happening here is that we were using an AutoOpenSurface to do the front->back sync that *wasn't* tracked by buffer tracker. The sequence of events in goes something like this
- transaction 1
* thebes layer T has no buffers
* enter BasicShadowableThebesLayer::Paint
. put BufferTracker on the stack
. ThebesLayerBuffer::BeginPaint() causes buffer B1 to be created; BufferTracker B1->lock()s it
. paint to B1
. exit Paint(); BufferTracker goes out of scope, B1->unlock()
* send B1 to the compositor, get it back as the read-only front buffer
- transaction 2
* thebes layer T has no back buffer, and readOnlyFront = B1
* enter BasicShadowableThebesLayer::Paint
. put BufferTracker on the stack
. enter SyncFrontBufferToBackBuffer
o no back buffer, so allocate back buffer = B2
o put AutoOpenSurface(B2) on the stack, B2->lock()
o *** SET ThebesLayerBuffer.mBuffer = B2 IN PREP FOR COPY ***
o copy front->back
o exit Sync(); AutoOpenSurface(B2) goes out of scope, *** B2->unlock() ***
. ThebesLayerBuffer::BeginPaint() sees it already has mBuffer; doesn't ask mBufferProvider to map it
. paint to B2 *** outside the lock ***
. exit Paint(); BufferTracker goes out of scope; doesn't do anything because it didn't see B2 being created
Whups.
Assignee | ||
Comment 51•12 years ago
|
||
Comment on attachment 663613 [details] [diff] [review]
We really, seriously, can only let the BufferTracker take care of our required AutoOpenSurfaces
Nick, you probably understand this code as well as anyone now :). I want to get this landed asap, so will try to find someone to steal while you enjoy your weekend ;).
Attachment #663613 -
Flags: review?(ncameron)
Assignee | ||
Comment 52•12 years ago
|
||
I don't like to nag about review requests, but this bug is probably the most-complained-about issue with b2g at the moment, so everyone really wants to get it off our plates.
Assignee | ||
Updated•12 years ago
|
Attachment #663613 -
Flags: review?(ncameron) → review?(roc)
Attachment #663613 -
Flags: review?(roc) → review+
Assignee | ||
Comment 53•12 years ago
|
||
Comment 54•12 years ago
|
||
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla18
Comment 55•12 years ago
|
||
Verified on 9/30 Otoro build.
Status: RESOLVED → VERIFIED
Keywords: verifyme
Comment 56•12 years ago
|
||
(In reply to Chris Jones [:cjones] [:warhammer] from comment #52)
> I don't like to nag about review requests, but this bug is probably the
> most-complained-about issue with b2g at the moment, so everyone really wants
> to get it off our plates.
Sorry Chris, I was on vacation and away from the internets last week.
Thanks for the review roc!
Assignee | ||
Comment 57•12 years ago
|
||
(In reply to Nick Cameron [:nrc] from comment #56)
> (In reply to Chris Jones [:cjones] [:warhammer] from comment #52)
> > I don't like to nag about review requests, but this bug is probably the
> > most-complained-about issue with b2g at the moment, so everyone really wants
> > to get it off our plates.
>
> Sorry Chris, I was on vacation and away from the internets last week.
No worries at all! roc told me you were on PTO right after I wrote this comment.
You need to log in
before you can comment on or make changes to this bug.
Description
•