Closed Bug 784244 Opened 12 years ago Closed 12 years ago

Intermittent "snow" artifacts in content

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla18

Project Flags:

blocking-basecamp

People

(Reporter: cjones, Assigned: cjones)

References

Details

(Whiteboard: [WebAPI:P0][LOE:S])

Attachments

(2 files, 3 obsolete files)

Screen capture 12 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me (deleted), image/png		Details
Some experiments 12 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me (deleted), patch		Details \| Diff \| Splinter Review
mprotect() the mappings in libgralloc when DEBUG_LOCKING 12 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me (deleted), patch		Details \| Diff \| Splinter Review
We really, seriously, can only let the BufferTracker take care of our required AutoOpenSurfaces 12 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me (deleted), patch		Details \| Diff \| Splinter Review
We really, seriously, can only let the BufferTracker take care of our required AutoOpenSurfaces 12 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me (deleted), patch	roc : review+	Details \| Diff \| Splinter Review

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Description

•

12 years ago

This happens intermittently on different content. I suspect it's related to intermittent errors I see about locking gralloc buffers.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 1

•

12 years ago

While working on bug 778261, I happened to notice that we still have genlock disabled even though our kernels should support it now. That's likely the cause of these artifacts. Testing now.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 2

•

12 years ago

I still see these snow artifacts after reverting the genlock disable patch. (And maybe a new type of color artifact.) I'm not sure I backed out everything I need to, but will resume tomorrow.

Michael Wu [:mwu]

Comment 3

•

12 years ago

(In reply to Chris Jones [:cjones] [:warhammer] from comment #2) > I still see these snow artifacts after reverting the genlock disable patch. > (And maybe a new type of color artifact.) I'm not sure I backed out > everything I need to, but will resume tomorrow. Shouldn't matter - our copy of genlock comes from the vendor.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 4

•

12 years ago

The patch (which I haven't read) is to the qcom gralloc HAL code, so I assume it disables attempting to talk to the driver.

Kan-Ru Chen [:kanru] (UTC+9)

Comment 5

•

12 years ago

This is reproducible on the lock screen. Dragging the handle and release it would always produce a ghost image in the opposite direction.

blocking-basecamp: --- → ?

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 6

•

12 years ago

I think that's a separate bug.

blocking-basecamp: ? → +

Andrew Overholt [:overholt]

Updated

•

12 years ago

Whiteboard: [WebAPI:P0]

Andrew Overholt [:overholt]

Updated

•

12 years ago

Assignee: nobody → jones.chris.g

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Updated

•

12 years ago

Whiteboard: [WebAPI:P0] → [WebAPI:P0][LOE:S]

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 7

•

12 years ago

This seems to have gotten worse lately. Not sure why. May need to bump priority. This is especially obvious in the map view of the maps app. It also seems especially bad in the "Scanning ..." UI in the gallery.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 8

•

12 years ago

Hm, bug 780330 (which was an innocuous and correct optimization) may have magnified the underlying problem here.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 9

•

12 years ago

Tried a new Nexus S build, and I do see snow there. The snow isn't correlated with any error messages in logcat. So the finger is pointed at gecko, but based on the kind of artifacts we see, it sure looks like a cache coherency problem somewhere ...

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 10

•

12 years ago

Adding code to gecko that does an (apparently non-standard; following example in gralloc.so) cacheflush() on gralloc buffers before unlocking them doesn't affect the snow. AutoOpenSurface::~AutoOpenSurface() { if (mSurface) { gfxImageSurface* img = static_cast<gfxImageSurface*>(mSurface.get()); cacheflush(intptr_t(img->Data()), intptr_t(img->Data()) + img->GetDataSize(), 0/*?*/); So the problem doesn't seem to be CPU-cache<-->RAM coherency, unless I'm doing something wrong here. That's the most likely hypothesis so this result is a bit worrisome. I see there's a cache-flushing pmem IOCTL too, which is going to be quite a pain to test. But I assume that would eliminate the hypothesis of RAM<-->GPU-cache coherency problems.

Kan-Ru Chen [:kanru] (UTC+9)

Comment 11

•

12 years ago

(In reply to Chris Jones [:cjones] [:warhammer] from comment #7) > This is especially obvious in the map view of the maps app. It also seems > especially bad in the "Scanning ..." UI in the gallery. And the calculator app

Dave Hylands [:dhylands]

Comment 12

•

12 years ago

IIRC, we should make sure we know exactly which caches a given function is flushing. On the ARM, L2 cache is normally virtually tagged and L1 is normally physically tagged. So often times, the cache flushing code only does L2 flushes. This is sufficient to ensure coherency between processes, but not sufficient to ensure coherency between HW.

Dave Hylands [:dhylands]

Comment 13

•

12 years ago

Alright - shot me. I got L1 and L2 exactly reversed on my previous comment (I'm going to blame it on trying to write something technical before my first coffee)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 14

•

12 years ago

Good point, need to check that --- cacheflush() is a redheaded step child in the kernel API, and bionic makes it worse.

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 15

•

12 years ago

If I did not get completely lost in the kernel sources, the handling of the cacheflush syscall starts at [1] and the respective assembly code for flushing the caches on ARMv7 is located at [2]. I haven't yet decoded what the assembly does. [1] http://lxr.free-electrons.com/source/arch/arm/kernel/traps.c?v=3.0#L501 [2] http://lxr.free-electrons.com/source/arch/arm/mm/cache-v7.S#L166

Dave Hylands [:dhylands]

Comment 16

•

12 years ago

So yeah, those functions only flush the L1 cache. The outer_xxx functions (http://lxr.linux.no/#linux+v3.0.8/arch/arm/include/asm/outercache.h) are the ones which deal with the L2 cache. They're setup here: http://lxr.linux.no/#linux+v3.0.8/arch/arm/mm/cache-l2x0.c#L341 I see some references over here: http://lxr.linux.no/#linux+v3.0.8/drivers/staging/msm/memory.c#L96 but I'm not sure if any of that code is in our kernel.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 17

•

12 years ago

OK, good to know. I'll try an experiment with a manual cache flusher. (That'll affect timing, but at least would be evidence.)

Chris Lee [:clee]

Comment 18

•

12 years ago

Marked as P1 given it's a blocker.

Whiteboard: [WebAPI:P0][LOE:S] → [WebAPI:P1][LOE:S]

Justin Lebar (not reading bugmail)

Comment 19

•

12 years ago

Chris, This was WebAPI:P0 because it's not a WebAPI bug. The WebAPI team prioritized only its own bugs, because we didn't want to impose on anyone else's priorities. (FWIW, most to all of the bugs that we prioritized are blockers, so the fact that a WebAPI bug is a blocker doesn't make it an automatic WebAPI:P1.) Like I'd said in another bug, if you want to have a separate, parallel set of priorities, you're welcome to create your own whiteboard tag. Or alternatively, I doubt anyone would mind if you used Bugzilla's built-in priorities for your own prioritization scheme.

Whiteboard: [WebAPI:P1][LOE:S] → [WebAPI:P0][LOE:S]

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 20

•

12 years ago

Attached image Screen capture (deleted) — Details

The "unit" of the snow is a 16-pixel strip. That means they represent either 32 or 64 bytes being read by some processor. The arm v7a cache line size is 64 bytes, so this is fairly suspicious.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 21

•

12 years ago

It's also possible that the GPU is finishing a composition with a buffer that's (supposed to be) read-locked, the renderer write-locks it for repaint and starts scribbling on it, but somehow the synchronization isn't working as it's supposed to.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 22

•

12 years ago

If you slowly pan left/right in the ocean in gmaps, there are also patterns that seem correlated to the gmaps tile size. Might be a clue.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 23

•

12 years ago

Attached patch Some experiments (obsolete) (deleted) — Details — Splinter Review

I'm about 90% sure this is a L2 coherency issue now. The pipeline here basically looks like content compositor --------- ------------ begin txn draw (A) forward txn recv txn schedule composite ... begin txn composite (B) draw [etc.] [etc.] I tried 4 experiments 1. put a poor man's cacheflush (scribble 4MB RAM) at (A) -> reliably did away with snow 2. poor man's cacheflush at (B) -> didn't fix the snow; ameliorated but didn't remove in gmaps 3. usleep(50000) at (A) -> not much change 4. usleep(50000) at (B) -> similar result as (2); less amelioration If synchronization were the problem, I would have expected (2) or (4) to mostly make this go away. Didn't happen. Looks like the pixels aren't making it to RAM before the GPU starts reading them.

Dave Hylands [:dhylands]

Comment 24

•

12 years ago

(In reply to Chris Jones [:cjones] [:warhammer] from comment #20) > Created attachment 661085 [details] > Screen capture > > The "unit" of the snow is a 16-pixel strip. That means they represent > either 32 or 64 bytes being read by some processor. > > The arm v7a cache line size is 64 bytes, so this is fairly suspicious. So, it actually seems to depend on the processor. The Cortex-A9 has 32-byte cache lines. It looks like the Cortex-A8 has 64-byte cache lines, and I seem to recall that the otoro is an A8, so that would make your statement correct for L1 cache. The PL310 only supports 32-byte cache lines. http://lxr.linux.no/#linux+v3.5.3/arch/arm/mm/cache-l2x0.c#L29 (and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0246h/Beifajbj.html 0b00 is the only value listed for cache line length)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 25

•

12 years ago

Looking through the vendor source, we should be ending up here [1] ioctl(fd, PMEM_CLEAN_INV_CACHES, &pmem_addr) after a lock(READ|WRITE). Now to see what that's supposed to do ... [1] https://www.codeaurora.org/gitweb/quic/la/?p=platform/hardware/qcom/display.git;a=blob;f=libgralloc/pmemalloc.cpp;h=01628ecfa2d5ce062a284b07d4a3c2adfb98873c;hb=refs/heads/ics_chocolate#l100

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 26

•

12 years ago

So that bounces into either arch/arm/mach-msm/memory.c or drivers/staging/memory.c. I assume the former. As far as I can tell, it's doing the right thing. (I double checked and our kernel is configured with CONFIG_OUTER_CACHE=y.) This one's getting scary. Will check the assumption of comment 25.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 27

•

12 years ago

I instrumented gralloc.so and there are no errors from genlocking/genunlocking or the pmem flush. So userspace (on the renderer side) is doing all the right things. Afraid we might need to ask for help on this one.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 28

•

12 years ago

I doubt it matters much, but I noticed while testing that the artifacts in settings are 1/2 the length of those in gmaps. So that means gmaps is using an r5g6b5 surface and the strip is 32 bytes.

Dave Hylands [:dhylands]

Comment 29

•

12 years ago

So then the question becomes which virtual address is being used? A virtual address is only valid within exactly one process. Is the virtual address being allocated, used, cache-flushed all from the same process?

Dave Hylands [:dhylands]

Comment 30

•

12 years ago

Our /proc/config.gz says CONFIG_ARCH_MSM_CORTEX_A5 ARM says that the Cortex A5 has 32-byte cache lines: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0434b/BABCFDAH.html

Dave Hylands [:dhylands]

Comment 31

•

12 years ago

It also bothers me that dmesg shows: <6>[01-01 00:00:00.000] [0: swapper]L310 cache controller enabled <6>[01-01 00:00:00.000] [0: swapper]L310 cache controller enabled <6>[01-01 00:00:00.000] [0: swapper]l2x0: 8 ways, CACHE_ID 0x410054c7, AUX_CTRL 0x00540000, Cache size: 262144 B So why does L310 cache controller enabled show up twice? and the next line only shows up once?

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 32

•

12 years ago

Are folks seeing this in GB builds too?

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Updated

•

12 years ago

Whiteboard: [WebAPI:P0][LOE:S] → [WebAPI:P0][LOE:S][caf:help]

Justin Lebar (not reading bugmail)

Comment 33

•

12 years ago

Is this really LOE:S?

Whiteboard: [WebAPI:P0][LOE:S][caf:help] → [WebAPI:P0][LOE:?][caf:help]

David Flanagan [:djf]

Comment 34

•

12 years ago

chris: I sent my GB otoro back to MV, so I can't check for you.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 35

•

12 years ago

Yes.

Whiteboard: [WebAPI:P0][LOE:?][caf:help] → [WebAPI:P0][LOE:S][caf:help]

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 36

•

12 years ago

There's one more hypothesis we should eliminate on the gecko side. I assumed it wasn't possible that we were failing to unlock the buffers, because we previously got errors on attempted reflexive write locks. But I realized there's a crazy way we might be skipping unlock ...

Whiteboard: [WebAPI:P0][LOE:S][caf:help] → [WebAPI:P0][LOE:S]

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 37

•

12 years ago

I directly tried patching the crazy thing, and then wrote a small patch and python script to balance lock/unlock. No effect from former, and all good on the latter. Gecko is indeed doing what it's supposed to be doing.

Whiteboard: [WebAPI:P0][LOE:S] → [WebAPI:P0][LOE:S][caf:help]

Michael Vines [:m1] [:evilmachines]

Comment 38

•

12 years ago

After constructing a trivial app that demonstrates the issue I found that while I see "snow" when it runs OOP, there is no snow when blacklisted. A hint? Next will look into the call flow differences between these two cases now from the POV of the bits beneath gecko

Michael Vines [:m1] [:evilmachines]

Comment 39

•

12 years ago

Although globally disabling OOP in settings is not a panacea so maybe c38 will get us nowhere...

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 40

•

12 years ago

I've been seeing snow in in-process content (like settings before this week), load indicator browser, popup windows ...

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 41

•

12 years ago

Let me try again, with English enabled. It's quite easy to reproduce snow in in-process content. For example, - hold down power button - tap-down on "Power Off" in popup menu, hold finger down - slide finger off "Power Off"

Michael Vines [:m1] [:evilmachines]

Comment 42

•

12 years ago

Not that we didn't know this already but it's safe to say that this is 100% a ARM-->RAM caching issue, as snow is completely gone if |pmem[id].cached = 0| is forced in the kernel pmem driver. Ref: https://www.codeaurora.org/gitweb/quic/la/?p=kernel/msm.git;a=blob;f=drivers/misc/pmem.c;h=2f48d716d6a77286168793c378a5e635a76284a8;hb=ics_strawberry#l2644

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 43

•

12 years ago

Ball's back in our court --- it's indeed a cache coherency issue, but it's caused by us writing to the buffer after unlocking for write. Which is ... disturbing.

Whiteboard: [WebAPI:P0][LOE:S][caf:help] → [WebAPI:P0][LOE:S]

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 44

•

12 years ago

Attached patch mprotect() the mappings in libgralloc when DEBUG_LOCKING (obsolete) (deleted) — Details — Splinter Review

Would love to see this upstreamed in some form.

Attachment #661102 - Attachment is obsolete: true

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 45

•

12 years ago

HAH! You guys are going to love this one.

Dave Hylands [:dhylands]

Comment 46

•

12 years ago

So are you just going to leave us hanging...

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 47

•

12 years ago

I'm a little embarrassed because I thought it was something cute, but my first guess was wrong. It's actually just a run-of-the-mill subtle logic bug.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Updated

•

12 years ago

Depends on: 792663

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 48

•

12 years ago

Attached patch We really, seriously, can only let the BufferTracker take care of our required AutoOpenSurfaces (obsolete) (deleted) — Details — Splinter Review

I think that debugging patch has a bug; it was still kicking in after this fix. Anyway, it's a hot day in hell, snow. Not my finest hour but a lot of good came out of this.

Attachment #663544 - Attachment is obsolete: true

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 49

•

12 years ago

Comment on attachment 663586 [details] [diff] [review] We really, seriously, can only let the BufferTracker take care of our required AutoOpenSurfaces There's a bug here that I fixed locally. The goal now is figure out why bug 792663 bounced on inbound.

Attachment #663586 - Attachment is obsolete: true

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 50

•

12 years ago

Attached patch We really, seriously, can only let the BufferTracker take care of our required AutoOpenSurfaces (deleted) — Details — Splinter Review

To resummarize the problem here, we always must use the buffer tracker to unmap buffers used by BasicThebesLayer, because the buffers can be mapped from several different locations. What was happening here is that we were using an AutoOpenSurface to do the front->back sync that *wasn't* tracked by buffer tracker. The sequence of events in goes something like this - transaction 1 * thebes layer T has no buffers * enter BasicShadowableThebesLayer::Paint . put BufferTracker on the stack . ThebesLayerBuffer::BeginPaint() causes buffer B1 to be created; BufferTracker B1->lock()s it . paint to B1 . exit Paint(); BufferTracker goes out of scope, B1->unlock() * send B1 to the compositor, get it back as the read-only front buffer - transaction 2 * thebes layer T has no back buffer, and readOnlyFront = B1 * enter BasicShadowableThebesLayer::Paint . put BufferTracker on the stack . enter SyncFrontBufferToBackBuffer o no back buffer, so allocate back buffer = B2 o put AutoOpenSurface(B2) on the stack, B2->lock() o *** SET ThebesLayerBuffer.mBuffer = B2 IN PREP FOR COPY *** o copy front->back o exit Sync(); AutoOpenSurface(B2) goes out of scope, *** B2->unlock() *** . ThebesLayerBuffer::BeginPaint() sees it already has mBuffer; doesn't ask mBufferProvider to map it . paint to B2 *** outside the lock *** . exit Paint(); BufferTracker goes out of scope; doesn't do anything because it didn't see B2 being created Whups.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 51

•

12 years ago

Comment on attachment 663613 [details] [diff] [review] We really, seriously, can only let the BufferTracker take care of our required AutoOpenSurfaces Nick, you probably understand this code as well as anyone now :). I want to get this landed asap, so will try to find someone to steal while you enjoy your weekend ;).

Attachment #663613 - Flags: review?(ncameron)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 52

•

12 years ago

I don't like to nag about review requests, but this bug is probably the most-complained-about issue with b2g at the moment, so everyone really wants to get it off our plates.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Updated

•

12 years ago

Attachment #663613 - Flags: review?(ncameron) → review?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

12 years ago

Attachment #663613 - Flags: review?(roc) → review+

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 53

•

12 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/beed670216f6

Ed Morley [:emorley]

Comment 54

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/beed670216f6

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla18

Jason Smith [:jsmith]

Updated

•

12 years ago

Keywords: verifyme

QA Contact: jsmith

Jason Smith [:jsmith]

Comment 55

•

12 years ago

Verified on 9/30 Otoro build.

Status: RESOLVED → VERIFIED

Keywords: verifyme

Nick Cameron [:nrc]

Comment 56

•

12 years ago

(In reply to Chris Jones [:cjones] [:warhammer] from comment #52) > I don't like to nag about review requests, but this bug is probably the > most-complained-about issue with b2g at the moment, so everyone really wants > to get it off our plates. Sorry Chris, I was on vacation and away from the internets last week. Thanks for the review roc!

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 57

•

12 years ago

(In reply to Nick Cameron [:nrc] from comment #56) > (In reply to Chris Jones [:cjones] [:warhammer] from comment #52) > > I don't like to nag about review requests, but this bug is probably the > > most-complained-about issue with b2g at the moment, so everyone really wants > > to get it off our plates. > > Sorry Chris, I was on vacation and away from the internets last week. No worries at all! roc told me you were on PTO right after I wrote this comment.

You need to log in before you can comment on or make changes to this bug.