1210465 - Issues with test_basic_pan.html

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Description

•

9 years ago

Investigation bug for all the various intermittent failures with test_basic_pan.html which seem to have increased recently. I suspect a few of them have the same underlying cause.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 1

•

9 years ago

Try push with some logging enabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e35818ef4683. My plan is to retrigger test_basic_pan until it hits one or more of these failures and use the logging to find out roughly what's going on, and dig in from there.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Updated

•

9 years ago

Depends on: 1183144

Chris Manchester (limited bugmail, email directly)

Comment 2

•

9 years ago

If I understand what's intended, the mach command to push to try should be an effective workaround for bug 1183144: running |./mach try --tag apz -p emulator| will run the relevant tests in a single chunk (it will appear as M(1)).

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 3

•

9 years ago

Indeed, the try push I have above seems to provide the results that I need. Removing dependency on that bug. The failure from bug 1208200 showed up in my retriggers and from the log it looks like the touch events are synthesized just fine, but the main thread timeout is hit before the div gets layerized, and so the touch events end up scrolling the root document's APZ instead. This should be avoidable by increasing the content response timeout. In the log at [1] the timeout is installed at 20:19:29.244 and the response arrives at 20:19:47.784 which is 18.54 seconds later. This exceeds the 15 second timeout that we use in the testing environment, causing the problem. That being said, I'm not sure what was happening between 20:19:30.124 and 20:19:40.604 since that's a 10 second gap with no output in logcat. It could be that there's some underlying bug that caused the main thread to stop pumping events. [1] http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/6801cf4314f00cbd908cf01f3a4422cf93d4b6aaff8fd676fa785a2b4f4f89b9006b51217c69630ac9b79e1b740564de8592fa80ea0b95878bf7edf5afe8e32b

No longer depends on: 1183144

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 4

•

9 years ago

In the log at [2] the same thing happens for the iframe, and there it takes just under 50 seconds for the content response to get back! [2] http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/b977c422e5823bb2f36a3a113ca8d193d79896935fa55317f127992242ddfdbbd063ecf54913a8fdd730a30aea5ff1496b894ca7a1ae4974454ff38fd9f3ab36

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 5

•

9 years ago

Bug 1203901 also showed up in the retriggers, e.g. log at [3]. The reason it times out is because the touch events are delivered to the root APZ in the root process, so no scroll occurs and presumably no APZ:TransformEnd notification is dispatched in the child process. If we look at the hit-testing tree dump we can see that the child process layers (l=3) haven't shown up in the tree yet. This is evident if you compare it to the equivalent tree dump in the logs [1] or [2] from my previous comments. So likely there's a race here where the "onload" script runs before the compositor actually has the layers from the page that got loaded. We can probably fix this by doing a flushApzRepaints or some other equivalent wait-for-paint magic before synthesizing the drag. [3] http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/4630d8afe7fe3fb6b013d8a0fd2ecac92b376ee85b89b41e0e9a3380e9b9681b3802170c2ecac94705661b1c24b40070ca96e81b40be5ecdd9ef4ac27d9bb992

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 6

•

9 years ago

I suspect both bug 1208973 and bug 1209444 are due to improper multithreaded access of mQueuedTask in TaskThrottler. These crashes both started after the recent changes to TaskThrottler. There's a log at [4] with such a crash but it doesn't help that much. [4] http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/01abdc9cfd862deb17f186b59763e6e16763976d17d8c8335448b49b63481eb6ac86086e9fea490bf0b086e3bf585852c2f32c143cbee2e16819842d72e262fb

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 7

•

9 years ago

I did another try push at https://treeherder.mozilla.org/#/jobs?repo=try&revision=0a8b377ccfc0 with what I hope is a fix for bug 1203901 and more logging for the other issues.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 8

•

9 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #4) > In the log at [2] the same thing happens for the iframe, and there it takes > just under 50 seconds for the content response to get back! Log at [5] has more logging, it shows that the APZ deals with the touchstart at around 18:28:34.691 and the TabChild doesn't see it until 18:29:02.152. That's 28 seconds just to get the event to the child process. It doesn't seem to get stalled anywhere in particular, so I think this is just SLOW. Admittedly there's a bunch of other stuff happening (the drag event generates a lot of touch events) so if this is running on a slow machine that might explain it. [5] http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/5f672c2d6bb4f9204651bdb3ef46468bee3dee02dd77daf224a58cea113e2e2c0af11697a013a5a03d382810feb05aa0700694457b633de44f51dde5d8555f9f

Botond Ballo [:botond]

Comment 9

•

9 years ago

Wow - and I thought 15 seconds was an eternity to begin with!

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 10

•

9 years ago

Yeah :( The new try push has a few "test timed out" as well - for some of them there's a new issue with the drag turning into a long-press. We can fix that by increasing the long-press threshold timeout. And there's still an issue with the layer tree not being correct at the time the touchstart comes in, so I guess my patch didn't really work. It might need an APZ flush as well.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 11

•

9 years ago

Round 3: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e401d673a13d

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 12

•

9 years ago

The retriggers on this are showing a couple of failures. One is the same "layer tree not updated" problem. I fixed it for helper_div_pan and helper_iframe_pan but forgot to do it for helper_basic_pan. The crash from bug 1208973 also showed up and as I expected the log [6] indicates that it's probably a concurrency error. Specifically the task 0x46872780 gets run twice in the same throttler, which should never happen. I suspect that the second of those tasks ends up sending a corrupted framemetrics which then causes an assertion failure on the child send when it tries to unpack it. Adding some locking around this should fix it. [6] http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/ecf36972bf0a5fddc3432da6e9d3fe799eb019e521df230fceb0133f10e9c10ddfe7f9f2e6dc63b8d88a96eb64330821cfb8f51edec74368a8b4ee9dbe438153

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 13

•

9 years ago

Looks like round 4 is a winner: https://treeherder.mozilla.org/#/jobs?repo=try&revision=917d12e6eebb&group_state=expanded

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 14

•

9 years ago

My investigation is complete; I have attached patches to 3 of the 5 dependent bugs. The other two are probably dupes.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Bugzilla

Issues with test_basic_pan.html

Categories

(Core :: Panning and Zooming, defect)

Tracking

()

People

(Reporter: kats, Assigned: kats)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14