Closed Bug 1282638 Opened 8 years ago Closed 4 years ago

Intermittent gfx/layers/apz/test/mochitest/test_group_touchevents.html | Test timed out.

Categories

(Core :: Panning and Zooming, defect, P5)

defect

Tracking

()

VERIFIED INCOMPLETE

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [gfx-noted][stockwell fixed:backout])

Weird. It looks like all the subtests finished running but the top-level test didn't finish()? Not really sure what happened there.
Whiteboard: [gfx-noted]
Looks like the android failures went away but now we have failures on OS X.
Assignee: nobody → bugmail
Try push with logging: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a2f0e74c5929&group_state=expanded At least some of the failures seem to be because helper_touch_action ends up triggering a fling which is unexpected. Not only does it throw off the scroll position, future touch events get ignored because of the fast-motion. There are other failure too, so I'll spin off another bug to fix this issue.
I put a patch on bug 1297408. Next try run with more logging at https://treeherder.mozilla.org/#/jobs?repo=try&revision=58443a139413&selectedJob=26227850 seems to indicate that during helper_long_tap, the APZ that gets long-tapped upon is destroyed during the long-tap. Not sure why yet.
Seems like maybe waitUntilApzStable returns too soon, before the previous page is unloaded, and so the tap events end up going to the wrong page. That's a little concerning.
Based on additional logging [1], I think the problem is that waitUntilApzStable does a waitForAllPaints instead of a waitForAllPaintsFlushed. For some reason on OS X we can run through that waitForAllPaints call without having done a paint and without a paint being pending, and so the test goes on to run on the previous page's layer tree. This naturally doesn't work and usually results in the timeouts. I have a new try push [2] which changes that call to flush the paints to see if it fixes the issue. [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=b2f9d5411f9e&selectedJob=26520124 [2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=5fed88798269
So that seemed to help fix the helper_tap timeouts, but I see the exact same issue on helper_long_tap (which I was seeing before as well). So I suspect that really the patch did nothing. I pored over the logs a bit more and as far as I can tell, the compositor is in fact getting the layers update (and the screenshot indicates that the correct page is visible) but maybe the APZ tree rebuild is getting skipped? I don't see any APZC NLU calls with aIsFirstPaint=1 for the subtest that fails. Continuing investigation along those lines...
What I'm finding is that normally when subtests inside test_group_touchevents load, the window they spawn in first loads about:blank and then loads the subtest file. This is normal and expected, and it means that the compositor gets two aIsFirstPaint transactions (one for about:blank and the other for the page). In the bad cases it seems like about:blank doesn't load, it goes straight to the subtest file, and that may or may not be related to the problem. The latest log I have is from this job: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4caf56adfe12&selectedJob=26682790
Priority: P5 → P3
I'm not really sure where to go from here, and there are higher-volume intermittent failures that I can look at, so I'm unassigning this one.
Assignee: bugmail → nobody
Looks like this is basically permafail on Win8 e10s at the moment. I'll try to bisect when that happened. https://treeherder.mozilla.org/logviewer.html#?job_id=29204519&repo=try
Oh, that answer is obvious I guess. It was only re-enabled on Windows a week ago in bug 1291381. Kats, can you please take a look? Win8 M-e10s(4) is available on Try.
Flags: needinfo?(bugmail)
First try push with logging seems to show [1] that it's hanging during loading of helper_bug1162771.html. Not really surprising since that's what the screenshot was showing as well. I'll do more try pushes with more logging to figure out what's going on. [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7fdd4cb15c554e976116f1fbf9a1f37f58453a1&selectedJob=29211204
Flags: needinfo?(bugmail)
Assignee: nobody → bugmail
After many try pushes with successively more logging, it looks like we do call the injectTouchEvent windows API with the touchstart event on helper_bug1162771.html, but we never get the WM_TOUCH. Similar previous calls work fine. The only notable thing about this touchstart event that I can see is that it has a lower x-coordinate (x=16) than the previous ones. So my guess is that windows 8 has some sort of edge swipe gesture detector that's holding on to the touchstart to see if it's an edge gesture. I'm running a try push with a higher x-coordinate to see if it helps.
That seems to have worked: https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa2d935b69374f1135532da15f7a6f04645aba47&selectedJob=29455183 I'll spin out a new bug with the fix, since it's probably a separate issue from the intermittent this bug was originally tracking.
Split the fix into bug 1311406, throwing this back into the unassigned pool.
Assignee: bugmail → nobody
The failures reported in the previous comment were introduced with bug 1376519 and stopped when it was backed out.
Whiteboard: [gfx-noted] → [gfx-noted][stockwell fixed:backout]
Bulk priority update of open intermittent test failure bugs. P3 => P5 https://bugzilla.mozilla.org/show_bug.cgi?id=1381960
Priority: P3 → P5
From a quick look at the log it looks like touch event injection is failing. All the failures are on Windows so probably the releng or taskcluster folks updated something in the OS that caused this to start failing. If they can't roll it back I can probably update the tests to use our "synthetic touch events" rather than OS-injected touch events but the test will be less representative of real-world scenarios so I would prefer to avoid that.
some months ago i added a registry key to disable touch events (bug 1382988, comment 4). that patch never actually worked due to a syntax error that quietly logged a failure. recently, whilst cleaning up these syntax errors to reduce noise in the logs, the registry hack was corrected: https://github.com/mozilla-releng/OpenCloudConfig/commit/801ef77f468b7e6bc5778a7e231f196af17fee65 https://github.com/mozilla-releng/OpenCloudConfig/commit/51ecff2f17159a1ec9d13242438b7402a0d908b1 i assume that when the registry hack to disable touch events started working, it broke these tests. i've removed the registry hack today (https://github.com/mozilla-releng/OpenCloudConfig/commit/614792e280811e80422a5732c6304d348757e9e3) and will retrigger tests when the rebuilt amis have propagated to see if they go green. the ami rebuild is here: https://tools.taskcluster.net/groups/NY_vYQDuQZGZji0QY49cvQ
Flags: needinfo?(rthijssen)
Thanks Rob! Since this intermittent failure was happening even before the permafail I'm not sure it's correct to actually mark this bug FIXED but we can leave it for now and reopen it if we continue to see intermittent instances of this failure.
Flags: needinfo?(bugmail)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 7 years ago5 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.