Closed Bug 1561535 Opened 5 years ago Closed 5 years ago

Investigate max runtime errors after microsoft.com re-recording

Categories

(Testing :: Raptor, task, P1)

Version 3
task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: alexandru.irimovici, Assigned: alexandru.irimovici)

References

Details

Bugbug thinks this bug is a task, but please change it back in case of error.

Type: defect → task

(In reply to Dave Hunt [:davehunt] [he/him] ⌚️UTC from comment #2)

(In reply to Alexandru Irimovici from comment #0)

Investigate max runtime errors that we get here:
https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=253270045&revision=65cc93e84d6e8203b62d2d6706a3a54446b5c8c3

the PR with the changes is: https://phabricator.services.mozilla.com/D35644

I believe the correct patch is https://phabricator.services.mozilla.com/D35759

You are right :) I will edit my comment

Priority: -- → P1

try push with the errors: https://treeherder.mozilla.org/#/jobs?repo=try&revision=41bbda6e2e12ebd8cb05d34e025fb244939373bd

similar try push with another sites tested, that is all green: https://treeherder.mozilla.org/#/jobs?repo=try&revision=65cc93e84d6e8203b62d2d6706a3a54446b5c8c3

For the failing test we have 2 sites(apple and microsoft)
The jobs fail intermitently and from what I see from the logs, they go well for the first site(apple) and when it fails, it happens for the microsoft recording at about 12-16 pagecycle. It just hangs in there for ~25 min, until the task timeouts.
log sample: (https://taskcluster-artifacts.net/TxCDVjthScSotADqvnv_-A/0/public/logs/live_backing.log)

I was not able to reproduce it locally, but I recorded the microsoft.com site again and I'm going to switch it.

Robert, did you notice this kind of behavior before?

Flags: needinfo?(rwood)

(In reply to Alexandru Irimovici from comment #4)

they go well for the first site(apple) and when it fails, it happens for the microsoft recording at about 12-16 pagecycle. It just hangs in there for ~25 > min, until the task timeouts.

Robert, did you notice this kind of behavior before?

Sites working well and then timing out in future page-cycles? Yes before the general fix to the intermittents (Bug 1559798). Hopefully re-recording again will fix it?

Flags: needinfo?(rwood)

(In reply to Robert Wood [:rwood] from comment #5)

(In reply to Alexandru Irimovici from comment #4)

they go well for the first site(apple) and when it fails, it happens for the microsoft recording at about 12-16 pagecycle. It just hangs in there for ~25 > min, until the task timeouts.

Robert, did you notice this kind of behavior before?

Sites working well and then timing out in future page-cycles? Yes before the general fix to the intermittents (Bug 1559798). Hopefully re-recording again will fix it?

The new recording is still having the same issue(timing out in future page-cycles): https://treeherder.mozilla.org/#/jobs?repo=try&revision=39283025f4280b49c707ce30146fcd09a56f89fd

I'm still not able to reproduce it locally. Robert, can you help me with some advice for this? :)

Flags: needinfo?(rwood)

(In reply to Alexandru Irimovici from comment #6)

I'm still not able to reproduce it locally. Robert, can you help me with some advice for this? :)

Hmmm no I don't know what the issue is here - you and :bebe are much more familiar with recordings etc. than I am - :bebe please have a look, thanks!

Flags: needinfo?(rwood) → needinfo?(fstrugariu)
Blocks: 1559938

I don't think having more time would solve these intermittents. Almost every time the job succeeds, it does it in ~6 minutes. When it fails, it does it in the ~14th pagecycle, suddenly, after running normally and it just blocks the taskcluster task for the rest of the time that remained(example from logs below - notice the timestamps of the logs).

13:29:08     INFO -  raptor-control-server Info: received webext_status: begin pagecycle 14
13:29:08     INFO -  PID 1414 | console.log: "[raptor-runnerjs] begin pagecycle 14"
13:29:08     INFO -  PID 1414 | console.log: "[raptor-runnerjs] posting to control server"
13:29:08     INFO -  PID 1414 | console.log: "[raptor-runnerjs] begin pagecycle 14"
13:29:08     INFO -  PID 1414 | console.log: "[raptor-runnerjs] post success"
13:29:09     INFO -  PID 1414 | console.log: "[raptor-runnerjs] update tab: 1"
13:29:09     INFO -  PID 1414 | console.log: "[raptor-runnerjs] posting to control server"
13:29:09     INFO -  PID 1414 | console.log: "[raptor-runnerjs] update tab: 1"
13:29:09     INFO -  PID 1414 | console.log: "[raptor-runnerjs] test tab updated: 1"
13:29:09     INFO -  PID 1414 | console.log: "[raptor-runnerjs] posting to control server"
13:29:09     INFO -  PID 1414 | console.log: "[raptor-runnerjs] test tab updated: 1"
13:29:09     INFO -  raptor-control-server Info: received webext_status: update tab: 1
13:29:09     INFO -  raptor-control-server Info: received webext_status: test tab updated: 1
13:29:09     INFO -  PID 1414 | console.log: "[raptor-runnerjs] post success"
13:29:09     INFO -  PID 1414 | console.log: "[raptor-runnerjs] post success"
[taskcluster:error] Aborting task...
[taskcluster 2019-06-28T13:54:32.698Z] === Task Finished ===
[taskcluster 2019-06-28T13:54:32.699Z] Task Duration: 30m0.009884531s

Bebe and Bob Clary confirmed that the failures are not related to hardware issues, as they take place on different machines.

There is still this mistery: The test is failing intermittently in exactly the same spot, on the same pagecycle (14th for OS X and 16th for Linux), not teminating the test after the fail and letting the taskcluster task to timeout.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(fstrugariu)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.