Closed Bug 967223 Opened 11 years ago Closed 11 years ago

Intermittent Gaia unit test "timed out after 1760 seconds of no output" due to a hang while running

Categories

(Firefox OS Graveyard :: Gaia::TestAgent, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RyanVM, Assigned: jgriffin)

References

Details

Attachments

(1 file)

(We really need a component for Gaia unit tests). We've seen failures like these before and just retriggered. Looks like they're intermittently hanging mid-test. Note that it's 13:28 at the time it's finally killed. https://tbpl.mozilla.org/php/getParsedLog.php?id=34021800&tree=B2g-Inbound b2g_ubuntu64_vm b2g-inbound opt test gaia-unit on 2014-02-03 12:57:52 PST for push c2da8d1505fe slave: tst-linux64-spot-447 13:08:26 INFO - gaia-unit-tests TEST-START | calendar/test/unit/calc_test.js | #daysBetween 13:08:26 INFO - gaia-unit-tests TEST-PASS | calendar/test/unit/calc_test.js | calendar/calc #daysBetween same day 13:08:26 INFO - gaia-unit-tests TEST-PASS | calendar/test/unit/calc_test.js | calendar/calc #daysBetween include time 13:08:27 INFO - gaia-unit-tests TEST-PASS | calendar/test/unit/calc_test.js | calendar/calc #daysBetween exclude time 13:08:27 INFO - gaia-unit-tests TEST-END | calendar/test/unit/calc_test.js | #daysBetween 13:08:27 INFO - gaia-unit-tests TEST-START | calendar/test/unit/calc_test.js | #getWeekEndDate 13:08:27 INFO - gaia-unit-tests TEST-PASS | calendar/test/unit/calc_test.js | calendar/calc #getWeekEndDate when given middle 13:08:27 INFO - gaia-unit-tests TEST-PASS | calendar/test/unit/calc_test.js | calendar/calc #getWeekEndDate when given start command timed out: 1200 seconds without output, attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=1827.358479 ========= Finished '/tools/buildbot/bin/python scripts/scripts/gaia_unit.py ...' failed (results: 2, elapsed: 30 mins, 27 secs) (at 2014-02-03 13:28:27.475131) =========
We actually have a component. I have also a semi-ready patch for bug 892048, and I wonder if that could fix this as well. Even if the trigger cause is not the same, maybe it would make it more reliable.
Component: Gaia → Gaia::TestAgent
(In reply to Julien Wajsberg [:julienw] from comment #23) > We actually have a component. > > I have also a semi-ready patch for bug 892048, and I wonder if that could > fix this as well. Even if the trigger cause is not the same, maybe it would > make it more reliable. No activity in that bug for 3 months? Can we please bump the priority then? You can see how frequently this occurs on TBPL.
Actually, it's no visible activity ;) No real activity for about 3 weeks. And 3 weeks ago I landed the patch that made it possible to run them at all so.. :) I definitely want to finish this patch once I'm over with my 1.3+ bugs.
Depends on: 969590
Summary: Intermittent Gaia unit test "command timed out: 1200 seconds without output, attempting to kill" due to a hang while running → Intermittent Gaia unit test "timed out after 1760 seconds of no output" due to a hang while running
It's actually currently expected that we don't send any output until the very last suite: bug 907621. And this is not really easy to fix, although I'd like to find something.
(In reply to Julien Wajsberg [:julienw] from comment #37) > It's actually currently expected that we don't send any output until the > very last suite: bug 907621. > > And this is not really easy to fix, although I'd like to find something. We don't run tests this way in TBPL; here, we run tests one-at-a-time, and there is output for each test.
Oh right, forgot this.
Blocks: 966070
No longer blocks: 966070
Blocks: 945981
Bug 892048 landed February 17th. Also, I think you did some backend changes recently. Can you tell here when you did the backend changes, so that we can appreciate whether all this fixed this issue?
Depends on: 892048
From the dates of the occurrences of this bug, it's quite likely that most were related to some changes of the AWS node type that is used to run these tests; that change was reverted in bug 969590.
This happened a lot last night, do you know if it's something somewhat expected on your side? Do you see similar issues in other jobs? Also, the errors from comment 48 and comment 49 look like it's doing nothing (and especially not waiting for the test-agent) because a suite just finished. The errors from comment 46 and 47 stopped at about the same area too. I wonder if we don't have an issue in Firefox here :/
I'd guess this is a crash, but that our crash detection isn't catching it correctly. I'll take a look at it later.
Assignee: nobody → jgriffin
Status: NEW → ASSIGNED
Each time we get a message from the JS harness, we start a 120s timer; if we hit that timer, we assume the test is hung/crashed, perform crash detection, and abort the run. This should put an end to these mozharness timeouts.
Attachment #8387035 - Flags: review?(ahalberstadt)
Comment on attachment 8387035 [details] Link to Github pull-request: https://github.com/mozilla-b2g/gaia/pull/16968 Lgtm. Like I mentioned in the pull request, feel free to ignore my comment if it doesn't make sense.
Attachment #8387035 - Flags: review?(ahalberstadt) → review+
Landing without a killAndGetStack implementation for now; will handle in a follow-up if needed: https://github.com/mozilla-b2g/gaia/commit/a737fd38a0e8b3c678fdeb623a6ddeb9d3190817
No occurrences in 5 days; I'm optimistically calling this fixed.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
So this will catch crashes, report crashes, and restart the tests when this happens?
Yes, yes, and no. After a crash, the crash is reported and the test run is aborted. This is consistent with how our other harnesses work; we could look at resuming the tests in a subsequent patch, if it seems that it would be useful.
Ok, so if it's a crash and not a "real" timeout, I assume it shows differently in TBPL. Thanks !
Yep, they'll look very different in TBPL.
Jonathan, do you have a clue on how to fix this? Seems to come back these days...
Flags: needinfo?(jgriffin)
The most recent ones were bug 1023001, which is fixed now.
Flags: needinfo?(jgriffin)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: