967223 - Intermittent Gaia unit test "timed out after 1760 seconds of no output" due to a hang while running

Reporter

Description

•

11 years ago

(We really need a component for Gaia unit tests). We've seen failures like these before and just retriggered. Looks like they're intermittently hanging mid-test. Note that it's 13:28 at the time it's finally killed. https://tbpl.mozilla.org/php/getParsedLog.php?id=34021800&tree=B2g-Inbound b2g_ubuntu64_vm b2g-inbound opt test gaia-unit on 2014-02-03 12:57:52 PST for push c2da8d1505fe slave: tst-linux64-spot-447 13:08:26 INFO - gaia-unit-tests TEST-START | calendar/test/unit/calc_test.js | #daysBetween 13:08:26 INFO - gaia-unit-tests TEST-PASS | calendar/test/unit/calc_test.js | calendar/calc #daysBetween same day 13:08:26 INFO - gaia-unit-tests TEST-PASS | calendar/test/unit/calc_test.js | calendar/calc #daysBetween include time 13:08:27 INFO - gaia-unit-tests TEST-PASS | calendar/test/unit/calc_test.js | calendar/calc #daysBetween exclude time 13:08:27 INFO - gaia-unit-tests TEST-END | calendar/test/unit/calc_test.js | #daysBetween 13:08:27 INFO - gaia-unit-tests TEST-START | calendar/test/unit/calc_test.js | #getWeekEndDate 13:08:27 INFO - gaia-unit-tests TEST-PASS | calendar/test/unit/calc_test.js | calendar/calc #getWeekEndDate when given middle 13:08:27 INFO - gaia-unit-tests TEST-PASS | calendar/test/unit/calc_test.js | calendar/calc #getWeekEndDate when given start command timed out: 1200 seconds without output, attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=1827.358479 ========= Finished '/tools/buildbot/bin/python scripts/scripts/gaia_unit.py ...' failed (results: 2, elapsed: 30 mins, 27 secs) (at 2014-02-03 13:28:27.475131) =========

Comment hidden (Legacy TBPL/Treeherder Robot)

Julien Wajsberg [:julienw]

Comment 23

•

11 years ago

We actually have a component. I have also a semi-ready patch for bug 892048, and I wonder if that could fix this as well. Even if the trigger cause is not the same, maybe it would make it more reliable.

Component: Gaia → Gaia::TestAgent

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 26

•

11 years ago

(In reply to Julien Wajsberg [:julienw] from comment #23) > We actually have a component. > > I have also a semi-ready patch for bug 892048, and I wonder if that could > fix this as well. Even if the trigger cause is not the same, maybe it would > make it more reliable. No activity in that bug for 3 months? Can we please bump the priority then? You can see how frequently this occurs on TBPL.

Julien Wajsberg [:julienw]

Comment 27

•

11 years ago

Actually, it's no visible activity ;) No real activity for about 3 weeks. And 3 weeks ago I landed the patch that made it possible to run them at all so.. :) I definitely want to finish this patch once I'm over with my 1.3+ bugs.

Comment hidden (Legacy TBPL/Treeherder Robot)

Jonathan Griffin (:jgriffin)

Assignee

Updated

•

11 years ago

Depends on: 969590

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 36

•

11 years ago

1760s now... https://tbpl.mozilla.org/php/getParsedLog.php?id=34499477&tree=Mozilla-Inbound

Summary: Intermittent Gaia unit test "command timed out: 1200 seconds without output, attempting to kill" due to a hang while running → Intermittent Gaia unit test "timed out after 1760 seconds of no output" due to a hang while running

Julien Wajsberg [:julienw]

Comment 37

•

11 years ago

It's actually currently expected that we don't send any output until the very last suite: bug 907621. And this is not really easy to fix, although I'd like to find something.

Jonathan Griffin (:jgriffin)

Assignee

Comment 38

•

11 years ago

(In reply to Julien Wajsberg [:julienw] from comment #37) > It's actually currently expected that we don't send any output until the > very last suite: bug 907621. > > And this is not really easy to fix, although I'd like to find something. We don't run tests this way in TBPL; here, we run tests one-at-a-time, and there is output for each test.

Julien Wajsberg [:julienw]

Comment 39

•

11 years ago

Oh right, forgot this.

Rail Aliiev [:rail]

Updated

•

11 years ago

Blocks: 966070

Rail Aliiev [:rail]

Updated

•

11 years ago

No longer blocks: 966070

Rail Aliiev [:rail]

Updated

•

11 years ago

Blocks: 945981

Comment hidden (Legacy TBPL/Treeherder Robot)

Julien Wajsberg [:julienw]

Comment 41

•

11 years ago

Bug 892048 landed February 17th. Also, I think you did some backend changes recently. Can you tell here when you did the backend changes, so that we can appreciate whether all this fixed this issue?

Julien Wajsberg [:julienw]

Updated

•

11 years ago

Depends on: 892048

Jonathan Griffin (:jgriffin)

Assignee

Comment 42

•

11 years ago

From the dates of the occurrences of this bug, it's quite likely that most were related to some changes of the AWS node type that is used to run these tests; that change was reverted in bug 969590.

Comment hidden (Legacy TBPL/Treeherder Robot)

Julien Wajsberg [:julienw]

Comment 51

•

11 years ago

This happened a lot last night, do you know if it's something somewhat expected on your side? Do you see similar issues in other jobs? Also, the errors from comment 48 and comment 49 look like it's doing nothing (and especially not waiting for the test-agent) because a suite just finished. The errors from comment 46 and 47 stopped at about the same area too. I wonder if we don't have an issue in Firefox here :/

Comment hidden (Legacy TBPL/Treeherder Robot)

Jonathan Griffin (:jgriffin)

Assignee

Comment 55

•

11 years ago

I'd guess this is a crash, but that our crash detection isn't catching it correctly. I'll take a look at it later.

Comment hidden (Legacy TBPL/Treeherder Robot)

Jonathan Griffin (:jgriffin)

Assignee

Updated

•

11 years ago

Assignee: nobody → jgriffin

Jonathan Griffin (:jgriffin)

Assignee

Updated

•

11 years ago

Status: NEW → ASSIGNED

Jonathan Griffin (:jgriffin)

Assignee

Comment 59

•

11 years ago

Attached file Link to Github pull-request: https://github.com/mozilla-b2g/gaia/pull/16968 (deleted) — Details

Each time we get a message from the JS harness, we start a 120s timer; if we hit that timer, we assume the test is hung/crashed, perform crash detection, and abort the run. This should put an end to these mozharness timeouts.

Attachment #8387035 - Flags: review?(ahalberstadt)

Andrew Halberstadt [:ahal]

Comment 60

•

11 years ago

Comment on attachment 8387035 [details] Link to Github pull-request: https://github.com/mozilla-b2g/gaia/pull/16968 Lgtm. Like I mentioned in the pull request, feel free to ignore my comment if it doesn't make sense.

Attachment #8387035 - Flags: review?(ahalberstadt) → review+

Comment hidden (Legacy TBPL/Treeherder Robot)

Jonathan Griffin (:jgriffin)

Assignee

Comment 63

•

11 years ago

Landing without a killAndGetStack implementation for now; will handle in a follow-up if needed: https://github.com/mozilla-b2g/gaia/commit/a737fd38a0e8b3c678fdeb623a6ddeb9d3190817

Jonathan Griffin (:jgriffin)

Assignee

Comment 64

•

11 years ago

No occurrences in 5 days; I'm optimistically calling this fixed.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Julien Wajsberg [:julienw]

Comment 65

•

11 years ago

So this will catch crashes, report crashes, and restart the tests when this happens?

Jonathan Griffin (:jgriffin)

Assignee

Comment 66

•

11 years ago

Yes, yes, and no. After a crash, the crash is reported and the test run is aborted. This is consistent with how our other harnesses work; we could look at resuming the tests in a subsequent patch, if it seems that it would be useful.

Julien Wajsberg [:julienw]

Comment 67

•

11 years ago

Ok, so if it's a crash and not a "real" timeout, I assume it shows differently in TBPL. Thanks !

Jonathan Griffin (:jgriffin)

Assignee

Comment 68

•

11 years ago

Yep, they'll look very different in TBPL.

Comment hidden (Legacy TBPL/Treeherder Robot)

Julien Wajsberg [:julienw]

Comment 78

•

10 years ago

Jonathan, do you have a clue on how to fix this? Seems to come back these days...

Flags: needinfo?(jgriffin)

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 79

•

10 years ago

The most recent ones were bug 1023001, which is fixed now.

Flags: needinfo?(jgriffin)

Julien Wajsberg [:julienw]

Comment 80

•

10 years ago

okay

Comment hidden (Legacy TBPL/Treeherder Robot)

Phil Ringnalda (:philor)

Updated

•

9 years ago

Keywords: intermittent-failure