Closed Bug 609536 Opened 14 years ago Closed 12 years ago

Intermittent "failed to cleanup" errors in Android Tegra 250 talos runs

Categories

(Testing :: Talos, defect, P2)

ARM
Android
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [mobile_unittests][android_tier_1])

I really thought I'd seen this filed, but perhaps not. From the last 24 hours just on tracemonkey: http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288839527.1288840848.909.gz Android Tegra 250 tracemonkey talos remote-tsvg on 2010/11/03 19:58:47 (the actual cleanup step and everything before running apparently goes fine) reconnecting socket RETURN:s: tegra-001 tegra-001: Started Wed, 03 Nov 2010 19:58:57 Running test tsvg: Started Wed, 03 Nov 2010 19:58:57 reconnecting socket pushing directory: /tmp/tmps1XwvE/profile to /mnt/sdcard/tests/profile Failed tsvg: Stopped Wed, 03 Nov 2010 20:20:19 FAIL: Busted: tsvg FAIL: failed to cleanup http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288817651.1288818947.3593.gz Android Tegra 250 tracemonkey talos remote-ts on 2010/11/03 13:54:11 RETURN:s: tegra-001 tegra-001: Started Wed, 03 Nov 2010 13:54:22 Running test ts: Started Wed, 03 Nov 2010 13:54:22 reconnecting socket pushing directory: /tmp/tmpqdfL6N/profile to /mnt/sdcard/tests/profile Failed ts: Stopped Wed, 03 Nov 2010 14:15:43 FAIL: Busted: ts FAIL: failed to cleanup http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288788265.1288789571.892.gz Android Tegra 250 tracemonkey talos remote-tsvg on 2010/11/03 05:44:25 RETURN:s: tegra-001 tegra-001: Started Wed, 03 Nov 2010 05:44:36 Running test tsvg: Started Wed, 03 Nov 2010 05:44:36 reconnecting socket pushing directory: /tmp/tmp8uNzn1/profile to /mnt/sdcard/tests/profile Failed tsvg: Stopped Wed, 03 Nov 2010 06:05:58 FAIL: Busted: tsvg FAIL: failed to cleanup http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288750487.1288751796.32357.gz Android Tegra 250 tracemonkey talos remote-ts on 2010/11/02 19:14:47 RETURN:s: tegra-001 tegra-001: Started Tue, 02 Nov 2010 19:15:11 Running test ts: Started Tue, 02 Nov 2010 19:15:11 reconnecting socket pushing directory: /tmp/tmpxh630N/profile to /mnt/sdcard/tests/profile Failed ts: Stopped Tue, 02 Nov 2010 19:36:33 FAIL: Busted: ts FAIL: failed to cleanup http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288746288.1288747598.13948.gz Android Tegra 250 tracemonkey talos remote-tdhtml on 2010/11/02 18:04:48 RETURN:s: tegra-001 tegra-001: Started Tue, 02 Nov 2010 18:05:07 Running test tdhtml: Started Tue, 02 Nov 2010 18:05:07 reconnecting socket pushing directory: /tmp/tmpnt98Et/profile to /mnt/sdcard/tests/profile Failed tdhtml: Stopped Tue, 02 Nov 2010 18:26:29 FAIL: Busted: tdhtml FAIL: failed to cleanup
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288846409.1288847710.25962.gz Android Tegra 250 tracemonkey talos remote-tsvg on 2010/11/03 21:53:29 s: tegra-001 FAIL: Busted: tsvg FAIL: failed to cleanup
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288873499.1288874795.19838.gz Android Tegra 250 tracemonkey talos remote-tdhtml on 2010/11/04 05:24:59 s: tegra-001 FAIL: Busted: tdhtml FAIL: failed to cleanup
Whiteboard: [orange] → [mobile_unittests][orange]
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1289245776.1289245958.1383.gz Android Tegra 250 tracemonkey talos remote-ts on 2010/11/08 11:49:36 s: tegra-002 FAIL: Busted: ts FAIL: failed to cleanup
We need to find if this is constant or intermittent. If device/sdcard corruption that won't fix itself, we need to take these out of production (preferably automatically). If intermittent, we need the scripts to retry and/or be more fault tolerant.
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1294664143.1294666458.17331.gz Android Tegra 250 tracemonkey talos remote-tpan on 2011/01/10 04:55:43 s: tegra-005 FAIL: Busted: tpan FAIL: failed to cleanup
Hoping this will be fixed by bug 618363.
Depends on: 618363
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1294705968.1294707549.11207.gz Android Tegra 250 tracemonkey talos remote-tsspider on 2011/01/10 16:32:48 s: tegra-002 FAIL: Busted: tsspider FAIL: failed to cleanup
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1294708029.1294710817.26354.gz Android Tegra 250 tracemonkey talos remote-ts on 2011/01/10 17:07:09 s: tegra-004 FAIL: Busted: ts FAIL: failed to cleanup
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1294712085.1294715339.15239.gz Android Tegra 250 tracemonkey talos remote-ts on 2011/01/10 18:14:45 s: tegra-002 FAIL: Busted: ts FAIL: failed to cleanup
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1295048358.1295050755.10490.gz Android Tegra 250 tracemonkey talos remote-tdhtml on 2011/01/14 15:39:18 s: tegra-002 FAIL: Busted: tdhtml FAIL: failed to cleanup
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1295059369.1295062031.1084.gz Android Tegra 250 tracemonkey talos remote-tpan on 2011/01/14 18:42:49 s: tegra-012 FAIL: Busted: tpan FAIL: failed to cleanup My guess is that this happens on the Mobile tree quite a bit, too, but that nobody ever looks at anything on that tree. I asked in #mobile about a build failure that happens multiple times a day, and nobody seemed to have the slightest idea that it ever happens at all.
I think this is may be fixed by bug 625874
Depends on: 625874
Bear and I are still hitting this in staging. I'm wondering if remotePerfConfigurator.py is having issues, or is still running something on the device when the tests start.
Ignore comment 17 for now -- we're apparently not running against the latest dm.
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1296773075.1296775260.22751.gz Android Tegra 250 tracemonkey talos remote-tsspider on 2011/02/03 14:44:35 s: tegra-015 FAIL: Busted: tsspider FAIL: failed to cleanup
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1296780271.1296782453.23332.gz Android Tegra 250 tracemonkey talos remote-tpan on 2011/02/03 16:44:31 s: tegra-013 FAIL: Busted: tpan FAIL: failed to cleanup http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1296780271.1296783838.28928.gz Android Tegra 250 tracemonkey talos remote-ts on 2011/02/03 16:44:31 s: tegra-028 FAIL: Busted: ts FAIL: failed to cleanup
I know you've been missing me. http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1298555958.1298560387.2377.gz Android Tegra 250 tracemonkey talos remote-tpan on 2011/02/24 05:59:18 s: tegra-015 FAIL: Busted: tpan FAIL: failed to cleanup http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1298526327.1298530868.29364.gz Android Tegra 250 tracemonkey talos remote-tsspider on 2011/02/23 21:45:27 s: tegra-016 FAIL: Busted: tsspider FAIL: failed to cleanup http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1298526327.1298529977.25996.gz Android Tegra 250 tracemonkey talos remote-tsvg on 2011/02/23 21:45:27 s: tegra-023 FAIL: Busted: tsvg FAIL: failed to cleanup
New shooter: http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303237728.1303239727.15897.gz Android Tegra 250 mozilla-central talos remote-tp4m on 2011/04/19 11:28:48 s: tegra-038
Summary: Frequent "failed to clean up" errors in Android Tegra 250 talos runs → Frequent "failed to cleanup" errors in Android Tegra 250 talos runs
clint/jmaher: any ETA when someone can look at this? Its causing oranges on production tegras :-(
in looking at the last few days worth of logs, there are a few issues here. In general it looks like various tests are failing to run [randomly] such that they don't even produce any output. Off the top of my head, I see this as one of a few possible errors: - fennec is broken (unlikely, but possible) - fennec failed to install properly - the tegra is hung (http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303912636.1303914392.24405.gz)- no screen info is displayed, so the test never started - the test caused fennec to crash/hang (http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303877475.1303880837.23337.gz)- test starts running, but stops part way through. I always suspect out of memory here, and this is usually seen on tpan/tzoom tests.
(In reply to comment #52) > in looking at the last few days worth of logs, there are a few issues here. In > general it looks like various tests are failing to run [randomly] such that > they don't even produce any output. > > Off the top of my head, I see this as one of a few possible errors: > - fennec is broken (unlikely, but possible) > - fennec failed to install properly > - the tegra is hung > (http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303912636.1303914392.24405.gz)- > no screen info is displayed, so the test never started > - the test caused fennec to crash/hang > (http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303877475.1303880837.23337.gz)- > test starts running, but stops part way through. I always suspect out of > memory here, and this is usually seen on tpan/tzoom tests. This could be the handful of xpcshell processes I'm finding still and we will know once the new pidfile bug is fixed (they are not being killed even when the parent is completely killed)
(In reply to comment #53) > (In reply to comment #52) > > in looking at the last few days worth of logs, there are a few issues here. In > > general it looks like various tests are failing to run [randomly] such that > > they don't even produce any output. > > > > Off the top of my head, I see this as one of a few possible errors: > > - fennec is broken (unlikely, but possible) > > - fennec failed to install properly > > - the tegra is hung > > (http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303912636.1303914392.24405.gz)- > > no screen info is displayed, so the test never started > > - the test caused fennec to crash/hang > > (http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303877475.1303880837.23337.gz)- > > test starts running, but stops part way through. I always suspect out of > > memory here, and this is usually seen on tpan/tzoom tests. > > This could be the handful of xpcshell processes I'm finding still and we will > know once the new pidfile bug is fixed (they are not being killed even when the > parent is completely killed) Bear - good to know. When will this pidfile change land in production, so we can see if it really does solve this problem?
Depends on: 654116
(In reply to comment #54) > (In reply to comment #53) > > (In reply to comment #52) > > > in looking at the last few days worth of logs, there are a few issues here. In > > > general it looks like various tests are failing to run [randomly] such that > > > they don't even produce any output. > > > > > > Off the top of my head, I see this as one of a few possible errors: > > > - fennec is broken (unlikely, but possible) > > > - fennec failed to install properly > > > - the tegra is hung > > > (http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303912636.1303914392.24405.gz)- > > > no screen info is displayed, so the test never started > > > - the test caused fennec to crash/hang > > > (http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303877475.1303880837.23337.gz)- > > > test starts running, but stops part way through. I always suspect out of > > > memory here, and this is usually seen on tpan/tzoom tests. > > > > This could be the handful of xpcshell processes I'm finding still and we will > > know once the new pidfile bug is fixed (they are not being killed even when the > > parent is completely killed) > > Bear - good to know. When will this pidfile change land in production, so we > can see if it really does solve this problem? It is being tracked as bug 654116 - i'll know more once joel makes the change and I can stage it. The staging test won't take more than a couple of hours to check, but could go over night as that is when most of the xcpshell zombies come out
(In reply to comment #55) > It is being tracked as bug 654116 - i'll know more once joel makes the > change and I can stage it. The staging test won't take more than a couple > of hours to check, but could go over night as that is when most of the > xcpshell zombies come out Any follow-up here? The pidfile change has landed and we're still getting reports of this.
I did a search of these saturday and found that all of them had the same issues. The main process (where the new pidfile points) had all been killed by clientproxy as it should have, but the spawned xre process was still running. The process that we spawn did not kill it's child processes or they were started as standalones (or something.) Do we need to get that process to generate a pidfile or can runtestsremote be setup to kill them when it's being shutdown?
Priority: -- → P2
we generated a pid file for the python process that manages the test harness as well as for the xpcshell process. With these two pid files, I don't see why we would be failing on the zombie xpcshell process. The problem here is that Fennec is crashing in the middle of the test run. You can look at the log and see that we usually make it through a few iterations of this. I can easily reproduce this locally, but have not been able to figure out a fix. I think the fix needs to be on the Fennec side of the fence.
this is a bit confusing because I see a few problems here: 1) getInfo.html is not returning information. This fails after 1300+ seconds which indicates that we are timing out and killing the process. This is because the process is launched but never terminated. Likewise we don't find browser_output.txt on the device. 2) I see other errors where we are running a test (usually fennecmark | tp4m) and we stop part way through the test (maybe 2-5 iterations instead of 10). By the long duration of the test, I would assume that the browser is hung and we are killing it as a timeout. These are symptoms of either the browser hanging or the webserver not serving files. Most likely it is the browser hanging. Other related issues could be that we require the profile and extensions to be available and compatible.
Depends on: 662936
OK, even I had to get bored with the copy-pasting eventually. Just assume this still happens constantly.
Whiteboard: [mobile_unittests][orange] → [mobile_unittests][orange][android_tier_1]
Interestingly, still happens constantly on m-a and m-b, but has gone away on the trunk (perhaps in favor of one of the other errors, perhaps really gone away).
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Summary: Frequent "failed to cleanup" errors in Android Tegra 250 talos runs → Intermittent "failed to cleanup" errors in Android Tegra 250 talos runs
Status: REOPENED → RESOLVED
Closed: 13 years ago12 years ago
Resolution: --- → WORKSFORME
Whiteboard: [mobile_unittests][orange][android_tier_1] → [mobile_unittests][android_tier_1]
You need to log in before you can comment on or make changes to this bug.