Closed
Bug 1244446
Opened 9 years ago
Closed 8 years ago
Intermittent failure to running rmtree on Linux, "WARNING: Command failed. See logs for output.", "# umount -n /builds/mock_mozilla/mozilla-centos6-x86_64/root/tmp/ccache", etc.
Categories
(Firefox Build System :: General, defect)
Tracking
(firefox50 fixed)
RESOLVED
FIXED
mozilla50
Tracking | Status | |
---|---|---|
firefox50 | --- | fixed |
People
(Reporter: aryx, Assigned: glandium)
References
Details
(Keywords: intermittent-failure)
Attachments
(2 files)
https://treeherder.mozilla.org/logviewer.html#?job_id=20771996&repo=mozilla-inbound Also seen similar behavior yesterday, can't find a bug for it. 01:43:12 ERROR - WARNING: Command failed. See logs for output. 01:43:12 ERROR - # umount -n /builds/mock_mozilla/mozilla-centos6-x86_64/root/sys 01:43:12 ERROR - WARNING: Command failed. See logs for output. 01:43:12 ERROR - # umount -n /builds/mock_mozilla/mozilla-centos6-x86_64/root/proc 01:43:12 ERROR - Traceback (most recent call last): 01:43:12 ERROR - File "/usr/sbin/mock_mozilla", line 862, in <module> 01:43:12 ERROR - main(retParams) 01:43:12 ERROR - File "/usr/sbin/mock_mozilla", line 740, in main 01:43:12 ERROR - sys.exit(chroot.shell(options, cmd)) 01:43:12 ERROR - File "/usr/lib/python2.6/site-packages/mock_mozilla/backend.py", line 659, in shell 01:43:12 ERROR - self._setupDev() 01:43:12 ERROR - File "<peak.util.decorators.rewrap wrapping mock_mozilla.backend._setupDev at 0x013F3230>", line 3, in _setupDev 01:43:12 ERROR - File "/usr/lib/python2.6/site-packages/mock_mozilla/trace_decorator.py", line 70, in trace 01:43:12 ERROR - result = func(*args, **kw) 01:43:12 ERROR - File "/usr/lib/python2.6/site-packages/mock_mozilla/backend.py", line 386, in _setupDev 01:43:12 ERROR - mock_mozilla.util.rmtree(self.makeChrootPath("dev"), selinux=self.selinux) 01:43:12 ERROR - File "<peak.util.decorators.rewrap wrapping mock_mozilla.util.rmtree at 0x013E0938>", line 3, in rmtree 01:43:12 ERROR - File "/usr/lib/python2.6/site-packages/mock_mozilla/trace_decorator.py", line 70, in trace 01:43:12 ERROR - result = func(*args, **kw) 01:43:12 ERROR - File "/usr/lib/python2.6/site-packages/mock_mozilla/util.py", line 87, in rmtree 01:43:12 ERROR - shutil.rmtree(path, *args, **kargs) 01:43:12 ERROR - File "/usr/lib64/python2.6/shutil.py", line 212, in rmtree 01:43:12 ERROR - rmtree(fullname, ignore_errors, onerror) 01:43:12 ERROR - File "/usr/lib64/python2.6/shutil.py", line 221, in rmtree 01:43:12 ERROR - onerror(os.rmdir, path, sys.exc_info()) 01:43:12 ERROR - File "/usr/lib64/python2.6/shutil.py", line 219, in rmtree 01:43:12 ERROR - os.rmdir(path) 01:43:12 ERROR - OSError: [Errno 16] Device or resource busy: '/builds/mock_mozilla/mozilla-centos6-x86_64/root/dev/shm' 01:43:12 ERROR - Return code: 1 01:43:12 WARNING - setting return code to 2 01:43:12 FATAL - Halting on failure while running ['mock_mozilla', '-r', 'mozilla-centos6-x86_64', '-q', '--cwd', '/builds/slave/m-in-l64-000000000000000000000/build/src/obj-firefox', '--unpriv', '--shell', '/usr/bin/env MOZ_AUTOMATION=1 "LESSOPEN=|/usr/bin/lesspipe.sh %s" TMOUT=86400 LOGNAME=cltbld USER=cltbld PATH=/tools/buildbot/bin:/usr/local/bin:/usr/lib64/ccache:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/tools/git/bin:/tools/python27/bin:/tools/python27-mercurial/bin:/home/cltbld/bin SYMBOL_SERVER_USER=ffxbld DISPLAY=:2 CCACHE_UMASK=002 LANG=en_US.UTF-8 CCACHE_HASHDIR= TERM=linux SHELL=/bin/bash MOZ_SIGNING_SERVERS=gpg:sha2signcode:osslsigncode:signcode:mar:jar:b2gmar:emevoucher:signing4.srv.releng.scl3.mozilla.com:9110,gpg:sha2signcode:osslsigncode:signcode:mar:jar:b2gmar:emevoucher:signing5.srv.releng.scl3.mozilla.com:9110,gpg:sha2signcode:osslsigncode:signcode:mar:jar:b2gmar:emevoucher:signing6.srv.releng.scl3.mozilla.com:9110,dmgv2:mac-v2-signing1.srv.releng.scl3.mozilla.com:9110,dmgv2:mac-v2-signing2.srv.releng.scl3.mozilla.com:9110,dmgv2:mac-v2-signing3.srv.releng.scl3.mozilla.com:9110,dmgv2:mac-v2-signing4.srv.releng.scl3.mozilla.com:9110,dmgv2:mac-v2-signing6.srv.releng.scl3.mozilla.com:9110,dmgv2:mac-v2-signing7.srv.releng.scl3.mozilla.com:9110 SHLVL=1 G_BROKEN_FILENAMES=1 HISTSIZE=1000 SYMBOL_SERVER_PATH=/mnt/netapp/breakpad/symbols_ffx/ CCACHE_DIR=/builds/ccache HG_SHARE_BASE_DIR=/builds/hg-shared SYMBOL_SERVER_HOST=symbolpush.mozilla.org MAIL=/var/spool/mail/cltbld MOZ_BUILD_DATE=20160130013032 TINDERBOX_OUTPUT=1 "MOZ_SIGN_CMD=python /builds/slave/m-in-l64-000000000000000000000/build/tools/release/signing/signtool.py --cachedir /builds/slave/m-in-l64-000000000000000000000/build/signing_cache -t /builds/slave/m-in-l64-000000000000000000000/token -n /builds/slave/m-in-l64-000000000000000000000/nonce -c /builds/slave/m-in-l64-000000000000000000000/build/tools/release/signing/host.cert -H gpg:sha2signcode:osslsigncode:signcode:mar:jar:b2gmar:emevoucher:signing4.srv.releng.scl3.mozilla.com:9110 -H gpg:sha2signcode:osslsigncode:signcode:mar:jar:b2gmar:emevoucher:signing5.srv.releng.scl3.mozilla.com:9110 -H gpg:sha2signcode:osslsigncode:signcode:mar:jar:b2gmar:emevoucher:signing6.srv.releng.scl3.mozilla.com:9110 -H dmgv2:mac-v2-signing1.srv.releng.scl3.mozilla.com:9110 -H dmgv2:mac-v2-signing2.srv.releng.scl3.mozilla.com:9110 -H dmgv2:mac-v2-signing3.srv.releng.scl3.mozilla.com:9110 -H dmgv2:mac-v2-signing4.srv.releng.scl3.mozilla.com:9110 -H dmgv2:mac-v2-signing6.srv.releng.scl3.mozilla.com:9110 -H dmgv2:mac-v2-signing7.srv.releng.scl3.mozilla.com:9110" LC_ALL=C TOOLTOOL_HOME=/builds _=/tools/buildbot/bin/python LD_LIBRARY_PATH=/tools/gcc-4.3.3/installed/lib64 MOZ_SYMBOLS_EXTRA_BUILDID=linux64-mozilla-inbound HOSTNAME=bld-linux64-spot-1052.build.releng.usw2.mozilla.com SYMBOL_SERVER_SSH_KEY=/home/mock_mozilla/.ssh/ffxbld_rsa MOZBUILD_STATE_PATH=/builds/slave/m-in-l64-000000000000000000000/.mozbuild HISTCONTROL=ignoredups POST_SYMBOL_UPLOAD_CMD=/usr/local/bin/post-symbol-upload.py PWD=/builds/slave/m-in-l64-000000000000000000000 PROPERTIES_FILE=/builds/slave/m-in-l64-000000000000000000000/buildprops.json MOZ_CRASHREPORTER_NO_REPORT=1 CCACHE_COMPRESS=1 TOOLTOOL_CACHE=/builds/tooltool_cache python2.7 /builds/slave/m-in-l64-000000000000000000000/build/src/mach python /builds/slave/m-in-l64-000000000000000000000/build/src/config/printconfigsetting.py /builds/slave/m-in-l64-000000000000000000000/build/src/obj-firefox/dist/bin/application.ini App SourceStamp'] 01:43:12 FATAL - Running post_fatal callback... 01:43:12 FATAL - Exiting 2
Updated•8 years ago
|
Summary: Intermittent failure to running rmtree on Linux → Intermittent failure to running rmtree on Linux, "WARNING: Command failed. See logs for output.", "# umount -n /builds/mock_mozilla/mozilla-centos6-x86_64/root/tmp/ccache", etc.
Comment hidden (Intermittent Failures Robot) |
Comment 2•8 years ago
|
||
catlee, coop: this seems to be more frequent (had 4 instances also today so far), can we get someone from releng to look at this ?
Flags: needinfo?(coop)
Flags: needinfo?(catlee)
Comment 3•8 years ago
|
||
cc-ing the buildduty people so they can help. Based on the graph, it seems like this bug started re-occurring between March 7 and 8: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1244446&startday=2016-01-01&tree=all We should start by looking for infra changes around that time frame. e.g. is this related to the new clobberer system :garbas is deploying? I'm not sure on the timing there. Also, the log in comment #0 is for a linux64 opt build, but are they all opt builds?
Flags: needinfo?(coop)
Flags: needinfo?(catlee)
Comment 4•8 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #3) > Also, the log in comment #0 is for a linux64 opt build, but are they all opt > builds? https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1244446&startday=2016-01-01&tree=all says opt and debug
Comment 5•8 years ago
|
||
If an infra failure doesn't have a bug which is suggested by treeherder, which this one didn't until the morning of the 8th, then you really don't know anything at all about its frequency. We star hundreds, perhaps thousands, of infra failures as just "infra" without using a bug number, every day. I restarred fx-team back to March 1st, so https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1244446&entireHistory=true&tree=trunk is a little more accurate now.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 14•8 years ago
|
||
Did some investigation here. Here's what I've found up until now: Faulty run: https://treeherder.mozilla.org/logviewer.html#?job_id=25983380&repo=mozilla-inbound Successful run: https://treeherder.mozilla.org/logviewer.html#?job_id=25993102&repo=mozilla-inbound Taking a look at the logs, it can be noticed that the failure from the first case is due to the fact that "umount -n" fails. 15:23:55 INFO - WARNING: Command failed. See logs for output. 15:23:55 INFO - # umount -n /builds/mock_mozilla/mozilla-centos6-x86_64/root/dev/shm 15:23:55 INFO - WARNING: Command failed. See logs for output. 15:23:55 INFO - # umount -n /builds/mock_mozilla/mozilla-centos6-x86_64/root/builds/slave It then forcibly unmounts the files in question: 15:23:55 INFO - WARNING: Forcibly unmounting '/builds/mock_mozilla/mozilla-centos6-x86_64/root/dev/shm' from chroot. 15:23:55 INFO - WARNING: Forcibly unmounting '/builds/mock_mozilla/mozilla-centos6-x86_64/root/builds/slave' from chroot. 15:23:55 INFO - Return code: 0 Even though it returns code 0, it seems that the issue is not solved at this point, as we have the following error below: 15:23:58 ERROR - OSError: [Errno 16] Device or resource busy: '/builds/mock_mozilla/mozilla-centos6-x86_64/root/dev/shm' While doing some research in order to better understand mounting/unmounting filesytems, I was wondering if we could use a lazy unmount ("umount -l") instead of forcing it.
Flags: needinfo?(coop)
Comment 15•8 years ago
|
||
vlad, alin: you might want to look here https://wiki.mozilla.org/ReleaseEngineering/Applications/Mock and at the source code to see if there is a retry to backoff and umount again
Comment hidden (Intermittent Failures Robot) |
Comment 17•8 years ago
|
||
Changing the component, as we had little luck investigating the issue. Maybe some other folks will have a better idea on how to solve this.
Component: Buildduty → General Automation
QA Contact: bugspam.Callek → catlee
Comment 18•8 years ago
|
||
Could we add some code to run lsof when we get into this state? Having the lsof output in the log will help us figure out what's holding the mount open.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 21•8 years ago
|
||
I did a try push that adds some debugging info here: https://hg.mozilla.org/try/rev/d6b582c03c066f30356a8bc43b94d834ccd7507c And according to the log[1], it looks like maybe the sccache server is still running. Maybe this is preventing unmounting the chroot dirs? http://archive.mozilla.org/pub/firefox/try-builds/catlee@mozilla.com-d6b582c03c066f30356a8bc43b94d834ccd7507c/try-linux64/try-linux64-bm78-try1-build9554.txt.gz
Flags: needinfo?(ted)
Flags: needinfo?(mh+mozilla)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 23•8 years ago
|
||
The sccache server is supposed to be killed by the postflight in build/sccache.mk, but that only happens after a successful build. But yes, running processes can very much prevent unmounting things. That's why these days it's better to use container-like things than chroots, (or at the very least cgroups).
Assignee | ||
Updated•8 years ago
|
Flags: needinfo?(mh+mozilla)
Comment 24•8 years ago
|
||
can we have mach clean up here? we're not going to be able to get off of mock (chroot) in the short term
Assignee | ||
Comment 25•8 years ago
|
||
You're better off using cgroups (Or "simply" lsof for that matter). mach can't know for sure all the things that may or may not require being killed. Today it's sccache. Tomorrow it might be something else. If the glibc in the mock environment was more recent, until bug 1274497 it would have been adb too.
Comment hidden (Intermittent Failures Robot) |
Comment 27•8 years ago
|
||
You probably already know, but mock can use systemd-nspawn containers since https://bugzilla.redhat.com/show_bug.cgi?id=1132762 - this probably doesn't help on Centos 6, though.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 31•8 years ago
|
||
this is spiking and now a frequent failure :(
Updated•8 years ago
|
Component: General Automation → Build Config
Product: Release Engineering → Core
QA Contact: catlee
Comment 32•8 years ago
|
||
I'm pretty sure this is due to the sccache not being killed properly. I think mach is the best place to fix this, since by the time mozharness gets back control, the mock environment is already in an inconsistent state.
Assignee | ||
Comment 33•8 years ago
|
||
Considering actual logs and why this is happening, I'm mystified that this didn't start happening earlier. way earlier.
Assignee: nobody → mh+mozilla
Assignee | ||
Comment 34•8 years ago
|
||
At this point, the only remaining part of the postflight for OSX universal builds dates back to bug 834228. Since then, many things have changed, one of them being that automation build steps have dependencies that can be expressed through make dependencies. While this is not directly related to bug 1244446, fixing this bug gets more complicated if postflight needs to happen before some of the automation build steps. Review commit: https://reviewboard.mozilla.org/r/56736/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/56736/
Attachment #8758482 -
Flags: review?(gps)
Attachment #8758483 -
Flags: review?(gps)
Assignee | ||
Comment 35•8 years ago
|
||
Killing the sccache background daemon is part of postflight_all, but in the current setup, postflight_all happens at the end of a "normal" build, but we run automation build steps after that. What happens then is that more compilations happen (gtests), which start sccache again, but there's nothing to kill sccache again once this is all done. Now that the OSX universal builds postflight is gone, it is not necessary for postflight_all to happen before the automation build steps. So ensure postflight_all scripts happen last. The downside of this change is that this now prevents sccache.log from being uploaded, but we should probably send processed data to the graph server instead. Review commit: https://reviewboard.mozilla.org/r/56738/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/56738/
Comment hidden (Intermittent Failures Robot) |
Comment 37•8 years ago
|
||
Comment on attachment 8758482 [details] MozReview Request: Bug 1244446 - Remove OSX universal builds postflight. r?gps https://reviewboard.mozilla.org/r/56736/#review53450 I'm not convinced the deleted `rm -rf` and `ln -s` in `flight.mk` aren't actually needed. But I would think this would break pretty clearly if it is. So I'll r+ and let automation be the arbiter of correct.
Attachment #8758482 -
Flags: review?(gps) → review+
Comment 38•8 years ago
|
||
Comment on attachment 8758483 [details] MozReview Request: Bug 1244446 - Run automation build steps before postflight_all. r?gps https://reviewboard.mozilla.org/r/56738/#review53460 You may want to check with ted to make sure he's ok losing sccache logs. Actually, you may want to run this series by him since he knows OS X universal builds a bit better than me. But otherwise LGTM.
Attachment #8758483 -
Flags: review?(gps) → review+
Assignee | ||
Updated•8 years ago
|
Attachment #8758482 -
Flags: review?(ted)
Assignee | ||
Updated•8 years ago
|
Attachment #8758483 -
Flags: review?(ted)
Assignee | ||
Comment 39•8 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #37) > Comment on attachment 8758482 [details] > MozReview Request: Bug 1244446 - Remove OSX universal builds postflight. > r?gps > > https://reviewboard.mozilla.org/r/56736/#review53450 > > I'm not convinced the deleted `rm -rf` and `ln -s` in `flight.mk` aren't > actually needed. But I would think this would break pretty clearly if it is. > So I'll r+ and let automation be the arbiter of correct. Try seems to agree they're not necessary, but I wouldn't be surprised if that breaks some random job that is not on try.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Attachment #8758482 -
Flags: review?(ted) → review+
Comment 42•8 years ago
|
||
Comment on attachment 8758482 [details] MozReview Request: Bug 1244446 - Remove OSX universal builds postflight. r?gps https://reviewboard.mozilla.org/r/56736/#review54292 Nice! Maybe we can get rid of client.mk soon...
Comment 43•8 years ago
|
||
Comment on attachment 8758483 [details] MozReview Request: Bug 1244446 - Run automation build steps before postflight_all. r?gps https://reviewboard.mozilla.org/r/56738/#review54294 Sad to put more logic in client.mk, but that logic in the mach command wasn't exactly fantastic.
Attachment #8758483 -
Flags: review?(ted) → review+
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 47•8 years ago
|
||
Pushed by mh@glandium.org: https://hg.mozilla.org/integration/mozilla-inbound/rev/e89afb1745a1 Remove OSX universal builds postflight. r=gps,r=ted https://hg.mozilla.org/integration/mozilla-inbound/rev/1d2c46c8c7c7 Run automation build steps before postflight_all. r=gps,r=ted
Comment 48•8 years ago
|
||
sorry had to back this out since somehow win8x64 pgo failed after this push like https://treeherder.mozilla.org/logviewer.html#?job_id=29541401&repo=mozilla-inbound
Flags: needinfo?(mh+mozilla)
Comment 49•8 years ago
|
||
Backout by cbook@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/7ab897482e06 Backed out changeset 1d2c46c8c7c7 https://hg.mozilla.org/integration/mozilla-inbound/rev/ed9a7b145c3c Backed out changeset e89afb1745a1 for windows pgo bustage
Comment 50•8 years ago
|
||
Pushed by mh@glandium.org: https://hg.mozilla.org/integration/mozilla-inbound/rev/e1bd07ec0b5b Remove OSX universal builds postflight. r=gps,r=ted https://hg.mozilla.org/integration/mozilla-inbound/rev/aa8cdba912d3 Run automation build steps before postflight_all. r=gps,r=ted
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•8 years ago
|
Flags: needinfo?(mh+mozilla)
Comment 52•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/e1bd07ec0b5b https://hg.mozilla.org/mozilla-central/rev/aa8cdba912d3
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox50:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Updated•8 years ago
|
Flags: needinfo?(ted)
Comment hidden (Intermittent Failures Robot) |
Updated•6 years ago
|
Product: Core → Firefox Build System
You need to log in
before you can comment on or make changes to this bug.
Description
•