Closed
Bug 1501113
Opened 6 years ago
Closed 6 years ago
race condition: concurrent partials generation with a shared cache
Categories
(Release Engineering :: Release Automation: Updates, enhancement)
Release Engineering
Release Automation: Updates
Tracking
(firefox-esr60 fixed, firefox64 fixed, firefox65 fixed)
RESOLVED
FIXED
People
(Reporter: mozilla, Assigned: sfraser)
Details
Attachments
(2 files)
This caused UV 1/12 to fail in both linux64 and win64 in Fx 64.0b3. To fix, we reran partials on down; otherwise it would have required a build 2.
Reporter | ||
Comment 1•6 years ago
|
||
This is the list of tasks I had to rerun. I accidentally reran linux64 be partials, so I added those.
Reporter | ||
Comment 2•6 years ago
|
||
We could serialize this by awaiting these tasks [1] rather than appending them to an asyncio.gather() call.
Alternately, we could somehow make the cache unshared or somehow only update it with complete files (are we hitting mid-download files, or mid-generation, or ?).
The two UV tasks that showed the bustage are
+ update-verify win64 1/12 YETuIdBuQpuvb1lXUCbHxg
+ update-verify linux64 1/12 EtMs1b3vT1i7hhZ6kbtsgA
The two partials tasks that created the bustage are
+ win64 be partials C9mYTX4kTWe5lX8y8-NnLw
+ linux64 ar partials EkOgMi-NSXSHWjDqCV65Jg
[1] https://searchfox.org/mozilla-central/source/taskcluster/docker/funsize-update-generator/scripts/funsize.py#428
Comment 3•6 years ago
|
||
Reruns worked OK, except for Balrog bug 1501167 requiring a manual fix, and bug 1501131 to purge CDNs.
Assignee | ||
Comment 4•6 years ago
|
||
(In reply to Aki Sasaki [:aki] from comment #2)
> We could serialize this by awaiting these tasks [1] rather than appending
> them to an asyncio.gather() call.
> Alternately, we could somehow make the cache unshared or somehow only update
> it with complete files (are we hitting mid-download files, or
> mid-generation, or ?).
>
Do we know which file inside the mar was created using an incorrect cached file? I didn't think S3 had race conditions like that, but if it does we should definitely reorganise.
I'm not clear at this point how the wrong size ended up in balrog, as funsize just uses os.path.getsize() on the finished file, and puts that in the manifest. I've checked the logs for all partials tasks in that group for 34192776 and none of them logged it, at least on the first runs.
Reporter | ||
Comment 5•6 years ago
|
||
I don't think it's an s3 race condition; it looks like a working directory race condition in the docker container.
`be` appears to be diffing against msvcp140.dll at the same time as bg is patching it?
$ grep msvcp140.dll bg.log
...
2018-10-22 18:59:53,478 - DEBUG - target-63.0b12.partial.mar: diffing "msvcp140.dll"
2018-10-22 18:59:53,478 - DEBUG - target-63.0b12.partial.mar: '/tmp/tmp0ibcyk7k/to.work/msvcp140.dll.patch.xz' -> '/tmp/fs-cache/830840918d9c262c4763b1a4364554333bf790c05f7baec994bb38d5210d203001654f1a50a0ae9b54e8fe6bee9b920a5f76b428ec36c3a126ce83d3999e50ec/2bb5ada10764e4ea527e1f8e706ceba8f3fd25704f494e1b900b8c9a24a954a1ee069ebcff8899d0e1bec92af2025e0a58b4b7745f72b1313ba27b93b26be5e0'
2018-10-22 18:59:53,478 - DEBUG - target-63.0b12.partial.mar: /tmp/tmp0ibcyk7k/to.work/msvcp140.dll.patch.xz saved on local cache.
2018-10-22 18:59:53,478 - DEBUG - target-63.0b12.partial.mar: patch "msvcp140.dll.patch" "msvcp140.dll"
2018-10-22 18:59:54,788 - DEBUG - target-63.0b13.partial.mar: diffing "msvcp140.dll"
2018-10-22 18:59:54,788 - DEBUG - target-63.0b13.partial.mar: '/tmp/fs-cache/830840918d9c262c4763b1a4364554333bf790c05f7baec994bb38d5210d203001654f1a50a0ae9b54e8fe6bee9b920a5f76b428ec36c3a126ce83d3999e50ec/2bb5ada10764e4ea527e1f8e706ceba8f3fd25704f494e1b900b8c9a24a954a1ee069ebcff8899d0e1bec92af2025e0a58b4b7745f72b1313ba27b93b26be5e0' -> '/tmp/tmp6sngxofn/to.work/msvcp140.dll.patch.xz'
2018-10-22 18:59:54,788 - DEBUG - target-63.0b13.partial.mar: Successful retrieved /tmp/tmp6sngxofn/to.work/msvcp140.dll.patch.xz from local cache.
2018-10-22 18:59:54,788 - DEBUG - target-63.0b13.partial.mar: file "msvcp140.dll" found in funsize, diffing skipped
2018-10-22 18:59:54,788 - DEBUG - target-63.0b13.partial.mar: patch "msvcp140.dll.patch" "msvcp140.dll"
2018-10-22 19:00:16,904 - DEBUG - target-63.0b14.partial.mar: diffing "msvcp140.dll"
2018-10-22 19:00:16,904 - DEBUG - target-63.0b14.partial.mar: '/tmp/fs-cache/830840918d9c262c4763b1a4364554333bf790c05f7baec994bb38d5210d203001654f1a50a0ae9b54e8fe6bee9b920a5f76b428ec36c3a126ce83d3999e50ec/2bb5ada10764e4ea527e1f8e706ceba8f3fd25704f494e1b900b8c9a24a954a1ee069ebcff8899d0e1bec92af2025e0a58b4b7745f72b1313ba27b93b26be5e0' -> '/tmp/tmpgcwzn26c/to.work/msvcp140.dll.patch.xz'
2018-10-22 19:00:16,904 - DEBUG - target-63.0b14.partial.mar: Successful retrieved /tmp/tmpgcwzn26c/to.work/msvcp140.dll.patch.xz from local cache.
2018-10-22 19:00:16,904 - DEBUG - target-63.0b14.partial.mar: file "msvcp140.dll" found in funsize, diffing skipped
2018-10-22 19:00:16,904 - DEBUG - target-63.0b14.partial.mar: patch "msvcp140.dll.patch" "msvcp140.dll"
$ grep msvcp140.dll be.log
...
2018-10-22 19:00:04,296 - DEBUG - target-63.0b14.partial.mar: diffing "msvcp140.dll"
2018-10-22 19:00:04,296 - DEBUG - target-63.0b14.partial.mar: '/tmp/fs-cache/830840918d9c262c4763b1a4364554333bf790c05f7baec994bb38d5210d203001654f1a50a0ae9b54e8fe6bee9b920a5f76b428ec36c3a126ce83d3999e50ec/2bb5ada10764e4ea527e1f8e706ceba8f3fd25704f494e1b900b8c9a24a954a1ee069ebcff8899d0e1bec92af2025e0a58b4b7745f72b1313ba27b93b26be5e0' -> '/tmp/tmpv5lxp3fz/to.work/msvcp140.dll.patch.xz'
2018-10-22 19:00:04,296 - DEBUG - target-63.0b14.partial.mar: Successful retrieved /tmp/tmpv5lxp3fz/to.work/msvcp140.dll.patch.xz from local cache.
2018-10-22 19:00:04,296 - DEBUG - target-63.0b14.partial.mar: file "msvcp140.dll" found in funsize, diffing skipped
2018-10-22 19:00:04,296 - DEBUG - target-63.0b14.partial.mar: patch "msvcp140.dll.patch" "msvcp140.dll"
2018-10-22 19:00:04,429 - DEBUG - target-63.0b12.partial.mar: diffing "msvcp140.dll"
2018-10-22 19:00:04,429 - DEBUG - target-63.0b12.partial.mar: '/tmp/tmp51f3_u08/to.work/msvcp140.dll.patch.xz' -> '/tmp/fs-cache/830840918d9c262c4763b1a4364554333bf790c05f7baec994bb38d5210d203001654f1a50a0ae9b54e8fe6bee9b920a5f76b428ec36c3a126ce83d3999e50ec/2bb5ada10764e4ea527e1f8e706ceba8f3fd25704f494e1b900b8c9a24a954a1ee069ebcff8899d0e1bec92af2025e0a58b4b7745f72b1313ba27b93b26be5e0'
2018-10-22 19:00:04,429 - DEBUG - target-63.0b12.partial.mar: /tmp/tmp51f3_u08/to.work/msvcp140.dll.patch.xz saved on local cache.
2018-10-22 19:00:04,429 - DEBUG - target-63.0b12.partial.mar: patch "msvcp140.dll.patch" "msvcp140.dll"
2018-10-22 19:00:04,564 - DEBUG - target-63.0b13.partial.mar: diffing "msvcp140.dll"
2018-10-22 19:00:04,564 - DEBUG - target-63.0b13.partial.mar: '/tmp/tmpxqgi7nda/to.work/msvcp140.dll.patch.xz' -> '/tmp/fs-cache/830840918d9c262c4763b1a4364554333bf790c05f7baec994bb38d5210d203001654f1a50a0ae9b54e8fe6bee9b920a5f76b428ec36c3a126ce83d3999e50ec/2bb5ada10764e4ea527e1f8e706ceba8f3fd25704f494e1b900b8c9a24a954a1ee069ebcff8899d0e1bec92af2025e0a58b4b7745f72b1313ba27b93b26be5e0'
2018-10-22 19:00:04,564 - DEBUG - target-63.0b13.partial.mar: /tmp/tmpxqgi7nda/to.work/msvcp140.dll.patch.xz saved on local cache.
2018-10-22 19:00:04,564 - DEBUG - target-63.0b13.partial.mar: patch "msvcp140.dll.patch" "msvcp140.dll"
Reporter | ||
Comment 6•6 years ago
|
||
More context, since I wasn't the one that found this:
<•nthomas> This is what I get trying to unpack win64 be partial manually https://irccloud.mozilla.com/pastebin/PU37Q6FV/
15:01 N<•nthomas> msvcp140.dll definitely changes between 63.0b14 and 64.0b3, fwiw
15:03 A<•aki> i looked at the be partial task for win64 -- the partial task reruns, but the sha for the partial artifact matches betwen run1 and the signing task's cot log
15:04 N<•nthomas> yeah, the first run appeared to abort early enough that it was still downloading and hadn't generated yet
15:09 N<•nthomas> do we generate the partials in parallel ?
15:10 A<•aki> the partials tasks all queue and are claimed in parallel, if that's what you mean
15:10 N<•nthomas> actually meant within each task
15:11 N<•nthomas> this is from grepping the partial generation logs https://irccloud.mozilla.com/pastebin/dCuCU2Au/
15:11 N<•nthomas> bg does what I'd blindly expect, diff and cache the first partial, then use the cache for the subsequent too
15:11 N<•nthomas> be finds something in the cache the first time, then generates diffs the subsequent two
15:12 A<•aki> looks like yes, we await multiple concurrent partial generation coroutines in taskcluster/docker/funsize-update-generator/scripts/funsize.py
15:13 N<•nthomas> and all the log timestamps are very similar in be
15:13 N<•nthomas> raaaaaaaaaaaaace
Assignee | ||
Comment 7•6 years ago
|
||
(In reply to Aki Sasaki [:aki] from comment #5)
> I don't think it's an s3 race condition; it looks like a working directory
> race condition in the docker container.
> `be` appears to be diffing against msvcp140.dll at the same time as bg is
> patching it?
If they're separate tasks they should be on separate instances, and only sharing the S3 cache. A race within an instance is definitely there, though. A quick fix would be to turn off the local cache, so we can try that. S3 will get used anyway.
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → sfraser
Reporter | ||
Comment 8•6 years ago
|
||
Yeah, aiui this is a single funsize.py run in a single container, with multiple partials run concurrently via asyncio.
Comment 9•6 years ago
|
||
I agree. We're not using the S3 caches for releases so it's a local cache at /tmp/fs-cache/. It's hard to trust the timestamps for be in comment #5 as something was found in the cache before it was written, but the two writes are only 140ms apart so it seems very likely there was a write collision.
Assignee | ||
Comment 10•6 years ago
|
||
Comment 11•6 years ago
|
||
Pushed by sfraser@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/eaf542a7bdb1
Disable local partials caching to work around races r=aki
Comment 12•6 years ago
|
||
bugherder |
Updated•6 years ago
|
Whiteboard: [checkin-needed-beta]
Comment 13•6 years ago
|
||
bugherder uplift |
status-firefox64:
--- → fixed
Updated•6 years ago
|
Whiteboard: [checkin-needed-beta]
Comment 14•6 years ago
|
||
bugherder uplift |
status-firefox-esr60:
--- → fixed
You need to log in
before you can comment on or make changes to this bug.
Description
•