Closed Bug 1248299 Opened 9 years ago Closed 9 years ago

Frequent antivirus job failures with IncompleteRead after several retries

Categories

(Release Engineering :: Release Automation: Other, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: nthomas, Unassigned)

References

Details

Eg: for 45.0b5 we had errors in three separate runs 09:29:14 INFO - Downloading pub/firefox/candidates/45.0b5-candidates/build1/update/mac/cy/firefox-45.0b5.complete.mar to cache/update/mac/cy/firefox-45.0b5.complete.mar 09:29:14 INFO - retry: Calling <bound method Key.get_contents_to_filename of <Key: net-mozaws-prod-delivery-firefox,pub/firefox/candidates/45.0b5-candidates/build1/update/mac/cy/firefox-45.0b5.complete.mar>> with args: (u'cache/update/mac/cy/firefox-45.0b5.complete.mar',), kwargs: {}, attempt #1 09:32:37 INFO - ... attempt #2 09:33:16 INFO - ... attempt #3 09:33:32 INFO - ... attempt #4 09:33:56 INFO - ... attempt #5 09:33:56 INFO - retry: Giving up ... 10:23:28 INFO - Downloading pub/firefox/candidates/45.0b5-candidates/build1/update/win32/ml/firefox-45.0b5.complete.mar to cache/update/win32/ml/firefox-45.0b5.complete.mar 10:23:28 INFO - retry: Calling <bound method Key.get_contents_to_filename of <Key: net-mozaws-prod-delivery-firefox,pub/firefox/candidates/45.0b5-candidates/build1/update/win32/ml/firefox-45.0b5.complete.mar>> with args: (u'cache/update/win32/ml/firefox-45.0b5.complete.mar',), kwargs: {}, attempt #1 10:26:50 INFO - .... attempt #2 10:28:40 INFO - ... attempt #3 10:28:54 INFO - ... attempt #4 10:29:15 INFO - ... attempt #5 10:29:15 INFO - retry: Giving up ... 10:51:22 INFO - Downloading pub/firefox/candidates/45.0b5-candidates/build1/win32/fr/Firefox Setup 45.0b5.exe to cache/win32/fr/Firefox Setup 45.0b5.exe 10:51:22 INFO - retry: Calling <bound method Key.get_contents_to_filename of <Key: net-mozaws-prod-delivery-firefox,pub/firefox/candidates/45.0b5-candidates/build1/win32/fr/Firefox Setup 45.0b5.exe>> with args: (u'cache/win32/fr/Firefox Setup 45.0b5.exe',), kwargs: {}, attempt #1 10:52:47 INFO - ... attempt #2 10:52:54 INFO - ... attempt #3 10:53:03 INFO - ... attempt #4 10:53:16 INFO - ... attempt #5 10:53:16 INFO - retry: Giving up ... We've also had issues on other releases, although usually one retry is sufficient. We probably have to retry harder.
The explicit parameters on the retry call are at http://hg.mozilla.org/mozilla-central/file/default/testing/mozharness/scripts/release/antivirus.py#l156 namely sleeptime=30, max_sleeptime=150 on the call, and defaults are attempts=5, sleepscale=1.5, jitter=1 I was just going to increase the number of attempts, but .... The retry code is in http://hg.mozilla.org/mozilla-central/file/default/python/redo/redo/__init__.py, so we should be waiting sleeptime, sleeptime * sleep_scale, sleeptime * sleep_scale**2, etc, with plus/minus 1 second jitter on each iteration. So roughly speaking 30s, 45s, etc. This is definitely not happening between attempts 3, 4, and 5. Ben, are you aware of any problems using redo with multiprocessing ? It might be that as we finish other downloads that free workers are picking retreies, maybe *waves hands wildly*.
Flags: needinfo?(bhearsum)
(In reply to Nick Thomas [:nthomas] from comment #1) > The explicit parameters on the retry call are at > > http://hg.mozilla.org/mozilla-central/file/default/testing/mozharness/ > scripts/release/antivirus.py#l156 > namely > sleeptime=30, max_sleeptime=150 > on the call, and defaults are > attempts=5, sleepscale=1.5, jitter=1 > > I was just going to increase the number of attempts, but .... > > The retry code is in > http://hg.mozilla.org/mozilla-central/file/default/python/redo/redo/__init__. > py, so we should be waiting sleeptime, sleeptime * sleep_scale, sleeptime * > sleep_scale**2, etc, with plus/minus 1 second jitter on each iteration. So > roughly speaking 30s, 45s, etc. This is definitely not happening between > attempts 3, 4, and 5. > > Ben, are you aware of any problems using redo with multiprocessing ? It > might be that as we finish other downloads that free workers are picking > retreies, maybe *waves hands wildly*. I'm not aware of any issues, but that doesn't mean much! It looks like to me like sleeptime is all over the place...we've got: 1->2: a few minutes 2->3: 39s 3->4: 16s 4->5: 24s 1->2: a few minutes 2->3: 110s 3->4: 14s 4->5: 21s 1->2: 85s 2->3: 7s 3->4: 9s 4->5: 13s So sleeptime is all over the place, sometimes lower than the original, and sometimes lower than the previous attempt. That's definitely funky. I also noticed that despite other debug messages being logged, the "sleeping for...." ones are not. This happens after retrier yields, which makes me wonder if there's something funky about doing work in a function after yielding, maybe only when using multiprocessing?
Flags: needinfo?(bhearsum)
I just realized that a ThreadPool is being used here ("multiprocessing" made me think many processes), which makes me wonder if there's a GIL related issue where waiting threads are unable to do anything while another one downloads. That might explain having many retries, not sure if it explains the funky logging or sleep times. I tried to reproduce in a smaller test case and couldn't, but I'm not sure if time.sleep() is equivalent to waiting on a read(), or if I missed other key things in the real case: https://gist.github.com/bhearsum/690ec4c880db2b45bfe2
Depends on: 1234289
so I think part of the confusion is that m-c antivirus retry timeout was bumped by me but I failed to uplift it immediately to beta: http://hg.mozilla.org/mozilla-central/file/default/testing/mozharness/scripts/release/antivirus.py#l156 http://hg.mozilla.org/releases/mozilla-beta/file/default/testing/mozharness/scripts/release/antivirus.py#l158 does your findings line up with what is on beta right now?
Fresh data on this for 38.7.0esr on three successive runs so far: 1st run (via http://buildbot-master77.bb.releng.use1.mozilla.com:8001/builders/release-mozilla-esr38-firefox_antivirus/builds/4/steps/run_script/logs/stdio) 21:55:14 INFO - Downloading pub/firefox/candidates/38.7.0esr-candidates/build1/update/linux-x86_64/ca/firefox-38.3.0esr-38.7.0esr.partial.mar to cache/update/linux-x86_64/ca/firefox-38.3.0esr-38.7.0esr.partial.mar 21:57:05 INFO - .... attempt #2 21:57:50 INFO - ..... attempt #3 21:58:58 INFO - ..... attempt #4 22:00:40 INFO - ..... attempt #5 22:00:40 INFO - retry: Giving up .... 2nd run (via http://buildbot-master77.bb.releng.use1.mozilla.com:8001/builders/release-mozilla-esr38-firefox_antivirus/builds/5/steps/run_script/logs/stdio): 23:00:53 INFO - Downloading pub/firefox/candidates/38.7.0esr-candidates/build1/update/win32/da/firefox-38.6.1esr-38.7.0esr.partial.mar to cache/update/win32/da/firefox-38.6.1esr-38.7.0esr.partial.mar 23:02:43 INFO - .... attempt #2 23:03:28 INFO - .... attempt #3 23:04:36 INFO - .... attempt #4 23:06:16 INFO - .... attempt #5 23:06:16 INFO - retry: Giving up ... (in the same file also) 23:01:47 INFO - Downloading pub/firefox/candidates/38.7.0esr-candidates/build1/update/win32/gl/firefox-38.7.0esr.complete.mar to cache/update/win32/gl/firefox-38.7.0esr.complete.mar 23:03:37 INFO - .... attempt #2 23:04:22 INFO - .... attempt #3 23:05:29 INFO - .... attempt #4 3rd run (via http://buildbot-master77.bb.releng.use1.mozilla.com:8001/builders/release-mozilla-esr38-firefox_antivirus/builds/6/steps/run_script/logs/stdio): 23:36:04 INFO - Downloading pub/firefox/candidates/38.7.0esr-candidates/build1/update/linux-x86_64/km/firefox-38.7.0esr.complete.mar to cache/update/linux-x86_64/km/firefox-38.7.0esr.complete.mar 23:37:56 INFO - .... attempt #2 23:38:42 INFO - .... attempt #3 23:39:49 INFO - .... attempt #4 23:41:31 INFO - .... attempt #5 23:41:31 INFO - retry: Giving up ... This proves two things: 1) jlund's patch uplifted to beta and all the way to esr38 is working smooth. There is a gap of at least 30/45 seconds between retries so I assume that earlier in the thread the problem was just an async between the m-c -> m-b uplift and the time of the builds as jlund suggested. 2) on the other hand - should we retry even harder on this? ... usually one retrigger suffices but in this particular case I'm retriggering for the forth time.
The times still seem a bit off (not increasing by ~50% each time) but I'm not sure how to make any progress here. When I was setting this up there was definitely a speedup when using a threadpool over sequential downloads, so the GIL (comment #3) can't be hard blocking.
Assignee: nthomas → nobody
Going to WFM this on the basis that we haven't had issues recently, and release promotion has reimplemented this.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.