Closed Bug 884115 Opened 11 years ago Closed 11 years ago

Add timeouts to mozharness' urllib2.urlopen() requests

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: jyeo)

References

Details

(Whiteboard: [mozharness])

Attachments

(2 files)

Python 2.6 and higher supports specifying a timeout for urllib2.urlopen(), which determines how long a socket should wait for a response before timing out and raising socket.timeout

We should add timeouts to avoid mozharness hangs which result in a generic buildbot timeout (which requires the opening of the full log before starring) as well as make the most of the retry wrapper that we already use to catch the urllib2.HTTPError & urllib2.URLError failure modes.
Attached patch Patch v1 (deleted) — Splinter Review
Adds timeouts to mozharness' urllib2.urlopen() calls, makes us auto retry for the socket.timeout failure mode during _download_file(), and cleans up mouse_and_screen_resolution.py::wfetch()'s retry handling.
Attachment #763896 - Flags: review?(aki)
Note requires Python 2.6 or higher - is this present on all the machines for which we run mozharness?
(In reply to Ed Morley [:edmorley UTC+1] from comment #2)
> Note requires Python 2.6 or higher - is this present on all the machines for
> which we run mozharness?

I seem to remember we were lagging on a 2.5.1 install on some platform when we first rolled out mozharness unittests.  With the attempts to standardize on Python 2.7[.3?], that may no longer be the case.  We should probably verify.
Comment on attachment 763896 [details] [diff] [review]
Patch v1

Might be worth baking a bit on Cedar (default) before merging to every other branch (production branch).
Attachment #763896 - Flags: review?(aki) → review+
(In reply to Aki Sasaki [:aki] from comment #3)
> I seem to remember we were lagging on a 2.5.1 install on some platform when
> we first rolled out mozharness unittests.  With the attempts to standardize
> on Python 2.7[.3?], that may no longer be the case.  We should probably
> verify.

A number of bug 724191's dependants have been fixed & whilst inspecting a bunch of logs shows a few places using older versions, they only seem to be talos machines, so shouldn't affect mozharness for now :-)
(In reply to Aki Sasaki [:aki] from comment #4)
> Might be worth baking a bit on Cedar (default) before merging to every other
> branch (production branch).

Agreed. Pushed to cedar to generate a recent set of builds, then pushed this patch:
https://hg.mozilla.org/build/mozharness/rev/7a39cb9045f9

Have requested a dep set of builds after that.
(In reply to Ed Morley [:edmorley UTC+1] from comment #6)
> Have requested a dep set of builds after that.

In fact just did another push rather than requesting more builds via buildapi to make it easier to distinguish them on TBPL.
Attachment #763896 - Flags: checked-in+
In production.
Thank you :-)
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
We seem to not be catching socket.error somewhere.

14:05:46     INFO - #####
14:05:46     INFO - ##### Running download-and-extract step.
14:05:46     INFO - #####
14:05:46     INFO - mkdir: C:\slave\test\build
14:05:46     INFO - Downloading http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32-debug/1371671157/firefox-24.0a1.en-US.win32.tests.zip to C:\slave\test\build\firefox-24.0a1.en-US.win32.tests.zip
14:05:46     INFO - retry: Calling <bound method DesktopUnittest._download_file of <__main__.DesktopUnittest object at 0x00E66EB0>> with args: ('http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32-debug/1371671157/firefox-24.0a1.en-US.win32.tests.zip', 'C:\\slave\\test\\build\\firefox-24.0a1.en-US.win32.tests.zip'), kwargs: {}, attempt #1
Traceback (most recent call last):
  File "scripts/scripts/desktop_unittest.py", line 374, in <module>
    desktop_unittest.run()
  File "C:\slave\test\scripts\mozharness\base\script.py", line 821, in run
    self._possibly_run_method(method_name, error_if_missing=True)
  File "C:\slave\test\scripts\mozharness\base\script.py", line 780, in _possibly_run_method
    return getattr(self, method_name)()
  File "scripts/scripts/desktop_unittest.py", line 275, in download_and_extract
    super(DesktopUnittest, self).download_and_extract(target_unzip_dirs=target_unzip_dirs)
  File "C:\slave\test\scripts\mozharness\mozilla\testing\testbase.py", line 237, in download_and_extract
    self._download_test_zip()
  File "C:\slave\test\scripts\mozharness\mozilla\testing\testbase.py", line 162, in _download_test_zip
    error_level=FATAL)
  File "C:\slave\test\scripts\mozharness\base\script.py", line 228, in download_file
    error_level=error_level,
  File "C:\slave\test\scripts\mozharness\base\script.py", line 440, in retry
    status = action(*args, **kwargs)
  File "C:\slave\test\scripts\mozharness\base\script.py", line 177, in _download_file
    block = f.read(1024 ** 2)
  File "c:\mozilla-build\python27\lib\socket.py", line 380, in read
    data = self._sock.recv(left)
  File "c:\mozilla-build\python27\lib\httplib.py", line 561, in read
    s = self.fp.read(amt)
  File "c:\mozilla-build\python27\lib\socket.py", line 380, in read
    data = self._sock.recv(left)
socket.error: [Errno 10035] A non-blocking socket operation could not be completed immediately
program finished with exit code 1
elapsedTime=47.716000
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Ed, are you still working on this?
Flags: needinfo?(emorley)
Hi sorry was on PTO Fri/away over the weekend.

I had a quick dig around at the time of the backout, but struggled to find anything useful in the Python docs. I'd like to come back to this in the not so distant future if it's not been fixed by someone else since - but not sure on the timeframe - so happy for someone else to take it in the meantime :-)
Flags: needinfo?(emorley)
I applied edmorley's patch and did the following experiment. I tried to disconnect the network cable on a win and a linux machine during the download_and_extract step and found out that the winapi and the libc implements sockets different. (obviously)

Disconnecting the cable on a win machine causes mozharness to die because of an uncaught socket error but disconnecting the cable on a linux machine causes the internal socket.py's timeout counter to start counting and raises the socket.timeout exception. I think we should try catching socket.error as the last case.

I will try pushing this to ash-mozharness just in case errno 10035 happens again.
Assignee: emorley → yshun
Attachment #783176 - Flags: review?(aki)
Thank you for picking this up! :-)
Attachment #783176 - Flags: review?(aki) → review+
Seems to be working on Ash. I don't see any socket.error :-/ https://tbpl.mozilla.org/?tree=Ash
Good to land on mozharness now? :-)
Attachment #783176 - Flags: checked-in+
In production
I don't see any issues related to this on our production mozharness jobs. FIXED. :)
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → Mozharness
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: