Closed Bug 1353509 Opened 8 years ago Closed 7 years ago

Autophone - autophone process is hanging intermittently

Categories

(Testing Graveyard :: Autophone, defect, P1)

Product:

Component:

Type:

defect

Priority:

P1

Severity:

blocker

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(10 files)

Screenshot_2017-04-05_07-32-16-autophone-hangs.png 8 years ago Bob Clary [:bc] (inactive) (deleted), image/png		Details
hang.log 8 years ago Bob Clary [:bc] (inactive) (deleted), text/plain		Details
bug-1353509-v1.patch 8 years ago Bob Clary [:bc] (inactive) (deleted), patch	jmaher : review+	Details \| Diff \| Splinter Review
bug-1353509.patch 8 years ago Bob Clary [:bc] (inactive) (deleted), patch	jmaher : review+	Details \| Diff \| Splinter Review
bug-1353509-privatebuild-v1.patch 8 years ago Bob Clary [:bc] (inactive) (deleted), patch	jmaher : review+	Details \| Diff \| Splinter Review
bug-1353509-pulse-listen-locking.patch 7 years ago Bob Clary [:bc] (inactive) (deleted), patch	jmaher : review+	Details \| Diff \| Splinter Review
2017-05-18 09:50 hang 7 years ago Bob Clary [:bc] (inactive) (deleted), text/plain		Details
bug-1353509-shared-lock-v1.patch 7 years ago Bob Clary [:bc] (inactive) (deleted), patch	jmaher : review+	Details \| Diff \| Splinter Review
bug-1353509-purge-logging-v1.patch 7 years ago Bob Clary [:bc] (inactive) (deleted), patch	jmaher : review+	Details \| Diff \| Splinter Review
bug-1353509-worker-msg-loop-wait-v1.patch 7 years ago Bob Clary [:bc] (inactive) (deleted), patch	jmaher : review+	Details \| Diff \| Splinter Review

Bob Clary [:bc] (inactive)

Assignee

Description

•

8 years ago

After recovering from the network outage for the devices and restarting, autophone-3 is hanging on status and other control messages. I have attempted to restart it several times and it has continued to lock up. I have force stopped it by killing the processes and an clearing the buildcache in an attempt to clear the problem.

Bob Clary [:bc] (inactive)

Assignee

Comment 1

•

8 years ago

I finally deleted the pulse queues and restarted. It appears to be working now but I don't understand the cause or why this would fix the hang. I enabled debug output and it appeared to be hanging during/after the get_build_data: 2017-04-04 13:35:21,843|17879|Thread-5|root|DEBUG|get_build_data: {'changeset': u'https://hg.mozilla.org/integration/autoland/rev/0b45ef5f7702377bded5a67c6af0a713eccfc41e', 'repo': u'autoland', 'date': datetime.datetime(2017, 4, 3, 14, 50, 14, tzinfo=<UTC>), 'build_type': 'opt', 'sdk': u'api-15', 'nightly': True, 'url': 'https://queue.taskcluster.net/v1/task/YgeyaEOWSqC5KkPxi0SgyA/runs/0/artifacts/public/build/target.apk', 'abi': u'arm', 'builder_type': 'taskcluster', 'id': '20170403145014', 'platform': u'android-api-15', 'changeset_dirs': ['gfx/src', 'gfx/thebes', 'layout/base', 'modules/libpref/init', 'browser/components/preferences/in-content/tests', 'toolkit/mozapps/preferences', 'browser/components/preferences/in-content-old/tests'], 'revision': u'0b45ef5f7702377bded5a67c6af0a713eccfc41e'} 2017-04-04 13:35:21,865|17879|Thread-5|root|DEBUG|BuildMetadata: url: https://queue.taskcluster.net/v1/task/YgeyaEOWSqC5KkPxi0SgyA/runs/0/artifacts/public/build/target.apk, directory: builds/aHR0cHM6Ly9xdWV1ZS50YXNrY2x1c3Rlci5uZXQvdjEvdGFzay9ZZ2V5YUVPV1NxQzVLa1B4aTBTZ3lBL3J1bnMvMC9hcnRpZmFjdHMvcHVibGljL2J1aWxkL3RhcmdldC5hcGs=, tree: autoland, buildid: 20170403145014, revision: 0b45ef5f7702377bded5a67c6af0a713eccfc41e, changeset: https://hg.mozilla.org/integration/autoland/rev/0b45ef5f7702377bded5a67c6af0a713eccfc41e, changeset_dirs: ['gfx/src', 'gfx/thebes', 'layout/base', 'modules/libpref/init', 'browser/components/preferences/in-content/tests', 'toolkit/mozapps/preferences', 'browser/components/preferences/in-content-old/tests'], app_name: org.mozilla.fennec, version: 55.0a1, build_type: opt, treeherder_url: https://treeherder.mozilla.org, abi: arm, sdk: api-15, nightly: True, platform: android-api-15, builder_type: taskcluster

Bob Clary [:bc] (inactive)

Assignee

Comment 2

•

8 years ago

autophone-1,2's pulse queues were deleted before the servers could process the outstanding messages which accumulated during the outage. autophone-3's pulse queue was not deleted however since it hadn't grown as fast as the others. Perhaps that is the root cause since autophone-3 appears to be working properly since its queue was deleted. marking fixed for now until proven otherwise.

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Bob Clary [:bc] (inactive)

Assignee

Updated

•

8 years ago

Blocks: 1352892

Bob Clary [:bc] (inactive)

Assignee

Comment 3

•

8 years ago

This is not resolved. Overnight, each server has encountered periods where it stopped consuming the pulse queues. Fortunately, they recovered but this issue is ongoing.

Severity: normal → blocker

Status: RESOLVED → REOPENED

Priority: -- → P1

Resolution: FIXED → ---

Summary: Autophone - autophone-3 is hanging on startup → Autophone - autophone process is hanging intermittently

Bob Clary [:bc] (inactive)

Assignee

Comment 4

•

8 years ago

Attached image Screenshot_2017-04-05_07-32-16-autophone-hangs.png (deleted) — Details

Couldn't figure an easy way to capture the message summaries other than screen shotting it.

Bob Clary [:bc] (inactive)

Assignee

Comment 5

•

8 years ago

Unfortunately, we updated the test manifests at the same time of the office network maintenance and I think this was conflated with the network change. I believe we are experiencing a deadlock due to the changes in the test manifests and device assignments. Removing bug 1349904.

No longer blocks: 1352892

Bob Clary [:bc] (inactive)

Assignee

Comment 6

•

8 years ago

bug 1353760 made the autophone production queues unbounded. This allowed us to keep processing even though the total number of messages temporarily exceeded 20,000. All of the production queues have now been processed and are back to normal values. Therefore, I believe we can say that the pure processing of pulse messages may be slow but it does not dead lock on its own. I did not issue any ap commands to the processes during this time. There was still the previous situation where an ap command which contacts the autophone process and sends / receives messages from the server process would appear to hang. This may be due to a blocking http request to one of our other services such as hg, treeherder, taskcluster and I just didn't wait long enough for it to resolve. I'm investigating that possibility. One other thing is that we are using the default pyamqp transport which uses the pure Python library amqp. librabbitmq is supposedly a high performance transport written in C. Perhaps I could improve the performance by installing librabbitmq. It will be used if available and default back to pyamqp if it is not.

Bob Clary [:bc] (inactive)

Assignee

Comment 7

•

8 years ago

Forgot to include this link: http://docs.celeryproject.org/projects/kombu/en/latest/userguide/connections.html#amqp-transports

Bob Clary [:bc] (inactive)

Assignee

Comment 8

•

8 years ago

Attached file hang.log (deleted) — Details

This is happening again on autophone-3. It began around 4:15 AM PDT this morning for some unknown reason when we stopped consuming pulse messages. We are not in danger of losing our queue now that they are unbounded. Looking at the tail of the logs, it appears the is a deadlock with the workers attempting to send status messages to the parent process.

Bob Clary [:bc] (inactive)

Assignee

Comment 9

•

8 years ago

The parent process is constantly consuming about 5% cpu while the children are pretty much idle. killed the parent process and restarted.

Bob Clary [:bc] (inactive)

Assignee

Comment 10

•

8 years ago

Attached patch bug-1353509-v1.patch (deleted) — Details — Splinter Review

Attachment #8855892 - Flags: review?(jmaher)

Bob Clary [:bc] (inactive)

Assignee

Comment 11

•

8 years ago

This patch just removes some platforms which we no longer support or don't test. This reduces the number of bindings we have for the exchanges/queues, but is not meant to fix the underlying issue. Might as well get it in now while we continue to investigate.

Joel Maher ( :jmaher ) (UTC -8)

Comment 12

•

8 years ago

Comment on attachment 8855892 [details] [diff] [review] bug-1353509-v1.patch Review of attachment 8855892 [details] [diff] [review]: ----------------------------------------------------------------- win/win!

Attachment #8855892 - Flags: review?(jmaher) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 13

•

8 years ago

https://github.com/mozilla/autophone/commit/55d214bb3bfade38dc1b11cf1c136cffd62aa030 leaving open for the underlying issue. deployed 2017-04-08 17:45

Blocks: autophone-deployments

Bob Clary [:bc] (inactive)

Assignee

Comment 14

•

8 years ago

This has been behaving much better since the removal of some of the platforms, but we started having issues again today with the queues overly large. They recovered temporarily but the queues remained too large. autophone-1 was effectively hung up and I rebooted it but autophone-2 and autophone-3 were responding to ap commands thought their queues were also too large for comfort.

Bob Clary [:bc] (inactive)

Assignee

Comment 15

•

8 years ago

Actually, the change to the binds had no effect since the queues are now durable and the bindings are fixed until I delete them. This may also be associated with being DOSd via try. I'm investigating.

Bob Clary [:bc] (inactive)

Assignee

Comment 16

•

8 years ago

Attached patch bug-1353509.patch (deleted) — Details — Splinter Review

Attachment #8860013 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 17

•

8 years ago

Comment on attachment 8860013 [details] [diff] [review] bug-1353509.patch Review of attachment 8860013 [details] [diff] [review]: ----------------------------------------------------------------- nice to see durable queues being used

Attachment #8860013 - Flags: review?(jmaher) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 18

•

8 years ago

https://github.com/mozilla/autophone/commit/48a11a49c0f312d0f7306a6b23bdb5bf907a82ad

Bob Clary [:bc] (inactive)

Assignee

Comment 19

•

8 years ago

autophone-2 began hanging and accumulating pulse message this morning. It appears it began after: 2017-04-20 10:00:02,007|957|PulseMonitorThread|root|ERROR|Unable to open https://treeherder.mozilla.org/api/jobdetail/?repository=autoland&job_id=91626182 Traceback (most recent call last): File "/mozilla/autophone/autophone/utils.py", line 53, in get_remote_text conn = urllib2.urlopen(req) File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 429, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 447, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1241, in https_open context=self._context) File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open raise URLError(err) URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:590)>

Bob Clary [:bc] (inactive)

Assignee

Comment 20

•

8 years ago

https://github.com/mozilla/autophone/commit/48a11a49c0f312d0f7306a6b23bdb5bf907a82ad

Bob Clary [:bc] (inactive)

Assignee

Comment 21

•

8 years ago

Now autophone-3 is hung up with repeated 2017-04-21 08:05:51,226|920|PulseMonitorThread|root|WARNING|get_treeherder_privatebuild_info: https://treeherder.mozilla.org/api/jobdetail/?repository=try&job_id=93298011 missing keys: set(['config_file', 'chunk', 'build_url', 'builder_type']) job: {u'submit_timestamp': 1492775164, u'build_system_type': u'taskcluster', u'machine_name': u'i-0492d2009f4e5d6e6', u'job_group_symbol': u'tc-M', u'logs': [{u'url': u'https://queue.taskcluster.net/v1/task/MwddINb-R9iguE0-lyjp1w/runs/0/artifacts/public/logs/live_backing.log', u'name': u'builds-4h'}, {u'url': u'https://queue.taskcluster.net/v1/task/MwddINb-R9iguE0-lyjp1w/runs/0/artifacts/public/test_info/browser-chrome-chunked_errorsummary.log', u'name': u'errorsummary_json'}], u'job_group_name': u'Mochitests executed by TaskCluster', u'platform_option': u'opt', u'job_type_description': u'', u'result_set_id': 195226, u'build_platform_id': 215, u'result': u'success', u'machine_platform_os': u'-', u'push_id': 195226, u'ref_data_name': u'94825cee3fb45f5a487cd2dfc8c0f2cadfc93a40', u'machine_platform_architecture': u'-', u'job_type_id': 33664, u'build_platform': u'windows7-32-vm', u'running_eta': 752, u'job_type_name': u'test-windows7-32-vm/opt-mochitest-browser-chrome-2', u'id': 93298011, u'platform': u'windows7-32-vm', u'state': u'completed', u'end_timestamp': 1492776487, u'build_os': u'-', u'who': u'fliu@mozilla.com', u'taskcluster_metadata': {u'task_id': u'MwddINb-R9iguE0-lyjp1w', u'retry_id': 0}, u'option_collection_hash': u'102210fe594ee9b33d82058545b1ed14f4c8206e', u'failure_classification_id': 1, u'autoclassify_status': u'crossreferenced', u'job_type_symbol': u'bc2', u'reason': u'scheduled', u'last_modified': u'2017-04-21T12:08:08.122351', u'tier': 2, u'job_coalesced_to_guid': None, u'job_guid': u'33075d20-d6fe-47d8-a0b8-4d3e9728e9d7/0', u'start_timestamp': 1492775644, u'build_architecture': u'-', u'job_group_description': u'', u'signature': u'94825cee3fb45f5a487cd2dfc8c0f2cadfc93a40', u'job_group_id': 257, u'resource_uri': u'/api/project/try/jobs/93298011/'} which isn't even an autophone job. https://treeherder.mozilla.org/#/jobs?repo=try&revision=2f7336ed7ed2eee0d144cb9ac762e92cfc2c2726

Bob Clary [:bc] (inactive)

Assignee

Comment 22

•

8 years ago

All three job action queues are growing this morning. Looking at the logs we are getting lots of 2017-04-26 05:00:55,498|921|PulseMonitorThread|root|WARNING|get_treeherder_privatebuild_info: https://treeherder.mozilla.org/api/jobdetail/?repository=try&job_id=94342282 missing keys: set([ 2 These are for jobs which do not belong to us and we appear not to be processing them quickly enough. Perhaps we don't need to log as much here.

Bob Clary [:bc] (inactive)

Assignee

Comment 23

•

8 years ago

Attached patch bug-1353509-privatebuild-v1.patch (deleted) — Details — Splinter Review

make this a debug only message.

Attachment #8861917 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 24

•

8 years ago

Comment on attachment 8861917 [details] [diff] [review] bug-1353509-privatebuild-v1.patch Review of attachment 8861917 [details] [diff] [review]: ----------------------------------------------------------------- great

Attachment #8861917 - Flags: review?(jmaher) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 25

•

8 years ago

https://github.com/mozilla/autophone/commit/cc5c38e9725d79124aca6bad26dfeb927a0edf0f will deploy in a few minutes. Leaving open.

Bob Clary [:bc] (inactive)

Assignee

Comment 26

•

7 years ago

autophone-1 hung this morning. It appeared to be downloading builds/installing builds on the workers. The log messages weren't informative. I killed the process and restarted.

Bob Clary [:bc] (inactive)

Assignee

Comment 27

•

7 years ago

This autophone-1 hang has persisted across reboots and a selective delete of a single job. I've enabled DEBUG logging and started with a fresh jobs.sqlite database (saving the old version for later resubmissions). Hopefully we'll catch up with pulse before too long.

Bob Clary [:bc] (inactive)

Assignee

Comment 28

•

7 years ago

This appears to be related to queuing up the treeherder messages in the database. 2017-05-04 12:17:44,220|8990|PulseMonitorThread|root|DEBUG|new_job: worker phoneid pixel-05 2017-05-04 12:17:44,290|8990|PulseMonitorThread|root|DEBUG|AutophoneTreeherder.queue_request: {'endpoint_base': u'jobs', 'data': [<thclient.client.TreeherderJob object at 0x7f041009f7d0>]} 2017-05-04 12:17:44,291|8990|PulseMonitorThread|root|DEBUG|AutophoneTreeherder shared_lock.acquire https://github.com/mozilla/autophone/blob/master/autophonetreeherder.py#L130 should log the lock request, acquire the lock, insert the row into the treeherder table, then log the lock release and release the lock. We never see the release request which leads me to believe that the insertion into the Treeherder table is hanging. In the mean time, anyone who wants to communicate with treeherder will do the same which will block since the shared lock is never released. Sounds like sqlite is the dead lock to me. In the mean time, I have deleted the autophone-1 pulse queues and restarted. If necessary, we'll have to resubmit the missing jobs.

Bob Clary [:bc] (inactive)

Assignee

Comment 29

•

7 years ago

Attached patch bug-1353509-pulse-listen-locking.patch (deleted) — Details — Splinter Review

We have been using a blocking shared lock to serialize access across threads and processes for the Pulse messages and Treeherder submissions. One assumption has been that the Kombu consumer did not call the connection consumer's callback functions until the connection.drain_events was called. This is not always the case however. The callbacks can be called before the drain_events call is made. If the callback was called after the lock was set in the listen method but before the lock was released and the drain_events method was called, then handle_message would be called with the shared lock already locked which violates the assumption. This would result in a dead lock if the message resulted in a call to a function which also attempted to lock the shared lock. This patch attempts to prevent the dead lock by unlocking the shared lock upon entering handle_message. If it is not locked, no foul, no error. However if it is locked, we relock after completing the handling of the message to keep our locking assumptions in listen true. I do have another patch which adds a new shared multiprocessing Lock for serializing access to the jobs database, but I don't think it is absolutely necessary and would like to try this one first.

Attachment #8864751 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 30

•

7 years ago

Comment on attachment 8864751 [details] [diff] [review] bug-1353509-pulse-listen-locking.patch Review of attachment 8864751 [details] [diff] [review]: ----------------------------------------------------------------- rs=me!

Attachment #8864751 - Flags: review?(jmaher) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 31

•

7 years ago

https://github.com/mozilla/autophone/commit/2c5298003ccfd0c120e18244cc9e1461d025de7f deployed 2017-05-06 ~11:25. marking fixed with the hope that it is true this time. If not, I'll open a new bug.

Status: REOPENED → RESOLVED

Closed: 8 years ago → 7 years ago

Resolution: --- → FIXED

Bob Clary [:bc] (inactive)

Assignee

Comment 32

•

7 years ago

autophone-2 hung this morning... 2017-05-09 09:27:09,477 nexus-4-9 INFO <2017-05-09T16:27:09+00:00> nexus-4-9 (WORKING): logcontrol: flush log 2017-05-09 09:27:09,477 nexus-4-9 INFO Waiting for log_flushed_event 2017-05-09 09:27:09,477 nexus-4-9 INFO Got log_flushed_event 2017-05-09 09:34:43,256 root WARNING check_for_unrecoverable_errors: Purging hung phone pixel-02 2017-05-09 17:51:42,256 nexus-6p-11 ERROR Exception while checking path /sdcard/tests Traceback (most recent call last): File "/mozilla/autophone/autophone/worker.py", line 652, in _check_path self.dm.mkdir(d, parents=True, root=True) File "/mozilla/autophone/autophone/adb.py", line 1684, in mkdir root=root): File "/mozilla/autophone/autophone/adb.py", line 1104, in shell_bool raise ADBTimeoutError("%s" % adb_process) ADBTimeoutError: args: adb -s 84B7N16421000012 wait-for-device shell su 0 mkdir -p /sdcard/tests/autopho check_for_unrecoverable_errors: Purging hung phone pixel-02 might be the hint.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Bob Clary [:bc] (inactive)

Assignee

Comment 33

•

7 years ago

autophone-1 hung last night: 2017-05-15 20:19:24,777 root WARNING check_for_unrecoverable_errors: Purging hung phone nexus-6p-1

Bob Clary [:bc] (inactive)

Assignee

Comment 34

•

7 years ago

autophone-2 2017-05-16 19:05:47,025 pixel-02 ERROR TalosTestJob 20170517010617 opt remote-tsvg Exception tearing down job Traceback (most recent call last): File "/mozilla/autophone/autophone/phonetest.py", line 878, in teardown_job tests=[self]) File "/mozilla/autophone/autophone/autophonetreeherder.py", line 442, in submit_complete self.queue_request(machine, project, tjc) File "/mozilla/autophone/autophone/autophonetreeherder.py", line 133, in queue_request self.shared_lock.release() ValueError: semaphore or lock released too many times 2017-05-15 21:27:32,884 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets 53 ERROR Automation Error: Exception caught while running tests 2017-05-15 21:27:32,884 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Traceback (most recent call last): 2017-05-15 21:27:32,884 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "mochitest/runtestsremote.py", line 365, in run_test_harness 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets retVal = mochitest.runTests(options) 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets return self.runMochitests(options, testsToRun) 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets result = self.doTests(options, testsToRun) 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets return status, self.lastTestSeen 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets AttributeError: 'MochiRemote' object has no attribute 'lastTestSeen' 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Stopping web server 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Failed to stop web server on http://10.252.73.231:38852/server/shutdown 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Traceback (most recent call last): 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets with closing(urllib2.urlopen(self.shutdownURL)) as c: 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets return opener.open(url, data, timeout) 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 429, in open 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets response = self._open(req, data) 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 447, in _open 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets '_open', req) 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets result = func(*args) 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets return self.do_open(httplib.HTTPConnection, req) 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets raise URLError(err) 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets URLError: <urlopen error [Errno 111] Connection refused> 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Stopping web socket server 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Stopping ssltunnel 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Traceback (most recent call last): 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "mochitest/runtestsremote.py", line 393, in <module> 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets sys.exit(main()) 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "mochitest/runtestsremote.py", line 389, in main 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets return run_test_harness(parser, options) 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "mochitest/runtestsremote.py", line 371, in run_test_harness 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets mochitest.cleanup(options) 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "mochitest/runtestsremote.py", line 67, in cleanup 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets MochitestDesktop.cleanup(self, options) 2017-05-15 21:27:32,888 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets os.remove(self.manifest) 2017-05-15 21:27:32,888 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets OSError: [Errno 2] No such file or directory: '/tmp/tmpc59Co3.mozrunner/test s.manifest'

Bob Clary [:bc] (inactive)

Assignee

Comment 35

•

7 years ago

Attached file 2017-05-18 09:50 hang (deleted) — Details

Not purging but related to rebooting a disconnected device?

Bob Clary [:bc] (inactive)

Assignee

Comment 36

•

7 years ago

2017-05-22 18:00:27,737 pixel-07 ERROR UnitTestJob mozilla-beta 20170522120019 debug api-25 android-api-15 mochitest-dom-browser-element Exception tearing down job Traceback (most recent call last): File "/mozilla/autophone/autophone/phonetest.py", line 904, in teardown_job tests=[self]) File "/mozilla/autophone/autophone/autophonetreeherder.py", line 466, in submit_complete self.queue_request(machine, project, tjc) File "/mozilla/autophone/autophone/autophonetreeherder.py", line 133, in queue_request self.shared_lock.release() ValueError: semaphore or lock released too many times 2017-05-22 18:01:21,261 nexus-6p-2 INFO ========= Started Submitting Log (results: 0, elapsed: 0 secs) (at 2017-05-22 18:01:21.261154) ========= 2017-05-22 18:01:21,261 nexus-6p-2 INFO ========= Finished Submitting Log (results: 0, elapsed: 0 secs) (at 2017-05-22 18:01:21.261475) ========= 2017-05-22 18:14:50,697 root WARNING check_for_unrecoverable_errors: Purging hung phone nexus-6p-2 Maybe this is a lock contention issue. We screw up releasing too many times but later try to acquire too many times? And what's up with the purging thing. I'll look into this more now.

Bob Clary [:bc] (inactive)

Assignee

Comment 37

•

7 years ago

In AutophoneTreeherder def queue_request(self, machine, project, job_collection): logger = utils.getLogger() logger.debug('AutophoneTreeherder.queue_request: %s', job_collection.__dict__) logger.debug('AutophoneTreeherder shared_lock.acquire') self.shared_lock.acquire() try: self.jobs.new_treeherder_job(machine, project, job_collection) finally: logger.debug('AutophoneTreeherder shared_lock.release') self.shared_lock.release() In Jobs def new_treeherder_job(self, machine, project, job_collection): logger = utils.getLogger() logger.debug('jobs.new_treeherder_job: %s %s %s', machine, project, job_collection.__dict__) attempts = 0 now = datetime.datetime.utcnow().isoformat() conn = self._conn() job_cursor = self._execute_sql( conn, 'insert into treeherder values (?, ?, ?, ?, ?, ?)', values=(None, attempts, now, machine, project, job_collection.to_json())) job_cursor.close() self._commit_connection(conn) self._close_connection(conn) There is no way these could get out of sync, so it must be the case that someone else is releasing the shared lock without first obtaining it. Must be a thread in the Main process.

Bob Clary [:bc] (inactive)

Assignee

Comment 38

•

7 years ago

Apart from bug 1367199, I've been looking into this in a variety of ways. One approach was to realize that simply having the shared lock as an attribute which is "passed" to the worker sub process created via invoking the PhoneWorkerSubProcess.run method probably isn't correct. I changed the run method to take a shared_lock argument and changed the Process call to include args=(shared_lock,) so that we use the "correct" method of sharing an object via shared memory. With various other changes, I managed to create dead locks accidentally and also managed to pretty much lock up my gnome session while Autophone was running. This leads me to believe that the previous method of just reusing the PhoneWorkerSubProcess's instance's attribute of shared_lock didn't do anything to serialize access as they probably weren't shared at all. I've been investigating exactly why I think I need the shared lock and it seems to me that really all I cared about was inserting rows into the jobs, tests and treeherder tables in the jobs.sqlite database. But, sqlite3 already handles locking the database which leads to me to wonder if I really need the shared lock at all. That is what I am investigating now. Additional thoughts welcome.

Bob Clary [:bc] (inactive)

Assignee

Comment 39

•

7 years ago

Attached patch bug-1353509-shared-lock-v1.patch (deleted) — Details — Splinter Review

Attachment #8870901 - Flags: review?(jmaher)

Bob Clary [:bc] (inactive)

Assignee

Comment 40

•

7 years ago

Attached patch bug-1353509-purge-logging-v1.patch (deleted) — Details — Splinter Review

Attachment #8870902 - Flags: review?(jmaher)

Bob Clary [:bc] (inactive)

Assignee

Comment 41

•

7 years ago

Attached patch bug-1353509-worker-msg-loop-wait-v1.patch (deleted) — Details — Splinter Review

Attachment #8870903 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 42

•

7 years ago

Comment on attachment 8870901 [details] [diff] [review] bug-1353509-shared-lock-v1.patch Review of attachment 8870901 [details] [diff] [review]: ----------------------------------------------------------------- bummer to remove the shared locks

Attachment #8870901 - Flags: review?(jmaher) → review+

Joel Maher ( :jmaher ) (UTC -8)

Comment 43

•

7 years ago

Comment on attachment 8870902 [details] [diff] [review] bug-1353509-purge-logging-v1.patch Review of attachment 8870902 [details] [diff] [review]: ----------------------------------------------------------------- simple!

Attachment #8870902 - Flags: review?(jmaher) → review+

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Attachment #8870903 - Flags: review?(jmaher) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 44

•

7 years ago

https://github.com/mozilla/autophone/commit/05108454c60e8672553aa4b28a663da7c331211d https://github.com/mozilla/autophone/commit/6153b8eaef0b8addce38c368975c02fe0d35d084 https://github.com/mozilla/autophone/commit/96dc1e04ebbd63b9b950c47b99d7751fed3f1318 I'll leave open for a few days just in case.

Bob Clary [:bc] (inactive)

Assignee

Comment 45

•

7 years ago

Seems to be resolved. I've retriggered jobs and had no issues.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Updated

•

3 years ago

Product: Testing → Testing Graveyard

You need to log in before you can comment on or make changes to this bug.