Closed Bug 1353509 Opened 8 years ago Closed 7 years ago

Autophone - autophone process is hanging intermittently

Categories

(Testing Graveyard :: Autophone, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(10 files)

After recovering from the network outage for the devices and restarting, autophone-3 is hanging on status and other control messages. I have attempted to restart it several times and it has continued to lock up. I have force stopped it by killing the processes and an clearing the buildcache in an attempt to clear the problem.
I finally deleted the pulse queues and restarted. It appears to be working now but I don't understand the cause or why this would fix the hang. I enabled debug output and it appeared to be hanging during/after the get_build_data: 2017-04-04 13:35:21,843|17879|Thread-5|root|DEBUG|get_build_data: {'changeset': u'https://hg.mozilla.org/integration/autoland/rev/0b45ef5f7702377bded5a67c6af0a713eccfc41e', 'repo': u'autoland', 'date': datetime.datetime(2017, 4, 3, 14, 50, 14, tzinfo=<UTC>), 'build_type': 'opt', 'sdk': u'api-15', 'nightly': True, 'url': 'https://queue.taskcluster.net/v1/task/YgeyaEOWSqC5KkPxi0SgyA/runs/0/artifacts/public/build/target.apk', 'abi': u'arm', 'builder_type': 'taskcluster', 'id': '20170403145014', 'platform': u'android-api-15', 'changeset_dirs': ['gfx/src', 'gfx/thebes', 'layout/base', 'modules/libpref/init', 'browser/components/preferences/in-content/tests', 'toolkit/mozapps/preferences', 'browser/components/preferences/in-content-old/tests'], 'revision': u'0b45ef5f7702377bded5a67c6af0a713eccfc41e'} 2017-04-04 13:35:21,865|17879|Thread-5|root|DEBUG|BuildMetadata: url: https://queue.taskcluster.net/v1/task/YgeyaEOWSqC5KkPxi0SgyA/runs/0/artifacts/public/build/target.apk, directory: builds/aHR0cHM6Ly9xdWV1ZS50YXNrY2x1c3Rlci5uZXQvdjEvdGFzay9ZZ2V5YUVPV1NxQzVLa1B4aTBTZ3lBL3J1bnMvMC9hcnRpZmFjdHMvcHVibGljL2J1aWxkL3RhcmdldC5hcGs=, tree: autoland, buildid: 20170403145014, revision: 0b45ef5f7702377bded5a67c6af0a713eccfc41e, changeset: https://hg.mozilla.org/integration/autoland/rev/0b45ef5f7702377bded5a67c6af0a713eccfc41e, changeset_dirs: ['gfx/src', 'gfx/thebes', 'layout/base', 'modules/libpref/init', 'browser/components/preferences/in-content/tests', 'toolkit/mozapps/preferences', 'browser/components/preferences/in-content-old/tests'], app_name: org.mozilla.fennec, version: 55.0a1, build_type: opt, treeherder_url: https://treeherder.mozilla.org, abi: arm, sdk: api-15, nightly: True, platform: android-api-15, builder_type: taskcluster
autophone-1,2's pulse queues were deleted before the servers could process the outstanding messages which accumulated during the outage. autophone-3's pulse queue was not deleted however since it hadn't grown as fast as the others. Perhaps that is the root cause since autophone-3 appears to be working properly since its queue was deleted. marking fixed for now until proven otherwise.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Blocks: 1352892
This is not resolved. Overnight, each server has encountered periods where it stopped consuming the pulse queues. Fortunately, they recovered but this issue is ongoing.
Severity: normal → blocker
Status: RESOLVED → REOPENED
Priority: -- → P1
Resolution: FIXED → ---
Summary: Autophone - autophone-3 is hanging on startup → Autophone - autophone process is hanging intermittently
Couldn't figure an easy way to capture the message summaries other than screen shotting it.
Unfortunately, we updated the test manifests at the same time of the office network maintenance and I think this was conflated with the network change. I believe we are experiencing a deadlock due to the changes in the test manifests and device assignments. Removing bug 1349904.
No longer blocks: 1352892
bug 1353760 made the autophone production queues unbounded. This allowed us to keep processing even though the total number of messages temporarily exceeded 20,000. All of the production queues have now been processed and are back to normal values. Therefore, I believe we can say that the pure processing of pulse messages may be slow but it does not dead lock on its own. I did not issue any ap commands to the processes during this time. There was still the previous situation where an ap command which contacts the autophone process and sends / receives messages from the server process would appear to hang. This may be due to a blocking http request to one of our other services such as hg, treeherder, taskcluster and I just didn't wait long enough for it to resolve. I'm investigating that possibility. One other thing is that we are using the default pyamqp transport which uses the pure Python library amqp. librabbitmq is supposedly a high performance transport written in C. Perhaps I could improve the performance by installing librabbitmq. It will be used if available and default back to pyamqp if it is not.
Attached file hang.log (deleted) —
This is happening again on autophone-3. It began around 4:15 AM PDT this morning for some unknown reason when we stopped consuming pulse messages. We are not in danger of losing our queue now that they are unbounded. Looking at the tail of the logs, it appears the is a deadlock with the workers attempting to send status messages to the parent process.
The parent process is constantly consuming about 5% cpu while the children are pretty much idle. killed the parent process and restarted.
Attached patch bug-1353509-v1.patch (deleted) — Splinter Review
Attachment #8855892 - Flags: review?(jmaher)
This patch just removes some platforms which we no longer support or don't test. This reduces the number of bindings we have for the exchanges/queues, but is not meant to fix the underlying issue. Might as well get it in now while we continue to investigate.
Comment on attachment 8855892 [details] [diff] [review] bug-1353509-v1.patch Review of attachment 8855892 [details] [diff] [review]: ----------------------------------------------------------------- win/win!
Attachment #8855892 - Flags: review?(jmaher) → review+
https://github.com/mozilla/autophone/commit/55d214bb3bfade38dc1b11cf1c136cffd62aa030 leaving open for the underlying issue. deployed 2017-04-08 17:45
This has been behaving much better since the removal of some of the platforms, but we started having issues again today with the queues overly large. They recovered temporarily but the queues remained too large. autophone-1 was effectively hung up and I rebooted it but autophone-2 and autophone-3 were responding to ap commands thought their queues were also too large for comfort.
Actually, the change to the binds had no effect since the queues are now durable and the bindings are fixed until I delete them. This may also be associated with being DOSd via try. I'm investigating.
Attached patch bug-1353509.patch (deleted) — Splinter Review
Attachment #8860013 - Flags: review?(jmaher)
Comment on attachment 8860013 [details] [diff] [review] bug-1353509.patch Review of attachment 8860013 [details] [diff] [review]: ----------------------------------------------------------------- nice to see durable queues being used
Attachment #8860013 - Flags: review?(jmaher) → review+
autophone-2 began hanging and accumulating pulse message this morning. It appears it began after: 2017-04-20 10:00:02,007|957|PulseMonitorThread|root|ERROR|Unable to open https://treeherder.mozilla.org/api/jobdetail/?repository=autoland&job_id=91626182 Traceback (most recent call last): File "/mozilla/autophone/autophone/utils.py", line 53, in get_remote_text conn = urllib2.urlopen(req) File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 429, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 447, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1241, in https_open context=self._context) File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open raise URLError(err) URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:590)>
Now autophone-3 is hung up with repeated 2017-04-21 08:05:51,226|920|PulseMonitorThread|root|WARNING|get_treeherder_privatebuild_info: https://treeherder.mozilla.org/api/jobdetail/?repository=try&job_id=93298011 missing keys: set(['config_file', 'chunk', 'build_url', 'builder_type']) job: {u'submit_timestamp': 1492775164, u'build_system_type': u'taskcluster', u'machine_name': u'i-0492d2009f4e5d6e6', u'job_group_symbol': u'tc-M', u'logs': [{u'url': u'https://queue.taskcluster.net/v1/task/MwddINb-R9iguE0-lyjp1w/runs/0/artifacts/public/logs/live_backing.log', u'name': u'builds-4h'}, {u'url': u'https://queue.taskcluster.net/v1/task/MwddINb-R9iguE0-lyjp1w/runs/0/artifacts/public/test_info/browser-chrome-chunked_errorsummary.log', u'name': u'errorsummary_json'}], u'job_group_name': u'Mochitests executed by TaskCluster', u'platform_option': u'opt', u'job_type_description': u'', u'result_set_id': 195226, u'build_platform_id': 215, u'result': u'success', u'machine_platform_os': u'-', u'push_id': 195226, u'ref_data_name': u'94825cee3fb45f5a487cd2dfc8c0f2cadfc93a40', u'machine_platform_architecture': u'-', u'job_type_id': 33664, u'build_platform': u'windows7-32-vm', u'running_eta': 752, u'job_type_name': u'test-windows7-32-vm/opt-mochitest-browser-chrome-2', u'id': 93298011, u'platform': u'windows7-32-vm', u'state': u'completed', u'end_timestamp': 1492776487, u'build_os': u'-', u'who': u'fliu@mozilla.com', u'taskcluster_metadata': {u'task_id': u'MwddINb-R9iguE0-lyjp1w', u'retry_id': 0}, u'option_collection_hash': u'102210fe594ee9b33d82058545b1ed14f4c8206e', u'failure_classification_id': 1, u'autoclassify_status': u'crossreferenced', u'job_type_symbol': u'bc2', u'reason': u'scheduled', u'last_modified': u'2017-04-21T12:08:08.122351', u'tier': 2, u'job_coalesced_to_guid': None, u'job_guid': u'33075d20-d6fe-47d8-a0b8-4d3e9728e9d7/0', u'start_timestamp': 1492775644, u'build_architecture': u'-', u'job_group_description': u'', u'signature': u'94825cee3fb45f5a487cd2dfc8c0f2cadfc93a40', u'job_group_id': 257, u'resource_uri': u'/api/project/try/jobs/93298011/'} which isn't even an autophone job. https://treeherder.mozilla.org/#/jobs?repo=try&revision=2f7336ed7ed2eee0d144cb9ac762e92cfc2c2726
All three job action queues are growing this morning. Looking at the logs we are getting lots of 2017-04-26 05:00:55,498|921|PulseMonitorThread|root|WARNING|get_treeherder_privatebuild_info: https://treeherder.mozilla.org/api/jobdetail/?repository=try&job_id=94342282 missing keys: set([ 2 These are for jobs which do not belong to us and we appear not to be processing them quickly enough. Perhaps we don't need to log as much here.
make this a debug only message.
Attachment #8861917 - Flags: review?(jmaher)
Comment on attachment 8861917 [details] [diff] [review] bug-1353509-privatebuild-v1.patch Review of attachment 8861917 [details] [diff] [review]: ----------------------------------------------------------------- great
Attachment #8861917 - Flags: review?(jmaher) → review+
autophone-1 hung this morning. It appeared to be downloading builds/installing builds on the workers. The log messages weren't informative. I killed the process and restarted.
This autophone-1 hang has persisted across reboots and a selective delete of a single job. I've enabled DEBUG logging and started with a fresh jobs.sqlite database (saving the old version for later resubmissions). Hopefully we'll catch up with pulse before too long.
This appears to be related to queuing up the treeherder messages in the database. 2017-05-04 12:17:44,220|8990|PulseMonitorThread|root|DEBUG|new_job: worker phoneid pixel-05 2017-05-04 12:17:44,290|8990|PulseMonitorThread|root|DEBUG|AutophoneTreeherder.queue_request: {'endpoint_base': u'jobs', 'data': [<thclient.client.TreeherderJob object at 0x7f041009f7d0>]} 2017-05-04 12:17:44,291|8990|PulseMonitorThread|root|DEBUG|AutophoneTreeherder shared_lock.acquire https://github.com/mozilla/autophone/blob/master/autophonetreeherder.py#L130 should log the lock request, acquire the lock, insert the row into the treeherder table, then log the lock release and release the lock. We never see the release request which leads me to believe that the insertion into the Treeherder table is hanging. In the mean time, anyone who wants to communicate with treeherder will do the same which will block since the shared lock is never released. Sounds like sqlite is the dead lock to me. In the mean time, I have deleted the autophone-1 pulse queues and restarted. If necessary, we'll have to resubmit the missing jobs.
We have been using a blocking shared lock to serialize access across threads and processes for the Pulse messages and Treeherder submissions. One assumption has been that the Kombu consumer did not call the connection consumer's callback functions until the connection.drain_events was called. This is not always the case however. The callbacks can be called before the drain_events call is made. If the callback was called after the lock was set in the listen method but before the lock was released and the drain_events method was called, then handle_message would be called with the shared lock already locked which violates the assumption. This would result in a dead lock if the message resulted in a call to a function which also attempted to lock the shared lock. This patch attempts to prevent the dead lock by unlocking the shared lock upon entering handle_message. If it is not locked, no foul, no error. However if it is locked, we relock after completing the handling of the message to keep our locking assumptions in listen true. I do have another patch which adds a new shared multiprocessing Lock for serializing access to the jobs database, but I don't think it is absolutely necessary and would like to try this one first.
Attachment #8864751 - Flags: review?(jmaher)
Comment on attachment 8864751 [details] [diff] [review] bug-1353509-pulse-listen-locking.patch Review of attachment 8864751 [details] [diff] [review]: ----------------------------------------------------------------- rs=me!
Attachment #8864751 - Flags: review?(jmaher) → review+
https://github.com/mozilla/autophone/commit/2c5298003ccfd0c120e18244cc9e1461d025de7f deployed 2017-05-06 ~11:25. marking fixed with the hope that it is true this time. If not, I'll open a new bug.
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → FIXED
autophone-2 hung this morning... 2017-05-09 09:27:09,477 nexus-4-9 INFO <2017-05-09T16:27:09+00:00> nexus-4-9 (WORKING): logcontrol: flush log 2017-05-09 09:27:09,477 nexus-4-9 INFO Waiting for log_flushed_event 2017-05-09 09:27:09,477 nexus-4-9 INFO Got log_flushed_event 2017-05-09 09:34:43,256 root WARNING check_for_unrecoverable_errors: Purging hung phone pixel-02 2017-05-09 17:51:42,256 nexus-6p-11 ERROR Exception while checking path /sdcard/tests Traceback (most recent call last): File "/mozilla/autophone/autophone/worker.py", line 652, in _check_path self.dm.mkdir(d, parents=True, root=True) File "/mozilla/autophone/autophone/adb.py", line 1684, in mkdir root=root): File "/mozilla/autophone/autophone/adb.py", line 1104, in shell_bool raise ADBTimeoutError("%s" % adb_process) ADBTimeoutError: args: adb -s 84B7N16421000012 wait-for-device shell su 0 mkdir -p /sdcard/tests/autopho check_for_unrecoverable_errors: Purging hung phone pixel-02 might be the hint.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
autophone-1 hung last night: 2017-05-15 20:19:24,777 root WARNING check_for_unrecoverable_errors: Purging hung phone nexus-6p-1
autophone-2 2017-05-16 19:05:47,025 pixel-02 ERROR TalosTestJob 20170517010617 opt remote-tsvg Exception tearing down job Traceback (most recent call last): File "/mozilla/autophone/autophone/phonetest.py", line 878, in teardown_job tests=[self]) File "/mozilla/autophone/autophone/autophonetreeherder.py", line 442, in submit_complete self.queue_request(machine, project, tjc) File "/mozilla/autophone/autophone/autophonetreeherder.py", line 133, in queue_request self.shared_lock.release() ValueError: semaphore or lock released too many times 2017-05-15 21:27:32,884 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets 53 ERROR Automation Error: Exception caught while running tests 2017-05-15 21:27:32,884 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Traceback (most recent call last): 2017-05-15 21:27:32,884 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "mochitest/runtestsremote.py", line 365, in run_test_harness 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets retVal = mochitest.runTests(options) 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets return self.runMochitests(options, testsToRun) 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets result = self.doTests(options, testsToRun) 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets return status, self.lastTestSeen 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets AttributeError: 'MochiRemote' object has no attribute 'lastTestSeen' 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Stopping web server 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Failed to stop web server on http://10.252.73.231:38852/server/shutdown 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Traceback (most recent call last): 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets with closing(urllib2.urlopen(self.shutdownURL)) as c: 2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets return opener.open(url, data, timeout) 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 429, in open 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets response = self._open(req, data) 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 447, in _open 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets '_open', req) 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets result = func(*args) 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets return self.do_open(httplib.HTTPConnection, req) 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets raise URLError(err) 2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets URLError: <urlopen error [Errno 111] Connection refused> 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Stopping web socket server 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Stopping ssltunnel 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Traceback (most recent call last): 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "mochitest/runtestsremote.py", line 393, in <module> 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets sys.exit(main()) 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "mochitest/runtestsremote.py", line 389, in main 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets return run_test_harness(parser, options) 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "mochitest/runtestsremote.py", line 371, in run_test_harness 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets mochitest.cleanup(options) 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "mochitest/runtestsremote.py", line 67, in cleanup 2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets MochitestDesktop.cleanup(self, options) 2017-05-15 21:27:32,888 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets os.remove(self.manifest) 2017-05-15 21:27:32,888 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets OSError: [Errno 2] No such file or directory: '/tmp/tmpc59Co3.mozrunner/test s.manifest'
Attached file 2017-05-18 09:50 hang (deleted) —
Not purging but related to rebooting a disconnected device?
2017-05-22 18:00:27,737 pixel-07 ERROR UnitTestJob mozilla-beta 20170522120019 debug api-25 android-api-15 mochitest-dom-browser-element Exception tearing down job Traceback (most recent call last): File "/mozilla/autophone/autophone/phonetest.py", line 904, in teardown_job tests=[self]) File "/mozilla/autophone/autophone/autophonetreeherder.py", line 466, in submit_complete self.queue_request(machine, project, tjc) File "/mozilla/autophone/autophone/autophonetreeherder.py", line 133, in queue_request self.shared_lock.release() ValueError: semaphore or lock released too many times 2017-05-22 18:01:21,261 nexus-6p-2 INFO ========= Started Submitting Log (results: 0, elapsed: 0 secs) (at 2017-05-22 18:01:21.261154) ========= 2017-05-22 18:01:21,261 nexus-6p-2 INFO ========= Finished Submitting Log (results: 0, elapsed: 0 secs) (at 2017-05-22 18:01:21.261475) ========= 2017-05-22 18:14:50,697 root WARNING check_for_unrecoverable_errors: Purging hung phone nexus-6p-2 Maybe this is a lock contention issue. We screw up releasing too many times but later try to acquire too many times? And what's up with the purging thing. I'll look into this more now.
In AutophoneTreeherder def queue_request(self, machine, project, job_collection): logger = utils.getLogger() logger.debug('AutophoneTreeherder.queue_request: %s', job_collection.__dict__) logger.debug('AutophoneTreeherder shared_lock.acquire') self.shared_lock.acquire() try: self.jobs.new_treeherder_job(machine, project, job_collection) finally: logger.debug('AutophoneTreeherder shared_lock.release') self.shared_lock.release() In Jobs def new_treeherder_job(self, machine, project, job_collection): logger = utils.getLogger() logger.debug('jobs.new_treeherder_job: %s %s %s', machine, project, job_collection.__dict__) attempts = 0 now = datetime.datetime.utcnow().isoformat() conn = self._conn() job_cursor = self._execute_sql( conn, 'insert into treeherder values (?, ?, ?, ?, ?, ?)', values=(None, attempts, now, machine, project, job_collection.to_json())) job_cursor.close() self._commit_connection(conn) self._close_connection(conn) There is no way these could get out of sync, so it must be the case that someone else is releasing the shared lock without first obtaining it. Must be a thread in the Main process.
Apart from bug 1367199, I've been looking into this in a variety of ways. One approach was to realize that simply having the shared lock as an attribute which is "passed" to the worker sub process created via invoking the PhoneWorkerSubProcess.run method probably isn't correct. I changed the run method to take a shared_lock argument and changed the Process call to include args=(shared_lock,) so that we use the "correct" method of sharing an object via shared memory. With various other changes, I managed to create dead locks accidentally and also managed to pretty much lock up my gnome session while Autophone was running. This leads me to believe that the previous method of just reusing the PhoneWorkerSubProcess's instance's attribute of shared_lock didn't do anything to serialize access as they probably weren't shared at all. I've been investigating exactly why I think I need the shared lock and it seems to me that really all I cared about was inserting rows into the jobs, tests and treeherder tables in the jobs.sqlite database. But, sqlite3 already handles locking the database which leads to me to wonder if I really need the shared lock at all. That is what I am investigating now. Additional thoughts welcome.
Attachment #8870901 - Flags: review?(jmaher)
Attachment #8870902 - Flags: review?(jmaher)
Attachment #8870903 - Flags: review?(jmaher)
Comment on attachment 8870901 [details] [diff] [review] bug-1353509-shared-lock-v1.patch Review of attachment 8870901 [details] [diff] [review]: ----------------------------------------------------------------- bummer to remove the shared locks
Attachment #8870901 - Flags: review?(jmaher) → review+
Comment on attachment 8870902 [details] [diff] [review] bug-1353509-purge-logging-v1.patch Review of attachment 8870902 [details] [diff] [review]: ----------------------------------------------------------------- simple!
Attachment #8870902 - Flags: review?(jmaher) → review+
Attachment #8870903 - Flags: review?(jmaher) → review+
Seems to be resolved. I've retriggered jobs and had no issues.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: