Closed
Bug 1353509
Opened 8 years ago
Closed 7 years ago
Autophone - autophone process is hanging intermittently
Categories
(Testing Graveyard :: Autophone, defect, P1)
Testing Graveyard
Autophone
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bc, Assigned: bc)
References
Details
Attachments
(10 files)
(deleted),
image/png
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
After recovering from the network outage for the devices and restarting, autophone-3 is hanging on status and other control messages. I have attempted to restart it several times and it has continued to lock up. I have force stopped it by killing the processes and an clearing the buildcache in an attempt to clear the problem.
Assignee | ||
Comment 1•8 years ago
|
||
I finally deleted the pulse queues and restarted. It appears to be working now but I don't understand the cause or why this would fix the hang. I enabled debug output and it appeared to be hanging during/after the get_build_data:
2017-04-04 13:35:21,843|17879|Thread-5|root|DEBUG|get_build_data: {'changeset': u'https://hg.mozilla.org/integration/autoland/rev/0b45ef5f7702377bded5a67c6af0a713eccfc41e', 'repo': u'autoland', 'date': datetime.datetime(2017, 4, 3, 14, 50, 14, tzinfo=<UTC>), 'build_type': 'opt', 'sdk': u'api-15', 'nightly': True, 'url': 'https://queue.taskcluster.net/v1/task/YgeyaEOWSqC5KkPxi0SgyA/runs/0/artifacts/public/build/target.apk', 'abi': u'arm', 'builder_type': 'taskcluster', 'id': '20170403145014', 'platform': u'android-api-15', 'changeset_dirs': ['gfx/src', 'gfx/thebes', 'layout/base', 'modules/libpref/init', 'browser/components/preferences/in-content/tests', 'toolkit/mozapps/preferences', 'browser/components/preferences/in-content-old/tests'], 'revision': u'0b45ef5f7702377bded5a67c6af0a713eccfc41e'}
2017-04-04 13:35:21,865|17879|Thread-5|root|DEBUG|BuildMetadata: url: https://queue.taskcluster.net/v1/task/YgeyaEOWSqC5KkPxi0SgyA/runs/0/artifacts/public/build/target.apk, directory: builds/aHR0cHM6Ly9xdWV1ZS50YXNrY2x1c3Rlci5uZXQvdjEvdGFzay9ZZ2V5YUVPV1NxQzVLa1B4aTBTZ3lBL3J1bnMvMC9hcnRpZmFjdHMvcHVibGljL2J1aWxkL3RhcmdldC5hcGs=, tree: autoland, buildid: 20170403145014, revision: 0b45ef5f7702377bded5a67c6af0a713eccfc41e, changeset: https://hg.mozilla.org/integration/autoland/rev/0b45ef5f7702377bded5a67c6af0a713eccfc41e, changeset_dirs: ['gfx/src', 'gfx/thebes', 'layout/base', 'modules/libpref/init', 'browser/components/preferences/in-content/tests', 'toolkit/mozapps/preferences', 'browser/components/preferences/in-content-old/tests'], app_name: org.mozilla.fennec, version: 55.0a1, build_type: opt, treeherder_url: https://treeherder.mozilla.org, abi: arm, sdk: api-15, nightly: True, platform: android-api-15, builder_type: taskcluster
Assignee | ||
Comment 2•8 years ago
|
||
autophone-1,2's pulse queues were deleted before the servers could process the outstanding messages which accumulated during the outage. autophone-3's pulse queue was not deleted however since it hadn't grown as fast as the others. Perhaps that is the root cause since autophone-3 appears to be working properly since its queue was deleted. marking fixed for now until proven otherwise.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 3•8 years ago
|
||
This is not resolved. Overnight, each server has encountered periods where it stopped consuming the pulse queues. Fortunately, they recovered but this issue is ongoing.
Severity: normal → blocker
Status: RESOLVED → REOPENED
Priority: -- → P1
Resolution: FIXED → ---
Summary: Autophone - autophone-3 is hanging on startup → Autophone - autophone process is hanging intermittently
Assignee | ||
Comment 4•8 years ago
|
||
Couldn't figure an easy way to capture the message summaries other than screen shotting it.
Assignee | ||
Comment 5•8 years ago
|
||
Unfortunately, we updated the test manifests at the same time of the office network maintenance and I think this was conflated with the network change. I believe we are experiencing a deadlock due to the changes in the test manifests and device assignments. Removing bug 1349904.
No longer blocks: 1352892
Assignee | ||
Comment 6•8 years ago
|
||
bug 1353760 made the autophone production queues unbounded. This allowed us to keep processing even though the total number of messages temporarily exceeded 20,000. All of the production queues have now been processed and are back to normal values. Therefore, I believe we can say that the pure processing of pulse messages may be slow but it does not dead lock on its own. I did not issue any ap commands to the processes during this time.
There was still the previous situation where an ap command which contacts the autophone process and sends / receives messages from the server process would appear to hang. This may be due to a blocking http request to one of our other services such as hg, treeherder, taskcluster and I just didn't wait long enough for it to resolve. I'm investigating that possibility.
One other thing is that we are using the default pyamqp transport which uses the pure Python library amqp. librabbitmq is supposedly a high performance transport written in C. Perhaps I could improve the performance by installing librabbitmq. It will be used if available and default back to pyamqp if it is not.
Assignee | ||
Comment 7•8 years ago
|
||
Forgot to include this link: http://docs.celeryproject.org/projects/kombu/en/latest/userguide/connections.html#amqp-transports
Assignee | ||
Comment 8•8 years ago
|
||
This is happening again on autophone-3. It began around 4:15 AM PDT this morning for some unknown reason when we stopped consuming pulse messages. We are not in danger of losing our queue now that they are unbounded. Looking at the tail of the logs, it appears the is a deadlock with the workers attempting to send status messages to the parent process.
Assignee | ||
Comment 9•8 years ago
|
||
The parent process is constantly consuming about 5% cpu while the children are pretty much idle. killed the parent process and restarted.
Assignee | ||
Comment 10•8 years ago
|
||
Attachment #8855892 -
Flags: review?(jmaher)
Assignee | ||
Comment 11•8 years ago
|
||
This patch just removes some platforms which we no longer support or don't test. This reduces the number of bindings we have for the exchanges/queues, but is not meant to fix the underlying issue. Might as well get it in now while we continue to investigate.
Comment 12•8 years ago
|
||
Comment on attachment 8855892 [details] [diff] [review]
bug-1353509-v1.patch
Review of attachment 8855892 [details] [diff] [review]:
-----------------------------------------------------------------
win/win!
Attachment #8855892 -
Flags: review?(jmaher) → review+
Assignee | ||
Comment 13•8 years ago
|
||
https://github.com/mozilla/autophone/commit/55d214bb3bfade38dc1b11cf1c136cffd62aa030
leaving open for the underlying issue.
deployed 2017-04-08 17:45
Blocks: autophone-deployments
Assignee | ||
Comment 14•8 years ago
|
||
This has been behaving much better since the removal of some of the platforms, but we started having issues again today with the queues overly large. They recovered temporarily but the queues remained too large. autophone-1 was effectively hung up and I rebooted it but autophone-2 and autophone-3 were responding to ap commands thought their queues were also too large for comfort.
Assignee | ||
Comment 15•8 years ago
|
||
Actually, the change to the binds had no effect since the queues are now durable and the bindings are fixed until I delete them.
This may also be associated with being DOSd via try. I'm investigating.
Assignee | ||
Comment 16•8 years ago
|
||
Attachment #8860013 -
Flags: review?(jmaher)
Comment 17•8 years ago
|
||
Comment on attachment 8860013 [details] [diff] [review]
bug-1353509.patch
Review of attachment 8860013 [details] [diff] [review]:
-----------------------------------------------------------------
nice to see durable queues being used
Attachment #8860013 -
Flags: review?(jmaher) → review+
Assignee | ||
Comment 18•8 years ago
|
||
Assignee | ||
Comment 19•8 years ago
|
||
autophone-2 began hanging and accumulating pulse message this morning. It appears it began after:
2017-04-20 10:00:02,007|957|PulseMonitorThread|root|ERROR|Unable to open https://treeherder.mozilla.org/api/jobdetail/?repository=autoland&job_id=91626182
Traceback (most recent call last):
File "/mozilla/autophone/autophone/utils.py", line 53, in get_remote_text
conn = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1241, in https_open
context=self._context)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:590)>
Assignee | ||
Comment 20•8 years ago
|
||
Assignee | ||
Comment 21•8 years ago
|
||
Now autophone-3 is hung up with repeated
2017-04-21 08:05:51,226|920|PulseMonitorThread|root|WARNING|get_treeherder_privatebuild_info: https://treeherder.mozilla.org/api/jobdetail/?repository=try&job_id=93298011 missing keys: set(['config_file', 'chunk', 'build_url', 'builder_type']) job: {u'submit_timestamp': 1492775164, u'build_system_type': u'taskcluster', u'machine_name': u'i-0492d2009f4e5d6e6', u'job_group_symbol': u'tc-M', u'logs': [{u'url': u'https://queue.taskcluster.net/v1/task/MwddINb-R9iguE0-lyjp1w/runs/0/artifacts/public/logs/live_backing.log', u'name': u'builds-4h'}, {u'url': u'https://queue.taskcluster.net/v1/task/MwddINb-R9iguE0-lyjp1w/runs/0/artifacts/public/test_info/browser-chrome-chunked_errorsummary.log', u'name': u'errorsummary_json'}], u'job_group_name': u'Mochitests executed by TaskCluster', u'platform_option': u'opt', u'job_type_description': u'', u'result_set_id': 195226, u'build_platform_id': 215, u'result': u'success', u'machine_platform_os': u'-', u'push_id': 195226, u'ref_data_name': u'94825cee3fb45f5a487cd2dfc8c0f2cadfc93a40', u'machine_platform_architecture': u'-', u'job_type_id': 33664, u'build_platform': u'windows7-32-vm', u'running_eta': 752, u'job_type_name': u'test-windows7-32-vm/opt-mochitest-browser-chrome-2', u'id': 93298011, u'platform': u'windows7-32-vm', u'state': u'completed', u'end_timestamp': 1492776487, u'build_os': u'-', u'who': u'fliu@mozilla.com', u'taskcluster_metadata': {u'task_id': u'MwddINb-R9iguE0-lyjp1w', u'retry_id': 0}, u'option_collection_hash': u'102210fe594ee9b33d82058545b1ed14f4c8206e', u'failure_classification_id': 1, u'autoclassify_status': u'crossreferenced', u'job_type_symbol': u'bc2', u'reason': u'scheduled', u'last_modified': u'2017-04-21T12:08:08.122351', u'tier': 2, u'job_coalesced_to_guid': None, u'job_guid': u'33075d20-d6fe-47d8-a0b8-4d3e9728e9d7/0', u'start_timestamp': 1492775644, u'build_architecture': u'-', u'job_group_description': u'', u'signature': u'94825cee3fb45f5a487cd2dfc8c0f2cadfc93a40', u'job_group_id': 257, u'resource_uri': u'/api/project/try/jobs/93298011/'}
which isn't even an autophone job.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2f7336ed7ed2eee0d144cb9ac762e92cfc2c2726
Assignee | ||
Comment 22•8 years ago
|
||
All three job action queues are growing this morning. Looking at the logs we are getting lots of
2017-04-26 05:00:55,498|921|PulseMonitorThread|root|WARNING|get_treeherder_privatebuild_info: https://treeherder.mozilla.org/api/jobdetail/?repository=try&job_id=94342282 missing keys: set([
2
These are for jobs which do not belong to us and we appear not to be processing them quickly enough. Perhaps we don't need to log as much here.
Assignee | ||
Comment 23•8 years ago
|
||
make this a debug only message.
Attachment #8861917 -
Flags: review?(jmaher)
Comment 24•8 years ago
|
||
Comment on attachment 8861917 [details] [diff] [review]
bug-1353509-privatebuild-v1.patch
Review of attachment 8861917 [details] [diff] [review]:
-----------------------------------------------------------------
great
Attachment #8861917 -
Flags: review?(jmaher) → review+
Assignee | ||
Comment 25•8 years ago
|
||
https://github.com/mozilla/autophone/commit/cc5c38e9725d79124aca6bad26dfeb927a0edf0f
will deploy in a few minutes. Leaving open.
Assignee | ||
Comment 26•7 years ago
|
||
autophone-1 hung this morning. It appeared to be downloading builds/installing builds on the workers. The log messages weren't informative. I killed the process and restarted.
Assignee | ||
Comment 27•7 years ago
|
||
This autophone-1 hang has persisted across reboots and a selective delete of a single job. I've enabled DEBUG logging and started with a fresh jobs.sqlite database (saving the old version for later resubmissions). Hopefully we'll catch up with pulse before too long.
Assignee | ||
Comment 28•7 years ago
|
||
This appears to be related to queuing up the treeherder messages in the database.
2017-05-04 12:17:44,220|8990|PulseMonitorThread|root|DEBUG|new_job: worker phoneid pixel-05
2017-05-04 12:17:44,290|8990|PulseMonitorThread|root|DEBUG|AutophoneTreeherder.queue_request: {'endpoint_base': u'jobs', 'data': [<thclient.client.TreeherderJob object at 0x7f041009f7d0>]}
2017-05-04 12:17:44,291|8990|PulseMonitorThread|root|DEBUG|AutophoneTreeherder shared_lock.acquire
https://github.com/mozilla/autophone/blob/master/autophonetreeherder.py#L130 should log the lock request, acquire the lock, insert the row into the treeherder table, then log the lock release and release the lock.
We never see the release request which leads me to believe that the insertion into the Treeherder table is hanging. In the mean time, anyone who wants to communicate with treeherder will do the same which will block since the shared lock is never released. Sounds like sqlite is the dead lock to me.
In the mean time, I have deleted the autophone-1 pulse queues and restarted. If necessary, we'll have to resubmit the missing jobs.
Assignee | ||
Comment 29•7 years ago
|
||
We have been using a blocking shared lock to serialize access
across threads and processes for the Pulse messages and
Treeherder submissions.
One assumption has been that the Kombu consumer did not call the
connection consumer's callback functions until the
connection.drain_events was called. This is not always the case
however. The callbacks can be called before the drain_events call
is made.
If the callback was called after the lock was set in the listen
method but before the lock was released and the drain_events
method was called, then handle_message would be called with the
shared lock already locked which violates the assumption. This
would result in a dead lock if the message resulted in a call to
a function which also attempted to lock the shared lock.
This patch attempts to prevent the dead lock by unlocking the
shared lock upon entering handle_message. If it is not locked, no
foul, no error. However if it is locked, we relock after
completing the handling of the message to keep our locking
assumptions in listen true.
I do have another patch which adds a new shared multiprocessing Lock for serializing access to the jobs database, but I don't think it is absolutely necessary and would like to try this one first.
Attachment #8864751 -
Flags: review?(jmaher)
Comment 30•7 years ago
|
||
Comment on attachment 8864751 [details] [diff] [review]
bug-1353509-pulse-listen-locking.patch
Review of attachment 8864751 [details] [diff] [review]:
-----------------------------------------------------------------
rs=me!
Attachment #8864751 -
Flags: review?(jmaher) → review+
Assignee | ||
Comment 31•7 years ago
|
||
https://github.com/mozilla/autophone/commit/2c5298003ccfd0c120e18244cc9e1461d025de7f
deployed 2017-05-06 ~11:25.
marking fixed with the hope that it is true this time. If not, I'll open a new bug.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 7 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 32•7 years ago
|
||
autophone-2 hung this morning...
2017-05-09 09:27:09,477 nexus-4-9 INFO <2017-05-09T16:27:09+00:00> nexus-4-9 (WORKING): logcontrol: flush log
2017-05-09 09:27:09,477 nexus-4-9 INFO Waiting for log_flushed_event
2017-05-09 09:27:09,477 nexus-4-9 INFO Got log_flushed_event
2017-05-09 09:34:43,256 root WARNING check_for_unrecoverable_errors: Purging hung phone pixel-02
2017-05-09 17:51:42,256 nexus-6p-11 ERROR Exception while checking path /sdcard/tests
Traceback (most recent call last):
File "/mozilla/autophone/autophone/worker.py", line 652, in _check_path
self.dm.mkdir(d, parents=True, root=True)
File "/mozilla/autophone/autophone/adb.py", line 1684, in mkdir
root=root):
File "/mozilla/autophone/autophone/adb.py", line 1104, in shell_bool
raise ADBTimeoutError("%s" % adb_process)
ADBTimeoutError: args: adb -s 84B7N16421000012 wait-for-device shell su 0 mkdir -p /sdcard/tests/autopho
check_for_unrecoverable_errors: Purging hung phone pixel-02 might be the hint.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 33•7 years ago
|
||
autophone-1 hung last night:
2017-05-15 20:19:24,777 root WARNING check_for_unrecoverable_errors: Purging hung phone nexus-6p-1
Assignee | ||
Comment 34•7 years ago
|
||
autophone-2
2017-05-16 19:05:47,025 pixel-02 ERROR TalosTestJob 20170517010617 opt remote-tsvg Exception tearing down job
Traceback (most recent call last):
File "/mozilla/autophone/autophone/phonetest.py", line 878, in teardown_job
tests=[self])
File "/mozilla/autophone/autophone/autophonetreeherder.py", line 442, in submit_complete
self.queue_request(machine, project, tjc)
File "/mozilla/autophone/autophone/autophonetreeherder.py", line 133, in queue_request
self.shared_lock.release()
ValueError: semaphore or lock released too many times
2017-05-15 21:27:32,884 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets 53 ERROR Automation Error: Exception caught while running tests
2017-05-15 21:27:32,884 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Traceback (most recent call last):
2017-05-15 21:27:32,884 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "mochitest/runtestsremote.py", line 365, in run_test_harness
2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets retVal = mochitest.runTests(options)
2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets return self.runMochitests(options, testsToRun)
2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets result = self.doTests(options, testsToRun)
2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets return status, self.lastTestSeen
2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets AttributeError: 'MochiRemote' object has no attribute 'lastTestSeen'
2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Stopping web server
2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Failed to stop web server on http://10.252.73.231:38852/server/shutdown
2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Traceback (most recent call last):
2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets with closing(urllib2.urlopen(self.shutdownURL)) as c:
2017-05-15 21:27:32,885 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets return opener.open(url, data, timeout)
2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 429, in open
2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets response = self._open(req, data)
2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 447, in _open
2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets '_open', req)
2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets result = func(*args)
2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets return self.do_open(httplib.HTTPConnection, req)
2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets raise URLError(err)
2017-05-15 21:27:32,886 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets URLError: <urlopen error [Errno 111] Connection refused>
2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Stopping web socket server
2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Stopping ssltunnel
2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets Traceback (most recent call last):
2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "mochitest/runtestsremote.py", line 393, in <module>
2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets sys.exit(main())
2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "mochitest/runtestsremote.py", line 389, in main
2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets return run_test_harness(parser, options)
2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "mochitest/runtestsremote.py", line 371, in run_test_harness
2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets mochitest.cleanup(options)
2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets File "mochitest/runtestsremote.py", line 67, in cleanup
2017-05-15 21:27:32,887 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets MochitestDesktop.cleanup(self, options)
2017-05-15 21:27:32,888 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets os.remove(self.manifest)
2017-05-15 21:27:32,888 nexus-5-6 INFO UnitTestJob 20170516031557 debug mochitest-toolkit-widgets OSError: [Errno 2] No such file or directory: '/tmp/tmpc59Co3.mozrunner/test
s.manifest'
Assignee | ||
Comment 35•7 years ago
|
||
Not purging but related to rebooting a disconnected device?
Assignee | ||
Comment 36•7 years ago
|
||
2017-05-22 18:00:27,737 pixel-07 ERROR UnitTestJob mozilla-beta 20170522120019 debug api-25 android-api-15 mochitest-dom-browser-element Exception tearing down job
Traceback (most recent call last):
File "/mozilla/autophone/autophone/phonetest.py", line 904, in teardown_job
tests=[self])
File "/mozilla/autophone/autophone/autophonetreeherder.py", line 466, in submit_complete
self.queue_request(machine, project, tjc)
File "/mozilla/autophone/autophone/autophonetreeherder.py", line 133, in queue_request
self.shared_lock.release()
ValueError: semaphore or lock released too many times
2017-05-22 18:01:21,261 nexus-6p-2 INFO ========= Started Submitting Log (results: 0, elapsed: 0 secs) (at 2017-05-22 18:01:21.261154) =========
2017-05-22 18:01:21,261 nexus-6p-2 INFO ========= Finished Submitting Log (results: 0, elapsed: 0 secs) (at 2017-05-22 18:01:21.261475) =========
2017-05-22 18:14:50,697 root WARNING check_for_unrecoverable_errors: Purging hung phone nexus-6p-2
Maybe this is a lock contention issue. We screw up releasing too many times but later try to acquire too many times? And what's up with the purging thing. I'll look into this more now.
Assignee | ||
Comment 37•7 years ago
|
||
In AutophoneTreeherder
def queue_request(self, machine, project, job_collection):
logger = utils.getLogger()
logger.debug('AutophoneTreeherder.queue_request: %s', job_collection.__dict__)
logger.debug('AutophoneTreeherder shared_lock.acquire')
self.shared_lock.acquire()
try:
self.jobs.new_treeherder_job(machine, project, job_collection)
finally:
logger.debug('AutophoneTreeherder shared_lock.release')
self.shared_lock.release()
In Jobs
def new_treeherder_job(self, machine, project, job_collection):
logger = utils.getLogger()
logger.debug('jobs.new_treeherder_job: %s %s %s',
machine, project, job_collection.__dict__)
attempts = 0
now = datetime.datetime.utcnow().isoformat()
conn = self._conn()
job_cursor = self._execute_sql(
conn,
'insert into treeherder values (?, ?, ?, ?, ?, ?)',
values=(None, attempts, now, machine, project, job_collection.to_json()))
job_cursor.close()
self._commit_connection(conn)
self._close_connection(conn)
There is no way these could get out of sync, so it must be the case that someone else is releasing the shared lock without first obtaining it. Must be a thread in the Main process.
Assignee | ||
Comment 38•7 years ago
|
||
Apart from bug 1367199, I've been looking into this in a variety of ways. One approach was to realize that simply having the shared lock as an attribute which is "passed" to the worker sub process created via invoking the PhoneWorkerSubProcess.run method probably isn't correct. I changed the run method to take a shared_lock argument and changed the Process call to include args=(shared_lock,) so that we use the "correct" method of sharing an object via shared memory. With various other changes, I managed to create dead locks accidentally and also managed to pretty much lock up my gnome session while Autophone was running. This leads me to believe that the previous method of just reusing the PhoneWorkerSubProcess's instance's attribute of shared_lock didn't do anything to serialize access as they probably weren't shared at all.
I've been investigating exactly why I think I need the shared lock and it seems to me that really all I cared about was inserting rows into the jobs, tests and treeherder tables in the jobs.sqlite database. But, sqlite3 already handles locking the database which leads to me to wonder if I really need the shared lock at all.
That is what I am investigating now.
Additional thoughts welcome.
Assignee | ||
Comment 39•7 years ago
|
||
Attachment #8870901 -
Flags: review?(jmaher)
Assignee | ||
Comment 40•7 years ago
|
||
Attachment #8870902 -
Flags: review?(jmaher)
Assignee | ||
Comment 41•7 years ago
|
||
Attachment #8870903 -
Flags: review?(jmaher)
Comment 42•7 years ago
|
||
Comment on attachment 8870901 [details] [diff] [review]
bug-1353509-shared-lock-v1.patch
Review of attachment 8870901 [details] [diff] [review]:
-----------------------------------------------------------------
bummer to remove the shared locks
Attachment #8870901 -
Flags: review?(jmaher) → review+
Comment 43•7 years ago
|
||
Comment on attachment 8870902 [details] [diff] [review]
bug-1353509-purge-logging-v1.patch
Review of attachment 8870902 [details] [diff] [review]:
-----------------------------------------------------------------
simple!
Attachment #8870902 -
Flags: review?(jmaher) → review+
Updated•7 years ago
|
Attachment #8870903 -
Flags: review?(jmaher) → review+
Assignee | ||
Comment 44•7 years ago
|
||
Assignee | ||
Comment 45•7 years ago
|
||
Seems to be resolved. I've retriggered jobs and had no issues.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Updated•3 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•