Closed
Bug 1275344
Opened 8 years ago
Closed 8 years ago
Autophone - download builds from the TaskCluster index
Categories
(Testing Graveyard :: Autophone, defect)
Testing Graveyard
Autophone
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bc, Assigned: bc)
References
Details
Attachments
(6 files)
(deleted),
patch
|
gbrown
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
bc
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
gbrown
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
gbrown
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
gbrown
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
bc
:
review+
|
Details | Diff | Splinter Review |
As build jobs move from buildbot to TaskCluster, the builds will no longer be available from ftp|archive.mozilla.org. trigger_runs.py will need to be updated to use the TC index instead of scraping the archive site. The pulse notification of builds may not require any modification since I would expect them to include the new locations for builds, but I haven't checked that yet.
Assignee | ||
Comment 1•8 years ago
|
||
This adds monitoring for trunk builds for exchange/taskcluster-queue/v1/task-completed and begins the transition from getting metadata from the build urls. With the exception of a minor tweak to trigger_runs.py to support passing the abi/sdk to the autophone process, this does not handle the case where trigger_runs.py downloads taskcluster builds since I am still working on that patch. It also tweaks some of the more verbose logging statements to reduce the log spew.
I had planned to finish the work for supporting triggering taskcluster builds, then consolidate the taskcluster related code into a file/class but am pushing this work at its current stage since the android debug trunk buildbot builds have been turned off. I'll complete the refactoring in the next patch.
We are ok if this doesn't get reviewed until Monday since the weekend is slow and I still have backfilling to do. It should be easy enough to backfill the debug unit tests on mozilla-central even if we miss them this weekend.
Attachment #8773911 -
Flags: review?(gbrown)
Comment 2•8 years ago
|
||
Comment on attachment 8773911 [details] [diff] [review]
bug-1275344-part-1-pulse-v1.patch
Review of attachment 8773911 [details] [diff] [review]:
-----------------------------------------------------------------
::: autophonepulsemonitor.py
@@ +30,5 @@
> normalized messages in order to obtain the check-in comment for a
> build. The comment is used to determine if a try build has
> requested Autophone testing.
>
> + Taskcluster builds are detected by listing to the
s/listing/listening/
@@ +247,5 @@
> return self.listen_thread.is_alive()
>
> def listen(self):
> + if self.verbose:
> + logger.debug('AutophonePulseMonitor: start shared_lock.acquire')
Thanks - I think I'll like that better.
@@ +519,5 @@
> + response = tc_queue.listArtifacts(task_id, run_id, {
> + 'continuationToken': response['continuationToken']})
> +
> + if 'buildid' not in build_metadata or 'build_type' not in build_metadata:
> + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: '
Does this happen often/normally? Should it be a warning? (just checking)
@@ +526,5 @@
> + return
> +
> + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: build_metadata: %s' % (task_id, run_id, build_metadata))
> +
> + if build_metadata['appName'] != 'fennec':
case sensitivity?
@@ +530,5 @@
> + if build_metadata['appName'] != 'fennec':
> + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName']))
> + return
> + if build_metadata['branch'] not in self.trees:
> + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['branch']))
maybe add more to the logging here (and above, below) to differentiate the various early return cases
Attachment #8773911 -
Flags: review?(gbrown) → review+
Assignee | ||
Comment 3•8 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #2)
> Comment on attachment 8773911 [details] [diff] [review]
> bug-1275344-part-1-pulse-v1.patch
>
> Review of attachment 8773911 [details] [diff] [review]:
> -----------------------------------------------------------------
>
>
> @@ +519,5 @@
> > + response = tc_queue.listArtifacts(task_id, run_id, {
> > + 'continuationToken': response['continuationToken']})
> > +
> > + if 'buildid' not in build_metadata or 'build_type' not in build_metadata:
> > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: '
>
> Does this happen often/normally? Should it be a warning? (just checking)
>
Seems pretty common in testing. Seems the tasks are for lint checking, or in other cases an error has occurred. I don't think it should be a warning, but I wanted to get examples of the situation in the debug log.
examples:
https://queue.taskcluster.net/v1/task/Qx2GiTUvS3qwGIkEi9swXw/runs/0/artifacts
https://queue.taskcluster.net/v1/task/fw10xa4cRuG68uZ-SXXyVw/runs/0/artifacts
> @@ +526,5 @@
> > + return
> > +
> > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: build_metadata: %s' % (task_id, run_id, build_metadata))
> > +
> > + if build_metadata['appName'] != 'fennec':
>
> case sensitivity?
>
buildbot's messages return in the payload, build, properties:
[
"appName",
"Fennec",
"SetProperty Step"
],
but mozinfo has appname "fennec"
https://public-artifacts.taskcluster.net/UAA7utzeRHq2CpvH7S54CQ/0/public/build/target.mozinfo.json
The appName stuff in the handle_taskcompleted just matches the buildbot property name since I want the builds to be handled by the same build_callback. I'll clean up the camelCase when we drop buildbot support in November.
> @@ +530,5 @@
> > + if build_metadata['appName'] != 'fennec':
> > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName']))
> > + return
> > + if build_metadata['branch'] not in self.trees:
> > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['branch']))
>
> maybe add more to the logging here (and above, below) to differentiate the
> various early return cases
Not sure what you mean here. I'll go ahead with the patch and we can fix this in the part 2 when we add taskcluster builds to trigger.
Comment 4•8 years ago
|
||
(In reply to Bob Clary [:bc:] from comment #3)
> > @@ +530,5 @@
> > > + if build_metadata['appName'] != 'fennec':
> > > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName']))
> > > + return
> > > + if build_metadata['branch'] not in self.trees:
> > > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['branch']))
> >
> > maybe add more to the logging here (and above, below) to differentiate the
> > various early return cases
>
> Not sure what you mean here. I'll go ahead with the patch and we can fix
> this in the part 2 when we add taskcluster builds to trigger.
Sorry for the lack of clarity. I was just trying to say that there are several different conditions that can trigger a log message of 'handle_taskcompleted: task_id: %s, run_id: %s: skip %s'; it will be easier to debug some conditions if each message is unique.
Assignee | ||
Comment 5•8 years ago
|
||
Ok, so instead of
logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName']))
we could do
logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip appName %s' % (task_id, run_id, build_metadata['appName']))
etc?
Comment 6•8 years ago
|
||
Yes.
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → bob
Assignee | ||
Comment 7•8 years ago
|
||
v1 with tweaks to logging as discussed. for application of following patches. carrying forward r+.
Attachment #8778408 -
Flags: review+
Assignee | ||
Comment 8•8 years ago
|
||
Simple patch which basically renames the BuildLocation stuff to prior to adding the Taskcluster bits. Also removes inbound-archive since that is 404.
Attachment #8778409 -
Flags: review?(gbrown)
Assignee | ||
Comment 9•8 years ago
|
||
Adding Taskcluster support was more problematic than I thought and the patch exemplifies that fact.
Previously we had relied on the meta data contained in the ftp build urls. This is no longer available using taskcluster build urls so we need to obtain it elsewhere and need to make pass it around so we don't have to keep looking it up. This necessitated changes to the jobs table in the jobs database as well as other changes to the treeherder submission code.
utils.get_build_data() was expanded to look at the additional json files if they are available and to collect additional information such as abi, sdk, build type, nightly status, platform, the full changeset url and the revision id as well as whether the build was produced by buildbot or taskcluster.
Taskcluster also forced a change to use UTC build ids. Tracking the builder_type is necessary for example since we need to be able to tell which timezone the original build id was in: for buildbot it is Pacific time; for taskcluster it is UTC. This also made it necessary to make the code timezone aware.
During the transition from buildbot to taskcluster, the code prefers buildbot builds or taskcluster tier 1 builds and will ignore taskcluster builds which are not tier 1. If during a transition period where both a being produced, we will get "duplicate' builds from both but only for the short period where they are both running.
When reviewing, I suggest this order:
utils.py
build_dates.py
builds.py
trigger_runs.py
buildserver.py
jobs.py
autophonepulsemonitor.py
autophonetreeherder.py
phonetest.py
autophone.py
worker.py
tests/*.py
Attachment #8778410 -
Flags: review?(gbrown)
Assignee | ||
Comment 10•8 years ago
|
||
Phonedash will now use UTC. Note that no changes are required to the database which uses System time UTC but if you have a local database and do not have it set to use UTC you will need to add default-time-zone='+00:00' to your [mysqld] section in /etc/my.cnf and restart mysqld/mariadbd or whatever you have.
Attachment #8778412 -
Flags: review?(gbrown)
Assignee | ||
Comment 11•8 years ago
|
||
example runs on staging.
https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&exclusion_profile=false&filter-searchStr=autophone&filter-tier=1&filter-tier=2&filter-tier=3&fromchange=be293e09860e&tochange=0f66d9a52316
https://treeherder.allizom.org/#/jobs?repo=fx-team&exclusion_profile=false&filter-searchStr=autophone&filter-tier=1&filter-tier=2&filter-tier=3&fromchange=c60886d5c003&tochange=c60886d5c003
Note the debug builds are tier 1 taskcluster and have Build: Build: unknown arm-unknown-api15 android in the jobinfo panel. I'll need to deal with that some day but left it for later.
Updated•8 years ago
|
Attachment #8778409 -
Flags: review?(gbrown) → review+
Comment 12•8 years ago
|
||
Comment on attachment 8778412 [details] [diff] [review]
bug-1275344-phonedash-buildids-in-utc-v1.patch
Review of attachment 8778412 [details] [diff] [review]:
-----------------------------------------------------------------
I don't know phonedash well, but this looks sane and I think you know what you are doing.
Attachment #8778412 -
Flags: review?(gbrown) → review+
Comment 13•8 years ago
|
||
Comment on attachment 8778410 [details] [diff] [review]
bug-1275344-part-3-taskcluster-v1.patch
Review of attachment 8778410 [details] [diff] [review]:
-----------------------------------------------------------------
Epic. Finding all the date changes must have been painful.
I'm not sure I fully appreciate all that's going on here, but it seems to work and I don't see a single thing wrong with it.
Attachment #8778410 -
Flags: review?(gbrown) → review+
Assignee | ||
Comment 14•8 years ago
|
||
https://github.com/mozilla/autophone/commit/07ff8deb73558990d31e459d245b05eb3d1689a9
https://github.com/mozilla/autophone/commit/473e6380fbb614db003d1ef2ab3779371418980f
https://github.com/mozilla/autophone/commit/7145da02367b342031d10d94b64147afb7bea763
Blocks: autophone-deployments
Status: NEW → ASSIGNED
Assignee | ||
Comment 15•8 years ago
|
||
https://github.com/mozilla/phonedash/commit/dccd2bbdfaa3ffb2b7b9880b5e340f8b3b037d48
deployed 2016-08-10 ~00:00 PDT
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 16•8 years ago
|
||
The production Ubuntu servers running Python 2.7.6 show the following error after starting:
Traceback (most recent call last):
File "autophone.py", line 1503, in <module>
exit_code = main(options)
File "autophone.py", line 1208, in main
autophone.run()
File "autophone.py", line 219, in run
self.worker_msg_loop()
File "autophone.py", line 340, in worker_msg_loop
self.check_for_unrecoverable_errors()
File "autophone.py", line 329, in check_for_unrecoverable_errors
elapsed = datetime.datetime.now(tz=pytz.utc) - worker.last_status_msg.timestamp
TypeError: can't subtract offset-naive and offset-aware datetimes
which does not occur on my local Fedora 23 system with Python 2.7.11
Assignee | ||
Comment 17•8 years ago
|
||
This replaces the utcnow in the worker status with now(tz=pytz.utc).
https://github.com/mozilla/autophone/commit/7d54bd34a5fe18068277d8965c9c4f0e39dc7253
Hopefully there aren't other timebombs waiting to be found.
Attachment #8779657 -
Flags: review+
Updated•3 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•