Closed Bug 1275344 Opened 8 years ago Closed 8 years ago

Autophone - download builds from the TaskCluster index

Categories

(Testing Graveyard :: Autophone, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(6 files)

As build jobs move from buildbot to TaskCluster, the builds will no longer be available from ftp|archive.mozilla.org. trigger_runs.py will need to be updated to use the TC index instead of scraping the archive site. The pulse notification of builds may not require any modification since I would expect them to include the new locations for builds, but I haven't checked that yet.
Depends on: 1118394
This adds monitoring for trunk builds for exchange/taskcluster-queue/v1/task-completed and begins the transition from getting metadata from the build urls. With the exception of a minor tweak to trigger_runs.py to support passing the abi/sdk to the autophone process, this does not handle the case where trigger_runs.py downloads taskcluster builds since I am still working on that patch. It also tweaks some of the more verbose logging statements to reduce the log spew. I had planned to finish the work for supporting triggering taskcluster builds, then consolidate the taskcluster related code into a file/class but am pushing this work at its current stage since the android debug trunk buildbot builds have been turned off. I'll complete the refactoring in the next patch. We are ok if this doesn't get reviewed until Monday since the weekend is slow and I still have backfilling to do. It should be easy enough to backfill the debug unit tests on mozilla-central even if we miss them this weekend.
Attachment #8773911 - Flags: review?(gbrown)
Comment on attachment 8773911 [details] [diff] [review] bug-1275344-part-1-pulse-v1.patch Review of attachment 8773911 [details] [diff] [review]: ----------------------------------------------------------------- ::: autophonepulsemonitor.py @@ +30,5 @@ > normalized messages in order to obtain the check-in comment for a > build. The comment is used to determine if a try build has > requested Autophone testing. > > + Taskcluster builds are detected by listing to the s/listing/listening/ @@ +247,5 @@ > return self.listen_thread.is_alive() > > def listen(self): > + if self.verbose: > + logger.debug('AutophonePulseMonitor: start shared_lock.acquire') Thanks - I think I'll like that better. @@ +519,5 @@ > + response = tc_queue.listArtifacts(task_id, run_id, { > + 'continuationToken': response['continuationToken']}) > + > + if 'buildid' not in build_metadata or 'build_type' not in build_metadata: > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: ' Does this happen often/normally? Should it be a warning? (just checking) @@ +526,5 @@ > + return > + > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: build_metadata: %s' % (task_id, run_id, build_metadata)) > + > + if build_metadata['appName'] != 'fennec': case sensitivity? @@ +530,5 @@ > + if build_metadata['appName'] != 'fennec': > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName'])) > + return > + if build_metadata['branch'] not in self.trees: > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['branch'])) maybe add more to the logging here (and above, below) to differentiate the various early return cases
Attachment #8773911 - Flags: review?(gbrown) → review+
(In reply to Geoff Brown [:gbrown] from comment #2) > Comment on attachment 8773911 [details] [diff] [review] > bug-1275344-part-1-pulse-v1.patch > > Review of attachment 8773911 [details] [diff] [review]: > ----------------------------------------------------------------- > > > @@ +519,5 @@ > > + response = tc_queue.listArtifacts(task_id, run_id, { > > + 'continuationToken': response['continuationToken']}) > > + > > + if 'buildid' not in build_metadata or 'build_type' not in build_metadata: > > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: ' > > Does this happen often/normally? Should it be a warning? (just checking) > Seems pretty common in testing. Seems the tasks are for lint checking, or in other cases an error has occurred. I don't think it should be a warning, but I wanted to get examples of the situation in the debug log. examples: https://queue.taskcluster.net/v1/task/Qx2GiTUvS3qwGIkEi9swXw/runs/0/artifacts https://queue.taskcluster.net/v1/task/fw10xa4cRuG68uZ-SXXyVw/runs/0/artifacts > @@ +526,5 @@ > > + return > > + > > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: build_metadata: %s' % (task_id, run_id, build_metadata)) > > + > > + if build_metadata['appName'] != 'fennec': > > case sensitivity? > buildbot's messages return in the payload, build, properties: [ "appName", "Fennec", "SetProperty Step" ], but mozinfo has appname "fennec" https://public-artifacts.taskcluster.net/UAA7utzeRHq2CpvH7S54CQ/0/public/build/target.mozinfo.json The appName stuff in the handle_taskcompleted just matches the buildbot property name since I want the builds to be handled by the same build_callback. I'll clean up the camelCase when we drop buildbot support in November. > @@ +530,5 @@ > > + if build_metadata['appName'] != 'fennec': > > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName'])) > > + return > > + if build_metadata['branch'] not in self.trees: > > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['branch'])) > > maybe add more to the logging here (and above, below) to differentiate the > various early return cases Not sure what you mean here. I'll go ahead with the patch and we can fix this in the part 2 when we add taskcluster builds to trigger.
(In reply to Bob Clary [:bc:] from comment #3) > > @@ +530,5 @@ > > > + if build_metadata['appName'] != 'fennec': > > > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName'])) > > > + return > > > + if build_metadata['branch'] not in self.trees: > > > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['branch'])) > > > > maybe add more to the logging here (and above, below) to differentiate the > > various early return cases > > Not sure what you mean here. I'll go ahead with the patch and we can fix > this in the part 2 when we add taskcluster builds to trigger. Sorry for the lack of clarity. I was just trying to say that there are several different conditions that can trigger a log message of 'handle_taskcompleted: task_id: %s, run_id: %s: skip %s'; it will be easier to debug some conditions if each message is unique.
Ok, so instead of logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName'])) we could do logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip appName %s' % (task_id, run_id, build_metadata['appName'])) etc?
Yes.
Assignee: nobody → bob
v1 with tweaks to logging as discussed. for application of following patches. carrying forward r+.
Attachment #8778408 - Flags: review+
Attached patch bug-1275344-part-2-ftp-v1.patch (deleted) — Splinter Review
Simple patch which basically renames the BuildLocation stuff to prior to adding the Taskcluster bits. Also removes inbound-archive since that is 404.
Attachment #8778409 - Flags: review?(gbrown)
Adding Taskcluster support was more problematic than I thought and the patch exemplifies that fact. Previously we had relied on the meta data contained in the ftp build urls. This is no longer available using taskcluster build urls so we need to obtain it elsewhere and need to make pass it around so we don't have to keep looking it up. This necessitated changes to the jobs table in the jobs database as well as other changes to the treeherder submission code. utils.get_build_data() was expanded to look at the additional json files if they are available and to collect additional information such as abi, sdk, build type, nightly status, platform, the full changeset url and the revision id as well as whether the build was produced by buildbot or taskcluster. Taskcluster also forced a change to use UTC build ids. Tracking the builder_type is necessary for example since we need to be able to tell which timezone the original build id was in: for buildbot it is Pacific time; for taskcluster it is UTC. This also made it necessary to make the code timezone aware. During the transition from buildbot to taskcluster, the code prefers buildbot builds or taskcluster tier 1 builds and will ignore taskcluster builds which are not tier 1. If during a transition period where both a being produced, we will get "duplicate' builds from both but only for the short period where they are both running. When reviewing, I suggest this order: utils.py build_dates.py builds.py trigger_runs.py buildserver.py jobs.py autophonepulsemonitor.py autophonetreeherder.py phonetest.py autophone.py worker.py tests/*.py
Attachment #8778410 - Flags: review?(gbrown)
Phonedash will now use UTC. Note that no changes are required to the database which uses System time UTC but if you have a local database and do not have it set to use UTC you will need to add default-time-zone='+00:00' to your [mysqld] section in /etc/my.cnf and restart mysqld/mariadbd or whatever you have.
Attachment #8778412 - Flags: review?(gbrown)
Attachment #8778409 - Flags: review?(gbrown) → review+
Comment on attachment 8778412 [details] [diff] [review] bug-1275344-phonedash-buildids-in-utc-v1.patch Review of attachment 8778412 [details] [diff] [review]: ----------------------------------------------------------------- I don't know phonedash well, but this looks sane and I think you know what you are doing.
Attachment #8778412 - Flags: review?(gbrown) → review+
Comment on attachment 8778410 [details] [diff] [review] bug-1275344-part-3-taskcluster-v1.patch Review of attachment 8778410 [details] [diff] [review]: ----------------------------------------------------------------- Epic. Finding all the date changes must have been painful. I'm not sure I fully appreciate all that's going on here, but it seems to work and I don't see a single thing wrong with it.
Attachment #8778410 - Flags: review?(gbrown) → review+
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
The production Ubuntu servers running Python 2.7.6 show the following error after starting: Traceback (most recent call last): File "autophone.py", line 1503, in <module> exit_code = main(options) File "autophone.py", line 1208, in main autophone.run() File "autophone.py", line 219, in run self.worker_msg_loop() File "autophone.py", line 340, in worker_msg_loop self.check_for_unrecoverable_errors() File "autophone.py", line 329, in check_for_unrecoverable_errors elapsed = datetime.datetime.now(tz=pytz.utc) - worker.last_status_msg.timestamp TypeError: can't subtract offset-naive and offset-aware datetimes which does not occur on my local Fedora 23 system with Python 2.7.11
This replaces the utcnow in the worker status with now(tz=pytz.utc). https://github.com/mozilla/autophone/commit/7d54bd34a5fe18068277d8965c9c4f0e39dc7253 Hopefully there aren't other timebombs waiting to be found.
Attachment #8779657 - Flags: review+
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: