Closed Bug 1187395 Opened 9 years ago Closed 9 years ago

HTTP 400 errors when mozilla-taskcluster POSTs to the jobs endpoint on treeherder staging

Categories

(Tree Management :: Treeherder: API, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: garndt, Assigned: emorley)

References

Details

starting around 15:12 GMT mozilla-taskcluster started reporting issues with posting to the /jobs endpoint for all projects it's monitoring. The endpoint returns a 400 status code. This causes jobs to not appear on treeherder staging, but as a side effect mozilla-taskcluster might also not be posting jobs to prod until the stage pushes occur (still investigating that)
from the logs: cannot POST /api/project/try/jobs/?oauth_body_hash=<body hash>&oauth_consumer_key=<consumer key>&oauth_nonce=<nonce>&oauth_signature_method=HMAC-SHA1&oauth_timestamp=1437758671&oauth_token=&oauth_version=1.0&user=<user>&oauth_signature=<signature> (400) Error messages with the originaly tokens and such in them can be found on papertrail [1]. Please see me for access if needed [1] https://papertrailapp.com/systems/mozilla-taskcluster-staging/events?r=561287715187109903-561288307397668880
Seems like next steps are: 1) Find out when the 400s on stage started (using papertrail) 2) See if those same 400s are present in the prod taskcluster logs - if they aren't, then the "jobs not appearing on prod" issue is likely just a side-effect, as per comment 0 - and should be fixed on the TC side. From IRC: <garndt> emorley: do you know of anything changing recently that would have caused 400 errors to be returned when attempting to post to TH? <emorley> garndt: prod's last deploy was on the 17th July, so nothing has changed there since then. There was a new treeherder-client release on pypi, but I guess you don't use that <garndt> yea, we're using the node client here <emorley> garndt: in which case I don't believe anything has changed on our side (so long as we're talking about prod, not stage) <garndt> oh, i suppose all of these are staging so far <emorley> garndt: stage deploys are listed on https://rpm.newrelic.com/accounts/677903/applications/5585473/deployments <emorley> (if you don't have access, PM me an email address and I'll send an invite) <emorley> garndt: which API endpoint is it? <emorley> garndt: and when did it start? <emorley> garndt: we were about to do a prod push, so this kind of blocks that if it's something we've changed <garndt> /jobs <emorley> garndt: as in, if you'd sent the message 5 mins later, we'd have probably already pushed to prod, and presumably TC would be broken there <emorley> :-) <garndt> I'm trying to figure out when this started happening, but I'm noticing jobs not appeared on prod TH from taskcluster <garndt> looks maybe ~2hrs ago <garndt> emorley: looks like our first error to staging was at 10:12am CDT...so 2 hours ago <emorley> garndt: so we did a stage push 2 hours ago (last before that was ~20 hours ago), which deployed: https://github.com/mozilla/treeherder/compare/798df37...2c57b1d <emorley> garndt: so if this did start 2 hours ago, and is stage-only, then https://github.com/mozilla/treeherder/commit/e5484eed066cfa249759902871524186ac66f533 looks to blame <emorley> garndt: if you're seeing prod+stage, then it's not something on our side afaict <garndt> well, so far I'm seeing jobs not appearing stage or prod, but I'm wondering if because this isn't posting to staging that it's blocking posting to prod <emorley> ordinarily I'd be more than happy to stay up until the early hours fixing this, I just have to leave by a certain time today to get a lift with someone, then I'm away from wifi, camping in a field for much of the weekend <emorley> (typical timing) <wlach> garndt: I don't think you should depend on stage working <wlach> garndt: that's pretty crazy <emorley> agreed <wlach> the whole idea is that it's an experimental environment <emorley> I imagine it's unintentional though <garndt> wlach: I'm looking through the code that was written by someone else no longer here to yell at :) I'm not sure what's going on on this side...
Priority: -- → P1
Update: I'm not sure about the 403 errors on staging, but the errors posting to prod was a side effect of a different push that changed the credentials used on prod. That was fixed thanks to emorely and wlach. mozilla-taskcluster was changed to look at prod, but that did not fix the underlying issue of why treeherder might be returning 403s now. If I recall correctly the time that it started happening was around the time some changes were pushed to staging.
So there were several problems: 1) Taskcluster submission to treeherder prod appears to break, if submission to stage fails (ie: presumably it submits to stage first and then only after submits to prod). This is very fragile, and needs to be fixed on the taskcluster side. 2) Something (believed to be bug 1185520; now backed out), broke taskcluster submission to stage (ie comment 0 / comment 1 here) - which due to #1 also broke submission to prod. Stage being broken is unfortunate, but is to be expected from time to time, and shows stage as a process is "working". So this in itself is not a major problem. 3) Talos e10s scheduling/buildername changes were made before the necessary changes (bug 1168360) were made to treeherder. For now people must make sure not to do this (this isn't the first time) - particularly until we have an approve/deny mechanism of bug 1042077 + regexes that are less prone to false positives. 4) Due to #3, there was a keenness to deploy the talos e10s changes asap, on Friday night. However due to #1, we did not want to push master to prod, since it would have broken taskcluster even more. As a compromise, I suggested deploying a branch to prod which was just "current prod + two cherry-picked talos e10s buildbot.py commits", since in theory it should have been safe. Unfortunately due to the code+schema changes that needed to be made in a specific order (bug 1185030 - specifically the DB changes in bug 1185030 comment 8), this meant the credentials got overwritten when this cherry-picked branch was pushed. I fixed this in bug 1185030 comment 9 on Friday. Things we can do better next time: a) [taskcluster] Not have taskcluster prod break if treeherder stage is not working (avoids #1). b) [taskcluster] Use a proper client (eg the nodejs client), which the treeherder team maintains, to reduce the impact of breaking API changes (avoids #2). c) [treeherder] Take over ownership of the nodejs client (assuming taskcluster starts using it again), move it into our repo, and have its tests run on Travis (avoids #2). d) [treeherder] Remind people that they _must_ make regex changes to Treeherder for buildername changes *before* they go into production, or else expect that they'll have to wait a while until the changes are deployed (avoids #3). e) [treeherder] Switch to using Django ORM (bug 1178641), so we can use DB schema migrations (avoids #4). f) [treeherder] Not feel pressured to deploy on a friday, when people haven't made regex changes in time (avoids #4). g) [treeherder] Deploy Treeherder more frequently. We hadn't deployed for 7 days - which (i) means there are more changes backed up that people want landed, and (ii) means that it's harder to keep track of what more "dangerous" changes (eg those that require schema coordination) are not yet deployed (avoids #4). (In reply to Greg Arndt [:garndt] from comment #4) > Update: I'm not sure about the 403 errors on staging, but the errors posting > to prod was a side effect of a different push that changed the credentials > used on prod. That was fixed thanks to emorely and wlach. This problem was only added later on, when we were trying to cherry pick some other changes, when master wasn't working with taskcluster. Taskcluster's behaviour is still broken wrt prod (see #1 above). -- Next steps: * Now that master/stage is working (due to the backout of bug 1185520), undo the changes in bug 1185030 comment 9 and deploy master to prod. * Figure out why the landing in bug 1185520 broke TC, and reland it with fixes. * Ensure the taskcluster team make stage and prod environments more independent.
Assignee: nobody → emorley
Blocks: 1185520
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Summary: Errors when posting to treeherder staging from mozilla-taskcluster → HTTP 400 errors when mozilla-taskcluster POSTs to the jobs endpoint on treeherder staging
Depends on: 1188388
Depends on: 1188398
(In reply to Ed Morley [:emorley] from comment #5) > a) [taskcluster] Not have taskcluster prod break if treeherder stage is not > working (avoids #1). Filed bug 1188388. > b) [taskcluster] Use a proper client (eg the nodejs client) Filed bug 1188398. > c) [treeherder] Take over ownership of the nodejs client Filed bug 1188396. > Next steps: > * Now that master/stage is working (due to the backout of bug 1185520), undo > the changes in bug 1185030 comment 9 and deploy master to prod. Done; bug 1185030 comment 10.
Depends on: 1191276
You need to log in before you can comment on or make changes to this bug.