Closed Bug 1400860 Opened 7 years ago Closed 7 years ago

investigate jump in 400s from the public app on August 23rd

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: bhearsum, Unassigned)

Details

Nick noticed that we jumped from ~5k/hour to anywhere from ~10-30k/hour. https://github.com/mozilla/balrog/pull/373, https://github.com/mozilla/balrog/pull/372, and https://github.com/mozilla/balrog/pull/377 all landed in production around that time. There was some discussion in #377 about potentially increasing 400s, curiously. Datadog link for those with access: https://app.datadoghq.com/dash/156924/balrog-web-aus5mozillaorg?live=false&page=0&is_auto=false&from_ts=1503387484944&to_ts=1503686959047&tile_size=m&fullscreen=false
Looks like it's just 400s that spiked, actually.
Summary: investigate jump in 400s/500s from the public app on August 23rd → investigate jump in 400s from the public app on August 23rd
URLs like https://aus5.mozilla.org/update/2/%PRODUCT%/%VERSION%/%BUILD_ID%/%BUILD_TARGET%/%LOCALE%/%CHANNEL%/%OS_VERSION%/update.xml are definitely getting 400 instead of 404 now, but through curl and Firefox, so I think we may want to back out #377. It sounds like it worked well in all local testing, but had different behaviour with nginx in the mix...
I thought we agreed that this URL generated 400s because it had invalid percent escapes, and using e.g. https://aus5.mozilla.org/update/2/%25PRODUCT%25/%25VERSION%25/%25BUILD_ID%25/%25BUILD_TARGET%25/%25LOCALE%25/%25CHANNEL%25/%25OS_VERSION%25/update.xml worked fine. I also thought we agreed that the 400s were happening at the nginx level and changing our application's behavior wouldn't affect this one way or the other. Why would nginx allow a request like this through just because we changed our application's behavior? I'm happy to roll back #377, but I don't really understand why that would fix the problem.
(In reply to Ethan Glasser-Camp (:glasserc) from comment #3) > I thought we agreed that this URL generated 400s because it had invalid > percent escapes, and using e.g. > https://aus5.mozilla.org/update/2/%25PRODUCT%25/%25VERSION%25/%25BUILD_ID%25/ > %25BUILD_TARGET%25/%25LOCALE%25/%25CHANNEL%25/%25OS_VERSION%25/update.xml > worked fine. I also thought we agreed that the 400s were happening at the > nginx level and changing our application's behavior wouldn't affect this one > way or the other. Yep, and it still looks that way to me based on: curl -L https://aus5.mozilla.org/update/2/%PRODUCT%/%VERSION%/%BUILD_ID%/%BUILD_TARGET%/%LOCALE%/%CHANNEL%/%OS_VERSION%/update.xml <html> <head><title>400 Bad Request</title></head> <body bgcolor="white"> <center><h1>400 Bad Request</h1></center> <hr><center>nginx</center> </body> </html> ...and the fact that I can't find any requests with strings like "%PRODUCT%" in the ELB logs. > Why would nginx allow a request like this through just > because we changed our application's behavior? I'm happy to roll back #377, > but I don't really understand why that would fix the problem. I have no idea, to be honest, and #377 could be totally unrelated to this spike. It's tough to get more information without playing with different code with nginx in the mix. Relud - I don't suppose you have any insight into this, do you? Either why URLs like https://aus5.mozilla.org/update/2/%PRODUCT%/%VERSION%/%BUILD_ID%/%BUILD_TARGET%/%LOCALE%/%CHANNEL%/%OS_VERSION%/update.xml would generate 400s, or if there was any infra changes on August 23rd that would otherwise cause an increase in 400s. I'
Flags: needinfo?(dthorn)
If it's nginx that started returning 400s for some urls that it wasn't before, that's super suspicious, and I'm not aware of any changes in nginx that could have caused that. have we confirmed with looking through elb logs that those types of urls changed to responding 400? the nginx 400s look like a red herring to me, but i'm not sure. I strongly suspect that this is a direct result of 373. I was monitoring / once a minute, and at the time of the change I got paged. i'm not monitoring that endpoint anymore, but maybe there are things that are trying to crawl that now hit 400s.
Flags: needinfo?(dthorn)
I don't think we're going to investigate this any further - the spike went away.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Product: Release Engineering → Release Engineering Graveyard
You need to log in before you can comment on or make changes to this bug.