Closed Bug 1622267 Opened 5 years ago Closed 3 years ago

update-verify intermittently fails: "expected buildID 20200313021419 does not match actual 20200310192828"

Categories

(Release Engineering :: Release Automation: Updates, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: jlorenzo, Unassigned)

References

Details

(Keywords: intermittent-failure)

I haven't been on releaseduty for quite a long time, so this bug is maybe a duplicate.

I saw this on today's 75.0b3 alongside Bug 1622266:

[task 2020-03-13T05:54:52.421Z] TEST-UNEXPECTED-FAIL: expected buildID 20200313021419 does not match actual 20200310192828

Source: https://firefox-ci-tc.services.mozilla.com/tasks/IILbAXDDSDyltJk2TpRCng/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FIILbAXDDSDyltJk2TpRCng%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log#L81544

It's not the same error message as bug 1622266, but a rerun fixed it too. At the moment, I'm guessing it's a balrog race condition, but I'm not sure at all.

Buildhub says 20200310192828 is 75.0b2 so this looks like AUS caching. The request is

[task 2020-03-13T05:42:25.010Z] --2020-03-13 05:42:25--  https://aus5.mozilla.org/update/3/Firefox/61.0/20180604143143/Linux_x86-gcc3/pl/aurora-localtest/default/default/default/update.xml?force=1
[task 2020-03-13T05:42:25.013Z] Resolving aus5.mozilla.org (aus5.mozilla.org)... 52.85.148.63, 52.85.148.16, 52.85.148.127, ...
[task 2020-03-13T05:42:25.013Z] Connecting to aus5.mozilla.org (aus5.mozilla.org)|52.85.148.63|:443... connected.
[task 2020-03-13T05:42:34.081Z] HTTP request sent, awaiting response... 
[task 2020-03-13T05:42:34.081Z]   HTTP/1.1 200 OK
[task 2020-03-13T05:42:34.081Z]   Content-Type: text/xml; charset=utf-8
[task 2020-03-13T05:42:34.081Z]   Content-Length: 671
[task 2020-03-13T05:42:34.081Z]   Connection: keep-alive
[task 2020-03-13T05:42:34.081Z]   Server: nginx/1.17.1
[task 2020-03-13T05:42:34.081Z]   Date: Wed, 11 Mar 2020 05:42:39 GMT
[task 2020-03-13T05:42:34.081Z]   Cache-Control: public, max-age=90
[task 2020-03-13T05:42:34.081Z]   Strict-Transport-Security: max-age=31536000;
[task 2020-03-13T05:42:34.081Z]   X-Content-Type-Options: nosniff
[task 2020-03-13T05:42:34.081Z]   Content-Security-Policy: default-src 'none'; frame-ancestors 'none'
[task 2020-03-13T05:42:34.081Z]   X-Proxy-Cache-Status: MISS
[task 2020-03-13T05:42:34.081Z]   Via: 1.1 google, 1.1 47c0295005ec7d8570406951491004c3.cloudfront.net (CloudFront)
[task 2020-03-13T05:42:34.081Z]   Alt-Svc: clear
[task 2020-03-13T05:42:34.081Z]   Vary: Accept-Encoding
[task 2020-03-13T05:42:34.081Z]   X-Cache: RefreshHit from cloudfront
[task 2020-03-13T05:42:34.081Z]   X-Amz-Cf-Pop: IAD89-C3
[task 2020-03-13T05:42:34.081Z]   X-Amz-Cf-Id: wZB0FdvYMEydlK72C255U4IptOS_Hw7YtQuUETSCLBXwE2k80FTTwg==
[task 2020-03-13T05:42:34.081Z]   Age: 172795
[task 2020-03-13T05:42:34.081Z] Length: 671 [text/xml]
[task 2020-03-13T05:42:34.081Z] Saving to: ‘update.xml’
[task 2020-03-13T05:42:34.081Z] 
[task 2020-03-13T05:42:34.081Z]      0K                                                      100%  265M=0s
[task 2020-03-13T05:42:34.081Z] 
[task 2020-03-13T05:42:34.081Z] 2020-03-13 05:42:34 (265 MB/s) - ‘update.xml’ saved [671/671]
[task 2020-03-13T05:42:34.081Z] 
[task 2020-03-13T05:42:34.105Z] Got this response:
[task 2020-03-13T05:42:34.106Z] <?xml version="1.0"?>
[task 2020-03-13T05:42:34.106Z] <updates>
[task 2020-03-13T05:42:34.106Z]     <update actions="showURL" appVersion="75.0" buildID="20200310192828" detailsURL="https://www.mozilla.org/pl/firefox/75.0/releasenotes/" displayVersion="75.0 Beta 2" openURL="https://www.mozilla.org/pl/firefox/75.0a2/whatsnew/?oldversion=%OLD_VERSION%" type="minor">
[task 2020-03-13T05:42:34.106Z]         <patch type="complete" URL="http://archive.mozilla.org/pub/devedition/candidates/75.0b2-candidates/build1/update/linux-i686/pl/firefox-75.0b2.complete.mar" hashFunction="sha512" hashValue="3d15f02d66aaf0959a0fe147ceeec6f85f7f2c8bcbf9d8dbc8db1f63273d04b19003dc96100ea26bdd340082ef6fd44bf478780a88cade5fff787a8c65067a9a" size="58854877"/>
[task 2020-03-13T05:42:34.106Z]     </update>
[task 2020-03-13T05:42:34.108Z] </updates>

So Cloudfront had stale data and it asked Balrog ? This was the first request for the url, in this log anyway.

The balrog submission task was UkEQSlQYRBW4LQwXaH4LGQ and completed at 2020-03-13T04:50:52.084Z, ie 50 minutes before the request. Any ideas bhearsum ?

Flags: needinfo?(bhearsum)

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #1)

Buildhub says 20200310192828 is 75.0b2 so this looks like AUS caching. The request is

[task 2020-03-13T05:42:25.010Z] --2020-03-13 05:42:25--  https://aus5.mozilla.org/update/3/Firefox/61.0/20180604143143/Linux_x86-gcc3/pl/aurora-localtest/default/default/default/update.xml?force=1
[task 2020-03-13T05:42:25.013Z] Resolving aus5.mozilla.org (aus5.mozilla.org)... 52.85.148.63, 52.85.148.16, 52.85.148.127, ...
[task 2020-03-13T05:42:25.013Z] Connecting to aus5.mozilla.org (aus5.mozilla.org)|52.85.148.63|:443... connected.
[task 2020-03-13T05:42:34.081Z] HTTP request sent, awaiting response... 
[task 2020-03-13T05:42:34.081Z]   HTTP/1.1 200 OK
[task 2020-03-13T05:42:34.081Z]   Content-Type: text/xml; charset=utf-8
[task 2020-03-13T05:42:34.081Z]   Content-Length: 671
[task 2020-03-13T05:42:34.081Z]   Connection: keep-alive
[task 2020-03-13T05:42:34.081Z]   Server: nginx/1.17.1
[task 2020-03-13T05:42:34.081Z]   Date: Wed, 11 Mar 2020 05:42:39 GMT
[task 2020-03-13T05:42:34.081Z]   Cache-Control: public, max-age=90
[task 2020-03-13T05:42:34.081Z]   Strict-Transport-Security: max-age=31536000;
[task 2020-03-13T05:42:34.081Z]   X-Content-Type-Options: nosniff
[task 2020-03-13T05:42:34.081Z]   Content-Security-Policy: default-src 'none'; frame-ancestors 'none'
[task 2020-03-13T05:42:34.081Z]   X-Proxy-Cache-Status: MISS
[task 2020-03-13T05:42:34.081Z]   Via: 1.1 google, 1.1 47c0295005ec7d8570406951491004c3.cloudfront.net (CloudFront)
[task 2020-03-13T05:42:34.081Z]   Alt-Svc: clear
[task 2020-03-13T05:42:34.081Z]   Vary: Accept-Encoding
[task 2020-03-13T05:42:34.081Z]   X-Cache: RefreshHit from cloudfront
[task 2020-03-13T05:42:34.081Z]   X-Amz-Cf-Pop: IAD89-C3
[task 2020-03-13T05:42:34.081Z]   X-Amz-Cf-Id: wZB0FdvYMEydlK72C255U4IptOS_Hw7YtQuUETSCLBXwE2k80FTTwg==
[task 2020-03-13T05:42:34.081Z]   Age: 172795
[task 2020-03-13T05:42:34.081Z] Length: 671 [text/xml]
[task 2020-03-13T05:42:34.081Z] Saving to: ‘update.xml’
[task 2020-03-13T05:42:34.081Z] 
[task 2020-03-13T05:42:34.081Z]      0K                                                      100%  265M=0s
[task 2020-03-13T05:42:34.081Z] 
[task 2020-03-13T05:42:34.081Z] 2020-03-13 05:42:34 (265 MB/s) - ‘update.xml’ saved [671/671]
[task 2020-03-13T05:42:34.081Z] 
[task 2020-03-13T05:42:34.105Z] Got this response:
[task 2020-03-13T05:42:34.106Z] <?xml version="1.0"?>
[task 2020-03-13T05:42:34.106Z] <updates>
[task 2020-03-13T05:42:34.106Z]     <update actions="showURL" appVersion="75.0" buildID="20200310192828" detailsURL="https://www.mozilla.org/pl/firefox/75.0/releasenotes/" displayVersion="75.0 Beta 2" openURL="https://www.mozilla.org/pl/firefox/75.0a2/whatsnew/?oldversion=%OLD_VERSION%" type="minor">
[task 2020-03-13T05:42:34.106Z]         <patch type="complete" URL="http://archive.mozilla.org/pub/devedition/candidates/75.0b2-candidates/build1/update/linux-i686/pl/firefox-75.0b2.complete.mar" hashFunction="sha512" hashValue="3d15f02d66aaf0959a0fe147ceeec6f85f7f2c8bcbf9d8dbc8db1f63273d04b19003dc96100ea26bdd340082ef6fd44bf478780a88cade5fff787a8c65067a9a" size="58854877"/>
[task 2020-03-13T05:42:34.106Z]     </update>
[task 2020-03-13T05:42:34.108Z] </updates>

So Cloudfront had stale data and it asked Balrog ? This was the first request for the url, in this log anyway.

The balrog submission task was UkEQSlQYRBW4LQwXaH4LGQ and completed at 2020-03-13T04:50:52.084Z, ie 50 minutes before the request. Any ideas bhearsum ?

It looks like "RefreshHit from Cloudfront" means "the request was served from the cache of the edge location", but that it when back to the origin first to check if it was stale (from https://forums.aws.amazon.com/thread.jspa?threadID=31021) - so this does seem like Balrog was serving the wrong contents. I'll dig into that side of things more tomorrow.

Jeremy, can you confirm my interpretation of the CloudFront behaviour, and how long CloudFront caches for?

Flags: needinfo?(bhearsum) → needinfo?(oremj)

So, here are the potential caches in play, in order from edge to app:

  • CloudFront, which should be respect the Cache-Control header, which is setting 90s. We also set a default TTL of 300s, but that shouldn't apply here.
  • Nginx also looks like it should be respecting the 90s Cache-Control.
  • Uwsgi is doing no caching itself
  • In the app, we have two in-memory caches:
    ** "blob_version", which caches for 60s.
    ** "blob", which caches for 1 hour, unless the "data_version" of the cached blob is less than what we found in the "blob_version" cache.

Putting this together, we should have a maximum cache of time of 2.25 minutes - which is well below the window here.

Based on the headers, I'm pretty certain that neither Cloudfront nor nginx served a cached request (X-Cache: RefreshHit from cloudfront indicates that Cloudfront didn't, and X-Proxy-Cache-Status: MISS indicates that nginx didn't).

If the "blob" cache wasn't invalidated for some reason there's a possibility that we could've served out of date information in that time window. That's a little tough to believe because the in-memory cache affects all requests for that particular process - so I would expect to see a number of failures, not just one.

Other things of note:

  • The request took awhile to connect - about 9s:
    [task 2020-03-13T05:42:25.013Z] Connecting to aus5.mozilla.org (aus5.mozilla.org)|52.85.148.63|:443... connected.
    [task 2020-03-13T05:42:34.081Z] HTTP request sent, awaiting response...

Looking in Stackdriver, it looks like at least one uwsgi process restarted around this time. I can't think of any reason why this would cause us to get older data -- if anything, it should force new data.

  • I can't even find a record of this request in Stackdriver. I see other "aurora-localtest" requests, but not one for this particular platform+locale. It's possible that I'm searching poorly, but if it truly didn't hit the backend, my earlier read of the caching could be wrong, or something even weirder could be happening here.

I just noticed bug 1565490 was already filed. It's just a placeholder for general update-verify issues. So I'm marking this bug as blocking the placeholder.

Blocks: 1565490

Everything Ben has mentioned about caching looks accurate to me. Do you need anymore input on this bug at this time?

Flags: needinfo?(oremj)

(In reply to Jeremy Orem [:oremj] from comment #5)

Everything Ben has mentioned about caching looks accurate to me. Do you need anymore input on this bug at this time?

Unless you have other theories, I think that's it - thanks!

I dug into BigQuery a bit and confirmed that the requests in question did indeed make it to the backend -- so that rules out nginx/cloudfront being an issue here.

At this point, my strongest theory on this is this aforementioned one:

If the "blob" cache wasn't invalidated for some reason there's a possibility that we could've served out of date information in that time window. That's a little tough to believe because the in-memory cache affects all requests for that particular process - so I would expect to see a number of failures, not just one.

I still don't see how this is possible without the side effect of other request also failing, but perhaps between all the nodes and processes in play it's possible we only got one? If this is what happened, it seems likely that it's caused by some sort of race condition in the cache update/invalidation.

A couple of things I can see as ways forward:

  • If we want to debug this further we should probably add some logging around the blob and blob_version cache update/invalidation and see if we can hit it again.
  • Shorten the blob cache time. This will make us re-fetch blobs from the database more often, but probably(?) won't cause a significant performance change given the other caching layers.

2 year old bug; we should probably split UV into two tests.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.