Closed Bug 1558709 Opened 5 years ago Closed 5 years ago

Recent uptick in errors for security-state/intermediates

Categories

(Firefox :: Remote Settings Client, defect, P1)

67 Branch
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: glasserc, Unassigned)

References

Details

:mostlygeek reported an uptick in remote-settings errors starting on the night of 6/10. This seems to mostly be concentrated in the security-state bucket. If you have Datadog access, you can see it at https://app.datadoghq.com/dashboard/piq-rtw-fk9?screenId=piq-rtw-fk9&from_ts=1559694939993&live=true&tile_size=m&to_ts=1560299739993&tpl_var_source_subtype=intermediates&tpl_var_source_type=security-state

The errors seem to mostly be parse_error.

:jcj reports that around the same timeframe, his team had converted the security-state/intermediates records from using one kind of base64 encoding to another one. On the hypothesis that this was related somehow, we asked him to roll back the change. He did, and perhaps the errors have slowed, although it's hard to say.

I was able to produce parse_error while syncing security-state/intermediates using Remote Settings Dev Tools on the last Firefox build on my laptop, which is based off of commit 6daf8919d66e75e141ca2bc8ec5defd07cf75bf3 (a week ago). (I'm not 100% sure that it's exactly the same as the errors we are seeing in telemetry.) To reproduce this, I did:

  • Start local build
  • Open about:remotesettings. It seemed to open by default pointing at stage.
  • security-state/intermediates may not be listed; if not, wait a few minutes and refresh the page
  • "Clear ALL local data"
  • "Force sync" security-state/intermediates
  • "Poll server"
  • "Force sync" security-state/intermediates
  • An error is produced on the left-hand panel and an exception is produced in the browser toolbox.

WebExtensions: Error: "Response from server unparseable (HTTP 403; SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data): <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
</HEAD><BODY>
<H1>403 ERROR</H1>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
This distribution is not configured to allow the HTTP request method that was used for this request. The distribution supports only cachable requests.

<BR clear="all">
<HR noshade size="1px">
<PRE>
Generated by cloudfront (CloudFront)
Request ID: l-y285J4zbG_y0VgasTcvOJOCWCTBh05NQSakkaYLEqu4gHxj3IG2w==
</PRE>
<ADDRESS>
</ADDRESS>
</BODY></HTML>"
processResponse resource://services-common/kinto-http-client.js:2785
request resource://services-common/kinto-http-client.js:2864
execute resource://services-common/kinto-http-client.js:809
_batchRequests resource://services-common/kinto-http-client.js:705
_batchRequests resource://services-common/kinto-http-client.js:700
pMap resource://services-common/kinto-http-client.js:3192
pMap resource://services-common/kinto-http-client.js:3190
_batchRequests resource://services-common/kinto-http-client.js:700
batch resource://services-common/kinto-http-client.js:759
wrappedMethod resource://services-common/kinto-http-client.js:3380
batch resource://services-common/kinto-http-client.js:2478
pushChanges resource://services-common/kinto-offline-client.js:2463
sync resource://services-common/kinto-offline-client.js:2654
maybeSync resource://services-settings/RemoteSettingsClient.jsm:348
devtools mozilla.com.xpi!/experiments/remotesettings/api.js:188
result resource://gre/modules/ExtensionParent.jsm:998
withPendingBrowser resource://gre/modules/ExtensionParent.jsm:656
result resource://gre/modules/ExtensionParent.jsm:997
withTiming resource://gre/modules/ExtensionParent.jsm:964
call resource://gre/modules/ExtensionParent.jsm:996
receiveMessage resource://gre/modules/ExtensionParent.jsm:887
init resource://gre/modules/ExtensionParent.jsm:835
<anonymous> resource://gre/modules/ExtensionParent.jsm:1095
<anonymous> resource://gre/modules/Extension.jsm:77
loadBootstrapScope resource://gre/modules/addons/XPIProvider.jsm:1788
callBootstrapMethod resource://gre/modules/addons/XPIProvider.jsm:1687
startup resource://gre/modules/addons/XPIProvider.jsm:1841
startup resource://gre/modules/addons/XPIProvider.jsm:2325
callProvider resource://gre/modules/AddonManager.jsm:193
_startProvider resource://gre/modules/AddonManager.jsm:568
startup resource://gre/modules/AddonManager.jsm:723
startup resource://gre/modules/AddonManager.jsm:2787
observe resource://gre/modules/addonManager.js:71

In other words, the kinto sync is failing because the client thinks it has local changes and is trying to push those changes to the server, which Cloudfront doesn't allow. Indeed, when I look at the security state records using the browser toolbox storage inspector, I see that some are marked "updated". I eventually tracked this down to https://github.com/Kinto/kinto.js/issues/989.

If this is the cause of the recent spate of errors, why did it suddenly happen starting the 10th? If not, what is the real cause?

Depends on: 1558751
Depends on: 1558812
Depends on: 1404934

Thanks Ethan for having pointed out the root cause.

If this is the cause of the recent spate of errors, why did it suddenly happen starting the 10th? If not, what is the real cause?

My understanding is that the packaged JSON dump was up to date. The clients would load it on first sync, download the attachments, update the local records, toggling their sync status. But since the collection timestamp remained the same as the one on the server, no sync was issued. When some changes were published then the clients started to initiate bidirectional sync.

According to history, the records weren't modified from "2019-02-08T18:16:42.660429" to "2019-06-05T23:28:25.410216".

$ http https://settings-writer.prod.mozaws.net/v1/buckets/security-state-staging/history\?collection_id\=intermediates\&resource_name=collection\&action=update\&target.data.status\=to-review\&_limit\=20 Authorization:"Bearer JDFG.....uxr7mh0q" | jq .data[].date

"2019-06-11T19:31:51.539647"
"2019-06-11T19:31:51.253833"
"2019-06-11T10:35:09.948983"
"2019-06-11T00:00:07.728075"
"2019-06-11T00:00:07.719695"
"2019-06-10T23:58:58.181559"
"2019-06-10T23:58:57.883300"
"2019-06-07T00:38:18.380138"
"2019-06-07T00:38:18.006103"
"2019-06-05T23:30:37.460272"
"2019-06-05T23:28:25.798052"
"2019-06-05T23:28:25.410216"
"2019-02-08T18:16:42.660429"
"2019-02-08T18:16:42.654521"
"2019-02-08T18:15:32.758524"
"2019-02-08T18:15:32.372410"

And in mozilla-central the JSON dump was introduced in April (Bug 1541841).
So this makes sense.

We found a fix for the code, and we'll request uplift for 68. (any hope for the next dot release of 67?).

But that doesn't solve the problem of users local data, where records are marked as «to be pushed». We could ship a small routine along to fix that, or alternatively once the patch has shipped we could also bump the timestamp on the server for each record and they will be overwritten in local DBs (because we use a «server wins» strategy) and their status switched back to «synced».

Note: even if I don't explain the bump of sign errors, they should actually fix the local data since the local db is flushed and whole collection pulled.

The priority flag is not set for this bug.
:leplatrem, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mathieu)
Flags: needinfo?(mathieu)
Priority: -- → P1
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.