Open Bug 1729400 Opened 3 years ago Updated 2 years ago

Notify users when security-impactful remote-settings aren't updating correctly

Categories

(Firefox :: Remote Settings Client, enhancement)

enhancement

Tracking

()

People

(Reporter: dveditz, Unassigned, NeedInfo)

References

(Depends on 1 open bug)

Details

Many bits of Firefox important for user security are updated through remote-settings rather than through client updates These include certificate settings, addon blocklists, system addons, and others. Currently there is no way for a user to know if these checks are in a healthy state or if they have been failing for a while.

In comparison, when background update checks fail enough times the user is eventually prompted to do a manual update. Although failing remote-setting updates may not be as easy to understand or fix (there could be remote network blockage, for example) we should still surface this to users in some way so they can ask for help.

The least objectionable and easiest place to add this information would be to the about:support page. Not the most helpful since it won't alert anyone unless they're already hunting around for a problem, but it would be a start.

A second step might be adding a yellow or red badge to the hamburger menu icon (like the green "update available" one) that when opened has a brief item about "security settings are out of date", which when clicked could take you to the appropriate section on the about:support page. At that point, though, I'm not sure what course of action the user could take. The problem might be local (malware blocking the connection) or it could be remote (intermediate blocking the connection, our data center being dead, or our certificates expired).

Remote-settings could also have a dedicated about:page if that's easier (but it probably isn't).

Depends on: 1732056

Hi Daniel,
I agree that this would be very helpful.
A very first step, we would need to keep a trail of the last synchronization statuses (Bug 1732056).

Adding information in about:support seems reasonable in terms of complexity.

As for the integration with the frontend, this would require some planning with other teams since we have zero experience in that area.

With some redesign, the Remote Settings Devtools webextension may be migrated to an official about page. That's also an idea!

Daniel, we now have Bug 1732056 in release, and we can observe some broken clients.

I created Bug 1768174 for the about:support part.

For the frontend part, does your team have resources/knowledge to implement such a thing?

Flags: needinfo?(dveditz)
Depends on: 1768174

No, sorry, we don't. We're definitely gecko-platform folks. Ethan's team does, but is primarily focused on Privacy features. The install team implemented a notification when a certain number of background update checks fail that might be a useful example. mconley or gijs might also have ideas for how to approach this.

Flags: needinfo?(dveditz)

(In reply to Daniel Veditz [:dveditz] from comment #3)

No, sorry, we don't. We're definitely gecko-platform folks. Ethan's team does, but is primarily focused on Privacy features. The install team implemented a notification when a certain number of background update checks fail that might be a useful example. mconley or gijs might also have ideas for how to approach this.

I think the main question is what the remediation would be. As you said, for updates that is just "install a newer version manually". For remote settings, it's much less clear.

Mathieu, do you have ideas here? Do we already use updated dumps from a successfully-updated-Firefox if they are newer than the last-successfully-updated timestamp on the in-profile collection data? If so at least we could suggest updating their browser if they're 4 weeks or more out of date. Otherwise, I don't quite know what to suggest...

I expect product/UX will not be eager to add UI that basically shouts at the user that something is wrong, without a way forward for them to address it.

Flags: needinfo?(mathieu)

I think the main question is what the remediation would be.

I agree, warning without providing a remediation solution would probably bring us more issues.

Do we already use updated dumps from a successfully-updated-Firefox if they are newer than the last-successfully-updated timestamp on the in-profile collection data?

Yes, but it's enabled for all collections since 102 only (Bug 1718083). We indeed overwrite the local data if the dump is newer than the local DB.
Do all security-critical collections ship dumps?

If the synchronization is behind because of IndexedDB issues, I'm exploring ideas, like deleting the database and restarting from scratch (Bug 1658597). In that case, this would happen in the background and wouldn't really require user intervention.

If the profile is completely broken, I would imagine they are already user warnings in place? I would be in favor of letting Remote Settings follow the strategy of the rest of components (eg. Sync?).

Flags: needinfo?(mathieu)

(In reply to Mathieu Leplatre [:leplatrem] from comment #5)

Yes, but it's enabled for all collections since 102 only (Bug 1718083). We indeed overwrite the local data if the dump is newer than the local DB.
Do all security-critical collections ship dumps?

I don't know the answer for this but that seems like something we should file a separate bug for. :-)

If the synchronization is behind because of IndexedDB issues, I'm exploring ideas, like deleting the database and restarting from scratch (Bug 1658597). In that case, this would happen in the background and wouldn't really require user intervention.

This makes sense to me, and I suspect is the main way forward here.

The other reason I can think of is network failures - but honestly I'm not sure what to do about that. We could try making RS more resilient to intermittent network by having network-related sync failures schedule a re-attempt as soon as there's a network change event. But that too wouldn't require user messaging or actions by the user. Telling the user "hey your network isn't letting us check for updates, maybe fix that" probably isn't super helpful... but if the re-scheduling feature seems valuable to you, perhaps let's file a follow-up for that?

Could we try to find out how frequent this kind of issue is by adding some telemetry? (Perhaps we already have telemetry?)

Besides network and IDB failures, would a broken system clock cause sync failures, perhaps because TLS will fail as certificates won't be valid? And could we inform the user of that? (Perhaps orthogonal to RS as it'll cause "normal" browsing to fail too...)

If the profile is completely broken, I would imagine they are already user warnings in place? I would be in favor of letting Remote Settings follow the strategy of the rest of components (eg. Sync?).

I'm not sure what you mean here by "completely broken" :-)
Can you elaborate?

Dan: is there something we're missing here? It feels to me like exposing "how out of date are my Firefox remote settings info bits" on about:support could be useful, but I'm less sure what we'd tell the user if we were proactively messaging them...

Flags: needinfo?(mathieu)
Flags: needinfo?(dveditz)

I think you're right, unfortunately. If someone's having problems then having that info in about:support could be useful when they ask for help. But there's so many possible failures there's not a lot we could tell them proactively other than "This stuff isn't updating... talk to SUMO folks", and that's not all that satisfactory.

If all the collections are failing it could be a hostile network blocking our server. That might be something worth warning about, except: if the hostile network isn't blocking app updates then those will refresh (most of?) the collections, and if the hostile network is blocking updates then we already have a warning about that. If some are failing and not others it might be a local database problem, but I'm not sure there's much good we can tell the user. The usual SUMO advice for that is a Firefox Refresh which can be disruptive to users with add-ons and other customization.

Maybe bug 1768174 is the best we can do

Flags: needinfo?(dveditz)

f the re-scheduling feature seems valuable to you, perhaps let's file a follow-up for that?

We have the 24h timer, which will keep on retrying...

Could we try to find out how frequent this kind of issue is by adding some telemetry? (Perhaps we already have telemetry?)

We already have some specific Telemetry status reported in a number of situations (network, server, indexeddb, etc.). (errors distributions, sync statuses)

Besides network and IDB failures, would a broken system clock cause sync failures, perhaps because TLS will fail as certificates won't be valid?

In theory, we get the clock skew from the server and use it when verifying the cert validity. We have opened bugs around that matter though (Bug 1549961 and Bug 1551266).

I'm not sure what you mean here by "completely broken" :-) Can you elaborate?

I was thinking something like profile folder not writable, quota exceeded, hard disk is full, etc.

Also, if synchronization is consistently failing, should we take the initiative to reset critical preferences (see Bug 1769669) ?

Flags: needinfo?(mathieu)

(In reply to Mathieu Leplatre [:leplatrem] from comment #8)

f the re-scheduling feature seems valuable to you, perhaps let's file a follow-up for that?

We have the 24h timer, which will keep on retrying...

Sure, but it seems like if the last 24h it failed, we should retry if we get network-changed events so that we don't depend on the combination of:

  • firefox is running
  • firefox is connected to a network
  • that network allows us to reach the server

as much as we do now. (I realize that we'll retry on startup if Firefox wasn't running when the timer would have expired, but that still leaves the other 2 things, and I'd be curious if e.g. for captive portals, we are liable to end up in a situation where the user doesn't have network on startup, but has it shortly afterwards, which would break a lot! This might also affect app updates!)

Could we try to find out how frequent this kind of issue is by adding some telemetry? (Perhaps we already have telemetry?)

We already have some specific Telemetry status reported in a number of situations (network, server, indexeddb, etc.). (errors distributions, sync statuses)

The sync status seems to indicate about 2% of syncs fail. That seems worth trying to do something about - maybe. But if we just try more often then the error rate might actually go up! Do we have telemetry for how out-of-date collections are / "time since last successful sync" ? Would be interesting to know where the long tail of that graph starts, as it were... Like, are users often 2-3 days out of date, or maybe even a week or more?

I'm not sure what you mean here by "completely broken" :-) Can you elaborate?

I was thinking something like profile folder not writable, quota exceeded, hard disk is full, etc.

Good question. I think we don't deal well with this generally, but we probably should. I don't know how best to do so though...

Also, if synchronization is consistently failing, should we take the initiative to reset critical preferences (see Bug 1769669) ?

I think the approach in that bug (not allowing prefs to mess with it that much) is probably better. Resetting it without breaking automated tests might be tricky, and I'm always apprehensive about automatically overwriting users settings...

Flags: needinfo?(mathieu)
You need to log in before you can comment on or make changes to this bug.