1070230 - Loop URL generation after server outage and recovery can take several minutes to recover -- should we inform the user more than we do?

Reporter

Description

•

10 years ago

From 12:45PM to 2:22PM PDT today there was a loop server outage due to caching tier issues. During this outage, "Sorry, we were unable to retrieve a call url." was displayed in the loop desktop client when clicking the "Invite someone to talk" button (expected). After the caching issues were resolved, however, I was still unable to generate new call URLs from the browser that saw the error during the outage. I experienced this on two different browsers running Nightly. I restarted one of them and was able to once again generate call URLs from that one, so it looks like a browser restart is currently needed to get Loop working again after an outage.

James Bonacci [:jbonacci]

Comment 1

•

10 years ago

Added a couple people who opened the original server-side bug. I would hate to think this is the artifact of that issue...

Whiteboard: [qa+]

James Bonacci [:jbonacci]

Comment 2

•

10 years ago

+ Loop-Server devs just in case...

u279076

Comment 3

•

10 years ago

Is there some way we can simulate the caching issue to test/debug this?

Whiteboard: [qa+]

Mark Banner (:standard8)

Comment 4

•

10 years ago

We need to know exactly what was done on the server side. I've seen reference to the production database being "flushed", but what was that exactly? Old urls, uuids, everything..? It would also help to know what where the effects, was the server just not returning anything, or was it returning particular error messages?

Rémy Hubscher (:natim)

Comment 5

•

10 years ago

bobm did the cleaning. For the best of my knowledge only some call-urls where dropped. We are planning to move to a bigger database cluster today.

Alexis Metaireau (:alexis)

Comment 6

•

10 years ago

It's not normal that we need to have a browser restart to get things working. This can be done to two things, either: - The hawk sessions were dropped; - The simple push urls were dropped. In general, we're spending today to find out what happened, what went wrong and define failures modes and escalation paths for the service. These are long overdue and we're missing them now.

Alexis Metaireau (:alexis)

Comment 7

•

10 years ago

James, can you tell what you mean by "data being flushed"?

Maire Reavy [:mreavy]

Comment 8

•

10 years ago

We want to verify that the client recovers reasonably well after a server outage.

backlog: --- → Fx36?

James Bonacci [:jbonacci]

Comment 9

•

10 years ago

:alexis sorry for the long delay. That comment had to do with our Redis DB issues in Stage and Prod. This bug is (apparently) an artifact of that issue. :deanw or :bobm What can we do from the OPs side to address Comment 8?

Flags: qe-verify+

Flags: needinfo?(dwilson)

Flags: needinfo?(bobm)

Dean Wilson (:deanw)

Comment 10

•

10 years ago

We can terminate the redis instance in stage and see if the client still works without one running and then turn it back on to confirm it recovers.

Flags: needinfo?(dwilson)

James Bonacci [:jbonacci]

Comment 11

•

10 years ago

OK. So we just need to schedule this work.

Flags: needinfo?(bobm)

Maire Reavy [:mreavy]

Comment 12

•

10 years ago

We need to make sure we're ok for going out with a hard launch in Fx35.

backlog: Fx36? → Fx35+

Priority: -- → P1

Whiteboard: [investigation]

Paul Kerr [:pkerr]

Comment 13

•

10 years ago

The current operation will attempt to re-register with the push server after the websocket times out on the client side. The retry cycle runs in a back-off loop which starts at one minute and ends up at a maximum retry cycle of five minutes. So, it can take up to five minutes to the client to attempt a retry after the push server is again available. Bug 1028869 will add a 30 minute ping to the push server monitoring logic for the case where the websocket does not time-out but the push server is not responsive. (There is a set of xpcshell tests for these functions.) What is missing is any visibility to the user that the service is not available. Also, there is no manual means of triggering a connection retry short of restarting the browser. what would be the best mechanism to add connection visibility to the Loop view? We currently have the red icon when the URL cannot be generated, but nothing for when the loop server connection is lost after that point.

Flags: needinfo?(sfranks)

Flags: needinfo?(dhenein)

:shell escalante

Comment 14

•

10 years ago

so the UX of "Sorry, we were unable to retrieve a call url." is fine when things are down. and there's a bug that will make the client get back into a good state every 30 minutes, at least. but if a user tried before everything's in a happy state, we'd like the user to be able to manually trigger an action from the client front-end to try to reconnect. we'd need UX to connect that manual trigger to: 1. for the specific error - tell the user to try this action (click here or something) that will make the client front-end trigger manual reconnect to server 2. it will either fix it and they get a URL (not sure if we need words or if just working is enough) - or we need an error telling the user that the service is still down and to please try again later.

Blocks: 1082944, 1014931

Paul Kerr [:pkerr]

Comment 15

•

10 years ago

(In reply to sescalante from comment #14) > so the UX of "Sorry, we were unable to retrieve a call url." is fine when > things are down. and there's a bug that will make the client get back into > a good state every 30 minutes, at least. You will only see this error message if loop server is not reachable when the Hello icon is triggered. The use case being discussed here also deals not being reachable for incoming calls because the connection with the push server does not exist. That is, you have given out a call URL but you are actually unreachable and the Hello client is trying to re-acquire a connection with the push server.

Darrin Henein [:darrin]

Comment 16

•

10 years ago

Maybe I'm missing something here, but the design for the error state (red !) was meant to encapsulate any and all errors that make the service unusable. Clicking on the icon (opening the panel) should trigger an attempt to reconnect if the current state is disconnected. The less we expose to the user, and the less manual steps they have to take, the better. Does this answer the question you had?

Flags: needinfo?(dhenein)

Sevaan Franks [:sevaan]

Updated

•

10 years ago

Flags: needinfo?(sfranks)

Maire Reavy [:mreavy]

Comment 17

•

10 years ago

Hi Darrin -- The issue is actually that if the server has been offline for a while, the client backs off trying to contact the server. So even after the server comes back, it could take several minutes to reconnect unless the user intervenes. During these several minutes the user would be unavailable for incoming calls. Do we want to alert or inform the user more than we already do? The answer may be "no, we're good for now and we'll wait to see if we get feedback on this." (Paul -- let us know if I missed any part of your argument in my summary.) Moving this to Fx37? in case there is something we want to do in the near term. There is additional work to test that we're ok recovering from typically server errors, but that's a separate investigation with QA.

backlog: Fx35+ → Fx37?

Flags: needinfo?(dhenein)

Summary: Loop URL generation fails after server outage and recovery → Loop URL generation after server outage and recovery can take several minutes to recover -- should we inform the user more than we do?

Darrin Henein [:darrin]

Comment 18

•

10 years ago

My sense is that the error state icon should effectively communicate to the user that something is wrong/not working, and make sure we retry a connection each time the panel opens. Maybe a new string for the error bar which communicates that this is a server issue and not their fault, and that it may be a longer than usual outage?

Flags: needinfo?(dhenein)

Marco Mucci [:MarcoM]

Updated

•

10 years ago

Flags: firefox-backlog+

Sevaan Franks [:sevaan]

Updated

•

10 years ago

Points: --- → 2

Marco Mucci [:MarcoM]

Updated

•

10 years ago

Flags: qe-verify+ → qe-verify-

:shell escalante

Comment 19

•

10 years ago

was this scenario handled by one of the bugs you have up for review? was there any user notification / action to reset server connection portion to the bugs you were working on?

backlog: Fx37? → Fx38+

Flags: needinfo?(pkerr)

Paul Kerr [:pkerr]

Comment 20

•

10 years ago

The bugs that I have up for review focus on the client attempting to re-connect to the PushServer if a connection is lost. This occurs when the websocket closes or the PushServer does not respond to a periodic ping message. Along with the re-connection, the client can now deal with the PushServer assigning new PushURLs to the client by registering these new URLs with the LoopServer. All this happens in the background. As of now, the client PushHandler does not signal to the client code that the PushServer is unreachable. Any problem with the connection with the LoopServer will not be detected unless the user opens the panel or tries an action. That is, there is no periodic ping with the LoopServer. But, this PushHandler connection state is more important because it is via this path that the client will be notified of incoming calls, when rooms are created, room membership changes, etc. If the user takes an action they will immediately discover there is a problem (like getting no dial tone on a phone).

Flags: needinfo?(pkerr)

Pupfox

Comment 21

•

10 years ago

Interested in the information you guys have provided, I had a disconnection. From server pLus additional. Issues, only able to run on blogosphere been trying to figure it out if anyone wants more details. Email me maybe it's a missing link but just a hunch

:shell escalante

Comment 22

•

10 years ago

organize bug 1100969 and bug 1070230 into meta and prioritize all.

Flags: needinfo?(sescalante)

Priority: P1 → P2

:shell escalante

Updated

•

10 years ago

Blocks: 1100969

:shell escalante

Updated

•

10 years ago

Flags: needinfo?(sescalante)

:shell escalante

Updated

•

10 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1098629

:shell escalante

Comment 23

•

10 years ago

this bug may be fixed, but we want to go through our code paths and verify that we surface errors to user. next issue is if the UX is clear enough for what we share.

:shell escalante

Updated

•

10 years ago

Flags: needinfo?(sescalante)

Dan Mosedale (:dmosedale, :dmose)

Updated

•

10 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1092361

:shell escalante

Updated

•

10 years ago

Rank: 29

Flags: needinfo?(sescalante)

:shell escalante

Updated

•

10 years ago

backlog: Fx38+ → backlog+

:shell escalante

Updated

•

10 years ago

Whiteboard: [investigation] → [error][investigation]

:shell escalante

Comment 24

•

9 years ago

believe we've resolved - definitely adding retry whenever possible. lower potential of happening than other issues we're addressing. will re-open if determine bigger issue.

Status: NEW → RESOLVED

Closed: 9 years ago

Rank: 29 → 49

Priority: P2 → P4

Resolution: --- → WONTFIX

Mark Banner (:standard8)

Comment 25

•

9 years ago

(In reply to :shell escalante from comment #24) > believe we've resolved - definitely adding retry whenever possible. lower > potential of happening than other issues we're addressing. will re-open if > determine bigger issue. Nope we haven't resolved. We still don't notify the user at-all if the service (e.g. link to push servers) is down.

Status: RESOLVED → REOPENED

Resolution: WONTFIX → ---

Mark Banner (:standard8)

Updated

•

9 years ago

Blocks: 1203138

Mark Banner (:standard8)

Comment 26

•

9 years ago

Given the extended discussion, I've created bug 1203138 to give us a fresh place that re-summarises the issues and hopefully we can drive it through to completion.

Status: REOPENED → RESOLVED

Closed: 9 years ago → 9 years ago

Resolution: --- → DUPLICATE