Closed Bug 1070230 Opened 10 years ago Closed 9 years ago

Loop URL generation after server outage and recovery can take several minutes to recover -- should we inform the user more than we do?

Categories

(Hello (Loop) :: Client, defect, P4)

x86_64
Linux
defect
Points:
2

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1203138
backlog backlog+

People

(Reporter: whd, Unassigned)

References

Details

(Whiteboard: [error][investigation])

From 12:45PM to 2:22PM PDT today there was a loop server outage due to caching tier issues. During this outage, "Sorry, we were unable to retrieve a call url." was displayed in the loop desktop client when clicking the "Invite someone to talk" button (expected). After the caching issues were resolved, however, I was still unable to generate new call URLs from the browser that saw the error during the outage. I experienced this on two different browsers running Nightly. I restarted one of them and was able to once again generate call URLs from that one, so it looks like a browser restart is currently needed to get Loop working again after an outage.
Added a couple people who opened the original server-side bug. I would hate to think this is the artifact of that issue...
Whiteboard: [qa+]
+ Loop-Server devs just in case...
Is there some way we can simulate the caching issue to test/debug this?
Whiteboard: [qa+]
We need to know exactly what was done on the server side. I've seen reference to the production database being "flushed", but what was that exactly? Old urls, uuids, everything..? It would also help to know what where the effects, was the server just not returning anything, or was it returning particular error messages?
bobm did the cleaning. For the best of my knowledge only some call-urls where dropped. We are planning to move to a bigger database cluster today.
It's not normal that we need to have a browser restart to get things working. This can be done to two things, either: - The hawk sessions were dropped; - The simple push urls were dropped. In general, we're spending today to find out what happened, what went wrong and define failures modes and escalation paths for the service. These are long overdue and we're missing them now.
James, can you tell what you mean by "data being flushed"?
We want to verify that the client recovers reasonably well after a server outage.
backlog: --- → Fx36?
:alexis sorry for the long delay. That comment had to do with our Redis DB issues in Stage and Prod. This bug is (apparently) an artifact of that issue. :deanw or :bobm What can we do from the OPs side to address Comment 8?
Flags: qe-verify+
Flags: needinfo?(dwilson)
Flags: needinfo?(bobm)
We can terminate the redis instance in stage and see if the client still works without one running and then turn it back on to confirm it recovers.
Flags: needinfo?(dwilson)
OK. So we just need to schedule this work.
Flags: needinfo?(bobm)
We need to make sure we're ok for going out with a hard launch in Fx35.
backlog: Fx36? → Fx35+
Priority: -- → P1
Whiteboard: [investigation]
The current operation will attempt to re-register with the push server after the websocket times out on the client side. The retry cycle runs in a back-off loop which starts at one minute and ends up at a maximum retry cycle of five minutes. So, it can take up to five minutes to the client to attempt a retry after the push server is again available. Bug 1028869 will add a 30 minute ping to the push server monitoring logic for the case where the websocket does not time-out but the push server is not responsive. (There is a set of xpcshell tests for these functions.) What is missing is any visibility to the user that the service is not available. Also, there is no manual means of triggering a connection retry short of restarting the browser. what would be the best mechanism to add connection visibility to the Loop view? We currently have the red icon when the URL cannot be generated, but nothing for when the loop server connection is lost after that point.
Flags: needinfo?(sfranks)
Flags: needinfo?(dhenein)
so the UX of "Sorry, we were unable to retrieve a call url." is fine when things are down. and there's a bug that will make the client get back into a good state every 30 minutes, at least. but if a user tried before everything's in a happy state, we'd like the user to be able to manually trigger an action from the client front-end to try to reconnect. we'd need UX to connect that manual trigger to: 1. for the specific error - tell the user to try this action (click here or something) that will make the client front-end trigger manual reconnect to server 2. it will either fix it and they get a URL (not sure if we need words or if just working is enough) - or we need an error telling the user that the service is still down and to please try again later.
Blocks: 1082944, 1014931
(In reply to sescalante from comment #14) > so the UX of "Sorry, we were unable to retrieve a call url." is fine when > things are down. and there's a bug that will make the client get back into > a good state every 30 minutes, at least. You will only see this error message if loop server is not reachable when the Hello icon is triggered. The use case being discussed here also deals not being reachable for incoming calls because the connection with the push server does not exist. That is, you have given out a call URL but you are actually unreachable and the Hello client is trying to re-acquire a connection with the push server.
Maybe I'm missing something here, but the design for the error state (red !) was meant to encapsulate any and all errors that make the service unusable. Clicking on the icon (opening the panel) should trigger an attempt to reconnect if the current state is disconnected. The less we expose to the user, and the less manual steps they have to take, the better. Does this answer the question you had?
Flags: needinfo?(dhenein)
Flags: needinfo?(sfranks)
Hi Darrin -- The issue is actually that if the server has been offline for a while, the client backs off trying to contact the server. So even after the server comes back, it could take several minutes to reconnect unless the user intervenes. During these several minutes the user would be unavailable for incoming calls. Do we want to alert or inform the user more than we already do? The answer may be "no, we're good for now and we'll wait to see if we get feedback on this." (Paul -- let us know if I missed any part of your argument in my summary.) Moving this to Fx37? in case there is something we want to do in the near term. There is additional work to test that we're ok recovering from typically server errors, but that's a separate investigation with QA.
backlog: Fx35+ → Fx37?
Flags: needinfo?(dhenein)
Summary: Loop URL generation fails after server outage and recovery → Loop URL generation after server outage and recovery can take several minutes to recover -- should we inform the user more than we do?
My sense is that the error state icon should effectively communicate to the user that something is wrong/not working, and make sure we retry a connection each time the panel opens. Maybe a new string for the error bar which communicates that this is a server issue and not their fault, and that it may be a longer than usual outage?
Flags: needinfo?(dhenein)
Flags: firefox-backlog+
Points: --- → 2
Flags: qe-verify+ → qe-verify-
was this scenario handled by one of the bugs you have up for review? was there any user notification / action to reset server connection portion to the bugs you were working on?
backlog: Fx37? → Fx38+
Flags: needinfo?(pkerr)
The bugs that I have up for review focus on the client attempting to re-connect to the PushServer if a connection is lost. This occurs when the websocket closes or the PushServer does not respond to a periodic ping message. Along with the re-connection, the client can now deal with the PushServer assigning new PushURLs to the client by registering these new URLs with the LoopServer. All this happens in the background. As of now, the client PushHandler does not signal to the client code that the PushServer is unreachable. Any problem with the connection with the LoopServer will not be detected unless the user opens the panel or tries an action. That is, there is no periodic ping with the LoopServer. But, this PushHandler connection state is more important because it is via this path that the client will be notified of incoming calls, when rooms are created, room membership changes, etc. If the user takes an action they will immediately discover there is a problem (like getting no dial tone on a phone).
Flags: needinfo?(pkerr)
Interested in the information you guys have provided, I had a disconnection. From server pLus additional. Issues, only able to run on blogosphere been trying to figure it out if anyone wants more details. Email me maybe it's a missing link but just a hunch
organize bug 1100969 and bug 1070230 into meta and prioritize all.
Flags: needinfo?(sescalante)
Priority: P1 → P2
Blocks: 1100969
Flags: needinfo?(sescalante)
this bug may be fixed, but we want to go through our code paths and verify that we surface errors to user. next issue is if the UX is clear enough for what we share.
Flags: needinfo?(sescalante)
Rank: 29
Flags: needinfo?(sescalante)
backlog: Fx38+ → backlog+
Whiteboard: [investigation] → [error][investigation]
believe we've resolved - definitely adding retry whenever possible. lower potential of happening than other issues we're addressing. will re-open if determine bigger issue.
Status: NEW → RESOLVED
Closed: 9 years ago
Rank: 29 → 49
Priority: P2 → P4
Resolution: --- → WONTFIX
(In reply to :shell escalante from comment #24) > believe we've resolved - definitely adding retry whenever possible. lower > potential of happening than other issues we're addressing. will re-open if > determine bigger issue. Nope we haven't resolved. We still don't notify the user at-all if the service (e.g. link to push servers) is down.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Blocks: 1203138
Given the extended discussion, I've created bug 1203138 to give us a fresh place that re-summarises the issues and hopefully we can drive it through to completion.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.