Closed Bug 1080752 Opened 10 years ago Closed 10 years ago

Push notifications do not work when the phone is idle for more than 30 minutes

Categories

(Core :: DOM: Notifications, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla39
blocking-b2g 2.1?
Tracking Status
firefox37 --- wontfix
firefox38 --- wontfix
firefox39 --- fixed
b2g-v2.1 --- fixed
b2g-v2.1S --- fixed
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: vkrishnamoorthy, Assigned: nsm)

References

Details

(Keywords: verifyme, Whiteboard: [dependency: marketplace])

Attachments

(1 file, 2 obsolete files)

Symptomatic issues of this problem has been reported by applications using push in various bugs such as Bug 1003639, Bug 1034300 and Bug 1040394. The root cause seems to be that Push Notifications are not delivered to the phone when the phone is in idle state (left idle for more than 30 mins). This has discouraged applications that have real time need from using SimplePush.
The server side is delivering content to connected devices (currently averaging about 0.004 seconds) The problem is that the device is dropping connection to the server and not reconnecting. There may be several reasons for this, including the CPU being aggressive about app shutdown to save battery, long reconnection times, or other things I'm not qualified to talk about or even guess. Another possible option might be to look at how Android does this, possibly at a layer deep inside of the OS. Android maintains a constant socket connection, reconnecting back between network switches. (Actually, they keep several, as do a few other apps like twitter and some mail apps.)
Yeah, we should definitely not be requiring the CPU to remain active in order to deliver push notifications. We should talk to our chipset vendor partners to see what they recommend. As well as look at what Android does of course.
I am setting 2.2? as without this issue fixed, I doubt RTC apps will use the simple push service.
blocking-b2g: --- → 2.2?
Not sure how related this might be, but Ian and I have been talking with Victor Chen. He pointed us toward http://mxr.mozilla.org/gaia/source/apps/system/js/wifi.js#160 It's notable that on L#142, there's a 5m timeout that releases the CPU lock. I'm not sure, but I believe that would cause the device to go into a hard sleep after that time, which would drop the radio and wifi connections, correct? From reading various docs like https://developer.android.com/training/efficient-downloads/efficient-network-access.html it appears that there really isn't a lot of magic involved in keeping things alive, and that often, the mechanism of choice is "not to", but instead use something like polling using calls that are as close to "the metal" (the kernel level drivers for radios and battery) as possible. This would mean that there would be an appreciable delay between submission for an event and the device actually acting on the requested event. (so, for instance, things like Find My Device may require 90seconds+ in order to act. As I understand, however, the Push code should have a timer that does wake and attempts to reconnect. There do seem to be a lot of variables that need to be checked in order for the radio to come back up. It may be that one of the various conditions is either failing to be met or is preventing the radio from properly re-initializing.
This seems like it completely breaks push notifications. If people use polling instead of push that will severely impact battery usage.
Blocks: 1093617
(In reply to Jonas Sicking (:sicking) from comment #5) > This seems like it completely breaks push notifications. If people use > polling instead of push that will severely impact battery usage. Agreed. See for example bug 1003639
I'm not so sure. Holding an active socket open for a connection to a server does not allow the CPU to go idle. Even low power mode draws from the battery since it's a constant drain, and I believe that any network activity (for instance a connection ping) spins the CPU back into high draw, followed by low draw. Having a device actively poll on a several minute cycle would reduce this impact since the device could actually go to full idle if possible. (A bit like how hybrids save gas by turning off the engine at stop lights.) Granted, this does come at a penalty that mobile devices aren't "Always on and instantly accessible". It might be worth balancing that desired goal with what the device limitations are.
What devices? nsm, are we setting an alarm to wakeup periodically?
Flags: needinfo?(nsm.nikhil)
:dougt I would rather have this solved from the server end. I've talked to Ben about what it would take to adaptively alter the keep alive interval across different networks. That's going to be a far better solution than anything client-driven. Fixing this on the client is harder: there is less information about what a good interval is. The server sees a lot more networks and devices.
(In reply to Doug Turner (:dougt) from comment #8) > What devices? > > nsm, are we setting an alarm to wakeup periodically? Yes we have always done that. After :willyaranda landed adaptive ping, we do it adaptively too. But if the processor goes back to sleep once the alarm wakes the push code up, before we can set up the socket, then the connection isn't going to be reestablished. We could hold a wake-lock while waiting for a reply from the push server, with a short timeout. What is frustrating is that there is no sane way to test this with logging etc. without having the phone plugged in (which nullifies everything) AFAIK. If anyone knows better, I'd be happy to try it.
Flags: needinfo?(nsm.nikhil)
nsm, can we test it server side? if the server does not see a connection request from the device, can we infer that the AP goes to sleep before establishing connection?
I'm not sure how easy it is to observe a specific device on the server 'live'.
Even using an alarm to periodically wake up and then reconnect to the server sounds problematic. It means that you couldn't use Push API to implement a VoIP app like Loop for example. Since the push message could take multiple minutes to be delivered, which is not how long you expect to wait when placing a phone call. Is there no way to wake the device up on an incoming network package?
(In reply to Jonas Sicking (:sicking) from comment #13) > Even using an alarm to periodically wake up and then reconnect to the server > sounds problematic. It means that you couldn't use Push API to implement a > VoIP app like Loop for example. Since the push message could take multiple > minutes to be delivered, which is not how long you expect to wait when > placing a phone call. Even android doesn't allow this kind of behaviour when the phone is in low power mode or similar. I'm not sure if there is any alternative. > > Is there no way to wake the device up on an incoming network package? Using UDP/SMS, both of which require various agreements/deployments with carriers for widespread use. I don't think there is a way via TCP.
(In reply to Nikhil Marathe [:nsm] (needinfo? please) from comment #14) > (In reply to Jonas Sicking (:sicking) from comment #13) > > Even using an alarm to periodically wake up and then reconnect to the server > > sounds problematic. It means that you couldn't use Push API to implement a > > VoIP app like Loop for example. Since the push message could take multiple > > minutes to be delivered, which is not how long you expect to wait when > > placing a phone call. > > Even android doesn't allow this kind of behaviour when the phone is in low > power mode or similar. I'm not sure if there is any alternative. I thought GCM had a way to send ntf to idle devices using the delayWhileIdle flag? app servers can set this flag to false http://developer.android.com/google/gcm/adv.html
It's hard to tell what proprietary method Google uses inside of the Play app which handles notification, but I have not seen anything magic that they are doing.TCP handling is a reasonably new tech built on top of existing radio protocols. SMS uses extra, previously unused data, which is tightly controlled by the carriers. It may be possible to examine the various chipsets that are handling the radio portion, but I don't know if we have access to this, nor do I know if we can convince carriers to include whatever additional protocols or driving hardware. That aside, we need to work with what we know the devices can do. These devices consume battery when the cpu is active. The devices and core OS try their respective damndest to optimize battery. Talking with some folks who work on GCM recently, they stated that polling was preferred for devices. There are things we can do and learn to make that efficient, like code as close to the actual hardware as possible, minimize the out of idle instruction load, and revert back to sleep as soon as possible. It also means that less active devices may not be instantly accessible. I'd be willing to believe that there's similar delay for android devices that have been idle (although they've obviously solved the poll problem at the cost of some battery. It's also notable that opening up lots of connections guarantees high CPU and minimal battery, so creating more channels is a very bad idea.
First off, I agree that we should definitely rule out any solution which forces the CPU to stay "more awake" than it otherwise would. So I agree we should not spend time on discussing such solutions. But Google has to live with the same hardware and radio constraints as we do. So using proprietary protocols doesn't really gain them any significant capabilities. The only advantage that I could see that Google might have is that they might have information about different strategies that can be used with different carriers. But there's no reason we could start building such infrastructure too. I.e. we could check with the SIM card to see what carrier the user is using, and then use different signaling mechanisms for different carriers. It would be very sad if push didn't let us implement things like VoIP clients. Has anyone talked to Telefonica about this given that they are a carrier and that they are quite interested in Loop?
Telefonica proposed allowing us to use their proprietary UDP system to wake devices on their network, however they have not yet provided an endpoint we can use to do so. It us also not known if other carriers offer a similar system.
Sounds great. Vishy, could you help get Nikhil and JR hooked up with the right people at Telefonica?
My understanding is that an open TCP connection can send packets toward the client at any time and they will be delivered if the middleboxes haven't forgotten. The fact that middleboxes can forget sessions is the real problem we are seeing. The exact timeout is anywhere from 30s to 30m, but once that time passes without TCP activity, the connection turns into a zombie. Activity from either peer keeps the connection alive. That is, you don't have to maintain a timer on the client, you can maintain the timer on the server. As I note in comment 9, running timers on the server offers a lot of advantages for tuning of the interval. The catch is that the client will have to be paged to deliver the TCP packet. You might imagine a world where sending a TCP keepalive (a packet containing sequence numbers that have already been acknowledged, which might keep bindings alive, but not trigger delivery) would work, but that would be pure fantasy.
If we send TCP keepalive's this frequently, will it cause the phone to do additional work that could severely reduce its battery life as the other popular FxOS chat app has been responsible for?
You want to tune the interval so that it just shorter than the lowest timeout across the set of middleboxes that are on-path. Fail to send keepalives and you can't send pushes.
It's worth trying out to see what the effects are. I've added a ticket for pushgo (https://github.com/mozilla-services/pushgo/issues/112) to implement TCP server-side keep-alives as a configurable option so we can at least do some testing to see how frequent they might be needed and what effects they might have on battery life.
Whiteboard: [dependency: marketplace-partners]
feature-b2g: --- → 2.2?
Whiteboard: [dependency: marketplace-partners] → [dependency: marketplace]
I am going to try holding a wakelock while intiating the connection. Assigning to myself for now.
Assignee: nobody → nsm.nikhil
Attached patch wakelock (obsolete) (deleted) — Splinter Review
This patch holds a wakelock while the websocket is being setup. Unfortunately I am in a location right now where my phone has neither data nor wifi, so I haven't been able to test it. There should be adb messages about acquiring and releasing the wakelock once you enable the pref "services.push.debug". vishy, could you assign someone on this?
Flags: needinfo?(vkrishnamoorthy)
ni: Ian bicking, Ian, would someone in your team have cycles for this?
Flags: needinfo?(vkrishnamoorthy) → needinfo?(ianb)
I'm not sure what the next task here is. QA? At this specific moment I don't have anyone that can work on this (but with two days to Thanksgiving there's not many potential cycles).
Flags: needinfo?(ianb)
Are you still working on this? Do we plan to finish it in Gecko 37? Just want to know if we will have this one in Firefox OS 2.2. Thanks.
Flags: needinfo?(nsm.nikhil)
It's unknown what the next task here is. There's been several changes since the last QA verification on this. We might need to have the server initiate websocket ping frames (the latest version sends TCP keep-alive's, but that might not be enough?). Once we know whether the latest change was sufficient, we could prop up a test server that sends full websocket ping frame's regularly to see if that is enough to keep the connection alive.
To clarify, comment 25, I'd like someone from QA to verify that the patch helps, after which if it is insufficient, services can be looped in to try server side pings.
Flags: needinfo?(nsm.nikhil)
:benbangert, I think that we need to see full ping frames. The cost is almost the same as TCP keepalives, but they don't get swallowed by intermediaries in the same way.
Ok, I've added an issue to include server-sent websocket ping frames for our 1.5 release which is scheduled for mid-Jan. https://github.com/mozilla-services/pushgo/issues/174
Thanks. Let me remove it from 2.2 scope here.
feature-b2g: 2.2? → ---
I applied the patch; wifi has to be setup in the FTU. There's something busted with the settings, I think? Will verify tomorrow.
Bug 1118272 seems to confirm that a wakelock may be required.
Just a note, if someone wants to test we do have a pushgo that sends pings every 5 min in dev. The websocket url for our dev env is: wss://push.dev.mozaws.net/
this was posted to dev-gaia today and seems like it may be related Subject: Simple Push enpoints die after some time https://groups.google.com/forum/#!searchin/mozilla.dev.gaia/enpoints/mozilla.dev.gaia/YyZ4_cvKltY/Y4lxw9QQdY0J
I have been testing and having email conversations back and forth with the Devs and Richard. [Note: all have been kind enough to help me ramp up in the technology and testing] We found that when the network connection gets cut (ie via Sim or wifi) then an error gets produced via the patch. I was suggested to make some other modifications to stop the error message from appearing. Another thing I saw was that when placing the USB cable in, I got a lot of lmk, including the Simple Push app on the phone. This is possibly to bug 1122119. My concern was that the notification may have gotten lost if the push portion got killed. I was informed that the PushService runs in the parent process so that should not be an issue. One thing I'm not sure of is which ID the server is pinging back for. If the client registers more than once, it shows in the log register/logcat. I don't want to leave this part to assuming the last registration; though it may be the case as it only gives one pong per ping.
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #38) > > One thing I'm not sure of is which ID the server is pinging back for. > If the client registers more than once, it shows in the log register/logcat. > > I don't want to leave this part to assuming the last registration; though it > may be the case as it only gives one pong per ping. The pong is per device, not per individual registration endpoint. On the server, it is sent out after n minutes of idle time (no communication from client or server). The server currently sends it unrequested to the client in an effort to keep the connection from collapsing. If this is not the case, the server would benefit from not sending it.
Ah. Thanks for the information.
As far as I understand I've tested the patch and I can't seem to reproduce the issue at this moment in time. If you are asking me to test how to break connection in other ways than the suggested methods, this might be a different story and I would have to do further investigations.
As JR noted in comment #39, we've added server-sent pongs to the next server release. It's currently deployed on `dev`, and will be rolled to `stage` this week. The pongs are sent every 5 minutes by default; we'll want to keep an eye out for battery usage issues, and adjust the interval accordingly.
triage: blocker per its user impact.
blocking-b2g: 2.2? → 2.2+
Did this get onto stage? If I understand correctly once server-sent pongs hit stage, we need to test this again?
Flags: needinfo?(kcambridge)
This is on stage, but disabled until we can verify it doesn't negatively impact battery life. Do you know when QA will have cycles to test this, Larissa? The server-sent pong interval is a config setting, so it's easy to adjust on our end...we just didn't want to roll it through to production if our 1.5 release was ready before this could be verified.
Flags: needinfo?(kcambridge)
Attached file MozReview Request: bz://1080752/nsm (obsolete) (deleted) —
/r/5047 - Bug 1080752 - Hold wakelock when attempting to connect to push server. r=dougt Pull down this commit: hg pull review -r 34a2b42ec837f55d4b09933e4bc842bd11ecfa45
Attachment #8574902 - Flags: review?(dougt)
Attachment #8574902 - Flags: review?(dougt)
Comment on attachment 8574902 [details] MozReview Request: bz://1080752/nsm https://reviewboard.mozilla.org/r/5045/#review4075 ::: dom/push/PushService.jsm (Diff revision 1) > - return; > + .createInstance(Ci.nsIWebSocketChannel); remove this, right? ::: dom/push/PushService.jsm (Diff revision 1) > + this._acquireWakeLock(); If asyncOpen() returns a failure code, there is no guarantee that onstop (or onstart) will be called. https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/websocket/nsIWebSocketChannel.idl#133 I think you'll want to test the result of asyncOpen before aquiring the lock. ::: dom/push/PushService.jsm (Diff revision 1) > + // Allow the same time for socket setup as we do for requests after the setup. not sure if this is the optimal value. it seems that you might need a value a bit longer than the requestTimeout. Recall that the timer running is slack and might fire prematurely. Is this good enough as is?
Comment on attachment 8574902 [details] MozReview Request: bz://1080752/nsm /r/5047 - Bug 1080752 - Hold wakelock when attempting to connect to push server. r=dougt Pull down this commit: hg pull review -r e3e2de27477266e097350a823c26c3c5f8368ec0
Attachment #8574902 - Flags: review?(dougt)
Attachment #8574902 - Flags: review?(dougt) → review+
Comment on attachment 8574902 [details] MozReview Request: bz://1080752/nsm https://reviewboard.mozilla.org/r/5045/#review4227 Ship It!
Comment on attachment 8528524 [details] [diff] [review] wakelock the one on reviewboard is current
Attachment #8528524 - Attachment is obsolete: true
Attachment #8528524 - Flags: review?(dougt)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
Please request b2g37 approval on this when you get a chance.
Flags: needinfo?(nsm.nikhil)
Comment on attachment 8574902 [details] MozReview Request: bz://1080752/nsm NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings. [Approval Request Comment] Bug caused by (feature/regressing bug #): https://bugzilla.mozilla.org/show_bug.cgi?id=822712 User impact if declined: Users stop receiving push notifications after a while. Testing completed: yes Risk to taking this patch (and alternatives if risky): low. we hold a wakelock open but have a timer that forcibly releases it after a few seconds. String or UUID changes made by this patch: none
Flags: needinfo?(nsm.nikhil)
Attachment #8574902 - Flags: approval-mozilla-b2g37?
Keywords: verifyme
Attachment #8574902 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
[Blocking Requested - why for this release]:
Blocks: 1122624
blocking-b2g: 2.2+ → 2.1?
We need this patch to land in 2.1 as it's causing issues with push notification. See bug 1122624 which is a 2.1 blocker. ie this blocks a blocker.
Nikhil, could you ask for uplift to 2.1 please? This blocks a blocker. I found that the patch applies cleanly to 2.1 Josh Cheng will look into approving for 2.1/2.1S
Flags: needinfo?(nsm.nikhil)
Flags: needinfo?(jocheng)
I'm not sure what flag to use for 2.1. Isn't approval for 37 equivalent?
Flags: needinfo?(nsm.nikhil) → needinfo?(nhirata.bugzilla)
37 is for 2.2. mozilla-b2g34 would be equivalent for 2.1 Thanks!
Flags: needinfo?(nhirata.bugzilla) → needinfo?(nsm.nikhil)
Comment on attachment 8574902 [details] MozReview Request: bz://1080752/nsm NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings. [Approval Request Comment] Bug caused by (feature/regressing bug #): Please see comment 54 for all fields. User impact if declined: Testing completed: Risk to taking this patch (and alternatives if risky): String or UUID changes made by this patch:
Flags: needinfo?(nsm.nikhil)
Attachment #8574902 - Flags: approval-mozilla-b2g34?
Flags: needinfo?(jocheng)
Attachment #8574902 - Flags: approval-mozilla-b2g34? → approval-mozilla-b2g34+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: