Closed
Bug 1080752
Opened 10 years ago
Closed 10 years ago
Push notifications do not work when the phone is idle for more than 30 minutes
Categories
(Core :: DOM: Notifications, defect)
Tracking
()
People
(Reporter: vkrishnamoorthy, Assigned: nsm)
References
Details
(Keywords: verifyme, Whiteboard: [dependency: marketplace])
Attachments
(1 file, 2 obsolete files)
Symptomatic issues of this problem has been reported by applications using push in various bugs such as Bug 1003639, Bug 1034300 and Bug 1040394.
The root cause seems to be that Push Notifications are not delivered to the phone when the phone is in idle state (left idle for more than 30 mins).
This has discouraged applications that have real time need from using SimplePush.
Comment 1•10 years ago
|
||
The server side is delivering content to connected devices (currently averaging about 0.004 seconds) The problem is that the device is dropping connection to the server and not reconnecting.
There may be several reasons for this, including the CPU being aggressive about app shutdown to save battery, long reconnection times, or other things I'm not qualified to talk about or even guess.
Another possible option might be to look at how Android does this, possibly at a layer deep inside of the OS. Android maintains a constant socket connection, reconnecting back between network switches. (Actually, they keep several, as do a few other apps like twitter and some mail apps.)
Yeah, we should definitely not be requiring the CPU to remain active in order to deliver push notifications. We should talk to our chipset vendor partners to see what they recommend. As well as look at what Android does of course.
Reporter | ||
Comment 3•10 years ago
|
||
I am setting 2.2? as without this issue fixed, I doubt RTC apps will use the simple push service.
blocking-b2g: --- → 2.2?
Comment 4•10 years ago
|
||
Not sure how related this might be, but Ian and I have been talking with Victor Chen. He pointed us toward
http://mxr.mozilla.org/gaia/source/apps/system/js/wifi.js#160
It's notable that on L#142, there's a 5m timeout that releases the CPU lock. I'm not sure, but I believe that would cause the device to go into a hard sleep after that time, which would drop the radio and wifi connections, correct?
From reading various docs like
https://developer.android.com/training/efficient-downloads/efficient-network-access.html
it appears that there really isn't a lot of magic involved in keeping things alive, and that often, the mechanism of choice is "not to", but instead use something like polling using calls that are as close to "the metal" (the kernel level drivers for radios and battery) as possible. This would mean that there would be an appreciable delay between submission for an event and the device actually acting on the requested event. (so, for instance, things like Find My Device may require 90seconds+ in order to act.
As I understand, however, the Push code should have a timer that does wake and attempts to reconnect. There do seem to be a lot of variables that need to be checked in order for the radio to come back up. It may be that one of the various conditions is either failing to be met or is preventing the radio from properly re-initializing.
This seems like it completely breaks push notifications. If people use polling instead of push that will severely impact battery usage.
Blocks: 1093617
Comment 6•10 years ago
|
||
(In reply to Jonas Sicking (:sicking) from comment #5)
> This seems like it completely breaks push notifications. If people use
> polling instead of push that will severely impact battery usage.
Agreed. See for example bug 1003639
Comment 7•10 years ago
|
||
I'm not so sure.
Holding an active socket open for a connection to a server does not allow the CPU to go idle. Even low power mode draws from the battery since it's a constant drain, and I believe that any network activity (for instance a connection ping) spins the CPU back into high draw, followed by low draw.
Having a device actively poll on a several minute cycle would reduce this impact since the device could actually go to full idle if possible. (A bit like how hybrids save gas by turning off the engine at stop lights.)
Granted, this does come at a penalty that mobile devices aren't "Always on and instantly accessible". It might be worth balancing that desired goal with what the device limitations are.
Comment 8•10 years ago
|
||
What devices?
nsm, are we setting an alarm to wakeup periodically?
Flags: needinfo?(nsm.nikhil)
Comment 9•10 years ago
|
||
:dougt I would rather have this solved from the server end. I've talked to Ben about what it would take to adaptively alter the keep alive interval across different networks. That's going to be a far better solution than anything client-driven.
Fixing this on the client is harder: there is less information about what a good interval is. The server sees a lot more networks and devices.
Assignee | ||
Comment 10•10 years ago
|
||
(In reply to Doug Turner (:dougt) from comment #8)
> What devices?
>
> nsm, are we setting an alarm to wakeup periodically?
Yes we have always done that. After :willyaranda landed adaptive ping, we do it adaptively too. But if the processor goes back to sleep once the alarm wakes the push code up, before we can set up the socket, then the connection isn't going to be reestablished. We could hold a wake-lock while waiting for a reply from the push server, with a short timeout. What is frustrating is that there is no sane way to test this with logging etc. without having the phone plugged in (which nullifies everything) AFAIK. If anyone knows better, I'd be happy to try it.
Flags: needinfo?(nsm.nikhil)
Reporter | ||
Comment 11•10 years ago
|
||
nsm, can we test it server side? if the server does not see a connection request from the device, can we infer that the AP goes to sleep before establishing connection?
Assignee | ||
Comment 12•10 years ago
|
||
I'm not sure how easy it is to observe a specific device on the server 'live'.
Even using an alarm to periodically wake up and then reconnect to the server sounds problematic. It means that you couldn't use Push API to implement a VoIP app like Loop for example. Since the push message could take multiple minutes to be delivered, which is not how long you expect to wait when placing a phone call.
Is there no way to wake the device up on an incoming network package?
Assignee | ||
Comment 14•10 years ago
|
||
(In reply to Jonas Sicking (:sicking) from comment #13)
> Even using an alarm to periodically wake up and then reconnect to the server
> sounds problematic. It means that you couldn't use Push API to implement a
> VoIP app like Loop for example. Since the push message could take multiple
> minutes to be delivered, which is not how long you expect to wait when
> placing a phone call.
Even android doesn't allow this kind of behaviour when the phone is in low power mode or similar. I'm not sure if there is any alternative.
>
> Is there no way to wake the device up on an incoming network package?
Using UDP/SMS, both of which require various agreements/deployments with carriers for widespread use. I don't think there is a way via TCP.
Reporter | ||
Comment 15•10 years ago
|
||
(In reply to Nikhil Marathe [:nsm] (needinfo? please) from comment #14)
> (In reply to Jonas Sicking (:sicking) from comment #13)
> > Even using an alarm to periodically wake up and then reconnect to the server
> > sounds problematic. It means that you couldn't use Push API to implement a
> > VoIP app like Loop for example. Since the push message could take multiple
> > minutes to be delivered, which is not how long you expect to wait when
> > placing a phone call.
>
> Even android doesn't allow this kind of behaviour when the phone is in low
> power mode or similar. I'm not sure if there is any alternative.
I thought GCM had a way to send ntf to idle devices using the delayWhileIdle flag? app servers can set this flag to false http://developer.android.com/google/gcm/adv.html
Comment 16•10 years ago
|
||
It's hard to tell what proprietary method Google uses inside of the Play app which handles notification, but I have not seen anything magic that they are doing.TCP handling is a reasonably new tech built on top of existing radio protocols. SMS uses extra, previously unused data, which is tightly controlled by the carriers. It may be possible to examine the various chipsets that are handling the radio portion, but I don't know if we have access to this, nor do I know if we can convince carriers to include whatever additional protocols or driving hardware.
That aside, we need to work with what we know the devices can do. These devices consume battery when the cpu is active. The devices and core OS try their respective damndest to optimize battery.
Talking with some folks who work on GCM recently, they stated that polling was preferred for devices. There are things we can do and learn to make that efficient, like code as close to the actual hardware as possible, minimize the out of idle instruction load, and revert back to sleep as soon as possible.
It also means that less active devices may not be instantly accessible. I'd be willing to believe that there's similar delay for android devices that have been idle (although they've obviously solved the poll problem at the cost of some battery.
It's also notable that opening up lots of connections guarantees high CPU and minimal battery, so creating more channels is a very bad idea.
First off, I agree that we should definitely rule out any solution which forces the CPU to stay "more awake" than it otherwise would. So I agree we should not spend time on discussing such solutions.
But Google has to live with the same hardware and radio constraints as we do. So using proprietary protocols doesn't really gain them any significant capabilities.
The only advantage that I could see that Google might have is that they might have information about different strategies that can be used with different carriers. But there's no reason we could start building such infrastructure too. I.e. we could check with the SIM card to see what carrier the user is using, and then use different signaling mechanisms for different carriers.
It would be very sad if push didn't let us implement things like VoIP clients. Has anyone talked to Telefonica about this given that they are a carrier and that they are quite interested in Loop?
Comment 18•10 years ago
|
||
Telefonica proposed allowing us to use their proprietary UDP system to wake devices on their network, however they have not yet provided an endpoint we can use to do so. It us also not known if other carriers offer a similar system.
Sounds great.
Vishy, could you help get Nikhil and JR hooked up with the right people at Telefonica?
Comment 20•10 years ago
|
||
My understanding is that an open TCP connection can send packets toward the client at any time and they will be delivered if the middleboxes haven't forgotten. The fact that middleboxes can forget sessions is the real problem we are seeing. The exact timeout is anywhere from 30s to 30m, but once that time passes without TCP activity, the connection turns into a zombie.
Activity from either peer keeps the connection alive. That is, you don't have to maintain a timer on the client, you can maintain the timer on the server. As I note in comment 9, running timers on the server offers a lot of advantages for tuning of the interval. The catch is that the client will have to be paged to deliver the TCP packet.
You might imagine a world where sending a TCP keepalive (a packet containing sequence numbers that have already been acknowledged, which might keep bindings alive, but not trigger delivery) would work, but that would be pure fantasy.
Comment 21•10 years ago
|
||
If we send TCP keepalive's this frequently, will it cause the phone to do additional work that could severely reduce its battery life as the other popular FxOS chat app has been responsible for?
Comment 22•10 years ago
|
||
You want to tune the interval so that it just shorter than the lowest timeout across the set of middleboxes that are on-path. Fail to send keepalives and you can't send pushes.
Comment 23•10 years ago
|
||
It's worth trying out to see what the effects are. I've added a ticket for pushgo (https://github.com/mozilla-services/pushgo/issues/112) to implement TCP server-side keep-alives as a configurable option so we can at least do some testing to see how frequent they might be needed and what effects they might have on battery life.
Updated•10 years ago
|
Whiteboard: [dependency: marketplace-partners]
Updated•10 years ago
|
feature-b2g: --- → 2.2?
Updated•10 years ago
|
Whiteboard: [dependency: marketplace-partners] → [dependency: marketplace]
Assignee | ||
Comment 24•10 years ago
|
||
I am going to try holding a wakelock while intiating the connection. Assigning to myself for now.
Assignee: nobody → nsm.nikhil
Assignee | ||
Comment 25•10 years ago
|
||
This patch holds a wakelock while the websocket is being setup. Unfortunately I am in a location right now where my phone has neither data nor wifi, so I haven't been able to test it. There should be adb messages about acquiring and releasing the wakelock once you enable the pref "services.push.debug". vishy, could you assign someone on this?
Flags: needinfo?(vkrishnamoorthy)
Reporter | ||
Comment 26•10 years ago
|
||
ni: Ian bicking,
Ian, would someone in your team have cycles for this?
Flags: needinfo?(vkrishnamoorthy) → needinfo?(ianb)
Comment 27•10 years ago
|
||
I'm not sure what the next task here is. QA? At this specific moment I don't have anyone that can work on this (but with two days to Thanksgiving there's not many potential cycles).
Flags: needinfo?(ianb)
Comment 28•10 years ago
|
||
Are you still working on this? Do we plan to finish it in Gecko 37? Just want to know if we will have this one in Firefox OS 2.2.
Thanks.
Flags: needinfo?(nsm.nikhil)
Comment 29•10 years ago
|
||
It's unknown what the next task here is. There's been several changes since the last QA verification on this. We might need to have the server initiate websocket ping frames (the latest version sends TCP keep-alive's, but that might not be enough?). Once we know whether the latest change was sufficient, we could prop up a test server that sends full websocket ping frame's regularly to see if that is enough to keep the connection alive.
Assignee | ||
Comment 30•10 years ago
|
||
To clarify, comment 25, I'd like someone from QA to verify that the patch helps, after which if it is insufficient, services can be looped in to try server side pings.
Flags: needinfo?(nsm.nikhil)
Comment 31•10 years ago
|
||
:benbangert, I think that we need to see full ping frames. The cost is almost the same as TCP keepalives, but they don't get swallowed by intermediaries in the same way.
Comment 32•10 years ago
|
||
Ok, I've added an issue to include server-sent websocket ping frames for our 1.5 release which is scheduled for mid-Jan.
https://github.com/mozilla-services/pushgo/issues/174
I applied the patch; wifi has to be setup in the FTU. There's something busted with the settings, I think? Will verify tomorrow.
Assignee | ||
Comment 35•10 years ago
|
||
Bug 1118272 seems to confirm that a wakelock may be required.
Comment 36•10 years ago
|
||
Just a note, if someone wants to test we do have a pushgo that sends pings every 5 min in dev. The websocket url for our dev env is: wss://push.dev.mozaws.net/
Comment 37•10 years ago
|
||
this was posted to dev-gaia today and seems like it may be related
Subject: Simple Push enpoints die after some time
https://groups.google.com/forum/#!searchin/mozilla.dev.gaia/enpoints/mozilla.dev.gaia/YyZ4_cvKltY/Y4lxw9QQdY0J
I have been testing and having email conversations back and forth with the Devs and Richard. [Note: all have been kind enough to help me ramp up in the technology and testing]
We found that when the network connection gets cut (ie via Sim or wifi) then an error gets produced via the patch. I was suggested to make some other modifications to stop the error message from appearing.
Another thing I saw was that when placing the USB cable in, I got a lot of lmk, including the Simple Push app on the phone. This is possibly to bug 1122119. My concern was that the notification may have gotten lost if the push portion got killed. I was informed that the PushService runs in the parent process so that should not be an issue.
One thing I'm not sure of is which ID the server is pinging back for.
If the client registers more than once, it shows in the log register/logcat.
I don't want to leave this part to assuming the last registration; though it may be the case as it only gives one pong per ping.
Comment 39•10 years ago
|
||
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #38)
>
> One thing I'm not sure of is which ID the server is pinging back for.
> If the client registers more than once, it shows in the log register/logcat.
>
> I don't want to leave this part to assuming the last registration; though it
> may be the case as it only gives one pong per ping.
The pong is per device, not per individual registration endpoint. On the server, it is sent out after n minutes of idle time (no communication from client or server). The server currently sends it unrequested to the client in an effort to keep the connection from collapsing. If this is not the case, the server would benefit from not sending it.
Ah. Thanks for the information.
As far as I understand I've tested the patch and I can't seem to reproduce the issue at this moment in time.
If you are asking me to test how to break connection in other ways than the suggested methods, this might be a different story and I would have to do further investigations.
Comment 42•10 years ago
|
||
As JR noted in comment #39, we've added server-sent pongs to the next server release. It's currently deployed on `dev`, and will be rolled to `stage` this week. The pongs are sent every 5 minutes by default; we'll want to keep an eye out for battery usage issues, and adjust the interval accordingly.
Comment 44•10 years ago
|
||
Did this get onto stage? If I understand correctly once server-sent pongs hit stage, we need to test this again?
Flags: needinfo?(kcambridge)
Comment 45•10 years ago
|
||
This is on stage, but disabled until we can verify it doesn't negatively impact battery life. Do you know when QA will have cycles to test this, Larissa?
The server-sent pong interval is a config setting, so it's easy to adjust on our end...we just didn't want to roll it through to production if our 1.5 release was ready before this could be verified.
Flags: needinfo?(kcambridge)
Assignee | ||
Updated•10 years ago
|
Attachment #8528524 -
Flags: review?(dougt)
Assignee | ||
Comment 46•10 years ago
|
||
/r/5047 - Bug 1080752 - Hold wakelock when attempting to connect to push server. r=dougt
Pull down this commit:
hg pull review -r 34a2b42ec837f55d4b09933e4bc842bd11ecfa45
Attachment #8574902 -
Flags: review?(dougt)
Updated•10 years ago
|
Attachment #8574902 -
Flags: review?(dougt)
Comment 47•10 years ago
|
||
Comment on attachment 8574902 [details]
MozReview Request: bz://1080752/nsm
https://reviewboard.mozilla.org/r/5045/#review4075
::: dom/push/PushService.jsm
(Diff revision 1)
> - return;
> + .createInstance(Ci.nsIWebSocketChannel);
remove this, right?
::: dom/push/PushService.jsm
(Diff revision 1)
> + this._acquireWakeLock();
If asyncOpen() returns a failure code, there is no guarantee that onstop (or onstart) will be called.
https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/websocket/nsIWebSocketChannel.idl#133
I think you'll want to test the result of asyncOpen before aquiring the lock.
::: dom/push/PushService.jsm
(Diff revision 1)
> + // Allow the same time for socket setup as we do for requests after the setup.
not sure if this is the optimal value. it seems that you might need a value a bit longer than the requestTimeout. Recall that the timer running is slack and might fire prematurely. Is this good enough as is?
Assignee | ||
Comment 48•10 years ago
|
||
Comment on attachment 8574902 [details]
MozReview Request: bz://1080752/nsm
/r/5047 - Bug 1080752 - Hold wakelock when attempting to connect to push server. r=dougt
Pull down this commit:
hg pull review -r e3e2de27477266e097350a823c26c3c5f8368ec0
Attachment #8574902 -
Flags: review?(dougt)
Updated•10 years ago
|
Attachment #8574902 -
Flags: review?(dougt) → review+
Comment 49•10 years ago
|
||
Comment on attachment 8574902 [details]
MozReview Request: bz://1080752/nsm
https://reviewboard.mozilla.org/r/5045/#review4227
Ship It!
Assignee | ||
Comment 50•10 years ago
|
||
Comment on attachment 8528524 [details] [diff] [review]
wakelock
the one on reviewboard is current
Attachment #8528524 -
Attachment is obsolete: true
Attachment #8528524 -
Flags: review?(dougt)
Assignee | ||
Comment 51•10 years ago
|
||
Comment 52•10 years ago
|
||
Status: NEW → RESOLVED
Closed: 10 years ago
status-firefox39:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
Comment 53•10 years ago
|
||
Please request b2g37 approval on this when you get a chance.
status-b2g-v2.2:
--- → affected
status-b2g-master:
--- → fixed
status-firefox37:
--- → wontfix
status-firefox38:
--- → wontfix
Flags: needinfo?(nsm.nikhil)
Assignee | ||
Comment 54•10 years ago
|
||
Comment on attachment 8574902 [details]
MozReview Request: bz://1080752/nsm
NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.
[Approval Request Comment]
Bug caused by (feature/regressing bug #): https://bugzilla.mozilla.org/show_bug.cgi?id=822712
User impact if declined: Users stop receiving push notifications after a while.
Testing completed: yes
Risk to taking this patch (and alternatives if risky): low. we hold a wakelock open but have a timer that forcibly releases it after a few seconds.
String or UUID changes made by this patch: none
Flags: needinfo?(nsm.nikhil)
Attachment #8574902 -
Flags: approval-mozilla-b2g37?
Updated•10 years ago
|
Attachment #8574902 -
Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
Comment 55•10 years ago
|
||
[Blocking Requested - why for this release]:
Blocks: 1122624
blocking-b2g: 2.2+ → 2.1?
We need this patch to land in 2.1 as it's causing issues with push notification. See bug 1122624 which is a 2.1 blocker. ie this blocks a blocker.
Nikhil, could you ask for uplift to 2.1 please? This blocks a blocker. I found that the patch applies cleanly to 2.1
Josh Cheng will look into approving for 2.1/2.1S
Flags: needinfo?(nsm.nikhil)
Flags: needinfo?(jocheng)
Assignee | ||
Comment 59•10 years ago
|
||
I'm not sure what flag to use for 2.1. Isn't approval for 37 equivalent?
Flags: needinfo?(nsm.nikhil) → needinfo?(nhirata.bugzilla)
37 is for 2.2. mozilla-b2g34 would be equivalent for 2.1
Thanks!
status-b2g-v2.1:
--- → affected
Flags: needinfo?(nhirata.bugzilla) → needinfo?(nsm.nikhil)
Assignee | ||
Comment 61•10 years ago
|
||
Comment on attachment 8574902 [details]
MozReview Request: bz://1080752/nsm
NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.
[Approval Request Comment]
Bug caused by (feature/regressing bug #): Please see comment 54 for all fields.
User impact if declined:
Testing completed:
Risk to taking this patch (and alternatives if risky):
String or UUID changes made by this patch:
Flags: needinfo?(nsm.nikhil)
Attachment #8574902 -
Flags: approval-mozilla-b2g34?
Updated•10 years ago
|
Flags: needinfo?(jocheng)
Attachment #8574902 -
Flags: approval-mozilla-b2g34? → approval-mozilla-b2g34+
Comment 62•10 years ago
|
||
Comment 63•10 years ago
|
||
status-b2g-v2.1S:
--- → fixed
Assignee | ||
Comment 64•9 years ago
|
||
Attachment #8574902 -
Attachment is obsolete: true
Attachment #8618399 -
Flags: review+
Assignee | ||
Comment 65•9 years ago
|
||
You need to log in
before you can comment on or make changes to this bug.
Description
•