Closed Bug 803846 Opened 12 years ago Closed 12 years ago

[B2G] Fix PR_IntervalNow so it's monotonic (or otherwise audit all callers so as to ensure that non-monotonicity doesn't mess us up)

Categories

(Core :: General, defect)

All
Linux
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 757593
blocking-basecamp +

People

(Reporter: justin.lebar+bug, Unassigned)

References

Details

In bug 803039 comment 7, Shelly pointed out that PR_IntervalNow uses a non-monotonic clock on Linux. If this is correct (looks like it to me [1], although I don't claim to understand this code, so please correct us if we're wrong), then we probably should audit all callers of PR_IntervalNow to ensure that they function more-or-less correctly when time skips forward or backward. If the caller will not function correctly, we can change it to use TimeStamp. Alternatively, we could fix NSPR, but I'm not holding my breath for B2G (the fix would need to make it into Aurora to make B2G v1). In theory, this is a problem on Linux desktop as well as on B2G, but in practice, B2G receives time updates from cell towers, so large, abrupt time changes are much more likely to occur on B2G than on desktop. [1] http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/include/md/_unixos.h#484
blocking-basecamp: --- → ?
Hmmm. It looks like NSS relies on the monotonicity of PR_IntervalNow. :-/ Ted, Brian, what do you think we should do here?
First, I admit I am not quite sure what the non-monotonicity of PRIntervalTime really means. My (far-from-perfect) understanding from discussion with Honza and mcmanus is that it is more nuanced than it appears. Also, we use PR_IntervalTime instead of mozilla::TimeStamp in security/manager/ssl to avoid performance problems caused by mozilla::TimeStamp that mcmanus warned us about. There are a lot of uses in nsprpub/ and netwerk/, but before I dive into them, I'd like to hear from wtc about what he thinks about this bug. Regarding NSS: The thing I would be most concerned about is the code in sdb_updateCache, sdb_openDBLocal, and sdb_init in securityt/nss/lib/softoken. It would be good to hear from rrelyea and wtc about this. A lot of the uses of PR_IntervalTime seem to relate to performance measurement and/or debugging. Perhaps it would be good to: typedef PRIntervalTime NSSDebugTimestamp; typedef PRIntervalTime NSSPerfTimestamp; #define NSS_DebugNow() ((NSSDebugTimestamp) PR_IntervalNow()) #define NSS_PerfNow() ((NSSPerfTimestamp) PR_IntervalNow()) and use these symbols to distinguish these cases in the code. There are other uses of PR_IntervalNow for smart card insertion/removal. I don't understand this code enough to know what to do here. Regardless, it's not critical for B2G since there's no smartcard support in B2G. For B2G v1, we don't need to be concerned about security/lib/dtls* because B2G doesn't enable WebRTC and WebRTC (AFAIK) is the only user of DTLS. Still, it would be good to hear from ekr about what, if anything, should change in the DTLS code. I am less concerned about the stuff in security/nss/cmd because those are the command-line programs. Regarding security/manager (PSM): PRIntevalTime is used for the timeout logic for OCSP requests; this can easily be changed to mozilla::TimeStamp, if mcmanus and Honza agree. It is also used for the timeout logic for insecure TLS -> SSL 3.0 fallback, which probably should be removed (see bug 754356); if it can't be moved, it can also easily be updated to use mozilla::TimeStamp, but note that the review comments for that code explicitly asked for mozilla::TimeStamp to be avoided.
The correct solution is to fix NSPR. PRIntervalTime is specified to be monotonically increasing. If it is not, it is a bug of the implementation. The current implementation is based on gettimeofday(), which was probably the only thing available before. It should be replaced by a better implementation that meets the monotonic requirement. I remember there is already an NSPR bug report for this.
Awesome. The only thing that PR_IntervalNow() is used for in DTLS is for timeout and retransmission. If I understand the scenario, what can happen is that if I do: x = PR_IntervalNow() y = PR_IntervalNow() Then I can get x > y? If so, I think that this is likely to just cause premature timer expiry, but I haven't really researched it.
> A lot of the uses of PR_IntervalTime seem to relate to performance measurement and/or debugging. > Perhaps it would be good to: Perhaps, but depending on the frequency of updates to the B2G system clock, if we used a fast-but-broken implementation for the perf-measurement case, that could introduce an unacceptable amount of noise into our perf measurements. > but note that the review comments for that code explicitly asked for mozilla::TimeStamp to be > avoided. I'd like to understand why. In particular, if we modify PR_IntervalNow to be more like TimeStamp, we'll need to avoid re-creating whatever performance problems there are with TimeStamp. > The correct solution is to fix NSPR. Is this something we could get into Firefox Aurora before the branch uplift on 11/19? I think this is just a matter of passing CLOCK_MONOTONIC to the existing gettimeofday call (guarded by operating-system checks and probably a runtime check as well; see xpcom/ds/TimeStamp_posix.cpp; particularly TimeStamp::Startup). But cjones wrote this code, so maybe he has some input here.
> If I understand the scenario, what can happen is that if I do: > > x = PR_IntervalNow() > y = PR_IntervalNow() > > Then I can get x > y? Yes, and also if those two commands were separated by T seconds, you could get y - x >> T (if time skipped forward). But that case seems less likely to cause problems than time skipping backwards.
Summary: [B2G] Audit all callers of PR_IntervalNow → [B2G] Fix PR_IntervalNow so it's monotonic (or otherwise audit all callers so as to ensure that non-monotonicity doesn't mess us up)
(In reply to Justin Lebar [:jlebar] from comment #6) > > If I understand the scenario, what can happen is that if I do: > > > > x = PR_IntervalNow() > > y = PR_IntervalNow() > > > > Then I can get x > y? > > Yes, and also if those two commands were separated by T seconds, you could > get y - x >> T (if time skipped forward). But that case seems less likely > to cause problems than time skipping backwards. I think both of these will behave the same way with DTLS: spuriously aggressive retransmit timers and connection failures due to premature timeout.
All Gecko code should be using TimeStamp.
bug on pr_interval and montonicity bug 757593 when I measured a year ago there was at least a 10x performance gap between pr_interval and timestamp on windows. but even that only matters if its called frequently - the raw numbers weren't that big. (I don't remember the specifics). I also know some timestamp code for windows landed recently that purported to improve things.
Yes, that's because TimeStamp was guaranteeing monotonicity ;).
(In reply to Chris Jones [:cjones] [:warhammer] from comment #8) > All Gecko code should be using TimeStamp. Okay, but I don't feel like that gives us much guidance here. I think the questions we need to answer are: 1) Does the non-monotonicity of PR_IntervalTime mess us up in B2G v1, given that we'll be resetting the system time relatively often? 2) If so, do we want to fix this by fixing NSPR or by replacing all instances of PR_IntervalNow with TimeStamp? 3) If we decide to s/PR_IntervalNow/TimeStamp/ in Gecko, can we declare that all the non-Gecko users (just NSS, I guess) won't behave too badly if PR_IntervalNow goes backwards?
The other option here might be just to locally patch for b2g pr_intervaltime with the trivial approach in 757593 .. other platforms don't have to worry as much about the clock being warped around all the time and can continue to wait for nspr to patch the issue and b2g shouldn't have to worry about being linked with another version. fwiw I also glanced (not particularly rigorously) at all the http uses of pr_interval and nothing jumped out at me as a serious problem. The biggest potential problem would be a false-positive on a timeout causing a working tcp session to be torn down, but it wouldn't impact new ones or anything like that. So it would be a little hiccup. It could potentially disable pipelines on a per-host basis for a little while as well. Not great, but not the end of the world either.
(In reply to Patrick McManus [:mcmanus] from comment #12) > The other option here might be just to locally patch for b2g pr_intervaltime > with the trivial approach in 757593 .. other platforms don't have to worry > as much about the clock being warped around all the time and can continue to > wait for nspr to patch the issue and b2g shouldn't have to worry about being > linked with another version. Android must have the same issue, right? If not, do they smooth over the time reported to applications in some way, similar to the way that Google's leap seconds workaround for NTP did? I suggest we dupe this to bug 757593. That bug has a link to mcmanus's patch and wtc also has a patch for NSPR that he hasn't uploaded yet, and he's already done measurements of the resolution and latency of his patch.
I'm fine duping to bug 757593 if we think it's reasonable to resolve that bug on Aurora. If it's not, then we need a different solution for B2G v1.
Is there reason to believe that this bug would affect B2G worse than it has affected Fennec?
I have no idea how either b2g or fennec adjusts its clock, but I'll just throw out there that if its through normal ntp operation that won't interfere with a monotonic gettimeofday as the kernel just [de/a]ccelerates the wall clock rate for a little while to make the adjustment. Using settimeofday() will screw with you :)
(In reply to Jonas Sicking (:sicking) from comment #15) > Is there reason to believe that this bug would affect B2G worse than it has > affected Fennec? * We run lots of time-sensitive services on B2G that we don't run on Fennec (e.g. alarms). * The B2G process lives for much longer than Fennec. * Android may not adjust its system time the same was a B2G does.
> but I'll just throw out there that if its through normal ntp operation that won't interfere with a > monotonic gettimeofday as the kernel just [de/a]ccelerates the wall clock rate for a little while to > make the adjustment. (I doubt NTP is going to do a smooth transition when you need to change the clock backward a few hours, as happens when you're traveling. But that's an edge case...) In B2G we set the time by poking /dev/alarm. I don't understand the semantics of this, but I doubt it's a smooth transition. In hal/gonk/GonkHal.cpp, we do: fd = open("/dev/alarm", O_RDWR); ioctl(fd, ANDROID_ALARM_SET_RTC, &now);
(In reply to Justin Lebar [:jlebar] from comment #18) > > but I'll just throw out there that if its through normal ntp operation that won't interfere with a > > monotonic gettimeofday as the kernel just [de/a]ccelerates the wall clock rate for a little while to > > make the adjustment. > > (I doubt NTP is going to do a smooth transition when you need to change the > clock backward a few hours, as happens when you're traveling. But that's an > edge case...) Isn't gettimeofday() in absolute time? I.e., it shouldn't need to move by big chunks when you travel.
> Isn't gettimeofday() in absolute time? I.e., it shouldn't need to move by big chunks > when you travel. Ah, indeed. Right, okay.
(In reply to Justin Lebar [:jlebar] from comment #11) > (In reply to Chris Jones [:cjones] [:warhammer] from comment #8) > > All Gecko code should be using TimeStamp. > > Okay, but I don't feel like that gives us much guidance here. I think the > questions we need to answer are: > > 1) Does the non-monotonicity of PR_IntervalTime mess us up in B2G v1, given > that we'll be resetting the system time relatively often? > Yes. > 2) If so, do we want to fix this by fixing NSPR or by replacing all > instances of PR_IntervalNow with TimeStamp? > I doubt that we can fix NSPR given that there are probably applications that rely on non-monoticity by this point. > 3) If we decide to s/PR_IntervalNow/TimeStamp/ in Gecko, can we declare that > all the non-Gecko users (just NSS, I guess) won't behave too badly if > PR_IntervalNow goes backwards? I can't answer that, but yeah we should find out.
Is an s/PR_IntervalNow/TimeStamp/-style fix something we can really take on aurora? If this bug is really stopping us from shipping, and a s/PR_IntervalNow/TimeStamp/ fix is the only realistic solution, should we be worried about this forcing us to branch from the normal release trains (or go back to m-c)
It depends on what code might be affected and how seriously it would be affected. We need to answer those questions first.
My name is Wan-Teh Chang. I am the NSPR maintainer. It is fine to change PR_IntervalNow() to be monotonically increasing because that is the specification. I will do this work in NSPR bug 757593. I can fix the NSPR bug (for iOS, Mac OS X, Linux, and Android) and push a new NSPR tag to mozilla-central this week or next week, if everything goes well (I need help with code review and Android testing). If you have any concern about this plan, please comment in NSPR bug 757593. Thanks.
(In reply to Justin Lebar [:jlebar] from comment #18) > In B2G we set the time by poking /dev/alarm. I don't understand the > semantics of this, but I doubt it's a smooth transition. In > hal/gonk/GonkHal.cpp, we do: > > fd = open("/dev/alarm", O_RDWR); > ioctl(fd, ANDROID_ALARM_SET_RTC, &now); Not saying the above is right or wrong. Just noting that I read that "With Linux, NTP normally uses settimeofday() for large corrections (over half a second). The adjtimex() inteface allows for small clock frequency changes (slewing). This can be done in a few different ways, see the man page for adjtimex." IMO, it seems best to try to minimize adverse effects of setting the time from the tower by doing something similar. But, I don't know all (any) of the factors that went into the above implementation choice.
Let's either fix this independent of bug 757593 or decide if we can take that change on Aurora.
blocking-basecamp: ? → +
Please note that changing timezone doesn't affect PR_IntervalNow. At least if I understand things correctly. So traveling around the world with your device and updating the time as you arrive in different places won't have any effect. The only thing that will affect us here is if the user adjusts the actual time on the device. This seems like a more rare occurrence and is usually done just a few minutes forward or back. Probably the most common scenario here is when switching between automatically set time, and manually set time. That said, it's definitely true that Gecko will be running a lot longer on B2G than on other platforms, which does make this bug more likely to affect B2G. I'm still not convinced that it'd affect B2G enough that I'd take high-risk patches (if that's the only solution we can find) though.
> I'm still not convinced that it'd affect B2G enough that I'd take high-risk patches FWIW I am also not convinced that there is a large effect. Comment 21 is merely an assertion, which makes it hard to judge how large the effect might be.
What are we going to do here? Do we need more information to make a decision?
Flags: needinfo?(justin.lebar+bug)
(In reply to Andrew Overholt [:overholt] from comment #29) > What are we going to do here? Do we need more information to make a > decision? I think we need an owner here.
Flags: needinfo?(justin.lebar+bug)
If this is an actual B2G blocker, we really need to get on fixing this, particularly since pulling an NSPR release is a scary thing to do for desktop builds late in the desktop cycle. Who can own this? ted? bsmith? An alternative would be for someone to show that this isn't actually a release blocker. Right now we only have an assertion that this is problematic, but no evidence afaict.
Flags: needinfo?(ted)
I don't have any expertise here, so I'm not sure why you'd want me to own this. If you decide that patching NSPR is the right solution I'm happy to review and land patches, as wtc said in comment 3 that fixing NSPR is okay.
Flags: needinfo?(ted)
jlebar: I can own this bug, but ideally someone else should push an NSPR update to Aurora. I have not done that before and don't want to miss the deadline because of my unfamiliarity with the process. I unsubscribed from this bug on 2012-10-22 (to reduce my Bugzilla mail) and have been doing the work in NSPR bug 757593. I didn't forget the 11/19 deadline you mentioned in comment 5. I just pushed a new NSPR update NSPR_4_9_4_BETA1 to mozilla-central today (see bug 757593 comment 33). I can create an NSPR_4_9_4_RTM tag when you are happy with the new NSPR code. Note: Compared with TimeStamp, the only inherent limitation of PRIntervalTime is that it will wrap around after some period due to its 32-bit size. After I realized this, I adjusted my patch for NSPR bug 757593 to maximize the period for PRIntervalTime to wrap around.
> I unsubscribed from this bug on 2012-10-22 (to reduce my Bugzilla mail) > and have been doing the work in NSPR bug 757593. And I was not watching bug 757593, so there we go. :) Looks like we can dupe this bug there and figure out the rest in that bug. Thanks a lot for fixing this!
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.