Closed Bug 1271991 Opened 8 years ago Closed 8 years ago

Track recording state in Telemetry.cpp to avoid excessive locking in CanRecord{Base,Extended}

Categories

(Toolkit :: Telemetry, defect, P4)

defect

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: jseward, Assigned: gmoore, Mentored)

References

(Blocks 1 open bug)

Details

(Whiteboard: [measurement:client])

Attachments

(1 file, 5 obsolete files)

This is an indirect followup to bug 1261052. Per comment 14 there: When we add the API-level locking in TelemetryHistograms, we'd block all those CanRecordBase() & CanRecordExtended() checks on locking that is specific to histograms. I think we should probably track the recording state here as well so we can avoid that. Happy to take this to a follow-up bug.
Priority: -- → P3
Whiteboard: [measurement:client]
Priority: P3 → P4
The involved functions here are: CanRecordBase() CanRecordExtended() TelemetryImpl::GetCanRecordBase() TelemetryImpl::SetCanRecordBase() TelemetryImpl::GetCanRecordExtended() TelemetryImpl::SetCanRecordExtended() We want the getters to not call into (and wait for) TelemetryHistogram.cpp. To achieve this, the setters should store the recording state locally. Julian, could this just be stored in an atomic here?
Flags: needinfo?(jseward)
Gregory, would you be interested in taking this?
Flags: needinfo?(olucafont6)
Mentor: gfritzsche
Hi Georg, yeah, I am definitely interested in working on this, it looks interesting. Thanks for asking, I appreciate it. I will look into this and let you know if I have any questions. Also, if there is any more information that would be useful, please let me know. Thanks again.
Flags: needinfo?(olucafont6)
Assignee: nobody → olucafont6
Attached patch Created initial version of patch. (obsolete) (deleted) — Splinter Review
I created a patch with most of the changes that I think we would need to make. It's probably not finished yet though, and I have some questions about it. 1. We are supposed to track the recording state locally in Telemetry.cpp, but where exactly should we store it? Right now it's just a global variable in mozilla::Telemetry (in Telemetry.h), because I wasn't sure where else to put it. Should we make it a private static variable of TelemetryImpl? And then just call TelemetryImpl::GetCanRecord*() from CanRecord*(), instead of TelemetryHistogram::CanRecord*()? 2. How should we initialize the value of gCanRecord* (or whatever it will be called)? Right now I just initialize it to false, because that's how the other similar variables seem to be initialized. I'm not sure if there's a better, more correct way to do it though. Thanks.
Flags: needinfo?(gfritzsche)
Comment on attachment 8849237 [details] [diff] [review] Created initial version of patch. Review of attachment 8849237 [details] [diff] [review]: ----------------------------------------------------------------- This looks close, although we should keep the state private. Julian, do you think the atomic use here is ok? ::: toolkit/components/telemetry/Telemetry.h @@ +44,5 @@ > Microsecond > }; > > +Atomic<bool> gCanRecordBase(false); > +Atomic<bool> gCanRecordExtended(false); We shouldn't expose these in the header. How about we just make them members of the TelemetryImpl class?
Attachment #8849237 - Flags: feedback?(jseward)
(In reply to Georg Fritzsche [:gfritzsche] [away March 24 - April 2] from comment #5) > This looks close, although we should keep the state private. > Julian, do you think the atomic use here is ok? From irc with Georg, the following seems to be the case: (1) It is OK that gCanRecordBase and gCanRecordExtended can change state independantly of any other data. In particular there is no need to coordinate with changes to the mutex-protected data in the "protected" data collections (TelemetryHistogram, TelemetryScalar, ?) (2) Because these two booleans are used only for assignment and comparison, the only benefit from the Atomic<> use is memory fences, meaning: initially: gCanRecord = false T1: .. X .. ; gCanRecord = true T2: while (gCanRecord != true) { /* do nothing */ }; .. Y .. it guarantees that all the stores done by T1 inside X are visible to T2 when it starts Y. But they don't give you anything else. So, given (1), I think the atomic use here is OK.
Flags: needinfo?(jseward)
Attachment #8849237 - Flags: feedback?(jseward) → feedback+
Heads-up: i'll be away until Apr 3rd, :Dexter or :chutten could review instead.
Flags: needinfo?(gfritzsche)
(In reply to Gregory Moore [:gmoore] from comment #4) > 1. We are supposed to track the recording state locally in Telemetry.cpp, > but where exactly should we store it? Right now it's just a global variable > in mozilla::Telemetry (in Telemetry.h), because I wasn't sure where else to > put it. Should we make it a private static variable of TelemetryImpl? And > then just call TelemetryImpl::GetCanRecord*() from CanRecord*(), instead of > TelemetryHistogram::CanRecord*()? Exactly! > 2. How should we initialize the value of gCanRecord* (or whatever it will be > called)? Right now I just initialize it to false, because that's how the > other similar variables seem to be initialized. I'm not sure if there's a > better, more correct way to do it though. This is fine, just keep with the existing behavior in TelemetryHistogram etc.
Great, thanks for the information everyone. I have made the changes but I ran into an issue (or at least an area that could probably be improved on). I didn't realize this earlier but it seems that TelemetryImpl is actually a singleton class. I thought we could just make mCanRecord* private static variables, but GetCanRecord*() is not a static function, so we need the instance of TelemetryImpl to call it. For now in CanRecord*(), I used do_GetService(), but it's kind of clunky and I didn't see any other functions doing that. This seems like we could improve it (let me know if you have any suggestions). One thought would be to make CanRecord*() friend functions of TelemetryImpl, so then they could directly access sTelemetry and just call the functions from there. But that seems like overkill (and would probably ruin the point of the singleton), and it's not done anywhere else. Chris, would you mind reviewing on this bug as well? It's less reading than bug 1120409 :) Thanks.
Attachment #8849237 - Attachment is obsolete: true
Flags: needinfo?(chutten)
Comment on attachment 8850794 [details] [diff] [review] Made mCanRecord* private members of TelemetryImpl. Review of attachment 8850794 [details] [diff] [review]: ----------------------------------------------------------------- I think DoStackCapture() might be the commonly-accepted way of doing this sort of thing...
Flags: needinfo?(chutten)
(In reply to Chris H-C :chutten from comment #10) > I think DoStackCapture() might be the commonly-accepted way of doing this > sort of thing... Oh okay, I see. I did something similar and updated the patch. It seems kind of strange to have two different functions that do the same thing (but one is static), but I guess the non-static versions can be called from JavaScript. Thanks.
Attachment #8850794 - Attachment is obsolete: true
Attachment #8851264 - Flags: review?(chutten)
Comment on attachment 8851264 [details] [diff] [review] Added static functions to TelemetryImpl for retrieving recording state. Review of attachment 8851264 [details] [diff] [review]: ----------------------------------------------------------------- Yup, it's weird. Likely it's weird more because of how we've structured the code than limitations on what methods can be exposed to JS... but that doesn't make it any less odd. Thanks for ploughing through this!
Attachment #8851264 - Flags: review?(chutten) → review+
Keywords: checkin-needed
Pushed by ryanvm@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/810aeb2601d6 Store recording state locally in Telemetry.cpp to avoid excessive locking in CanRecord{Base, Extended}(). r=chutten
Keywords: checkin-needed
Backed out for bustage in Telemetry.cpp: https://hg.mozilla.org/integration/mozilla-inbound/rev/d2a17c05c419ff462cf32bcc9f541fccfa89ecf5 Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=43a3a3c4f4514f1db3caa41a45dda5b71b2e86f0&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=86738331&repo=mozilla-inbound 13:20:18 INFO - /builds/slave/m-in-m64-000000000000000000000/build/src/toolkit/components/telemetry/Telemetry.cpp:1252:5: error: field 'mFailedLockCount' will be initialized after field 'mCanRecordBase' [-Werror,-Wreorder] 13:20:18 INFO - , mFailedLockCount(0) 13:20:18 INFO - ^ 13:20:18 INFO - 1 error generated.
Flags: needinfo?(olucafont6)
Oh whoops, didn't see that. I've changed the order of initialization (assuming that's the only thing that's causing the problem). Thanks.
Attachment #8851264 - Attachment is obsolete: true
Flags: needinfo?(olucafont6)
Attachment #8851773 - Flags: review?(chutten)
Comment on attachment 8851773 [details] [diff] [review] Fixed initialization order error that caused bug 1351058. Review of attachment 8851773 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/telemetry/Telemetry.cpp @@ +1245,5 @@ > > TelemetryImpl::TelemetryImpl() > : mHashMutex("Telemetry::mHashMutex") > , mHangReportsMutex("Telemetry::mHangReportsMutex") > , mThreadHangStatsMutex("Telemetry::mThreadHangStatsMutex") This is declared after mCanRecordExtended. Doesn't -Wreorder complain about it too?
Attachment #8851773 - Flags: review?(chutten)
Attached patch Fixed initialization order (again). (obsolete) (deleted) — Splinter Review
Oh, yeah I guess it probably would. I wasn't actually sure what that warning meant, so I thought only the order of mFailedLockCount mattered. On Windows it doesn't seem to report the -Wreorder warning either, at least with the default build settings. Hopefully this patch will work. Thanks for the help with this bug!
Attachment #8851773 - Attachment is obsolete: true
Attachment #8852106 - Flags: review?(chutten)
Comment on attachment 8852106 [details] [diff] [review] Fixed initialization order (again). Review of attachment 8852106 [details] [diff] [review]: ----------------------------------------------------------------- Seems legit.
Attachment #8852106 - Flags: review?(chutten) → review+
Keywords: checkin-needed
Pushed by ryanvm@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/1138b6fe86e1 Store recording state locally in Telemetry.cpp to avoid excessive locking in CanRecord{Base, Extended}(). r=chutten
Keywords: checkin-needed
... that's a lot of orange :S On the plus side, these failures should be reproducible locally. :gmoore, do you need any assistance running tests? It ought to be as simple as running ./mach test toolkit/components/telemetry
(In reply to Chris H-C :chutten from comment #22) > On the plus side, these failures should be reproducible locally. :gmoore, do > you need any assistance running tests? It ought to be as simple as running > ./mach test toolkit/components/telemetry Alright, thanks. Yeah, I ran the tests and had the same errors as shown on the try run. I think I discovered the bug though. In the constructor for TelemetryImpl, we just set mCanRecordBase and mCanRecordExtended to false, but we actually have the correct values available in CreateTelemetryInstance(): https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Telemetry.cpp#2334-2342 And we use those values to call the various InitializeGlobalState() functions, which change the underlying values that TelemetryImpl::mCanRecord{Base, Extended} are representing. So they could be getting set to true, but the TelemetryImpl booleans could possibly never know. It might be better to add two arguments to the TelemetryImpl constructor, but for now I just set the values in InitializeGlobalState() and that seemed to make the tests pass (on my local build at least). One test kept timing out (test_TelemetrySession.js), but I believe it would always do that on my local build. So I think this patch should fix the problem.
Attachment #8852106 - Attachment is obsolete: true
Flags: needinfo?(olucafont6)
Attachment #8852743 - Flags: review?(chutten)
Comment on attachment 8852743 [details] [diff] [review] Set mCanRecord{Base, Extended} in CreateTelemetryInstance(). Review of attachment 8852743 [details] [diff] [review]: ----------------------------------------------------------------- I've placed it up on try to see if this takes care of things in taskcluster: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7d47225aeaf7da56231f627ca674becd8171736
try's not the happiest about the patch. Are any of the failing tests ones that you see failing locally?
(In reply to Chris H-C :chutten from comment #25) > try's not the happiest about the patch. Are any of the failing tests ones > that you see failing locally? I tried running several of the failing tests locally, but I didn't get any of the failures that the try run did. I guess that makes sense though since it only seems to be failing on Linux. Also, looking at the try run, a lot of the failures say: > application crashed [@ mozilla::Atomic<bool, (mozilla::MemoryOrdering)2u, void>::operator bool] So it seems like the problem might be with our usage of the Atomic<bool> class. I'm not sure if there's something were doing incorrectly. Also, not sure if I am just reading the try run incorrectly, but for the file test_eme_initDataTypes.html, it has a bunch of lines like this: > TEST-UNEXPECTED-FAIL | dom/media/test/test_eme_initDataTypes.html | 'One keyId' expected to fail. However, in the actual file, and on my local test run (though on Windows), it says it expects the opposite: > 10 INFO TEST-PASS | dom/media/test/test_eme_initDataTypes.html | 'One keyId' expected to pass This is only one of the tests, but I just thought that was strange. Thanks.
I wonder if the lifetime of sTelemetry is to blame, since the crashing stack is just asking if it can record extended. I have a Linux box that I can try things on tomorrow. I'll leave the r? up to remind me. Likely the EME tests are expected to fail on Linux due to platform reasons, and expected to succeed on Windows.
I tried reproducing the crash locally under debugger, but though I can find plenty of JS and GFX crashes on browser/components/contextualidentity/test/browser/browser_eme.js, I cannot get the Atomic<bool> crash to happen under inspection. I'm probably just doing it wrong, so as a last-ditch effort I threw some if (!sTelemetry) { return false; } into CanRecordBase and CanRecordExtended and lo and behold, it appears to have worked: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a3a62fe6af4649e23d834a3e0f72c8ae4960b7ed I'm going to throw it up for a broader try run (multiple platforms, additional tests). In the meantime, :gmoore, are you okay with those changes being added?
Flags: needinfo?(olucafont6)
(In reply to Chris H-C :chutten from comment #28) > I'm probably just doing it wrong, so as a last-ditch effort I threw some if > (!sTelemetry) { return false; } into CanRecordBase and CanRecordExtended and > lo and behold, it appears to have worked: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=a3a62fe6af4649e23d834a3e0f72c8ae4960b7ed > > I'm going to throw it up for a broader try run (multiple platforms, > additional tests). In the meantime, :gmoore, are you okay with those changes > being added? Oh awesome, that's great. Yeah, that's totally fine to add those changes. That makes sense that it would cause problems, since sTelemetry is not necessarily initialized when we call CanRecord*(). Thanks for finding that! Sorry for the late reply by the way, and thanks for all the help with this bug. Hopefully this will be the final patch we need. :)
Flags: needinfo?(olucafont6)
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/a848eff50b88 Store recording state locally in Telemetry.cpp to avoid excessive locking in CanRecord{Base, Extended}().
Comment on attachment 8852743 [details] [diff] [review] Set mCanRecord{Base, Extended} in CreateTelemetryInstance(). Review of attachment 8852743 [details] [diff] [review]: ----------------------------------------------------------------- I've put an updated patch, guarding against !sTelemetry, up against try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c6041725d16abcd3aae965ecd63c1dac6ed272c9 With luck, that ought to be the final form.
Attachment #8852743 - Flags: review?(chutten) → review-
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: