Closed
Bug 1271991
Opened 8 years ago
Closed 8 years ago
Track recording state in Telemetry.cpp to avoid excessive locking in CanRecord{Base,Extended}
Categories
(Toolkit :: Telemetry, defect, P4)
Toolkit
Telemetry
Tracking
()
RESOLVED
FIXED
mozilla55
Tracking | Status | |
---|---|---|
firefox55 | --- | fixed |
People
(Reporter: jseward, Assigned: gmoore, Mentored)
References
(Blocks 1 open bug)
Details
(Whiteboard: [measurement:client])
Attachments
(1 file, 5 obsolete files)
(deleted),
patch
|
chutten
:
review-
|
Details | Diff | Splinter Review |
This is an indirect followup to bug 1261052. Per comment 14 there:
When we add the API-level locking in TelemetryHistograms, we'd block
all those CanRecordBase() & CanRecordExtended() checks on locking that
is specific to histograms. I think we should probably track the
recording state here as well so we can avoid that. Happy to take this
to a follow-up bug.
Updated•8 years ago
|
Priority: -- → P3
Whiteboard: [measurement:client]
Updated•8 years ago
|
Priority: P3 → P4
Comment 1•8 years ago
|
||
The involved functions here are:
CanRecordBase()
CanRecordExtended()
TelemetryImpl::GetCanRecordBase()
TelemetryImpl::SetCanRecordBase()
TelemetryImpl::GetCanRecordExtended()
TelemetryImpl::SetCanRecordExtended()
We want the getters to not call into (and wait for) TelemetryHistogram.cpp.
To achieve this, the setters should store the recording state locally.
Julian, could this just be stored in an atomic here?
Flags: needinfo?(jseward)
Comment 2•8 years ago
|
||
Gregory, would you be interested in taking this?
Flags: needinfo?(olucafont6)
Updated•8 years ago
|
Mentor: gfritzsche
Assignee | ||
Comment 3•8 years ago
|
||
Hi Georg, yeah, I am definitely interested in working on this, it looks interesting. Thanks for asking, I appreciate it.
I will look into this and let you know if I have any questions. Also, if there is any more information that would be useful, please let me know. Thanks again.
Flags: needinfo?(olucafont6)
Updated•8 years ago
|
Assignee: nobody → olucafont6
Assignee | ||
Comment 4•8 years ago
|
||
I created a patch with most of the changes that I think we would need to make. It's probably not finished yet though, and I have some questions about it.
1. We are supposed to track the recording state locally in Telemetry.cpp, but where exactly should we store it? Right now it's just a global variable in mozilla::Telemetry (in Telemetry.h), because I wasn't sure where else to put it. Should we make it a private static variable of TelemetryImpl? And then just call TelemetryImpl::GetCanRecord*() from CanRecord*(), instead of TelemetryHistogram::CanRecord*()?
2. How should we initialize the value of gCanRecord* (or whatever it will be called)? Right now I just initialize it to false, because that's how the other similar variables seem to be initialized. I'm not sure if there's a better, more correct way to do it though.
Thanks.
Flags: needinfo?(gfritzsche)
Comment 5•8 years ago
|
||
Comment on attachment 8849237 [details] [diff] [review]
Created initial version of patch.
Review of attachment 8849237 [details] [diff] [review]:
-----------------------------------------------------------------
This looks close, although we should keep the state private.
Julian, do you think the atomic use here is ok?
::: toolkit/components/telemetry/Telemetry.h
@@ +44,5 @@
> Microsecond
> };
>
> +Atomic<bool> gCanRecordBase(false);
> +Atomic<bool> gCanRecordExtended(false);
We shouldn't expose these in the header.
How about we just make them members of the TelemetryImpl class?
Attachment #8849237 -
Flags: feedback?(jseward)
Reporter | ||
Comment 6•8 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] [away March 24 - April 2] from comment #5)
> This looks close, although we should keep the state private.
> Julian, do you think the atomic use here is ok?
From irc with Georg, the following seems to be the case:
(1) It is OK that gCanRecordBase and gCanRecordExtended can change state
independantly of any other data. In particular there is no need to
coordinate with changes to the mutex-protected data in the "protected"
data collections (TelemetryHistogram, TelemetryScalar, ?)
(2) Because these two booleans are used only for assignment and comparison,
the only benefit from the Atomic<> use is memory fences, meaning:
initially: gCanRecord = false
T1: .. X .. ; gCanRecord = true
T2: while (gCanRecord != true) { /* do nothing */ }; .. Y ..
it guarantees that all the stores done by T1 inside X are visible to T2
when it starts Y. But they don't give you anything else.
So, given (1), I think the atomic use here is OK.
Flags: needinfo?(jseward)
Reporter | ||
Updated•8 years ago
|
Attachment #8849237 -
Flags: feedback?(jseward) → feedback+
Comment 7•8 years ago
|
||
Heads-up: i'll be away until Apr 3rd, :Dexter or :chutten could review instead.
Flags: needinfo?(gfritzsche)
Comment 8•8 years ago
|
||
(In reply to Gregory Moore [:gmoore] from comment #4)
> 1. We are supposed to track the recording state locally in Telemetry.cpp,
> but where exactly should we store it? Right now it's just a global variable
> in mozilla::Telemetry (in Telemetry.h), because I wasn't sure where else to
> put it. Should we make it a private static variable of TelemetryImpl? And
> then just call TelemetryImpl::GetCanRecord*() from CanRecord*(), instead of
> TelemetryHistogram::CanRecord*()?
Exactly!
> 2. How should we initialize the value of gCanRecord* (or whatever it will be
> called)? Right now I just initialize it to false, because that's how the
> other similar variables seem to be initialized. I'm not sure if there's a
> better, more correct way to do it though.
This is fine, just keep with the existing behavior in TelemetryHistogram etc.
Assignee | ||
Comment 9•8 years ago
|
||
Great, thanks for the information everyone. I have made the changes but I ran into an issue (or at least an area that could probably be improved on).
I didn't realize this earlier but it seems that TelemetryImpl is actually a singleton class. I thought we could just make mCanRecord* private static variables, but GetCanRecord*() is not a static function, so we need the instance of TelemetryImpl to call it. For now in CanRecord*(), I used do_GetService(), but it's kind of clunky and I didn't see any other functions doing that. This seems like we could improve it (let me know if you have any suggestions).
One thought would be to make CanRecord*() friend functions of TelemetryImpl, so then they could directly access sTelemetry and just call the functions from there. But that seems like overkill (and would probably ruin the point of the singleton), and it's not done anywhere else.
Chris, would you mind reviewing on this bug as well? It's less reading than bug 1120409 :) Thanks.
Attachment #8849237 -
Attachment is obsolete: true
Flags: needinfo?(chutten)
Comment 10•8 years ago
|
||
Comment on attachment 8850794 [details] [diff] [review]
Made mCanRecord* private members of TelemetryImpl.
Review of attachment 8850794 [details] [diff] [review]:
-----------------------------------------------------------------
I think DoStackCapture() might be the commonly-accepted way of doing this sort of thing...
Updated•8 years ago
|
Flags: needinfo?(chutten)
Assignee | ||
Comment 11•8 years ago
|
||
(In reply to Chris H-C :chutten from comment #10)
> I think DoStackCapture() might be the commonly-accepted way of doing this
> sort of thing...
Oh okay, I see. I did something similar and updated the patch. It seems kind of strange to have two different functions that do the same thing (but one is static), but I guess the non-static versions can be called from JavaScript. Thanks.
Attachment #8850794 -
Attachment is obsolete: true
Attachment #8851264 -
Flags: review?(chutten)
Comment 12•8 years ago
|
||
Comment on attachment 8851264 [details] [diff] [review]
Added static functions to TelemetryImpl for retrieving recording state.
Review of attachment 8851264 [details] [diff] [review]:
-----------------------------------------------------------------
Yup, it's weird. Likely it's weird more because of how we've structured the code than limitations on what methods can be exposed to JS... but that doesn't make it any less odd.
Thanks for ploughing through this!
Attachment #8851264 -
Flags: review?(chutten) → review+
Updated•8 years ago
|
Keywords: checkin-needed
Comment 13•8 years ago
|
||
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/810aeb2601d6
Store recording state locally in Telemetry.cpp to avoid excessive locking in CanRecord{Base, Extended}(). r=chutten
Keywords: checkin-needed
Comment 14•8 years ago
|
||
Backed out for bustage in Telemetry.cpp:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d2a17c05c419ff462cf32bcc9f541fccfa89ecf5
Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=43a3a3c4f4514f1db3caa41a45dda5b71b2e86f0&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=86738331&repo=mozilla-inbound
13:20:18 INFO - /builds/slave/m-in-m64-000000000000000000000/build/src/toolkit/components/telemetry/Telemetry.cpp:1252:5: error: field 'mFailedLockCount' will be initialized after field 'mCanRecordBase' [-Werror,-Wreorder]
13:20:18 INFO - , mFailedLockCount(0)
13:20:18 INFO - ^
13:20:18 INFO - 1 error generated.
Flags: needinfo?(olucafont6)
Assignee | ||
Comment 15•8 years ago
|
||
Oh whoops, didn't see that. I've changed the order of initialization (assuming that's the only thing that's causing the problem). Thanks.
Attachment #8851264 -
Attachment is obsolete: true
Flags: needinfo?(olucafont6)
Attachment #8851773 -
Flags: review?(chutten)
Comment 17•8 years ago
|
||
Comment on attachment 8851773 [details] [diff] [review]
Fixed initialization order error that caused bug 1351058.
Review of attachment 8851773 [details] [diff] [review]:
-----------------------------------------------------------------
::: toolkit/components/telemetry/Telemetry.cpp
@@ +1245,5 @@
>
> TelemetryImpl::TelemetryImpl()
> : mHashMutex("Telemetry::mHashMutex")
> , mHangReportsMutex("Telemetry::mHangReportsMutex")
> , mThreadHangStatsMutex("Telemetry::mThreadHangStatsMutex")
This is declared after mCanRecordExtended. Doesn't -Wreorder complain about it too?
Attachment #8851773 -
Flags: review?(chutten)
Assignee | ||
Comment 18•8 years ago
|
||
Oh, yeah I guess it probably would. I wasn't actually sure what that warning meant, so I thought only the order of mFailedLockCount mattered. On Windows it doesn't seem to report the -Wreorder warning either, at least with the default build settings. Hopefully this patch will work. Thanks for the help with this bug!
Attachment #8851773 -
Attachment is obsolete: true
Attachment #8852106 -
Flags: review?(chutten)
Comment 19•8 years ago
|
||
Comment on attachment 8852106 [details] [diff] [review]
Fixed initialization order (again).
Review of attachment 8852106 [details] [diff] [review]:
-----------------------------------------------------------------
Seems legit.
Attachment #8852106 -
Flags: review?(chutten) → review+
Updated•8 years ago
|
Keywords: checkin-needed
Comment 20•8 years ago
|
||
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1138b6fe86e1
Store recording state locally in Telemetry.cpp to avoid excessive locking in CanRecord{Base, Extended}(). r=chutten
Keywords: checkin-needed
Comment 21•8 years ago
|
||
Backed out for failing various EME, CDM and Telemtry tests:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d2d1f1a8aca16264de48bcd9a38cdf74b76633e8
Push with failures (at least the browser-chrome and mda): https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=1138b6fe86e164767b141265b8c30c0c8c769a63&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Another push with failures (new: GTest, xpcshell): https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=92fe8848d8b733fa9614024083dc53f2a733bf6a&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Flags: needinfo?(olucafont6)
Comment 22•8 years ago
|
||
... that's a lot of orange :S
On the plus side, these failures should be reproducible locally. :gmoore, do you need any assistance running tests? It ought to be as simple as running ./mach test toolkit/components/telemetry
Assignee | ||
Comment 23•8 years ago
|
||
(In reply to Chris H-C :chutten from comment #22)
> On the plus side, these failures should be reproducible locally. :gmoore, do
> you need any assistance running tests? It ought to be as simple as running
> ./mach test toolkit/components/telemetry
Alright, thanks. Yeah, I ran the tests and had the same errors as shown on the try run. I think I discovered the bug though.
In the constructor for TelemetryImpl, we just set mCanRecordBase and mCanRecordExtended to false, but we actually have the correct values available in CreateTelemetryInstance():
https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Telemetry.cpp#2334-2342
And we use those values to call the various InitializeGlobalState() functions, which change the underlying values that TelemetryImpl::mCanRecord{Base, Extended} are representing. So they could be getting set to true, but the TelemetryImpl booleans could possibly never know.
It might be better to add two arguments to the TelemetryImpl constructor, but for now I just set the values in InitializeGlobalState() and that seemed to make the tests pass (on my local build at least). One test kept timing out (test_TelemetrySession.js), but I believe it would always do that on my local build. So I think this patch should fix the problem.
Attachment #8852106 -
Attachment is obsolete: true
Flags: needinfo?(olucafont6)
Attachment #8852743 -
Flags: review?(chutten)
Comment 24•8 years ago
|
||
Comment on attachment 8852743 [details] [diff] [review]
Set mCanRecord{Base, Extended} in CreateTelemetryInstance().
Review of attachment 8852743 [details] [diff] [review]:
-----------------------------------------------------------------
I've placed it up on try to see if this takes care of things in taskcluster: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7d47225aeaf7da56231f627ca674becd8171736
Comment 25•8 years ago
|
||
try's not the happiest about the patch. Are any of the failing tests ones that you see failing locally?
Assignee | ||
Comment 26•8 years ago
|
||
(In reply to Chris H-C :chutten from comment #25)
> try's not the happiest about the patch. Are any of the failing tests ones
> that you see failing locally?
I tried running several of the failing tests locally, but I didn't get any of the failures that the try run did. I guess that makes sense though since it only seems to be failing on Linux. Also, looking at the try run, a lot of the failures say:
> application crashed [@ mozilla::Atomic<bool, (mozilla::MemoryOrdering)2u, void>::operator bool]
So it seems like the problem might be with our usage of the Atomic<bool> class. I'm not sure if there's something were doing incorrectly.
Also, not sure if I am just reading the try run incorrectly, but for the file test_eme_initDataTypes.html, it has a bunch of lines like this:
> TEST-UNEXPECTED-FAIL | dom/media/test/test_eme_initDataTypes.html | 'One keyId' expected to fail.
However, in the actual file, and on my local test run (though on Windows), it says it expects the opposite:
> 10 INFO TEST-PASS | dom/media/test/test_eme_initDataTypes.html | 'One keyId' expected to pass
This is only one of the tests, but I just thought that was strange. Thanks.
Comment 27•8 years ago
|
||
I wonder if the lifetime of sTelemetry is to blame, since the crashing stack is just asking if it can record extended. I have a Linux box that I can try things on tomorrow. I'll leave the r? up to remind me.
Likely the EME tests are expected to fail on Linux due to platform reasons, and expected to succeed on Windows.
Comment 28•8 years ago
|
||
I tried reproducing the crash locally under debugger, but though I can find plenty of JS and GFX crashes on browser/components/contextualidentity/test/browser/browser_eme.js, I cannot get the Atomic<bool> crash to happen under inspection.
I'm probably just doing it wrong, so as a last-ditch effort I threw some if (!sTelemetry) { return false; } into CanRecordBase and CanRecordExtended and lo and behold, it appears to have worked: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a3a62fe6af4649e23d834a3e0f72c8ae4960b7ed
I'm going to throw it up for a broader try run (multiple platforms, additional tests). In the meantime, :gmoore, are you okay with those changes being added?
Updated•8 years ago
|
Flags: needinfo?(olucafont6)
Assignee | ||
Comment 29•8 years ago
|
||
(In reply to Chris H-C :chutten from comment #28)
> I'm probably just doing it wrong, so as a last-ditch effort I threw some if
> (!sTelemetry) { return false; } into CanRecordBase and CanRecordExtended and
> lo and behold, it appears to have worked:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=a3a62fe6af4649e23d834a3e0f72c8ae4960b7ed
>
> I'm going to throw it up for a broader try run (multiple platforms,
> additional tests). In the meantime, :gmoore, are you okay with those changes
> being added?
Oh awesome, that's great. Yeah, that's totally fine to add those changes. That makes sense that it would cause problems, since sTelemetry is not necessarily initialized when we call CanRecord*(). Thanks for finding that!
Sorry for the late reply by the way, and thanks for all the help with this bug. Hopefully this will be the final patch we need. :)
Flags: needinfo?(olucafont6)
Comment 30•8 years ago
|
||
Pushed by chutten@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a848eff50b88
Store recording state locally in Telemetry.cpp to avoid excessive locking in CanRecord{Base, Extended}().
Comment 31•8 years ago
|
||
Comment on attachment 8852743 [details] [diff] [review]
Set mCanRecord{Base, Extended} in CreateTelemetryInstance().
Review of attachment 8852743 [details] [diff] [review]:
-----------------------------------------------------------------
I've put an updated patch, guarding against !sTelemetry, up against try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c6041725d16abcd3aae965ecd63c1dac6ed272c9
With luck, that ought to be the final form.
Attachment #8852743 -
Flags: review?(chutten) → review-
Comment 32•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox55:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in
before you can comment on or make changes to this bug.
Description
•