Closed
Bug 1030487
Opened 10 years ago
Closed 10 years ago
Don't report main-thread I/O to Telemetry unless it lasts more than 50ms
Categories
(Core :: General, defect)
Core
General
Tracking
()
RESOLVED
FIXED
mozilla34
People
(Reporter: vladan, Assigned: rafael.kourdis, Mentored)
References
Details
(Whiteboard: [good first bug][lang=c++])
Attachments
(1 file, 5 obsolete files)
(deleted),
patch
|
bugzilla
:
review+
|
Details | Diff | Splinter Review |
BACKGROUND:
Firefox currently has a single-process architecture where most webpage, UI, and extension JavaScript is execute on a single thread (the "main thread"). Therefore, any long-running operation on the main thread makes Firefox unresponsive. I/O operations in particular have unbounded execution times, so it is very undesirable to do I/O on the main thread.
We want to collect information about all the main thread I/O currently happening in Firefox, so we intercept any I/O operations done on the main thread and report it to the "Telemetry" performance reporting system.
You can see the current results here: http://telemetry.mozilla.org/mainthreadio/
THE PROBLEM:
We currently report *ALL* main-thread I/O to Telemetry, but we really only want to report main-thread I/O that actually causes performance problems. For example, we don't want to inform Telemetry that a file was read on the main thread, if that file was first read on a background thread -- the file will have already been cached in the disk cache so the main thread read is not a performance problem.
Therefore we should only record occurences of main-thread I/O if they took more than 50ms to execute.
CODE:
You can add the check for duration here: http://hg.mozilla.org/mozilla-central/file/3961ab6093b4/xpcom/build/MainThreadIOLogger.cpp#l201
Comment 1•10 years ago
|
||
Does 50ms come from anywhere in particular? Would mapping it to 60fps response time of 16ms make sense?
Reporter | ||
Comment 2•10 years ago
|
||
It's arbitrary. It's based on an estimate of how long a browser response to a user click could be delayed without the user noticing. We use the same threshold for slowSQL (http://telemetry.mozilla.org/slowsql)
Reporter | ||
Comment 3•10 years ago
|
||
And yes 16ms would make sense too
Assignee | ||
Comment 4•10 years ago
|
||
Hey Guys!
I find this a nice first bug for a beginner like me :)
Can I work on it? I already have something ready, I can upload it tomorrow.
Thanks
Assignee | ||
Comment 5•10 years ago
|
||
Assignee | ||
Comment 6•10 years ago
|
||
Hello,
Since there haven't been any replies, I've decided to upload the patch anyway.
I do believe the patch is correct, however, as this is the first time I've worked with
Mozilla code, I'm not entirely sure.
I really look forward to your feedback!
Thanks.
Assignee | ||
Updated•10 years ago
|
Attachment #8466741 -
Flags: review?(vdjeric)
Updated•10 years ago
|
Mentor: vdjeric
Whiteboard: [good first bug][lang=c++][mentor=vladan] → [good first bug][lang=c++]
Updated•10 years ago
|
Assignee: nobody → rafael.kourdis
Status: NEW → ASSIGNED
Comment 7•10 years ago
|
||
Hi Rafel - just saw your comments in #developers - if it helps the "PTO" in the display name of Vladan Djeric (the person from whom you've requested review) means "personal time off" and is a Mozilla-ism (or USA-ism, not sure which) for "on holiday" - and so means Vladan is on holiday 4-5th August. This use of "PTO" catches people out all the time, so it's not just you :-)
Thank you for the patch!
Assignee | ||
Comment 8•10 years ago
|
||
Oh, wouldn't have guessed that, for sure :)
Thanks for your reply, looking forward for the review!
Reporter | ||
Comment 9•10 years ago
|
||
Comment on attachment 8466741 [details] [diff] [review]
Patch Version 1
Review of attachment 8466741 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good, just a few small comments. Fix it up and I'll do a final review :)
See here for our full coding style guide: https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style#General_C.2FC.2B.2B_Practices
::: xpcom/build/MainThreadIOLogger.cpp
@@ +199,5 @@
> return;
> }
> +
> + if(TimeStamp::Now() - aObservation.Start() < TimeDuration::FromMilliseconds(50))
> + {
- Is aObservation.mEnd set at this point? If it is, "aObservation.End() - aObservation.Start()" might be faster and more accurate than "TimeStamp::Now() - aObservation.Start()"
- Define the 50 millisecond threshold as a static constant. For example: http://hg.mozilla.org/mozilla-central/file/de4289ba35d6/toolkit/components/telemetry/Telemetry.cpp#l2489
- Add a space between the "if" and the "("
- the curly brace "{" should be on the same line as the "if"
@@ +201,5 @@
> +
> + if(TimeStamp::Now() - aObservation.Start() < TimeDuration::FromMilliseconds(50))
> + {
> + //Don't report I/O operation to Telemetry if there haven't
> + //passed 50 ms.
// Only report slow main-thread I/O to Telemetry (at least 50 ms)
Attachment #8466741 -
Flags: review?(vdjeric)
Assignee | ||
Comment 10•10 years ago
|
||
Thanks a lot for the review :) I made the changes, and will upload the patch in a while.
However, I am not 100% sure if mEnd is set when MainThreadIOLoggerImpl::Observe() is called.
Here is what I've found so far :
MainThreadIOLoggerImpl::Observe() is called by CallObservers() in IOInterposer.cpp.
It (as I understand it) created a vector of pointers to observer objects and at line 157 ((*i)->Observe(aObservation);) observes them.
Now, PerThreadData::CallObservers() is only called by IOInterposer::Report() at IOInterposer.cpp. (line 534), and Report() is called by 3 functions.
First of all, the destructor of IOThreadAutoTimer (~IOThreadAutoTimer(), line 102, TelemetryVFS.cpp).
The destructor creates a new Observation, and feeds an End value (Timestamp::Now()), so, no problem here.
Secondly, Report() is called at IOInterposer.cpp, line 437, from inside the Observation class, and IS assigned an end time before Report().
Report() is also called at IOInterposer.cpp, line 590, where a NextStageObservation object is constructed and then gets reported.
I can see from the constructor of NextStageObservation, that the aShouldReport flag of Observation is set to false, and Report() is not going to be called at IOInterposeObserver::Observation::Report() if the flag isn't set.
Please correct me if I'm wrong.
All these areas suggest that mEnd is set, when Observe() is called, however, if it's possible, I'd like a confirmation from you guys (I'm not experienced at all :) ).
Thanks for your time!
Assignee | ||
Comment 11•10 years ago
|
||
Attachment #8469191 -
Flags: review?(vdjeric)
Reporter | ||
Updated•10 years ago
|
Attachment #8466741 -
Attachment is obsolete: true
Reporter | ||
Comment 12•10 years ago
|
||
Comment on attachment 8469191 [details] [diff] [review]
Patch Version 2
Review of attachment 8469191 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good, just a few tiny changes. If you're ok with my suggestions, I can make the changes myself and land your patch in the repository
Did you build and test this patch by the way?
::: xpcom/build/MainThreadIOLogger.cpp
@@ +187,5 @@
> PR_Close(fd);
> }
>
> +const double TelemetryReportThreshold = 50;
> +// Milliseconds
Move the comment above the declaration, and use TimeDuration units here to avoid repeating the conversion in MainThreadIOLoggerImpl::Observe
// Threshold for reporting slow main-thread I/O (50 milliseconds)
const TimeDuration kTelemetryReportThreshold = TimeDuration::FromMilliseconds(50);
@@ +201,5 @@
> // The writer thread isn't running. Don't enqueue any more data.
> return;
> }
> +
> + if (aObservation.End() - aObservation.Start() < TimeDuration::FromMilliseconds(TelemetryReportThreshold)){
this line is too long, but it turns out there's a Duration() method, so
if (aObservation.Duration() < kTelemetryReportThreshold) {
..
}
Attachment #8469191 -
Flags: review?(vdjeric) → review+
Reporter | ||
Comment 13•10 years ago
|
||
Aaron, will this patch prevent the profiler or automated tests from seeing main-thread I/O taking less than 50ms?
Flags: needinfo?(aklotz)
Comment 14•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) from comment #13)
> Aaron, will this patch prevent the profiler or automated tests from seeing
> main-thread I/O taking less than 50ms?
We should be making this change to TelemetryIOInterposeObserver instead of MainThreadIOLogger.
http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Telemetry.cpp#414
I think I mistakenly pointed out the wrong location when we were discussing the requirements for this bug. The good news is that the additional code should translate easily to the other class. Sorry about that. :-(
Flags: needinfo?(aklotz)
Assignee | ||
Comment 15•10 years ago
|
||
Hello,
Thanks for the valuable feedback!
No problem, I'll make the changes and will resubmit the patch.
(It might take a while, building firefox takes at least 1 hour on my computer.)
Vladan, yes, I build it successfully and run normally, however, I wasn't sure about when the function kicks in, and any printf's that I used, didn't show up...
(Could you help me with that?)
Thanks :)
Assignee | ||
Comment 16•10 years ago
|
||
Submitting fixed review based on Vladan's and Aaron's feedback.
Attachment #8469191 -
Attachment is obsolete: true
Attachment #8470149 -
Flags: review?(vdjeric)
Assignee | ||
Comment 17•10 years ago
|
||
Keep in mind, although it does compile and run normally, I haven't tested it (See comment 15).
Thanks and sorry for being late :)
Reporter | ||
Comment 18•10 years ago
|
||
- You don't need to rebuild the entire Firefox source when you make small changes like these in *.cpp files.
You can just type "mach build xpcom/build/" or "mach build toolkit/components/telemetry" from the top of your source tree. This should only take ~5 minutes at maximum.
- printfs should work as expected.. you could also set breakpoints. How are you launching Firefox after you build it?
Reporter | ||
Comment 19•10 years ago
|
||
Comment on attachment 8470149 [details] [diff] [review]
Bug1030487v3.patch
Looks good, please test it and then we can land it
Attachment #8470149 -
Flags: review?(vdjeric) → review+
Assignee | ||
Comment 20•10 years ago
|
||
(Answering comment 18) :
I use ./mach run to launch FF.
I knew about not building the whole app, but wasn't sure about when it can be done...
------------------------
I'm now rebuilding everything after hg pull -u... I just had a problem where FF just hung (Can't see the output, as it's the terminal has a history limit).
I'm pretty sure it's not the patch (removing it didn't fix the problem).
HOWEVER, please wait for the build to finish before pushing the code... I'll leave a comment with the results.
Thanks
Assignee | ||
Comment 21•10 years ago
|
||
Heads up, actually, there is something wrong...
It's the patch that caused the crashes....
More info :
The program crashes with a signal 11.
Output---------------------------
...
UNKNOWN [/home/rafael/firefox-src/mozilla-central/obj-x86_64-unknown-linux-gnu/dist/bin/libxul.so +0x050F60D1]
UNKNOWN [/home/rafael/firefox-src/mozilla-central/obj-x86_64-unknown-linux-gnu/dist/bin/libxul.so +0x04CD44BF]
UNKNOWN 0x7f42d3b59943
Sleeping for 300 seconds.
Type 'gdb /home/rafael/firefox-src/mozilla-central/obj-x86_64-unknown-linux-gnu/dist/bin/firefox 3832' to attach your debugger to this thread.
---------------------------------
GDB OUTPUT----------------------
(process:3886): GLib-CRITICAL **: g_slice_set_config: assertion 'sys_page_size == 0' failed
nsStringStats
=> mAllocCount: 18
=> mReallocCount: 12
=> mFreeCount: 17 -- LEAKED 1 !!!
=> mShareCount: 16
=> mAdoptCount: 1
=> mAdoptFreeCount: 1
=> Process ID: 3886, Thread ID: 140737353918336
---------------------------------
I will post any updates here.
Assignee | ||
Comment 22•10 years ago
|
||
Yes, printf's do, in fact work (the function wasn't called previously, that's why I didn't see any output).
I think I've found the problem.
Printing aOb.DUration().ToMilliseconds prints a really large number (140734225164320) and causes a crash.
I believe mEnd isn't set (comment 10 was about MainThreadIOLoggerImpl::Observe()).
Any ideas?
Assignee | ||
Comment 23•10 years ago
|
||
Yep, mEnd is the problem...
(Ignore the large number, it was a humiliating mistake from my side, I used %lld to print a double :P)
Assertion failure: !IsNull() (Cannot compute with a null value), at ../../dist/include/mozilla/TimeStamp.h:304
From TimeStamp.h:
TimeDuration operator-(const TimeStamp& aOther) const
{
MOZ_ASSERT(!IsNull(), "Cannot compute with a null value");
MOZ_ASSERT(!aOther.IsNull(), "Cannot compute with aOther null value");
Duration() does mEnd - mStart, so mEnd is the problem here...
Could we possibly check if mEnd is empty before the check?
It's kinda late here... I'll see your feedback tomorrow.
Also, sorry for all these replies, I should have used IRC.
Thanks
Assignee | ||
Comment 24•10 years ago
|
||
Hello,
Seems like checking for aOb.End().IsNull(), does work.
Will test and upload this in a while.
Thanks
Assignee | ||
Comment 25•10 years ago
|
||
Attachment #8470149 -
Attachment is obsolete: true
Attachment #8470437 -
Flags: review?(vdjeric)
Reporter | ||
Comment 26•10 years ago
|
||
The patch looks good, but let's clear up this assert failure first before landing it.
Aaron, how can Observation::End() return null at this point in the code?
Flags: needinfo?(aklotz)
Comment 27•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) from comment #26)
> The patch looks good, but let's clear up this assert failure first before
> landing it.
>
> Aaron, how can Observation::End() return null at this point in the code?
We would know more definitively if provided a full call stack for the assertion.
I think the only place that this can happen is in NextStageObservation. I think that instead of checking for null mEnd, we should add a line after http://mxr.mozilla.org/mozilla-central/source/xpcom/build/IOInterposer.cpp#379 to set mEnd = mStart. This observation is supposed to be instantaneous, so I think that makes more sense anyway.
Flags: needinfo?(aklotz)
Reporter | ||
Comment 28•10 years ago
|
||
Rafael, can you update your patch with the change Aaron suggested and we'll get it landed?
Flags: needinfo?(rafael.kourdis)
Assignee | ||
Comment 29•10 years ago
|
||
I'm very sorry I didn't respond earlier. I was facing a problem with my computer and had to reinstall my OS.
Aaron, yes, now that I'm thinking about it I certainly should have provided a stack trace. I'm sorry about that, too :)
Thanks.
Attachment #8470437 -
Attachment is obsolete: true
Attachment #8470437 -
Flags: review?(vdjeric)
Attachment #8472853 -
Flags: review?(vdjeric)
Flags: needinfo?(rafael.kourdis)
Reporter | ||
Comment 30•10 years ago
|
||
Comment on attachment 8472853 [details] [diff] [review]
Bug1030487v5.patch
Review of attachment 8472853 [details] [diff] [review]:
-----------------------------------------------------------------
::: toolkit/components/telemetry/Telemetry.cpp
@@ +419,5 @@
> // We only report main-thread I/O
> if (!IsMainThread()) {
> return;
> }
> +
remove the trailing whitespace
@@ +422,5 @@
> }
> +
> + if (aOb.Duration() < kTelemetryReportThreshold) {
> + return;
> + }
put this if-block after the check for OpNextStage below.. otherwise we won't handle the startup/normal/shutdown stage transition
Attachment #8472853 -
Flags: review?(vdjeric) → review+
Assignee | ||
Comment 31•10 years ago
|
||
I've fixed the issues as per your suggestions.
I know I might be wasting your time, but, if it's possible, could you please firmly explain why OpNextStage is an Operation type?
In what way is it similar to an I/O operation?
Thanks
Attachment #8472853 -
Attachment is obsolete: true
Assignee | ||
Updated•10 years ago
|
Attachment #8473360 -
Flags: review?(vdjeric)
Comment 32•10 years ago
|
||
(In reply to Rafael Kourdis from comment #31)
> I know I might be wasting your time, but, if it's possible, could you please
> firmly explain why OpNextStage is an Operation type?
> In what way is it similar to an I/O operation?
>
> Thanks
I needed a way for IOInterposer and its observers to be notified when Firefox transitions between stages. Since we already have that code in place for I/O operations, I elected to create OpNextStage as a meta-operation to do that signalling.
Comment 33•10 years ago
|
||
Also, since Vladan already gave the previous version of your patch a review+, you don't need to request a review again. Just set the review flag to + on your patch and note in the comments that you have addressed his comments and are carrying forward the review that was granted in the previous version.
Comment 34•10 years ago
|
||
Comment on attachment 8473360 [details] [diff] [review]
Bug1030487v5-fixed.patch
Marking patch as r+ (carry forward from vladan)
Attachment #8473360 -
Flags: review?(vdjeric) → review+
Reporter | ||
Comment 35•10 years ago
|
||
Comment 36•10 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
Assignee | ||
Comment 37•10 years ago
|
||
Thanks a lot guys for the valuable help you provided, and for pushing the code to the tree.
Sorry for not replying earlier. My PC's PSU got fried (you can tell I'm a lucky man :) and I didn't think of using my phone.
So, thanks again for everything. This community is so strong only because of each and every one of you :)
Rafael.
You need to log in
before you can comment on or make changes to this bug.
Description
•