Closed Bug 896740 Opened 11 years ago Closed 9 years ago

Limit the number of chrome-hangs reported to Telemetry

Categories

(Toolkit :: Telemetry, defect, P1)

defect
Points:
1

Tracking

()

RESOLVED FIXED
mozilla45
Tracking Status
firefox45 --- fixed
b2g-v2.5 --- fixed

People

(Reporter: vladan, Assigned: Dexter, Mentored)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [unifiedTelemetry] [measurement:client] [lang=c++])

Attachments

(1 file, 5 obsolete files)

Some clients report over 1000 chrome-hang stacks from a single session! This causes unnecessary memory use in Firefox and causes Telemetry to spend a long time uploading its ping. We should impose limits on the size of individual hang stacks (preserving only top X frames) as well as the number of chrome-hangs reported (keeping only the most recent stacks).
Blocks: 896744
I am interested for getting involved in this bug. Please tell me how to start and related information.
(In reply to subh.singh007 from comment #1) > I am interested for getting involved in this bug. Please tell me how to > start and related information. Thanks for helping out! Here's a list of things you have to do, I hope it's not too much info at once :) 1. Do you have access to a Windows machine? You'll need it to test out the change, because "chrome hangs" are currently a Windows-only feature 2. Check out and build Firefox on Windows: https://developer.mozilla.org/en-US/docs/Simple_Firefox_build You can use the following sample .mozconfig file: http://pastebin.mozilla.org/2696396 3. Read a bit of background about the chrome-hang reporting feature (the implementation details are likely to be out of date though): https://blog.mozilla.org/vdjeric/2012/03/21/introducing-chrome-hang-reporting-and-the-symbolication-server/ 4. Modify the CombinedStacks::AddStack function in toolkit/components/telemetry/Telemetry.cpp to impose a max on the number of frames in a stack: http://hg.mozilla.org/mozilla-central/file/5ecd26bc5274/toolkit/components/telemetry/Telemetry.cpp#l146 Make sure the "most recent" stack frames in a stack are preserved and the "older" frames (e.g. "int main()") are truncated 5. Further modify the CombinedStacks::AddStack function, to never hold more than some maximum number of stacks. So, for example, if the max is 50 and we already have 50 stacks when AddStack gets called again, then first drop the oldest stack and then add this latest one. 6. For testing the chrome hangs reporting mechanism, you can trigger a hang in your own Nightly: - First, enable Telemetry from about:telemetry - Write a webpage with a tight JavaScript loop that runs for more than 5 seconds - Visit the webpage, wait for the "Unresponsive script" dialog to pop up - Reload about:telemetry and verify there is new hang in the "Browser Hangs" section - You should make sure that about:telemetry never has more than the maximum number of stacks and that none of the stacks have more than the maximum number of frames. You might want to lower the maximums to some small number during testing to make it easier. You can reach me at my email address if you have any question.
Whiteboard: [mentor=vladan][lang=c++][good first bug] → [mentor=vladan][lang=c++]
> 1. Do you have access to a Windows machine? You'll need it to test out the > change, because "chrome hangs" are currently a Windows-only feature I don't have access to window machine. I am using Mac machine. I have build it on my mac machine. Is it not possible to continue on this bug using my mac machine?
OS: All → Windows 7
(In reply to subh.singh007 from comment #3) > I don't have access to window machine. I am using Mac machine. I have build > it on my mac machine. Is it not possible to continue on this bug using my > mac machine? It's a Windows-only feature, so you won't be able to test it on a Mac. You could try hacking together some test inputs, but I wouldn't recommend it. Do you have a Windows VM?
Since this looks like it's available and I'm currently building Firefox so I can start working on bugs, I'll say I'll take this one. I'm using Windows 8 as my primary OS, but I have a Windows 7 VHD that I can boot from to test this if necessary.
Taking a look at the code now... is ProcessedStack FIFO, FILO, or what? If it's FIFO, that makes dropping the oldest frames easy (pop pop pop), if it's FILO then I'll have to look further into it (which is no problem, I plan on going through and looking at everything once the build completes anyway). It's been a while since I did any programming, especially with vectors, so I welcome any suggestions and I forsee myself looking up many things to find the right function call. Also, any suggestions for the max number of frames and stacks? I'm thinking 15 stacks could be enough, but I'm not sure either how often the reports get sent or how often people experience hangs in general. For frames I'm thinking 50 is probably good - it should more than cover what's necessary to determine the cause behind most hangs.
Attached patch Patch without switching to deque (obsolete) (deleted) — Splinter Review
mStacks is still a vector as there are problems with changing it to a deque that I'm working on fixing, but this patch is fully functional and the size of the vector (50) means that it shouldn't really be a performance hit to drop the first stack off of mStacks. Once I can get it working with deque I'll submit another patch, this is mainly just because a deque may not be a viable option.
Attachment #793781 - Flags: review?(vdjeric)
Assignee: nobody → demize95
Comment on attachment 793781 [details] [diff] [review] Patch without switching to deque Thank you for working on this bug :) Just a couple of comments below. As we discussed on IRC, I'd prefer to use a circular buffer or a deque (as you suggested) to store the stacks. Either one is fine, but erasing elements from the beginning of a vector is inefficient. >+ //Limit number of stacks kept to 50 >+ if(mStacks.size() > 50) { >+ mStacks.erase(mStacks.begin()); >+ } We shouldn't hard-code "magic numbers" in the code, instead we should use a constant. It's easier to find & change the max if it's stored in one place in a variable, plus we only have to update the value in one place if we decide to change the maximum # of chrome hangs one day. See "kMaxFailedProfileLockFileSize" in the same file for an example. > // Copy the information to the return value. > ProcessedStack Ret; >+ // Limit the number of frames to 50 >+ size_t endIndex = (rawStack.size() > 50) ? (50) >+ : (rawStack.size()); > for (std::vector<StackFrame>::iterator i = rawStack.begin(), >- e = rawStack.end(); i != e; ++i) { >+ e = rawStack.begin() + endIndex; i != e; ++i) { > const StackFrame &rawFrame = *i; > ProcessedStack::Frame frame = { rawFrame.mPC, rawFrame.mModIndex }; > Ret.AddFrame(frame); > } We can save a bit of time in this function by imposing the max frames limit in the first loop, i.e. when we iterate over aPCs to copy them into rawStack.
Attachment #793781 - Flags: review?(vdjeric)
Attached patch Buggy patch after switching to deque (obsolete) (deleted) — Splinter Review
Comment on attachment 794848 [details] [diff] [review] Buggy patch after switching to deque The new patch got lost in my bugmail, I'll take a look tomorrow
Attachment #794848 - Flags: feedback?(vdjeric)
Going to be without reliable internet for a while (up to a week, I'll say). I'll make any changes you find when I get reliable internet.
Comment on attachment 794848 [details] [diff] [review] Buggy patch after switching to deque >+ >+ //Limit number of stacks kept Nit: space after comment >+ if(mStacks.size() > kMaxStacksKept) { >+ mStacks.pop_front(); >+ } > } You also need to erase the first element in mDurations, I think that was your bug Also, you need a space after "if" >+const size_t kMaxFramesKept = 50; >+ We don't want this const in the Telemetry namespace, so put the declaration next to kMaxStacksKept (in the anonymous namespace). Also, add a comment explaining what these consts represent >+ // Limit the number of frames >+ size_t endIndex = (aPCs.size() > kMaxFramesKept) ? (kMaxFramesKept) >+ : (aPCs.size()); You can use std::min and figure out the last iterator here: std::vector<uintptr_t>::const_iterator e = aPCs.begin() + std::min(aPCs.size, kMaxFramesKept); > for (std::vector<uintptr_t>::const_iterator i = aPCs.begin(), >- e = aPCs.end(); i != e; ++i) { >+ e = aPCs.begin() + endIndex; i != e; ++i) { Btw, this implementation means that we'll sometimes be holding onto module info that is no longer needed.. I don't think it's a big deal, so can you just add a comment about it?
Attachment #794848 - Flags: feedback?(vdjeric)
Mentor: vdjeric
Whiteboard: [mentor=vladan][lang=c++] → [lang=c++]
Blocks: 1186986
No longer blocks: 896744
OS: Windows 7 → All
Points: --- → 3
Priority: -- → P2
So, per bug 1187864 there are different things we can do here: (1) limit the number of hang stat entries (2) limit the stack depth to a sane number (3) collapse repeated successive "(chrome script)" to one "(chrome script)" or "(chrome script <N>)" For (3) i think we should just stay with collapsing into the simple "(chrome script)" (unless there are compelling reasons to make it more complicated).
This looks like it hasn't been worked on for a while, unassigning.
Assignee: demize95 → nobody
Points: 3 → 1
Whiteboard: [lang=c++] → [unifiedTelemetry] [measurement:client] [lang=c++]
Priority: P2 → P1
Assignee: nobody → alessio.placitelli
Attached patch bug896740.patch (obsolete) (deleted) — Splinter Review
This patch continued where the original contributor left off, by limiting the number of reported chrome hangs to 50 and the depth of the reported stack to 50. Due to the way annotations are de-duplicated, discarding stacks is a bit more involving than it was before. Please note that this patch stacks on bug 1213780.
Attachment #793781 - Attachment is obsolete: true
Attachment #794848 - Attachment is obsolete: true
Attachment #8678776 - Flags: review?(vladan.bugzilla)
Comment on attachment 8678776 [details] [diff] [review] bug896740.patch Review of attachment 8678776 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/telemetry/Telemetry.cpp @@ +170,5 @@ > return mModules[aIndex]; > } > > void > +CombinedStacks::PopFrontStack() { Let's make this more descriptive: RemoveOldestStack @@ +381,5 @@ > + continue; > + } > + > + // Shift the index down, to account for the removed stack in the deque. > + stackIndices[k] = stackIndices[k] - 1; instead of rewriting indices, how about an alternate approach where the stacks are stored in a circular buffer? You would only need to change (remove) references to the evicted element. @@ +402,5 @@ > size_t n = 0; > n += mStacks.SizeOfExcludingThis(); > // This is a crude approximation. See comment on > // CombinedStacks::SizeOfExcludingThis. > + n += mHangInfo.size() * sizeof(HangInfo); why this change? @@ +3999,5 @@ > ProcessedStack > GetStackAndModules(const std::vector<uintptr_t>& aPCs) > { > std::vector<StackFrame> rawStack; > + std::vector<uintptr_t>::const_iterator e = nit: e -> stackEnd @@ +4001,5 @@ > { > std::vector<StackFrame> rawStack; > + std::vector<uintptr_t>::const_iterator e = > + aPCs.begin() + std::min(aPCs.size(), kMaxChromeStackDepth); > + for (std::vector<uintptr_t>::const_iterator i = aPCs.begin(); i != e; ++i) { you can simplify several lines in this patch by using auto
Attachment #8678776 - Flags: review?(vladan.bugzilla)
Status: NEW → ASSIGNED
Blocks: 1218768
Attached patch bug896740.patch (obsolete) (deleted) — Splinter Review
Thanks for the review. This patch uses a circular queue instead of a deque. The vectors holding the stacks/hang info grow up to the maximum allowed size (50) and then wraps around.
Attachment #8678776 - Attachment is obsolete: true
Attachment #8679556 - Flags: review?(vladan.bugzilla)
Comment on attachment 8679556 [details] [diff] [review] bug896740.patch Review of attachment 8679556 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/telemetry/Telemetry.cpp @@ +156,5 @@ > private: > std::vector<Telemetry::ProcessedStack::Module> mModules; > std::vector<Stack> mStacks; > + size_t mLastWrittenIndex; > + size_t mNextIndex; As far as I can tell mNextIndex is just mLastWrittenIndex + 1. I think you can just keep one variable, mHangCounter, as a static variable in HangReports. It could be monotonically increasing and you could derive vector indeces from it @@ +176,5 @@ > void > CombinedStacks::AddStack(const Telemetry::ProcessedStack& aStack) { > + // Advance the indices of the circular queue holding the stacks. > + mLastWrittenIndex = mNextIndex % kMaxChromeStacksKept; > + mNextIndex = mLastWrittenIndex + 1; if you used only one index, this would become: size_t index = HangReports::GetHangCount() % kMaxChromeStacksKept; @@ +182,5 @@ > + if (mStacks.size() < kMaxChromeStacksKept) { > + mStacks.resize(mStacks.size() + 1); > + } > + // Get a reference to the location holding the new stack. > + CombinedStacks::Stack& adjustedStack = mStacks[mLastWrittenIndex]; this would become: adjustedStack = mStacks[index] @@ +335,5 @@ > int32_t aFirefoxUptime, > HangAnnotationsPtr aAnnotations) { > + // Append the new stack to the stack's circular queue. > + mStacks.AddStack(aStack); > + size_t hangIndex = mStacks.GetLastWrittenIndex(); this would become: hangIndex = mHangCounter % kMaxChromeStacksKept; @@ +386,5 @@ > + for (size_t k = 0; k < stackIndices.Length(); k++) { > + // Is this index referencing the removed stack? > + if (stackIndices[k] == aRemovedStackIndex) { > + toRemove = k; > + continue; are all the indices in iter->mHangIndices distinct? if so, why not break after finding one?
Attachment #8679556 - Flags: review?(vladan.bugzilla)
Attached patch bug896740.patch (obsolete) (deleted) — Splinter Review
As discussed over IRC, this patch leaves the counters in CombinedStack (as this is used both for lateWrites and chromeHangs).
Attachment #8679556 - Attachment is obsolete: true
Attachment #8680054 - Flags: review?(vladan.bugzilla)
Comment on attachment 8680054 [details] [diff] [review] bug896740.patch Review of attachment 8680054 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/telemetry/Telemetry.cpp @@ +122,5 @@ > size_t GetStackCount() const; > size_t SizeOfExcludingThis() const; > private: > std::vector<Telemetry::ProcessedStack::Module> mModules; > std::vector<Stack> mStacks; add a comment that this is a circular buffer and that mNextIndex is the next buffer element to write to @@ +305,5 @@ > + mHangInfo[hangIndex] = info; > + } > + // Remove any reference to the stack overwritten in the circular queue > + // from the annotations. > + PruneStackReferences(hangIndex); you can put this in the else above, size >= kMaxChromeStacksKept will be rare
Attachment #8680054 - Flags: review?(vladan.bugzilla) → review+
Attached patch bug896740.patch (deleted) — Splinter Review
Attachment #8680054 - Attachment is obsolete: true
Attachment #8681124 - Flags: review+
https://hg.mozilla.org/integration/fx-team/rev/c48cb2c5ccf0fbef908edb37a595fbff9c6d409f Bug 896740 - Limit the number of chrome-hangs reported to Telemetry. r=vladan
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: