1303077 - Gather client-side stacks for the content process when a hang occurs after paint is forced

Assignee

Description

•

8 years ago

According to gsvelto, this kind of wizardry is possible now.

Gabriele Svelto [:gsvelto]

Comment 1

•

8 years ago

Almost, I'm about to land bug 1280477 which introduces an executable that can analyze minidumps and spit out stacks from them. There's a few gotchas right now though:

- It's designed to populate main crash events, so it'll need some tweaking to analyze any kind of minidump and return output in a way that's not just appending a JSON blob to an existing file

- It analyzes *the entire minidump* so if you're only interested in one process or one thread we've got no option to filter it out, but this could be added

- We don't have proper metadata-based stack-walking on many platforms so when working on release builds that have no frame pointers the stacks are walked using stack scanning which might or might not yield a proper stack trace. We're planning on adding proper metadata-based stack-walking but it's a long term goal as it requires significant infrastructure work

I'll soon modify this stuff to deal with minidump generated from content process crashes (bug 1293656); once that's done we'll have 90% of what's needed to use it to get client-side stacks from any sort of hangs.

Depends on: 1280477

(not currently active) Ted Mielczarek

Comment 2

•

8 years ago

Note that using Breakpad for this might be a little heavyweight--it's going to capture a dump of the entire process. We have all the machinery for that, but I don't know what the perf impact looks like. We haven't ever worried about that in the past because it's only been used in cases of crashes/hangs. The profiler machinery feels like it might be more suitable to this, but I can't recall if it requires cooperation from the content process to get a stack out of it (which obviously doesn't work well if that process is hung).

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 3

•

8 years ago

We could use the hang monitor channel to send a message asking a tab to walk the stack and send a stack trace back. This would be a pretty simple service to implement since the hang monitor already runs on a separate thread in the child process. It could call MozStackWalk and send the result back as a string.

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 4

•

8 years ago

Supposing we do this, what's the right ingestion end point for all of these stacks? The ChromeHangs stuff (which, I think, is part of BHR? I think?) might be one place.

Bug 1225851 offers an alternative, where stacks can be sent up to Telemetry, but a (very quick) scan of the bug shows that:

a) It hasn't landed yet, and
b) It has a requirement of one stack per "key", so unless we want to bump our key per spinner, this might not be the right ingestion point.

Mike Conley (:mconley) (:⚙️)

Assignee

Updated

•

8 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1225851

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 5

•

8 years ago

Cc'ing gfritzsche, who might find this stuff interesting.

Georg Fritzsche [:gfritzsche]

Comment 6

•

8 years ago

(In reply to Mike Conley (:mconley) - (needinfo me!) from comment #4)
> Bug 1225851 offers an alternative, where stacks can be sent up to Telemetry,
> but a (very quick) scan of the bug shows that:
> 
> a) It hasn't landed yet, and
> b) It has a requirement of one stack per "key", so unless we want to bump
> our key per spinner, this might not be the right ingestion point.

Re a), it's closing in now and we could speed it up by taking over parts.
For b), the "one stack per key" limitation was to simplify design as the original request didn't need more. We could change that if needed.

Another thing:
Both chromeHangs and bug 1225851 only collect the stack for one thread.
I assume you'd need the state of all threads for diagnosis?

Dão Gottwald [:dao]

Updated

•

8 years ago

Priority: -- → P2

Mike Conley (:mconley) (:⚙️)

Assignee

Updated

•

8 years ago

Depends on: 1225851

See Also: https://bugzilla.mozilla.org/show_bug.cgi?id=1225851 →

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 7

•

8 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #6)
> 
> Another thing:
> Both chromeHangs and bug 1225851 only collect the stack for one thread.
> I assume you'd need the state of all threads for diagnosis?

Hard to say - it might be different from cause to cause. Good thing to keep in mind though, so thanks.

Hey billm, can you describe how you plan on gathering the stack from the content process, and what kind of API you're hoping for so we can stash the stack in Telemetry? It looks like the patch in bug 1225851 adds a captureStack method to nsITelemetry, but only for the thread it's called from. If the main thread in the content process is blocked, (if I understand things correctly) I suspect we'll need to use a slightly different mechanism to get its stack.

I just want to get a sense of what kind of changes I need to make in that patch before I proceed.

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 8

•

8 years ago

My basic plan was to send a message on the hang monitor channel asking for a stack. That would be received on the hang monitor thread in the child. It would invoke a method to capture the main thread's stack. So we don't actually need to pass a thread into the capture routine--we could default to the main thread. But I guess there's no harm in asking for a specific thread. We'll also need the capture routine to be threadsafe (it looks like it isn't exactly threadsafe right now, although it sort of tries to be) and to work in the child process.

Flags: needinfo?(wmccloskey)

Georg Fritzsche [:gfritzsche]

Comment 9

•

8 years ago

We'll need to file a follow-up for thread-safety anyway, i filed bug 1307746 on that.

Is the main threads stack usually sufficient for the scenarios you are looking at?
I wonder if scenarios come up where the main thread is actually just waiting on other threads.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 10

•

8 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> Is the main threads stack usually sufficient for the scenarios you are
> looking at?
> I wonder if scenarios come up where the main thread is actually just waiting
> on other threads.

I suspect the main thread is likely to be sufficient. If it's not, we can add that functionality later. I'd rather not hold up the bug any more than necessary.

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/85246/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/85246/

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 12

•

8 years ago

So I've just posted a patch for review that, I believe, annotates hangs that occur on the main thread during the time between a force paint and the next time something from the content process is composited.

Note that I'm kind of a newbie to Gecko multi-threading. I know enough about it to be dangerous.

I'm hoping that aklotz can review the annotation stuff, and billm can make sure I'm popping and locking in the right places.

(No longer employed by Mozilla) Aaron Klotz

Comment 13

•

8 years ago

r+ for "popping and locking" ;-)

(No longer employed by Mozilla) Aaron Klotz

Comment 14

•

8 years ago

mozreview-review

Comment on attachment 8800282 [details]
Bug 1303077 - Annotate hangs that occur on the content process main thread during paint forcing.

https://reviewboard.mozilla.org/r/85246/#review83828

This looks mostly good to me, but I don't see a corresponding call to UnregisterAnnotator anywhere here. You'll need to call that on the main thread when HangMonitorChild is destroyed. Failing to do so will cause BHR to crash if a hang occurs after the HangMonitorChild is gone.

r- just because I'd like to take another look at it once that call is in place.

Attachment #8800282 - Flags: review?(aklotz) → review-

Comment hidden (mozreview-request)

Comment on attachment 8800282 [details]
Bug 1303077 - Annotate hangs that occur on the content process main thread during paint forcing.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/85246/diff/1-2/

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 16

•

8 years ago

mozreview-review-reply

Comment on attachment 8800282 [details]
Bug 1303077 - Annotate hangs that occur on the content process main thread during paint forcing.

https://reviewboard.mozilla.org/r/85246/#review83828

Ah, good catch! Thanks!

(No longer employed by Mozilla) Aaron Klotz

Comment 17

•

8 years ago

mozreview-review

Comment on attachment 8800282 [details]
Bug 1303077 - Annotate hangs that occur on the content process main thread during paint forcing.

https://reviewboard.mozilla.org/r/85246/#review83848

Thanks!

Attachment #8800282 - Flags: review?(aklotz) → review+

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 18

•

8 years ago

mozreview-review

Comment on attachment 8800282 [details]
Bug 1303077 - Annotate hangs that occur on the content process main thread during paint forcing.

https://reviewboard.mozilla.org/r/85246/#review83872

Is the idea here that we'll sort through the existing hang reports and now we'll be able to tell which ones are for tab switches?

::: dom/ipc/ProcessHangMonitor.cpp:1294
(Diff revision 2)
> +{
> +  MOZ_RELEASE_ASSERT(NS_IsMainThread());
> +  MOZ_RELEASE_ASSERT(XRE_IsContentProcess());
> +
> +  if (HangMonitorChild* child = HangMonitorChild::Get()) {
> +    child->ClearForcePaint();

This isn't really correct. There are cases where we receive a ForcePaint message but choose not to paint (if the message is stale or the docshell was already active for example). I think it would be better to unset the "in progress" flag at the end of the interrupt callback.

Attachment #8800282 - Flags: review?(wmccloskey) → review+

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 19

•

8 years ago

mozreview-review-reply

Comment on attachment 8800282 [details]
Bug 1303077 - Annotate hangs that occur on the content process main thread during paint forcing.

https://reviewboard.mozilla.org/r/85246/#review83872

Yep. I'll either use a Spark cluster to generate a list of stacks by frequency, or (maybe more sustainably) work with someone like rvitillo to modify the API endpoint that https://telemetry.mozilla.org/hang/bhr/ reads from so that it can display content process hangs and show / filter by annotations.

> This isn't really correct. There are cases where we receive a ForcePaint message but choose not to paint (if the message is stale or the docshell was already active for example). I think it would be better to unset the "in progress" flag at the end of the interrupt callback.

I worry that flipping the bool at the end of the callback doesn't cast a wide enough net. I want to annotate hangs that occur between the time that the user clicks on a tab, and the time that the tab content displays itself.

Locally, when I switch tabs between, say, YouTube, and the HTML5 spec, I can get the spinner to display itself (due to big layout calculations) but the hangs are not annotated because (I think) we had successfully exited RecvSetDocShellIsActive.

So I worry that knowing that we exited RecvSetDocShellIsActive isn't enough to say that we want to stop annotating hangs.

Perhaps the naming is wrong - instead of using mForcePaintInProgress, I should call it mTabPaintInProgress?

Is what I'm saying making sense, or am I missing something?

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 20

•

8 years ago

ni? for comment 19

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 21

•

8 years ago

(In reply to Mike Conley (:mconley) - (needinfo me!) from comment #19)
> > This isn't really correct. There are cases where we receive a ForcePaint message but choose not to paint (if the message is stale or the docshell was already active for example). I think it would be better to unset the "in progress" flag at the end of the interrupt callback.
> 
> I worry that flipping the bool at the end of the callback doesn't cast a
> wide enough net. I want to annotate hangs that occur between the time that
> the user clicks on a tab, and the time that the tab content displays itself.
> 
> Locally, when I switch tabs between, say, YouTube, and the HTML5 spec, I can
> get the spinner to display itself (due to big layout calculations) but the
> hangs are not annotated because (I think) we had successfully exited
> RecvSetDocShellIsActive.

If we're hung in layout, we never should have even started RecvSetDocShellIsActive. It only runs if we're running the event loop or if we're stuck in JS.

We actually paint inside of RecvSetDocShellIsActive. So if you unset the flag at the end of it, we will have already painted.

I'm not sure why you wouldn't be getting the hang annotation in that case, but it shouldn't be related to where you set the flag. I think more investigation is needed.

Flags: needinfo?(wmccloskey)

Comment hidden (mozreview-request)

Comment on attachment 8800282 [details]
Bug 1303077 - Annotate hangs that occur on the content process main thread during paint forcing.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/85246/diff/2-3/

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 23

•

8 years ago

mozreview-review-reply

Comment on attachment 8800282 [details]
Bug 1303077 - Annotate hangs that occur on the content process main thread during paint forcing.

https://reviewboard.mozilla.org/r/85246/#review83872

> I worry that flipping the bool at the end of the callback doesn't cast a wide enough net. I want to annotate hangs that occur between the time that the user clicks on a tab, and the time that the tab content displays itself.
> 
> Locally, when I switch tabs between, say, YouTube, and the HTML5 spec, I can get the spinner to display itself (due to big layout calculations) but the hangs are not annotated because (I think) we had successfully exited RecvSetDocShellIsActive.
> 
> So I worry that knowing that we exited RecvSetDocShellIsActive isn't enough to say that we want to stop annotating hangs.
> 
> Perhaps the naming is wrong - instead of using mForcePaintInProgress, I should call it mTabPaintInProgress?
> 
> Is what I'm saying making sense, or am I missing something?

Ah, I understand what's going wrong. We don't currently support annotating _in progress_ hangs. That's why I wasn't seeing annotations when I expected.

So this patch is going to only annotate hangs that occur between the point where we request a force paint, and when the force paint completes. I suspect the more common case for spinners is that the content process is already hung at the time that force paint is requested. I'll file a follow-up bug to allow us to annotate in-progress hangs.

Mike Conley (:mconley) (:⚙️)

Assignee

Updated

•

8 years ago

No longer depends on: 1225851, 1280477

Summary: Gather client-side stacks for the content process when the tab switch spinner is shown → Gather client-side stacks for the content process when a hang occurs after paint is forced

Mike Conley (:mconley) (:⚙️)

Assignee

Updated

•

8 years ago

Blocks: 1310011

Pulsebot

Comment 24

•

8 years ago

Pushed by mconley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/aeadb583041d
Annotate hangs that occur on the content process main thread during paint forcing. r=aklotz,billm

Carsten Book [:Tomcat]

Comment 25

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/aeadb583041d

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox52: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → Firefox 52

Mike Conley (:mconley) (:⚙️)

Assignee

Updated

•

8 years ago

Blocks: 1310255

Virtual_ManPL [:Virtual] 🇵🇱 - (please needinfo? me - so I will see your comment/reply/question/etc.)

Updated

•

8 years ago

Assignee: nobody → mconley

Mike Conley (:mconley) (:⚙️)

Assignee

Updated

•

8 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1310809

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 26

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/b4a6dca904d39041cff05765be2fa25f05f0c752
Backed out changeset aeadb583041d (bug 1303077) because the work in the upcoming bug 1310880 supercedes it.

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 27

•

8 years ago

I backed this out because the work in bug 1310880 effectively does the same thing, but better.

Resolution: FIXED → DUPLICATE