Closed Bug 1280484 Opened 8 years ago Closed 8 years ago

Augment the crash ping with stacks obtained on the client

Categories

(Toolkit :: Crash Reporting, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox52 --- fixed

People

(Reporter: gsvelto, Assigned: gsvelto)

References

Details

(Whiteboard: [fce-active-legacy])

Attachments

(1 file, 5 obsolete files)

+++ This bug was initially created as a clone of Bug #1280477 +++

Once bug 1280477 is implemented with should modify the crash manager to correctly process the new version of the crash.main event file and add the embedded stack trace to the crash ping. We should also bump the version of the crash ping IIRC.
Whiteboard: [fce-active]
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Attached patch [PATCH WIP] Add stack traces to the crash ping (obsolete) (deleted) — Splinter Review
This patch takes the stack traces read from the crash event file in bug 1280477 and puts them in the crash ping. I've not bumped the crash ping version number because this is an optional addition to the payload so I think that there's no need for it.

Chris, since I'm not familiar with how telemetry works yet, does this look correct to you? BTW do you know if we need server-side support first or if the extra field will just be ignored in case we land this first?
Attachment #8769179 - Flags: feedback?(chutten)
Comment on attachment 8769179 [details] [diff] [review]
[PATCH WIP] Add stack traces to the crash ping

Review of attachment 8769179 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM

ddurst and I asked infra about adding a new field. Go ahead and add away. It should be stored and available to analysis immediately with no change.
Attachment #8769179 - Flags: feedback?(chutten) → feedback+
Excellent!
We should also add the crash ID as per offline discussions to simplify correlating with the data stored in Socorro.
Pending data review. Including the crash id provides us with a path to correlate possibly-sensitive information from the crashdump with information transmitted by Telemetry. We may not want to walk that road.

Maybe we need a crash-specific client_id for crashpings. Or maybe we can skip client_id altogether, as crash ID is likely more important.

Anyhoo. Yes: we probably will want the facility to attach the crash ID to the crash ping. We may not use it immediately.
(In reply to Chris H-C :chutten from comment #5)
> Maybe we need a crash-specific client_id for crashpings. Or maybe we can
> skip client_id altogether, as crash ID is likely more important.

Indeed.
Attached patch [PATCH] Add stack traces to the crash ping (obsolete) (deleted) — Splinter Review
This is just the previous patch plus tests. I haven't added the crash ID to the ping as per offline discussions. It's ready for review because it's so simple but it requires all the dependent bugs to be fixed in order to work.
Attachment #8769179 - Attachment is obsolete: true
Attachment #8777338 - Flags: review?(benjamin)
On the contrary, it's very important that we be able to look at this by user with the clientID. And I don't think in general we need or want to correlate this with crash-stats, at this point anyway.
Attached patch [PATCH] Add stack traces to the crash ping (obsolete) (deleted) — Splinter Review
Patch had bit-rotted, refreshed to apply to the current m-c tree.
Attachment #8777338 - Attachment is obsolete: true
Attachment #8777338 - Flags: review?(benjamin)
Attachment #8779180 - Flags: review?(benjamin)
Comment on attachment 8779180 [details] [diff] [review]
[PATCH] Add stack traces to the crash ping

Unless it's in a different patch, this doesn't update the crash ping docs. Please update https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/crash-ping.html with detailed explanation of what this field will contain.
Attachment #8779180 - Flags: review?(benjamin) → review-
(In reply to Benjamin Smedberg [:bsmedberg] from comment #10)
> Unless it's in a different patch, this doesn't update the crash ping docs.
> Please update
> https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/
> telemetry/data/crash-ping.html with detailed explanation of what this field
> will contain.

You're right, it doesn't. Thanks for pointing it out.
Attached patch [PATCH v2] Add stack traces to the crash ping (obsolete) (deleted) — Splinter Review
Updated patch with full documentation for the new field added. Since I was at it I've also documented the rest of the metadata that ends up in the crash ping; I've tried to be as exhaustive as possible.
Attachment #8779180 - Attachment is obsolete: true
Attachment #8781284 - Flags: review?(benjamin)
Comment on attachment 8781284 [details] [diff] [review]
[PATCH v2] Add stack traces to the crash ping

>diff --git a/toolkit/components/telemetry/docs/data/crash-ping.rst b/toolkit/components/telemetry/docs/data/crash-ping.rst

>+Stack Traces
>+------------
>+
>+The crash ping may contain a ``stackTraces`` field which has been populated with
>+stack traces for all threads in the crashed process. The format of this field
>+is similar to the one used by Socorro for representing a crash minus the PII
>+and redundant data.

Please remove the phrase starting with "minus the...". PII is a specific legal term which doesn't strictly apply here. What may be better, though, is a description of the actual privacy properties of this data: that is contains only execution traces and we're confident that it won't expose data from the application.

>+
>+.. code-block:: js
>+
>+    {
>+      status: <string>, // Status of the analysis, "OK" or an error message
>+      crash_info: { // Basic crash information
>+        type: <string>, // Type of crash, SIGSEGV, assertion, etc...
>+        address: <ip>, // IP address of the crash
>+        crashing_thread: <index> // Index in the thread array below
>+      },
>+      main_module: <index>, // Index of the main module in the array below

Does "main module" mean the executable?

>+      modules: [{
>+        base_addr: <addr>, // Base address of the module
>+        end_addr: <addr>, // End address of the module
>+        code_id: <string>, // ID of this module
>+        debug_file: <string>, // Name of the file associated with the module
>+        debug_id: <string>, // Unique ID or hash of the module
>+        filename: <string>, // File name
>+        version: <string>, // Library/executable version

I don't think I know what code_id is. And it might be useful to link this bit about debug_file and debug_id to breakpad or Microsoft documentation about debug IDs.

>+      },
>+      ... // List of modules

Will this list be in any particular order?

>+      ],
>+      thread_count: <count>, // Number of threads
>+      threads: [{ // Stack traces for every thread
>+      frame_count: <count>, // Number of frames in this stack trace

Why do we have both a thread_count and a separate array of threads? Will the threads.length sometimes be different from thread_count? Similarly for frame_count and frames.length. Let's avoid duplication if at all possible.

>+        frames: [{
>+          frame: <index>, // Index of this frame, 0 is the topmost

Why include this, since it's already an ordered list of frames?

>+          module: <string>, // Module name this frame belongs to

It seems like string encoding would make this an inefficient storage format. Also, it's possible (though unlikely) to have multiple versions of the same DLL loaded. Could this be an index into the module array, instead of a name?

>+          offset: <ip>, // Program counter

What format is <ip>? hex-string or number?

>+Notes
>+~~~~~
>+
>+Memory addresses are always stored as strings in hexadecimal format
>+(e.g. "0x4000"). They can be made of up to 16 characters for 64-bit addresses.

Maybe this answers the <ip> question above, except that is an offset not exactly an address.

One of the useful outputs of the breakpad stackwalk is a marking when we're missing symbols for a particular frame. That way we stop signature generation at the point the symbols are missing. Will that information show up anywhere in this format?
Flags: needinfo?(gsvelto)
Attachment #8781284 - Flags: review?(benjamin) → review-
(In reply to Benjamin Smedberg [:bsmedberg] from comment #13)
> I don't think I know what code_id is. And it might be useful to link this
> bit about debug_file and debug_id to breakpad or Microsoft documentation
> about debug IDs.

code_id is the timestamp from the PE headers in hex (%08X) with the size of the file in bytes (also from the PE headers) in hex (%X) appended:
https://chromium.googlesource.com/breakpad/breakpad/+/master/src/common/windows/pdb_source_line_writer.cc#1312

> One of the useful outputs of the breakpad stackwalk is a marking when we're
> missing symbols for a particular frame. That way we stop signature
> generation at the point the symbols are missing. Will that information show
> up anywhere in this format?

I don't think we actually use that for signature generation currently, but it is useful in general.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #13)
> Please remove the phrase starting with "minus the...". PII is a specific
> legal term which doesn't strictly apply here. What may be better, though, is
> a description of the actual privacy properties of this data: that is
> contains only execution traces and we're confident that it won't expose data
> from the application.

OK, I'll change that.

> Does "main module" mean the executable?

Yes, I'll specify that.

> I don't think I know what code_id is. And it might be useful to link this
> bit about debug_file and debug_id to breakpad or Microsoft documentation
> about debug IDs.

I'll add the link Ted mentioned in comment 14 regarding the code id. It seems that this is relevant only to Windows since on Linux and MacOS X the dumps seem to always use the "id" string making it redundant.

> Will this list be in any particular order?

Not that I know of.

> Why do we have both a thread_count and a separate array of threads? Will the
> threads.length sometimes be different from thread_count? Similarly for
> frame_count and frames.length. Let's avoid duplication if at all possible.
>
> [..]
>
> >+        frames: [{
> >+          frame: <index>, // Index of this frame, 0 is the topmost
> 
> Why include this, since it's already an ordered list of frames?

I'll remove both the redundant length and indexes. I picked them up from the code in Socorro that does minidump processing since I was trying to be as close as possible to its format.

> >+          module: <string>, // Module name this frame belongs to
> 
> It seems like string encoding would make this an inefficient storage format.
> Also, it's possible (though unlikely) to have multiple versions of the same
> DLL loaded. Could this be an index into the module array, instead of a name?

Yes, it's possible and also preferable as you point out. This would deviate from how Socorro processes dumps but I think it's OK as long as it's documented.

> >+          offset: <ip>, // Program counter
> 
> What format is <ip>? hex-string or number?

Hex string, I'll make it clearer.
> >+Notes
> >+~~~~~
> >+
> >+Memory addresses are always stored as strings in hexadecimal format
> >+(e.g. "0x4000"). They can be made of up to 16 characters for 64-bit addresses.
> 
> Maybe this answers the <ip> question above, except that is an offset not
> exactly an address.

Yeah, in that particular case in spite of the name it's always an address and not an offset. If the frame points to a module that can be turned into an offset (the module_offset field in Socorro's JSON dumps). I didn't both computing the relative offset since it can be done on the server side and it doesn't need special-casing in the client code. That can be added though if it's preferable, it's trivial.

> One of the useful outputs of the breakpad stackwalk is a marking when we're
> missing symbols for a particular frame. That way we stop signature
> generation at the point the symbols are missing. Will that information show
> up anywhere in this format?

In this case that marking (missing_symbols) is always true because we're not symbolicating; that's why I've omitted it. We can always add it back when/if we start symbolicating on the client. BTW since that came up in discussions before I'll open a bug for it.
Flags: needinfo?(gsvelto)
Attached patch [PATCH v3] Add stack traces to the crash ping (obsolete) (deleted) — Splinter Review
Updated patch with review comments addressed. The patch holding the gecko changes that are reflected in the documentation here is attachment 8782634 [details] [diff] [review] in bug 1280477.
Attachment #8781284 - Attachment is obsolete: true
Attachment #8782645 - Flags: review?(benjamin)
Comment on attachment 8782645 [details] [diff] [review]
[PATCH v3] Add stack traces to the crash ping

>+        address: <ip>, // IP address of the crash, hex format

I'm sorry I didn't catch this the first time. In general, for bad-read or bad-write crashes, this address is not the instruction pointer: it's the address which failed to read/write properly. Assuming that's true, please clarify these docs.

OTOH, if this is fact always the IP, then we should separately include the crash address, and name both of them with clarifying names.

>+      main_module: <index>, // Index of Firefox' executable in the module list

nit, remove apostrophe

>+      modules: [{
>+        base_addr: <addr>, // Base address of the module, hex format
>+        end_addr: <addr>, // End address of the module, hex format
>+        code_id: <string>, // Unique ID of this module, see the notes below
>+        debug_file: <string>, // Name of the file holding the debug information
>+        debug_id: <string>, // ID or hash of the debug information file
>+        filename: <string>, // File name
>+        version: <string>, // Library/executable version
>+      },
>+      ... // List of modules ordered by base memory address
>+      ],
>+      threads: [{ // Stack traces for every thread
>+        frames: [{
>+          module: <index>, // Index of the module this frame belongs to
>+          offset: <ip>, // Program counter, hex format

If this is the actual IP, we shouldn't name it "offset". The name should match: I'm ok with recording either <ip> or <offset> (it's just math to compute the other), but let's not confuse them.

>+description for further information. This filed is populated only on Windows.

spelling "field"

r=me + data-review=me with nits addressed
Attachment #8782645 - Flags: review?(benjamin) → review+
re: code_id, it is unused on non-Windows currently, but if we update our Breakpad snapshot we'll pick up my patch that fills out code_id with the full Build ID on Linux. Currently the Linux debug_id is the first 16 bytes of the Build ID stuffed into a GUID struct and formatted as such.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #17)
> I'm sorry I didn't catch this the first time. In general, for bad-read or
> bad-write crashes, this address is not the instruction pointer: it's the
> address which failed to read/write properly. Assuming that's true, please
> clarify these docs.
> OTOH, if this is fact always the IP, then we should separately include the
> crash address, and name both of them with clarifying names.

I've dug a little further in breakpad's code and I've realized that the meaning of this field depends on the crash reason so I'll specify it in the docs. For example, for a segmentation fault it's the address that caused the fault, for an invalid instruction it's the instruction address, for an assertion it's the assertion's address, etc...

> >+      modules: [{
> >+        base_addr: <addr>, // Base address of the module, hex format
> >+        end_addr: <addr>, // End address of the module, hex format
> >+        code_id: <string>, // Unique ID of this module, see the notes below
> >+        debug_file: <string>, // Name of the file holding the debug information
> >+        debug_id: <string>, // ID or hash of the debug information file
> >+        filename: <string>, // File name
> >+        version: <string>, // Library/executable version
> >+      },
> >+      ... // List of modules ordered by base memory address
> >+      ],
> >+      threads: [{ // Stack traces for every thread
> >+        frames: [{
> >+          module: <index>, // Index of the module this frame belongs to
> >+          offset: <ip>, // Program counter, hex format
> 
> If this is the actual IP, we shouldn't name it "offset". The name should
> match: I'm ok with recording either <ip> or <offset> (it's just math to
> compute the other), but let's not confuse them.

OK, I'll rename it to <ip> then which is clearer. Socorro uses <offset> to mean the raw IP and <module_offset> to mean the offset within a module which makes the former a little confusing.

Updated patch coming soon.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #18)
> re: code_id, it is unused on non-Windows currently, but if we update our
> Breakpad snapshot we'll pick up my patch that fills out code_id with the
> full Build ID on Linux. Currently the Linux debug_id is the first 16 bytes
> of the Build ID stuffed into a GUID struct and formatted as such.

Excellent, so I'll keep that field in the crash ping on all architectures.
Updated patch with the review comments addressed, the crash ID added to the ping, tests revised to accommodate for this and a more detailed description of the crash reason and address fields. Carrying over the r=bsmedberg flag.
Attachment #8782645 - Attachment is obsolete: true
Attachment #8789354 - Flags: review+
I'm about to land this change finally. This will augment the crash ping with stacks extracted from the minidump on all nightly builds. This will also be enabled in aurora once it hits that tree. The tool that generates the stacks is not built on beta and release so the crash ping will be unaffected there. The amount of data added to the ping will increase by a variable amount, depending on the size of the module list, the amount of threads running at crash time and the depth of the stack. In general we're looking at a size between 40 and 120 KiB of extra JSON data. It should compress very well though.

One final try run before landing:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9d5664004ff8
... and one more because the previous one had a linting error (I cancelled it entirely).

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1ebe8513b59f
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6594e386f018
Add stack traces to the crash ping r=bsmedberg
https://hg.mozilla.org/mozilla-central/rev/6594e386f018
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Whiteboard: [fce-active] → [fce-active-legacy]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: