Closed Bug 1625888 Opened 5 years ago Closed 5 years ago

Minidump still not always being written by talos forced shutdowns after timeout (talos crash reporting and windows kill_and_get_minidump busted)

Categories

(Testing :: Talos, defect, P3)

defect

Tracking

(firefox76 fixed)

RESOLVED FIXED
mozilla76
Tracking Status
firefox76 --- fixed

People

(Reporter: Gijs, Assigned: gbrown)

References

(Blocks 1 open bug)

Details

(Whiteboard: dev-prod-2020)

Attachments

(1 file)

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=294290831&repo=autoland&lineNumber=3887

The log says it's writing one, but there's no attachment. I don't really know how to debug this further - this definitely post-dates bug 1623917 and bug 1622257, being run this past Monday.

Geoff, do you know how to investigate this further?

Flags: needinfo?(gbrown)
Depends on: 1622257, 1623917

Looks like this is the underlying reason for the intermittent on bug 1557982?

Blocks: 1557982
Priority: -- → P3

Here's an intentional crash in a variety of tasks:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2e13a8b00b434718d8fd0ac706632c0664137224

I guess it just confirms this bug: crashreporting is working for most test suites, but not for Talos, on any platform.

The Talos code looks okay to me. Looking closer...

Depends on: 1626097

In some cases, a browser crash may subsequently cause an exception in the harness before
check_for_crashes is called, effectively bypassing crash reporting. This patch catches
the exception to ensure that check_for_crashes is called regardless of such exceptions.

Assignee: nobody → gbrown
Status: NEW → ASSIGNED

My patch (comment 3) improves Talos crash reporting in the common case, but does not appear to address the problem reported in comment 0.

Bug 1626097 will provide slightly improved diagnostics if / when comment 0 is reproduced. However, I suspect that will show that check_for_crashes is checking for minidumps and not finding any, despite kill_and_get_minidumps being called appropriately. Maybe that's because the process was already partly shut down when killed?

Flags: needinfo?(gbrown)
Keywords: leave-open

I'm also working on a talos silent error and I am trying to get familiar with this framework.
Geoff, one place I see prone to silent errors in mozcrash.kill_and_get_minidump() is here.

Thanks - that's a good point. Are you going to add a warning there, or shall I?

I'd also like to see an info message like https://searchfox.org/mozilla-central/rev/fa2df28a49883612bd7af4dacd80cdfedcccd2f6/testing/mozbase/mozcrash/mozcrash/mozcrash.py#504 for the OpenProcess branch.

Flags: needinfo?(aionescu)

Let's first see a push to try. I'm not 100% sure until I see the results. Please do that and maybe we'll collaborate further if necessary. Thanks.

Flags: needinfo?(aionescu)

(In reply to Geoff Brown [:gbrown] from comment #4)

My patch (comment 3) improves Talos crash reporting in the common case, but does not appear to address the problem reported in comment 0.

Bug 1626097 will provide slightly improved diagnostics if / when comment 0 is reproduced. However, I suspect that will show that check_for_crashes is checking for minidumps and not finding any, despite kill_and_get_minidumps being called appropriately. Maybe that's because the process was already partly shut down when killed?

I tracked down the problem reported in comment 0: On Windows (only), kill_and_get_minidump was consistently failing to create a minidump. That was because the minidump file_name was originally non-unicode:

https://searchfox.org/mozilla-central/rev/4ccefc3181f9d237ef4ca8bd17b4e7c101ddf7b5/testing/mozbase/mozcrash/mozcrash/mozcrash.py#495

and, when running in a python 2 environment, not converted to unicode:

https://searchfox.org/mozilla-central/rev/4ccefc3181f9d237ef4ca8bd17b4e7c101ddf7b5/testing/mozbase/mozcrash/mozcrash/mozcrash.py#527

before being used in a call to CreateFileW, which fails silently when called with an 8-bit string.

Solution: if not isinstance(file_name, string_types): -> if not isinstance(file_name, text_type):

Keywords: leave-open
Summary: Minidump still not always being written by talos forced shutdowns after timeout → Minidump still not always being written by talos forced shutdowns after timeout (talos crash reporting and windows kill_and_get_minidump busted)
Whiteboard: dev-prod-2020

Thanks so much for chasing this, :gbrown ! Great stuff. I'm wondering, could we add some kind of test for the generic Windows mozcrash side of things where we crash deliberately, and check that we get a minidump and stack info, to ensure we can't accidentally break this again in future?

Flags: needinfo?(gbrown)

I think there would be value in such a test and I have contemplated such in the past, but I don't know how to implement an effective and robust test without a significant time investment.

Flags: needinfo?(gbrown)
Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/36010ca149f2 In Talos, check_for_crashes even when exception raised after browser run; r=perftest-reviewers,AlexandruIonescu
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla76
Blocks: 1629005
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: