Closed Bug 1679994 Opened 4 years ago Closed 4 years ago

Intemittent GTest/mda PROCESS-CRASH | gtest | application crashed [@ quant_coarse_energy] or jsctypes failures

Categories

(Firefox Build System :: General, defect)

defect

Tracking

(firefox-esr78 unaffected, firefox83 unaffected, firefox84 unaffected, firefox85 fixed)

RESOLVED WORKSFORME
Tracking Status
firefox-esr78 --- unaffected
firefox83 --- unaffected
firefox84 --- unaffected
firefox85 --- fixed

People

(Reporter: aryx, Unassigned)

References

(Regression)

Details

(Keywords: regression)

Crash Data

central-as-early beta simulation

Both central-as-early-beta and central-as-late-beta simulations have GTest and mda tasks failing with [@ quant_coarse_energy] and timeouts.

GTest: https://treeherder.mozilla.org/logviewer?job_id=323151767&repo=try

[task 2020-12-01T11:11:25.170Z] 11:11:25     INFO -  TEST-START | MediaPipelineTest.TestAudioSendNoMux
[task 2020-12-01T11:11:25.170Z] 11:11:25     INFO -  ExceptionHandler::GenerateDump cloned child 13512
[task 2020-12-01T11:11:25.170Z] 11:11:25     INFO -  ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2020-12-01T11:11:25.170Z] 11:11:25     INFO -  ExceptionHandler::WaitForContinueSignal waiting for continue signal...
[task 2020-12-01T11:11:25.170Z] 11:11:25     INFO -  [Socket 12861, Main Thread] WARNING: Shutting down Socket process early due to a crash!: file /builds/worker/checkouts/gecko/netwerk/ipc/SocketProcessChild.cpp:159
[task 2020-12-01T11:11:25.170Z] 11:11:25     INFO -  gtest INFO | gtest | process wait complete, returncode=-11
[task 2020-12-01T11:11:25.170Z] 11:11:25     INFO -  mozcrash checking /builds/worker/workspace/build/tests/gtest for minidumps...
[task 2020-12-01T11:11:25.170Z] 11:11:25     INFO -  mozcrash INFO | Copy/paste: /builds/worker/fetches/minidump_stackwalk/minidump_stackwalk /builds/worker/workspace/build/tests/gtest/4339c442-37e1-9de7-3abe-877f64ffc964.dmp /builds/worker/workspace/build/symbols
[task 2020-12-01T11:11:29.042Z] 11:11:29     INFO -  mozcrash INFO | Saved minidump as /builds/worker/workspace/build/blobber_upload_dir/4339c442-37e1-9de7-3abe-877f64ffc964.dmp
[task 2020-12-01T11:11:29.042Z] 11:11:29     INFO -  mozcrash INFO | Saved app info as /builds/worker/workspace/build/blobber_upload_dir/4339c442-37e1-9de7-3abe-877f64ffc964.extra
[task 2020-12-01T11:11:29.042Z] 11:11:29  WARNING -  PROCESS-CRASH | gtest | application crashed [@ quant_coarse_energy]
[task 2020-12-01T11:11:29.042Z] 11:11:29     INFO -  Crash dump filename: /builds/worker/workspace/build/tests/gtest/4339c442-37e1-9de7-3abe-877f64ffc964.dmp
[task 2020-12-01T11:11:29.042Z] 11:11:29     INFO -  Operating system: Linux
[task 2020-12-01T11:11:29.043Z] 11:11:29     INFO -                    0.0.0 Linux 4.4.0-1014-aws #14taskcluster1-Ubuntu SMP Tue Apr 3 10:27:00 UTC 2018 x86_64
[task 2020-12-01T11:11:29.043Z] 11:11:29     INFO -  CPU: amd64
[task 2020-12-01T11:11:29.043Z] 11:11:29     INFO -       family 23 model 1 stepping 2
[task 2020-12-01T11:11:29.043Z] 11:11:29     INFO -       4 CPUs
[task 2020-12-01T11:11:29.043Z] 11:11:29     INFO -  GPU: UNKNOWN
[task 2020-12-01T11:11:29.044Z] 11:11:29     INFO -  Crash reason:  SIGSEGV /SEGV_ACCERR
[task 2020-12-01T11:11:29.044Z] 11:11:29     INFO -  Crash address: 0x7fd20b527030
[task 2020-12-01T11:11:29.044Z] 11:11:29     INFO -  Process uptime: not available
[task 2020-12-01T11:11:29.044Z] 11:11:29     INFO -  Thread 31 (crashed)
[task 2020-12-01T11:11:29.044Z] 11:11:29     INFO -   0  libxul.so!quant_coarse_energy [quant_bands.c:79f00a53c3480eacf1e849866da172aa662eb65e : 329 + 0x1a]
[task 2020-12-01T11:11:29.045Z] 11:11:29     INFO -      rax = 0x0000000000000000   rdx = 0x00007fd22cd778d1
[task 2020-12-01T11:11:29.045Z] 11:11:29     INFO -      rcx = 0x0000000000000000   rbx = 0x0000000000000080
[task 2020-12-01T11:11:29.045Z] 11:11:29     INFO -      rsi = 0x00007fd20b561170   rdi = 0x00007fd234a97ca0
[task 2020-12-01T11:11:29.045Z] 11:11:29     INFO -      rbp = 0x00007fd20b5582e0   rsp = 0x00007fd20b527030
[task 2020-12-01T11:11:29.045Z] 11:11:29     INFO -       r8 = 0x00007fd20b558178    r9 = 0x0000000000000000
[task 2020-12-01T11:11:29.046Z] 11:11:29     INFO -      r10 = 0x0000000000000000   r11 = 0x0000000000000000
[task 2020-12-01T11:11:29.046Z] 11:11:29     INFO -      r12 = 0x00007fd20b5580e0   r13 = 0x00007fd20b558030
[task 2020-12-01T11:11:29.046Z] 11:11:29     INFO -      r14 = 0x0000000000000000   r15 = 0x0000000000000000
[task 2020-12-01T11:11:29.046Z] 11:11:29     INFO -      rip = 0x00007fd23b6de422
[task 2020-12-01T11:11:29.046Z] 11:11:29     INFO -      Found by: given as instruction pointer in context

mda timeout example: https://treeherder.mozilla.org/logviewer?job_id=323151823&repo=try

Andreas, are these from bug 1666116?

Flags: needinfo?(apehrson)

If anything of mine I'd suspect bug 1605134. I'll try to get a pernosco recording to see what's up.

Please also check the other test failures on Linux debug: wpts and crashtest are webrtc and related to 1605134. The xpcshell X2 and M-1proc(c3) are about ctypes and also only fail on Linux debug: https://treeherder.mozilla.org/jobs?repo=try&group_state=expanded&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&revision=79f00a53c3480eacf1e849866da172aa662eb65e

This is neither from bug 1605134 nor bug 1666116 based on an ongoing bisection on Try.

Bisection points to bug 1588710 as cause of these failures. The M-1proc(c3) also succeeded, expect this also for xpcshell:

https://treeherder.mozilla.org/jobs?repo=try&revision=50068779aaf8e7af7fcd2fd86484abac28b74baf

Component: Audio/Video: MediaStreamGraph → General
Flags: needinfo?(apehrson) → needinfo?(sledru)
Product: Core → Firefox Build System

Serge, do you think it could be caused by stack clash protection?
Thanks

Flags: needinfo?(sledru) → needinfo?(sguelton)

We probably need to increase the stack size. This is a function name from libopus, that is called from the MTG. If it's called from MockCubeb, then I'd say we're going too deep for the regular stack size. This has happened multiple times in the past (although MockCubeb should be using an std::thread so it should have an OK stack size). If it's being called from the GraphRunner (unclear from the logs), we need to increase its stack size (as we do in cubeb).

Flags: needinfo?(apehrson)

@Sylvestre: this theoretically could be (stack clash protection does inject a few loops)

Flags: needinfo?(sguelton)

Today's central-as-early-beta simulation is not affected by the issue.

CC glandium for awareness, tasks which reproducibly failed yesterday in comment 4 (but could have been served the same artifacts from sccache).

So, for now, I will consider that it isn't caused stack clash protection then :)

Not caused by, found by 👍.

Flags: needinfo?(apehrson)
Has Regression Range: --- → yes
Summary: Perma Beta Linux debug GTest/mda PROCESS-CRASH | gtest | application crashed [@ quant_coarse_energy] when Gecko 85 merges to Beta on 2020-12-14 → Intemittent GTest/mda PROCESS-CRASH | gtest | application crashed [@ quant_coarse_energy] or jsctypes failures

The stack clash protection enabling in bug 1588710 got reverted for now.

Not tracking for now given the revert.

Crash Signature: [@ ffi_closure_unix64_inner]

This issue seems to have subsided since the failed central-as-beta sim. I'm not sure if it was a build system problem to begin with but I think we can close this out for the time being.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Resolution: FIXED → WORKSFORME

It's this alloca: https://searchfox.org/mozilla-central/rev/c7cf087b6e1384608ca3989f042f12f7cabd0a5f/media/libopus/celt/quant_bands.c#329

It looks quite large. I can't see the original requested size in the minidump but at the point of segfault it was still 200KB away from what it wanted:

0:031> .formats @r13-@rsp
Evaluate expression:
  Hex:     00000000`00031000
  Decimal: 200704

On second thought...

libxul+0x867341d:
00007fd2`3b6de41d 4939e5          cmp     r13,rsp
00007fd2`3b6de420 7c11            jl      libxul+0x8673433 (00007fd2`3b6de433)
00007fd2`3b6de422 48c7042400000000 mov     qword ptr [rsp],0
00007fd2`3b6de42a 4881ec00100000  sub     rsp,1000h
00007fd2`3b6de431 ebea            jmp     libxul+0x867341d (00007fd2`3b6de41d)

0:031> r @rsp, @rbp, @r13
Last set context:
rsp=00007fd20b527030 rbp=00007fd20b5582e0 r13=00007fd20b558030

If r13 is the goal, why is it greater than rsp? We'll never reach it through repeated subs. Did the alloca go negative or something?

Actually, no: this was an alloca(0) meaning the toolchain for this push didn't have the subsequent bugfix.

That makes sense given that this bug was opened on December 1. But this should not be happening anymore and so it's not clear why this bug was the justification for the backout in bug 1588710 comment 49. I will ask on that bug.

You need to log in before you can comment on or make changes to this bug.