Closed
Bug 1472889
Opened 6 years ago
Closed 6 years ago
Intermittent INFO - PROCESS-CRASH | automation.py | application crashed [@ mozilla::dom::ipc::SharedStringMap::SharedStringMap(mozilla::dom::ipc::SharedStringMapBuilder&&)]
Categories
(Core :: IPC, defect)
Core
IPC
Tracking
()
RESOLVED
INCOMPLETE
mozilla63
Tracking | Status | |
---|---|---|
firefox63 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Unassigned)
References
Details
(Keywords: crash, intermittent-failure, regression, Whiteboard: [retriggered][stockwell needswork:owner])
Crash Data
Attachments
(1 file, 1 obsolete file)
Filed by: toros [at] mozilla.com
https://treeherder.mozilla.org/logviewer.html#?job_id=186067456&repo=mozilla-central
https://queue.taskcluster.net/v1/task/f3aQ73HVTsOy9zLh5IJGzQ/runs/0/artifacts/public/logs/live_backing.log
[task 2018-07-02T23:09:31.379Z] INFO - MochitestServer : launching [u'/builds/worker/workspace/build/tests/bin/xpcshell', '-g', '/builds/worker/workspace/build/application/firefox', '-f', '/builds/worker/workspace/build/tests/bin/components/httpd.js', '-e', "const _PROFILE_PATH = '/tmp/tmpFD87Ws.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', '/builds/worker/workspace/build/tests/mochitest/server.js']
[task 2018-07-02T23:09:31.380Z] INFO - runtests.py | Server pid: 4586
[task 2018-07-02T23:09:31.399Z] INFO - runtests.py | Websocket server pid: 4589
[task 2018-07-02T23:09:31.436Z] INFO - runtests.py | SSL tunnel pid: 4592
[task 2018-07-02T23:09:31.585Z] INFO - runtests.py | Running with e10s: True
[task 2018-07-02T23:09:31.587Z] INFO - runtests.py | Running tests: start.
[task 2018-07-02T23:09:31.588Z] INFO -
[task 2018-07-02T23:09:31.605Z] INFO - Application command: /builds/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmpFD87Ws.mozrunner
[task 2018-07-02T23:09:31.629Z] INFO - runtests.py | Application pid: 4611
[task 2018-07-02T23:09:31.630Z] INFO - TEST-INFO | started process GECKO(4611)
[task 2018-07-02T23:09:32.104Z] INFO - GECKO(4611) | ExceptionHandler::GenerateDump cloned child 4638
[task 2018-07-02T23:09:32.104Z] INFO - GECKO(4611) | ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2018-07-02T23:09:32.104Z] INFO - GECKO(4611) | ExceptionHandler::WaitForContinueSignal waiting for continue signal...
[task 2018-07-02T23:12:31.634Z] INFO - runtests.py | Waiting for browser...
[task 2018-07-02T23:12:31.635Z] INFO - TEST-INFO | Main app process: exit 11
[task 2018-07-02T23:12:31.636Z] INFO - Buffered messages finished
[task 2018-07-02T23:12:31.637Z] ERROR - TEST-UNEXPECTED-FAIL | automation.py | application terminated with exit code 11
[task 2018-07-02T23:12:31.638Z] INFO - runtests.py | Application ran for: 0:03:00.030697
[task 2018-07-02T23:12:31.638Z] INFO - zombiecheck | Reading PID log: /tmp/tmpw3bHhspidlog
[task 2018-07-02T23:12:31.639Z] INFO - ==> process 4611 launched child process 4633
[task 2018-07-02T23:12:31.640Z] INFO - zombiecheck | Checking for orphan process with PID: 4633
[task 2018-07-02T23:12:31.641Z] INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/Yex9d7yISmKY4DrqIAfN4A/artifacts/public/build/target.crashreporter-symbols.zip
[task 2018-07-02T23:12:37.456Z] INFO - mozcrash Copy/paste: /usr/local/bin/linux64-minidump_stackwalk /tmp/tmpFD87Ws.mozrunner/minidumps/16bda064-7e66-b27f-7a49-2526ecc5b607.dmp /tmp/tmp9aOSGp
[task 2018-07-02T23:12:44.805Z] INFO - mozcrash Saved minidump as /builds/worker/workspace/build/blobber_upload_dir/16bda064-7e66-b27f-7a49-2526ecc5b607.dmp
[task 2018-07-02T23:12:44.806Z] INFO - mozcrash Saved app info as /builds/worker/workspace/build/blobber_upload_dir/16bda064-7e66-b27f-7a49-2526ecc5b607.extra
[task 2018-07-02T23:12:44.869Z] INFO - PROCESS-CRASH | automation.py | application crashed [@ mozilla::dom::ipc::SharedStringMap::SharedStringMap(mozilla::dom::ipc::SharedStringMapBuilder&&)]
[task 2018-07-02T23:12:44.869Z] INFO - Crash dump filename: /tmp/tmpFD87Ws.mozrunner/minidumps/16bda064-7e66-b27f-7a49-2526ecc5b607.dmp
[task 2018-07-02T23:12:44.869Z] INFO - Operating system: Linux
[task 2018-07-02T23:12:44.869Z] INFO - 0.0.0 Linux 4.4.0-1014-aws #14taskcluster1-Ubuntu SMP Tue Apr 3 10:27:00 UTC 2018 x86_64
[task 2018-07-02T23:12:44.870Z] INFO - CPU: amd64
[task 2018-07-02T23:12:44.870Z] INFO - family 6 model 62 stepping 4
[task 2018-07-02T23:12:44.871Z] INFO - 2 CPUs
[task 2018-07-02T23:12:44.872Z] INFO -
[task 2018-07-02T23:12:44.872Z] INFO - GPU: UNKNOWN
[task 2018-07-02T23:12:44.873Z] INFO -
[task 2018-07-02T23:12:44.874Z] INFO - Crash reason: SIGSEGV
[task 2018-07-02T23:12:44.875Z] INFO - Crash address: 0x0
[task 2018-07-02T23:12:44.875Z] INFO - Process uptime: not available
[task 2018-07-02T23:12:44.876Z] INFO -
[task 2018-07-02T23:12:44.877Z] INFO - Thread 0 (crashed)
[task 2018-07-02T23:12:44.878Z] INFO - 0 libxul.so!mozilla::dom::ipc::SharedStringMap::SharedStringMap(mozilla::dom::ipc::SharedStringMapBuilder&&) [SharedStringMap.cpp:7d20e7fae1039720f92db1a3a72bc2c7424b5f98 : 46 + 0x0]
[task 2018-07-02T23:12:44.878Z] INFO - rax = 0x0000000000635c80 rdx = 0x00007f9681b1cab8
[task 2018-07-02T23:12:44.879Z] INFO - rcx = 0x0000000000000003 rbx = 0x00007f9670f63820
[task 2018-07-02T23:12:44.880Z] INFO - rsi = 0x0000000000000000 rdi = 0x00007f968dd00020
[task 2018-07-02T23:12:44.880Z] INFO - rbp = 0x00007ffc1a681c90 rsp = 0x00007ffc1a681c80
[task 2018-07-02T23:12:44.881Z] INFO - r8 = 0x0000000000000000 r9 = 0x0000000000000000
[task 2018-07-02T23:12:44.881Z] INFO - r10 = 0x001b1554477eec6f r11 = 0x0000000000000246
[task 2018-07-02T23:12:44.882Z] INFO - r12 = 0x00007f9670f63820 r13 = 0x00007ffc1a681d20
[task 2018-07-02T23:12:44.883Z] INFO - r14 = 0x00007ffc1a681cf0 r15 = 0x0000000000000000
[task 2018-07-02T23:12:44.884Z] INFO - rip = 0x00007f967eb7e3e9
[task 2018-07-02T23:12:44.884Z] INFO - Found by: given as instruction pointer in context
[task 2018-07-02T23:12:44.886Z] INFO - 1 libxul.so!SharedStringBundle::LoadProperties [nsStringBundle.cpp:7d20e7fae1039720f92db1a3a72bc2c7424b5f98 : 561 + 0x8]
[task 2018-07-02T23:12:44.886Z] INFO - rbx = 0x00007f9672e380b0 rbp = 0x00007ffc1a681dc0
[task 2018-07-02T23:12:44.887Z] INFO - rsp = 0x00007ffc1a681ca0 r12 = 0x00007f9670f63820
[task 2018-07-02T23:12:44.888Z] INFO - r13 = 0x00007ffc1a681d20 r14 = 0x00007ffc1a681cf0
[task 2018-07-02T23:12:44.889Z] INFO - r15 = 0x0000000000000000 rip = 0x00007f967ce2a13c
[task 2018-07-02T23:12:44.890Z] INFO - Found by: call frame info
[task 2018-07-02T23:12:44.890Z] INFO - 2 libxul.so!mozilla::detail::RunnableMethodImpl<nsStringBundleBase*, nsresult (nsStringBundleBase::*)(), true, (mozilla::RunnableKind)2>::Run() [nsThreadUtils.h:7d20e7fae1039720f92db1a3a72bc2c7424b5f98 : 1166 + 0x12]
[task 2018-07-02T23:12:44.892Z] INFO - rbx = 0x00007f968dce9100 rbp = 0x00007ffc1a681dd0
[task 2018-07-02T23:12:44.892Z] INFO - rsp = 0x00007ffc1a681dd0 r12 = 0x00007ffc1a681e90
[task 2018-07-02T23:12:44.893Z] INFO - r13 = 0x00007f9672e38100 r14 = 0x00007f967746b0a8
[task 2018-07-02T23:12:44.894Z] INFO - r15 = 0x00007ffc1a68233f rip = 0x00007f967ce26bb7
[task 2018-07-02T23:12:44.895Z] INFO - Found by: call frame info
[task 2018-07-02T23:12:44.896Z] INFO - 3 libxul.so!nsThread::ProcessNextEvent(bool, bool*) [nsThread.cpp:7d20e7fae1039720f92db1a3a72bc2c7424b5f98 : 1051 + 0xa]
[task 2018-07-02T23:12:44.896Z] INFO - rbx = 0x00007f968dce9100 rbp = 0x00007ffc1a682310
[task 2018-07-02T23:12:44.896Z] INFO - rsp = 0x00007ffc1a681de0 r12 = 0x00007ffc1a681e90
[task 2018-07-02T23:12:44.896Z] INFO - r13 = 0x00007f9672e38100 r14 = 0x00007f967746b0a8
[task 2018-07-02T23:12:44.897Z] INFO - r15 = 0x00007ffc1a68233f rip = 0x00007f967cdef05e
[task 2018-07-02T23:12:44.898Z] INFO - Found by: call frame info
[task 2018-07-02T23:12:44.898Z] INFO - 4 libxul.so!NS_ProcessNextEvent(nsIThread*, bool) [nsThreadUtils.cpp:7d20e7fae1039720f92db1a3a72bc2c7424b5f98 : 519 + 0xe]
[task 2018-07-02T23:12:44.899Z] INFO - rbx = 0x00007f9670f890a0 rbp = 0x00007ffc1a682340
[task 2018-07-02T23:12:44.900Z] INFO - rsp = 0x00007ffc1a682320 r12 = 0x0000000000000000
[task 2018-07-02T23:12:44.900Z] INFO - r13 = 0x0000000000000000 r14 = 0x00007f968dce9100
[task 2018-07-02T23:12:44.901Z] INFO - r15 = 0x00007ffc1a68235f rip = 0x00007f967cde6c0b
[task 2018-07-02T23:12:44.902Z] INFO - Found by: call frame info
[task 2018-07-02T23:12:4
Comment 1•6 years ago
|
||
I think this is either a mochitest or marionette bug, though I might be wrong.
Component: Applications: MozharnessCore → Mochitest
Product: Release Engineering → Testing
QA Contact: jlund
Updated•6 years ago
|
Component: Mochitest → IPC
Product: Testing → Core
Comment 2•6 years ago
|
||
This might be regression by bug 1470365 or bug 1470771. Adding kris.
Comment 3•6 years ago
|
||
This is unfortunately somewhat expected, for the moment. As I understand it, the Docker containers that we use on automation tend to limit the size of the tmpfs that backs /dev/shm, which means that our shared memory allocations fail even when the system has plenty of memory available to back them.
At some point, I'm going to migrate this code to use memory allocation APIs that don't rely on /dev/shm to avoid this problem.
Comment 4•6 years ago
|
||
So you mean it's related to bug 1245239? Also CC'ing Jed.
Flags: needinfo?(jld)
Updated•6 years ago
|
Flags: needinfo?(jld)
Comment 5•6 years ago
|
||
(In reply to Treeherder Bug Filer from comment #0)
> [task 2018-07-02T23:12:44.874Z] INFO - Crash reason: SIGSEGV
This is not SIGBUS; it's an assertion failure. Are we sure that the failure from SharedStringMapBuilder::Finalize is ENOSPC? I don't see anything in the log to either confirm or deny that.
Also, the Docker containers we use for CI generally haven't been a problem for this; the bug reports we've gotten have been other people using Docker's default of 64MB, which winds up not being enough for our graphics stack. There's a lot of indirection in our CI so I don't know what the situation is there, but I'd be surprised if string bundles caused space exhaustion but all the cross-process textures didn't.
Updated•6 years ago
|
Keywords: regression
Comment 6•6 years ago
|
||
Did some retriggers and backfill and this failure originated from this push:
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=bace7eb6b60ef75660c20a883e63e2422afcdad0&filter-searchStr=84bef5e1fbb573035d1bc42d9a3079b604693ee9
Whiteboard: [retriggered]
Comment 7•6 years ago
|
||
This is unfortunately somewhat expected, for the moment. As I understand it, the Docker containers that we use on automation tend to limit the size of the tmpfs that backs /dev/shm, which means that our shared memory allocations fail even when the system has plenty of memory available to back them.
At some point, I'm going to migrate this code to use memory allocation APIs that don't rely on /dev/shm to avoid this problem.
Depends on: 1440203
Comment hidden (Intermittent Failures Robot) |
Comment 9•6 years ago
|
||
(In reply to Kris Maglione [:kmag] from comment #7)
> This is unfortunately somewhat expected, for the moment. As I understand it,
> the Docker containers that we use on automation tend to limit the size of
> the tmpfs that backs /dev/shm, which means that our shared memory
> allocations fail even when the system has plenty of memory available to back
> them.
>
> At some point, I'm going to migrate this code to use memory allocation APIs
> that don't rely on /dev/shm to avoid this problem.
I'm not convinced that this is what's happening; see comment #5. (I'm also not completely convinced that it's *not* what's happening, but there doesn't seem to be concrete evidence either way?)
Comment 10•6 years ago
|
||
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #9)
> I'm not convinced that this is what's happening; see comment #5. (I'm also
> not completely convinced that it's *not* what's happening, but there doesn't
> seem to be concrete evidence either way?)
It's SIGSEGV rather than SIGBUS because we assert that the shmem creation was successful. Failing to allocate the shmem files is the only way to hit that assertion, so the only other possibility I can think of is OOM.
And since this is only happening on Linux so far, and the patch should actually significantly decrease our memory usage overall, that doesn't seem super likely. I suppose it's possible if our overall rate of OOM crashes is high on Linux automation, though.
Updated•6 years ago
|
Comment 11•6 years ago
|
||
From the failing tests resource-usage.json files, it doesn't look like we ever get above 30% of available RAM. Unfortunately, system-info.log doesn't seem to show tmpfs mounts, so I don't have much to go on...
Comment 12•6 years ago
|
||
(Er, ignore comment 7. Apparently session store saved my original comment and re-posted it when I added the dependency.)
Comment hidden (mozreview-request) |
Comment 14•6 years ago
|
||
mozreview-review |
Comment on attachment 8990893 [details]
Bug 1472889: Include all partition types in system-info.log.
https://reviewboard.mozilla.org/r/255892/#review262836
Attachment #8990893 -
Flags: review?(gbrown) → review+
Comment 15•6 years ago
|
||
OK, apparently this is something slightly more prosaic.
Once every few hundred thousand calls, we fail to stat the read-only copy of the map file we just opened with EBADF. That... is very odd.
The open definitely succeeds, so we should be guaranteed a valid file descriptor. The only plausible explanation I can think of right now is that some background thread closes the wrong file, and sometimes does it at exactly the wrong time.
No longer depends on: 1440203
Comment hidden (mozreview-request) |
Updated•6 years ago
|
Attachment #8990893 -
Attachment is obsolete: true
Comment 18•6 years ago
|
||
mozreview-review |
Comment on attachment 8991501 [details]
Bug 1472889: Work around sporadic automation failures on Linux.
https://reviewboard.mozilla.org/r/256394/#review263536
::: dom/ipc/MemMapSnapshot.cpp:124
(Diff revision 1)
> getter_AddRefs(file)));
>
> - return aMem.init(file);
> + auto result = aMem.init(file);
> +#ifdef XP_LINUX
> + // On Linux automation runs, every few hundred thousand calls, our attempt to
> + // stat the file that we just successfully opened fails with EBADF (bug
Should we fix this at that point where we stat instead?
Attachment #8991501 -
Flags: review?(erahm) → review+
Comment 19•6 years ago
|
||
mozreview-review-reply |
Comment on attachment 8991501 [details]
Bug 1472889: Work around sporadic automation failures on Linux.
https://reviewboard.mozilla.org/r/256394/#review263536
> Should we fix this at that point where we stat instead?
We have to re-open the file. The opening happens in init(nsIFile), the stat happens in initInternal(), so it would have to happen in init(). And it would get kind of ugly...
Comment 20•6 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/a7185bc9281c734eda902077ae35fc585fbc5223
Bug 1472889: Work around sporadic automation failures on Linux. r=erahm
Comment 21•6 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 6 years ago
status-firefox63:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Comment 22•6 years ago
|
||
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=188023941&repo=try&lineNumber=12388
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment hidden (Intermittent Failures Robot) |
Comment 25•6 years ago
|
||
Status: REOPENED → RESOLVED
Closed: 6 years ago → 6 years ago
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•