Closed
Bug 1051633
Opened 10 years ago
Closed 10 years ago
If we print to logcat too early the homescreen fails to start
Categories
(Core :: DOM: Content Processes, defect, P1)
Tracking
()
People
(Reporter: khuey, Assigned: cyu)
References
(Blocks 1 open bug)
Details
(Whiteboard: [caf priority: p2][CR 726359])
Attachments
(7 files, 7 obsolete files)
(deleted),
text/x-log
|
Details | |
(deleted),
application/x-compressed-tar
|
Details | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
khuey
:
review+
|
Details | Diff | Splinter Review |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
bajaj
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
Details | Diff | Splinter Review |
I built flame-kk using the provided base image from b2g-internal and ./config.sh flame-kk. Note that flame-kk does not have a 2.0 configuration, so I had to build 2.1. When I try to boot the device after flashing 2.1 the homescreen always fails to start. The vendor image comes with some version of 1.4 which works fine.
I have made the following observations:
1. Setting dom.ipc.prelaunchProcess.enabled to false makes the homescreen start.
2. The main thread of the Nuwa process is blocked in a different place.
On my Nexus 5 (working)
(gdb) bt
#0 __futex_syscall3 () at bionic/libc/arch-arm/bionic/futex_arm.S:39
#1 0xb6ec5ec8 in __pthread_cond_timedwait_relative (cond=cond@entry=0xbe83d874, mutex=mutex@entry=0xbe83d864, reltime=0xf0)
at bionic/libc/bionic/pthread.c:1117
#2 0xb6ec5f28 in __pthread_cond_timedwait (cond=0xbe83d874, mutex=0xbe83d864, abstime=<optimized out>,
clock=<optimized out>) at bionic/libc/bionic/pthread.c:1140
#3 0xb6f42e68 in __wrap_pthread_cond_wait (cond=0xbe83d874, mtx=0xbe83d864) at ../../../mozglue/build/Nuwa.cpp:1048
#4 0xb463ee5c in ConditionVariable::Wait (this=this@entry=0xbe83d874)
at ../../../ipc/chromium/src/base/condition_variable_posix.cc:40
#5 0xb4640c3a in base::WaitableEvent::TimedWait (this=this@entry=0xb6a3f08c, max_time=...)
at ../../../ipc/chromium/src/base/waitable_event_posix.cc:195
#6 0xb4640c5c in base::WaitableEvent::Wait (this=this@entry=0xb6a3f08c)
at ../../../ipc/chromium/src/base/waitable_event_posix.cc:201
#7 0xb4646140 in base::MessagePumpDefault::Run (this=0xb6a3f080, delegate=0xbe83d960)
at ../../../ipc/chromium/src/base/message_pump_default.cc:60
#8 0xb464625e in MessageLoop::RunInternal (this=this@entry=0xbe83d960)
at ../../../ipc/chromium/src/base/message_loop.cc:229
#9 0xb4646276 in RunHandler (this=0xbe83d960) at ../../../ipc/chromium/src/base/message_loop.cc:222
#10 MessageLoop::Run (this=this@entry=0xbe83d960) at ../../../ipc/chromium/src/base/message_loop.cc:196
#11 0xb4653e16 in ProcLoaderServiceRun (aArgv=<optimized out>, aArgc=1, aFd=<optimized out>, aPeerPid=186)
at ../../../ipc/glue/ProcessUtils_linux.cpp:528
#12 XRE_ProcLoaderServiceRun (aPeerPid=186, aFd=<optimized out>, aArgc=1, aArgv=<optimized out>)
at ../../../ipc/glue/ProcessUtils_linux.cpp:568
#13 0x0000b178 in RunProcesses (argv=0xbe83dac4, argc=1) at ../../../b2g/app/B2GLoader.cpp:215
#14 main (argc=1, argv=0xbe83dac4) at ../../../b2g/app/B2GLoader.cpp:247
On flame-kk (not working):
(gdb) bt
#0 __futex_wait () at bionic/libc/arch-arm/bionic/futex_arm.S:51
#1 0xb6f2f018 in __pthread_cond_timedwait_relative (cond=0xb6a04654, mutex=0xb6acc290, reltime=0xf0)
at bionic/libc/bionic/pthread.c:1117
#2 0xb6f2f078 in __pthread_cond_timedwait (cond=0xb6a04654, mutex=0xb6acc290, abstime=<optimized out>,
clock=<optimized out>) at bionic/libc/bionic/pthread.c:1140
#3 0xb6ed3bfc in __wrap_pthread_cond_wait (cond=0xb6a04654, mtx=0xb6acc290) at ../../../gecko/mozglue/build/Nuwa.cpp:1048
#4 0xb69a507c in PR_WaitCondVar (cvar=0xb6a04650, timeout=4294967295)
at ../../../../../gecko/nsprpub/pr/src/pthreads/ptsynch.c:385
#5 0xb4e2a09e in Wait (aInterval=4294967295, this=0xb6a04644) at ../../dist/include/mozilla/CondVar.h:79
#6 mozilla::Monitor::Wait (this=0xb6a04640, aInterval=aInterval@entry=4294967295)
at ../../dist/include/mozilla/Monitor.h:40
#7 0xb4f73322 in mozilla::ipc::MessageChannel::WaitForSyncNotify (this=this@entry=0xb6a54448)
at /home/khuey/dev/B2G-2.0-flame/gecko/ipc/glue/MessageChannel.cpp:1420
#8 0xb4f77b34 in mozilla::ipc::MessageChannel::SendAndWait (this=this@entry=0xb6a54448, aMsg=aMsg@entry=0xb6a3fd60,
aReply=aReply@entry=0xbef2de24) at /home/khuey/dev/B2G-2.0-flame/gecko/ipc/glue/MessageChannel.cpp:712
#9 0xb4f782da in mozilla::ipc::MessageChannel::Send (this=this@entry=0xb6a54448, aMsg=aMsg@entry=0xb6a3fd60,
aReply=aReply@entry=0xbef2de24) at /home/khuey/dev/B2G-2.0-flame/gecko/ipc/glue/MessageChannel.cpp:619
#10 0xb4fb8a5c in mozilla::dom::PContentChild::SendPCrashReporterConstructor (this=this@entry=0xb6a54418, actor=0xb6a3fd60,
tid=@0xbef2de70: 523, processType=@0xbef2de74: 2)
at /home/khuey/dev/B2G-2.0-flame/objdir-gecko/ipc/ipdl/PContentChild.cpp:931
#11 0xb4fb8afa in mozilla::dom::PContentChild::SendPCrashReporterConstructor (this=this@entry=0xb6a54418,
tid=@0xb6756e20: 1, processType=@0x1: <error reading variable>)
at /home/khuey/dev/B2G-2.0-flame/objdir-gecko/ipc/ipdl/PContentChild.cpp:894
#12 0xb55b9d0a in mozilla::dom::ContentChild::Init (this=this@entry=0xb6a54418, aIOLoop=aIOLoop@entry=0xb3fb1ba0,
aParentHandle=aParentHandle@entry=202, aChannel=0xb6a2b1d0)
at /home/khuey/dev/B2G-2.0-flame/gecko/dom/ipc/ContentChild.cpp:581
#13 0xb55ba874 in mozilla::dom::ContentProcess::Init (this=0xb6a54400)
at /home/khuey/dev/B2G-2.0-flame/gecko/dom/ipc/ContentProcess.cpp:27
#14 0xb5ab4218 in XRE_InitChildProcess (aArgc=<optimized out>, aArgv=<optimized out>)
at /home/khuey/dev/B2G-2.0-flame/gecko/toolkit/xre/nsEmbedFunctions.cpp:532
#15 0xb506e38a in content_process_main (argc=8, argc@entry=9, argv=argv@entry=0xb6a01af0)
at ../../../gecko/ipc/contentproc/plugin-container.cpp:148
#16 0xb4f75386 in mozilla::ipc::ProcLoaderLoadRunner::DoWork (this=<optimized out>)
at /home/khuey/dev/B2G-2.0-flame/gecko/ipc/glue/ProcessUtils_linux.cpp:381
#17 0xb4f73ae2 in ProcLoaderServiceRun (aArgv=<optimized out>, aArgc=1, aFd=<optimized out>, aPeerPid=202)
at /home/khuey/dev/B2G-2.0-flame/gecko/ipc/glue/ProcessUtils_linux.cpp:540
#18 XRE_ProcLoaderServiceRun (aPeerPid=202, aFd=<optimized out>, aArgc=1, aArgv=<optimized out>)
at /home/khuey/dev/B2G-2.0-flame/gecko/ipc/glue/ProcessUtils_linux.cpp:568
#19 0x0000af14 in RunProcesses (argv=0xbef2ea04, argc=1) at ../../../gecko/b2g/app/B2GLoader.cpp:215
#20 main (argc=1, argv=0xbef2ea04) at ../../../gecko/b2g/app/B2GLoader.cpp:247
3. The b2g-ps output contains two b2g processes (the Nuwa thing is bug 1050122 which I see on my Nexus 5, so I don't think it matters (this is not a DEBUG build)) and a preallocated process that did not change its name.
khuey@minbar:~/dev/B2G-2.0-flame$ adb shell b2g-ps
APPLICATION SEC USER PID PPID VSIZE RSS WCHAN PC NAME
b2g 0 root 201 1 204672 74156 ffffffff b6eb58a4 S /system/b2g/b2g
b2g 0 root 618 201 50500 3640 ffffffff b6eb5a70 S /system/b2g/b2g
(Preallocated a 2 u0_a1029 1029 201 71348 26096 ffffffff b6eee8a4 S /system/b2g/plugin-container
4. The preallocated process's PPID is the main b2g process, not the Nuwa process.
5. Consistent with not changing its name, the app process never received LoadURL.
6. The about:memory report shows a ContentParent in the main process for the homescreen that is attached to the right PID.
I think the first thing we need to do is rule this out as the cause of the various 2.0 related things we see, which means we need to run B2G 2.0 on flame-kk. mwu, how soon will we be able to do that?
Flags: needinfo?(mwu)
Reporter | ||
Comment 1•10 years ago
|
||
Reporter | ||
Updated•10 years ago
|
Attachment #8470560 -
Attachment description: flame-kk.tgz → about memory report
Assignee | ||
Comment 2•10 years ago
|
||
This is the patch I used to debug bug 1048011. When I sleep 30 seconds in the child process before running the PProcLoaderChild actor, the Nuwa process never goes up and the process is stuck in PContentChild::SendPCrashReporterConstructor().
Comment 3•10 years ago
|
||
(In reply to Cervantes Yu from comment #2)
> Created attachment 8470606 [details]
> Patch to reproduce
It doesn't reproduce the case for me.
Comment 4•10 years ago
|
||
(In reply to Cervantes Yu from comment #2)
> Created attachment 8470606 [details]
> Patch to reproduce
I did not run printf_stderr() line for my previous comment. After adding printf_stderr(), it is broken for our implementation is backed by logcat. So, it is broken for different reason. For it would be broken, there are lines to set the value of gDisableAndroidLog in ProcLoaderServiceRun().
Comment 5•10 years ago
|
||
Is this only on flame-kk, not for ICS?
Assignee | ||
Comment 6•10 years ago
|
||
Comment on attachment 8470606 [details]
Patch to reproduce
Scratch the patch since printf_stderr() is not supposed to be called in this place.
Attachment #8470606 -
Attachment is obsolete: true
Assignee | ||
Comment 7•10 years ago
|
||
Using changeset 198832:a9b43778f0c2, we initially encountered a similar problem, but resetting gaia makes the problem go away. I still don't know why.
Comment 8•10 years ago
|
||
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #0)
> I think the first thing we need to do is rule this out as the cause of the
> various 2.0 related things we see, which means we need to run B2G 2.0 on
> flame-kk. mwu, how soon will we be able to do that?
For your purposes, switching gecko and gaia to the 2.0 branches should be enough.
Flags: needinfo?(mwu)
Comment 9•10 years ago
|
||
I could reproduce the issue with following script after one hour.
while true; do adb shell stop b2g; sleep 1; adb shell start b2g; while true; do if adb shell b2g-ps|grep Homescreen > /dev/null; then date; break; fi; sleep 1; done; done
It is not easy to reproduce with my owned build.
Comment 10•10 years ago
|
||
(In reply to Thinker Li [:sinker] from comment #9)
> I could reproduce the issue with following script after one hour.
After studied into my case, I believe it is another bug. See bug 1052363.
So, I still could not reproduce the bug here.
Assignee | ||
Comment 11•10 years ago
|
||
I have a gut feeling that this might be that somewhere __android_log_print() or the like is called before it's safe to call it. I am making this guess because the problematic stack is exactly the same as what I saw when I called __android_log_print() to add debug information where it's unsafe to do so. See https://bugzilla.mozilla.org/show_bug.cgi?id=977026#c83 for why we android logcat can't be used before we shuffle IPC file descriptors.
An easy way to verify: list the open file descriptors of of the process where PProcLoaderChild runs (the 2nd b2g process from b2g-ps). If it's stuck because of android logcat, you should see the following, where FDs open to /dev/log/main where overwritten by ShuffleFileDescriptors().
> lrwx------ root root 2014-08-12 19:32 0 -> /dev/null
> lrwx------ root root 2014-08-12 19:32 1 -> /dev/null
> l-wx------ root root 2014-08-12 19:33 10 -> pipe:[845193]
> lrwx------ root root 2014-08-12 19:33 11 -> socket:[842374]
> lrwx------ root root 2014-08-12 19:33 12 -> socket:[842376]
> lrwx------ root root 2014-08-12 19:33 13 -> socket:[842378]
> lr-x------ root root 2014-08-12 19:32 14 -> pipe:[845194]
> l-wx------ root root 2014-08-12 19:32 15 -> pipe:[845194]
> lrwx------ root root 2014-08-12 19:32 2 -> /dev/null
> lrwx------ root root 2014-08-12 19:32 3 -> socket:[843135]
> lrwx------ root root 2014-08-12 19:32 4 -> socket:[842370]
> lr-x------ root root 2014-08-12 19:32 5 -> /dev/__properties__
> lrwx------ root root 2014-08-12 19:32 6 -> anon_inode:[eventpoll]
> lrwx------ root root 2014-08-12 19:32 7 -> socket:[845191]
> lrwx------ root root 2014-08-12 19:32 8 -> socket:[845192]
> lr-x------ root root 2014-08-12 19:32 9 -> pipe:[845193]
On a healthy process, you should see open file descriptors like:
> l-wx------ u0_a11806 u0_a11806 2014-08-12 19:30 21 -> /dev/log/main
> l-wx------ u0_a11806 u0_a11806 2014-08-12 19:30 22 -> /dev/log/radio
> l-wx------ u0_a11806 u0_a11806 2014-08-12 19:30 23 -> /dev/log/events
> l-wx------ u0_a11806 u0_a11806 2014-08-12 19:30 24 -> /dev/log/system
One possible use of android logcat is IPC message logging, but I am not sure if this is the root cause of your problem. Kyle would you make a quick check whether the android logcat file descriptors are overwritten.
Flags: needinfo?(khuey)
Assignee | ||
Comment 12•10 years ago
|
||
If it's confirmed that the root cause is android logcat accidentially called, we could avoid the problem by pre-reserving (e.g. open to /dev/null) file descriptors for ShuffleFileDescriptors so it won't be taken by accident.
Reporter | ||
Comment 13•10 years ago
|
||
khuey@minbar:~/dev/B2G-2.0-flame$ adb shell b2g-ps
APPLICATION SEC USER PID PPID VSIZE RSS WCHAN PC NAME
b2g 0 root 199 1 192660 71496 ffffffff b6ed18a4 S /system/b2g/b2g
b2g 0 root 583 199 50500 3604 ffffffff b6ed1a70 S /system/b2g/b2g
(Preallocated a 2 u0_a1020 1020 199 69300 26064 ffffffff b6ead8a4 S /system/b2g/plugin-container
khuey@minbar:~/dev/B2G-2.0-flame$ adb shell ls -l /proc/583/fd
lrwx------ root root 2014-08-10 16:18 0 -> /dev/null
lrwx------ root root 2014-08-10 16:18 1 -> /dev/null
l-wx------ root root 2014-08-10 16:20 10 -> pipe:[7676]
lrwx------ root root 2014-08-10 16:20 11 -> socket:[6083]
lrwx------ root root 2014-08-10 16:20 12 -> socket:[6085]
lrwx------ root root 2014-08-10 16:20 13 -> socket:[6087]
lrwx------ root root 2014-08-10 16:18 2 -> /dev/null
lrwx------ root root 2014-08-10 16:18 3 -> socket:[6081]
lrwx------ root root 2014-08-10 16:18 4 -> socket:[6077]
lr-x------ root root 2014-08-10 16:18 5 -> /dev/__properties__
lrwx------ root root 2014-08-10 16:18 6 -> anon_inode:[eventpoll]
lrwx------ root root 2014-08-10 16:18 7 -> socket:[7674]
lrwx------ root root 2014-08-10 16:18 8 -> socket:[7675]
lr-x------ root root 2014-08-10 16:20 9 -> pipe:[7676]
Flags: needinfo?(khuey)
Comment 14•10 years ago
|
||
We should remove the |gDisableAndroidLog| flag once we preserving fds for shuffling.
Assignee | ||
Comment 15•10 years ago
|
||
I am working on a patch to reserve the magic file descriptors.
Currently we have 3 for child side IPC channel, 4 for crash reporter and 5 for android system prop.
Assignee: nobody → cyu
Assignee | ||
Comment 16•10 years ago
|
||
This preoccupies the magic file descriptors ({3, 4, 5}) with /dev/null so other operations such as __android_log_print() won't create file descriptors in the "danger zone".
Try push: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=07f9fa4f56bb
Attachment #8472933 -
Flags: review?(khuey)
Assignee | ||
Comment 17•10 years ago
|
||
The less error-prone patch that won't depend on the magic list {3, 4, 5}. This won't break after a new magic FD (such as 6) is added, unless the number of magic FDs grows up to 10.
Try push log: https://tbpl.mozilla.org/?tree=Try&rev=0e76f02cde77
Attachment #8472933 -
Attachment is obsolete: true
Attachment #8472933 -
Flags: review?(khuey)
Attachment #8473582 -
Flags: review?(khuey)
Comment 18•10 years ago
|
||
I'm seeing a similar issue as khuey on JB. I locally applied attachment 8473582 [details] [diff] [review], but did not see an improvement. This is blocking me from doing any B2G dev unfortunately.
For reference my output:
erahm-25043:mozilla-central ericrahm$ adb shell b2g-ps
APPLICATION USER PID PPID VSIZE RSS WCHAN PC NAME
b2g root 2627 1 167152 78772 ffffffff b6f7163c S /system/b2g/b2g
(Nuwa) root 2630 2627 51412 7328 ffffffff b6f71808 S /system/b2g/b2g
(Preallocated a u0_a2791 2791 2627 70188 27180 ffffffff b6ed763c S /system/b2g/plugin-container
erahm-25043:mozilla-central ericrahm$ adb shell ls -l /proc/2630/fd
lrwx------ root root 2014-08-21 14:53 0 -> /dev/null
lrwx------ root root 2014-08-21 14:53 1 -> /dev/null
lrwx------ root root 2014-08-21 15:00 10 -> socket:[13227]
lrwx------ root root 2014-08-21 15:00 11 -> socket:[13229]
lrwx------ root root 2014-08-21 15:00 12 -> socket:[13231]
lr-x------ root root 2014-08-21 14:53 13 -> pipe:[7998]
l-wx------ root root 2014-08-21 15:00 14 -> pipe:[7998]
lr-x------ root root 2014-08-21 15:00 19 -> /dev/__properties__
lrwx------ root root 2014-08-21 14:53 2 -> /dev/null
lrwx------ root root 2014-08-21 15:00 20 -> socket:[7994]
lrwx------ root root 2014-08-21 14:53 3 -> socket:[7994]
lrwx------ root root 2014-08-21 14:53 4 -> anon_inode:[eventpoll]
lr-x------ root root 2014-08-21 14:53 5 -> /dev/__properties__
lrwx------ root root 2014-08-21 14:53 6 -> socket:[7995]
lrwx------ root root 2014-08-21 14:53 7 -> socket:[7996]
lr-x------ root root 2014-08-21 14:53 8 -> pipe:[7997]
l-wx------ root root 2014-08-21 15:00 9 -> pipe:[7997]
erahm-25043:mozilla-central ericrahm$
Assignee | ||
Comment 19•10 years ago
|
||
(In reply to Eric Rahm [:erahm] from comment #18)
> I'm seeing a similar issue as khuey on JB. I locally applied attachment
> 8473582 [details] [diff] [review], but did not see an improvement. This is
> blocking me from doing any B2G dev unfortunately.
>
Obviously android logcat file descriptors are still missing in the Nuwa process so it could be android logcat messing with IPC channels. The patch preoccupies 10 file descriptors right after fork(). The only reason that I can think of to make it fail is android_log_print() being called before fork(). I will need your help to debug this.
Assignee | ||
Comment 20•10 years ago
|
||
To use it, launch b2g with MOZ_DEBUG_CHILD_PROCESS env variable:
> root@android:/ # export MOZ_DEBUG_CHILD_PROCESS=1
> root@android:/ # b2g.sh
> B2GLoader CHILDCHILDCHILDCHILD
> debug me @ 17769
Then you will have 30 seconds to list the file descriptors. This is the output you should see, no android logcat FDs.
> root@android:/ # ls -l /proc/17769/fd
> lrwx------ root root 2014-08-21 23:19 0 -> /dev/pts/2
> lrwx------ root root 2014-08-21 23:19 1 -> /dev/pts/2
> lrwx------ root root 2014-08-21 23:19 2 -> /dev/pts/2
> lrwx------ root root 2014-08-21 23:19 4 -> socket:[952013]
> lr-x------ root root 2014-08-21 23:19 8 -> /dev/__properties__ (deleted)
>root@android:/ #
Eric, could you run this test? Thanks.
Flags: needinfo?(erahm)
Assignee | ||
Comment 21•10 years ago
|
||
Fix the build break for the debug code.
Attachment #8477171 -
Attachment is obsolete: true
Assignee | ||
Comment 22•10 years ago
|
||
I encountered the same on emulator debug build. The b2g loader already performed __android_log_print() even before entering main(). I sleep(30) in the beginning of main() and there are already many lines of
> I/Gecko ( 44): [44] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /mnt/SSD/data/hg/mozilla-central/xpcom/base/nsTraceRefcnt.cpp, line 148
in adb logcat, and the process already has adb logcat initialized:
> 1|root@android:/ # ls -l /proc/44/fd
> lrwx------ root root 2014-08-22 11:33 0 -> /dev/null
> lrwx------ root root 2014-08-22 11:33 1 -> /dev/null
> lrwx------ root root 2014-08-22 11:33 2 -> /dev/null
> l-wx------ root root 2014-08-22 11:33 4 -> /dev/log/main
> l-wx------ root root 2014-08-22 11:33 5 -> /dev/log/radio
> l-wx------ root root 2014-08-22 11:33 6 -> /dev/log/events
> lr-x------ root root 2014-08-22 11:33 8 -> /dev/__properties__ (deleted)
> root@android:/ #
I think we still need to disable android logcat before we fork the child process.
Assignee | ||
Comment 23•10 years ago
|
||
(In reply to Cervantes Yu from comment #22)
> I encountered the same on emulator debug build. The b2g loader already
> performed __android_log_print() even before entering main(). I sleep(30) in
> the beginning of main() and there are already many lines of
>
Scratch the above statement. The following warning are generated in loading libxul. We still can use the trick of preoccupying file descriptors, but this has to be done in the beginning of main(), and we need to close the unused ones both in parent and child processes. I'll update a patch for this.
> > I/Gecko ( 44): [44] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /mnt/SSD/data/hg/mozilla-central/xpcom/base/nsTraceRefcnt.cpp, line 148
Reporter | ||
Updated•10 years ago
|
Summary: When building flame-kk homescreen fails to start → If we print to logcat too early the homescreen fails to start
Comment 24•10 years ago
|
||
[Blocking Requested - why for this release]:
Blocks a CAF blocker.
blocking-b2g: --- → 2.1?
Comment 25•10 years ago
|
||
(In reply to Cervantes Yu from comment #20)
> Eric, could you run this test? Thanks.
I'll check this out now.
Comment 26•10 years ago
|
||
Note: this is reproduces 100% if you enable DMD.
Comment 27•10 years ago
|
||
I got three child processes, here's the output:
erahm-25043:~ ericrahm$ adb shell ls -l /proc/16377/fd
lrwx------ u0_a16377 u0_a16377 2014-08-22 18:55 0 -> /dev/pts/1
lrwx------ u0_a16377 u0_a16377 2014-08-22 18:55 1 -> /dev/pts/1
l-wx------ u0_a16377 u0_a16377 2014-08-22 18:55 10 -> /dev/cpuctl/apps/tasks
l-wx------ u0_a16377 u0_a16377 2014-08-22 18:55 11 -> /dev/cpuctl/apps/bg_non_interactive/tasks
lrwx------ u0_a16377 u0_a16377 2014-08-22 18:55 2 -> /dev/pts/1
lrwx------ u0_a16377 u0_a16377 2014-08-22 18:55 3 -> socket:[182717]
l-wx------ u0_a16377 u0_a16377 2014-08-22 18:55 4 -> /dev/log/main
lr-x------ u0_a16377 u0_a16377 2014-08-22 18:55 5 -> /dev/__properties__
l-wx------ u0_a16377 u0_a16377 2014-08-22 18:55 6 -> /dev/log/radio
l-wx------ u0_a16377 u0_a16377 2014-08-22 18:55 7 -> /dev/log/events
l-wx------ u0_a16377 u0_a16377 2014-08-22 18:55 8 -> /dev/log/system
lrwx------ u0_a16377 u0_a16377 2014-08-22 18:55 9 -> /dev/binder
erahm-25043:~ ericrahm$ adb shell ls -l /proc/16381/fd
lrwx------ u0_a16381 u0_a16381 2014-08-22 18:55 0 -> /dev/pts/1
lrwx------ u0_a16381 u0_a16381 2014-08-22 18:55 1 -> /dev/pts/1
l-wx------ u0_a16381 u0_a16381 2014-08-22 18:55 10 -> /dev/cpuctl/apps/tasks
l-wx------ u0_a16381 u0_a16381 2014-08-22 18:55 11 -> /dev/cpuctl/apps/bg_non_interactive/tasks
lrwx------ u0_a16381 u0_a16381 2014-08-22 18:55 2 -> /dev/pts/1
lrwx------ u0_a16381 u0_a16381 2014-08-22 18:55 3 -> socket:[183930]
l-wx------ u0_a16381 u0_a16381 2014-08-22 18:55 4 -> /dev/log/main
lr-x------ u0_a16381 u0_a16381 2014-08-22 18:55 5 -> /dev/__properties__
l-wx------ u0_a16381 u0_a16381 2014-08-22 18:55 6 -> /dev/log/radio
l-wx------ u0_a16381 u0_a16381 2014-08-22 18:55 7 -> /dev/log/events
l-wx------ u0_a16381 u0_a16381 2014-08-22 18:55 8 -> /dev/log/system
lrwx------ u0_a16381 u0_a16381 2014-08-22 18:55 9 -> /dev/binder
erahm-25043:~ ericrahm$ adb shell ls -l /proc/16158/fd
lrwx------ root root 2014-08-22 18:55 0 -> /dev/pts/1
lrwx------ root root 2014-08-22 18:55 1 -> /dev/pts/1
lrwx------ root root 2014-08-22 18:55 2 -> /dev/pts/1
lrwx------ root root 2014-08-22 18:55 3 -> socket:[183637]
lr-x------ root root 2014-08-22 18:55 5 -> /dev/__properties__
Flags: needinfo?(erahm)
Assignee | ||
Comment 28•10 years ago
|
||
(In reply to Eric Rahm [:erahm] from comment #26)
> Note: this is reproduces 100% if you enable DMD.
Yes, because DMD call __android_log_vprint() even before we enter main(), and because of this, preoccupying file descriptors on entering main() will not work. We might need to preoccupy FDs in libmozglue, the 1st library to load if we want to completely avoid this problem.
Assignee | ||
Comment 29•10 years ago
|
||
Update the patch to address the problem when DMD is enabled.
Eric, could you test this patch to see if it works? Thanks.
Attachment #8473582 -
Attachment is obsolete: true
Attachment #8473582 -
Flags: review?(khuey)
Attachment #8478977 -
Flags: feedback?(erahm)
Comment 30•10 years ago
|
||
Comment on attachment 8478977 [details] [diff] [review]
Make sure the magic file descriptors won't be taken for other uses (v3)
Review of attachment 8478977 [details] [diff] [review]:
-----------------------------------------------------------------
This does fix my issue, although I think we'd prefer to be able to print out those statements indicating that DMD is enabled.
If this is the only option, it might make more sense to just remove printing of those lines [1] from DMD on B2G rather than adding an env var.
[1] http://dxr.mozilla.org/mozilla-central/source/memory/replace/dmd/DMD.cpp?from=DMD.cpp&case=true#1604,1607,1614
Attachment #8478977 -
Flags: feedback?(erahm) → feedback+
Assignee | ||
Comment 31•10 years ago
|
||
Changes:
* Don't use env variable in DMD.
* Only reserve FD 3-7 in the child process and child will abort if any is taken.
Attachment #8478977 -
Attachment is obsolete: true
Attachment #8479776 -
Flags: review?(khuey)
Assignee | ||
Comment 32•10 years ago
|
||
(In reply to Eric Rahm [:erahm] from comment #30)
> Comment on attachment 8478977 [details] [diff] [review]
> Make sure the magic file descriptors won't be taken for other uses (v3)
>
> Review of attachment 8478977 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> This does fix my issue, although I think we'd prefer to be able to print out
> those statements indicating that DMD is enabled.
>
> If this is the only option, it might make more sense to just remove printing
> of those lines [1] from DMD on B2G rather than adding an env var.
>
We need to disable printing these messages to android logcat on startup. I only disable the output in Init() on b2g.
I considered some alternatives like delaying the output, sending to stderr, or fork() to print, none of which is clean so I just silence the output.
Assignee | ||
Comment 33•10 years ago
|
||
Comment 34•10 years ago
|
||
NI :kyle for help with review as this is needed for 1052720.
Flags: needinfo?(khuey)
Reporter | ||
Comment 35•10 years ago
|
||
Comment on attachment 8479776 [details] [diff] [review]
Make sure the magic file descriptors won't be taken for other uses (v4)
Review of attachment 8479776 [details] [diff] [review]:
-----------------------------------------------------------------
I didn't get a chance to test this today, but it looks good.
::: memory/replace/dmd/DMD.cpp
@@ +209,1 @@
> __android_log_vprint(ANDROID_LOG_INFO, "DMD", aFmt, ap);
indent this two more spaces please.
Attachment #8479776 -
Flags: review?(khuey) → review+
Reporter | ||
Updated•10 years ago
|
Flags: needinfo?(khuey)
Assignee | ||
Comment 36•10 years ago
|
||
Emulator runs fail because of
Hit MOZ_CRASH(ProcLoader error: a magic file descriptor is occupied.) at ../../../gecko/ipc/glue/ProcessUtils_linux.cpp:194
I need to figure out why someone still occupies a magic file descriptors.
Assignee | ||
Comment 37•10 years ago
|
||
It turns out that on debug build, nsTraceRefcnt.cpp will fopen() /data/local/tests/log/runtests_leaks.log with fileno = 4. The patch aborts because if finds FD 4 is taken.
Updated•10 years ago
|
blocking-b2g: 2.1? → 2.1+
Assignee | ||
Comment 38•10 years ago
|
||
V5 WIP to address the issue that nsTraceRefcnt.cpp fopen()s before we fork().
Also the fact that it fopen()s is not fork-safe. That's why I need to add NS_LogReinit() (and also call it after Nuwa forks).
I might need to separate the parts regarding nsTraceRefcnt into another part or track the issue with it in another bug.
Reporter | ||
Comment 39•10 years ago
|
||
How much more do we have to do here? Not being able to run DMD is painful.
Flags: needinfo?(cyu)
Assignee | ||
Comment 40•10 years ago
|
||
I will update the patch later today and open a new bug to handle the issue with XPCOM leak logs.
Flags: needinfo?(cyu)
Assignee | ||
Comment 41•10 years ago
|
||
Try push log:
https://tbpl.mozilla.org/?tree=Try&rev=f50b037fc7fa
Request for review again because there is design change in the timing of FD reservation. It's done right after entering main(). I do this to avoid the problem with android logcat calls coming from where, and we don't need to disallow logcat calls before fork.
Because of this I don't call base::CloseSuperfluousFds() after fork. The assumption under this is that we fork pretty early in the startup process, the file descriptors before fork should be necessary in the child process.
Attachment #8479776 -
Attachment is obsolete: true
Attachment #8487175 -
Attachment is obsolete: true
Attachment #8491290 -
Flags: review?(khuey)
Updated•10 years ago
|
Whiteboard: [CR 726359]
Updated•10 years ago
|
Blocks: CAF-v2.1-FC-metabug
(In reply to Cervantes Yu from comment #41)
> Created attachment 8491290 [details] [diff] [review]
> Make sure the magic file descriptors won't be taken for other uses (v5)
>
> Try push log:
> https://tbpl.mozilla.org/?tree=Try&rev=f50b037fc7fa
>
> Request for review again because there is design change in the timing of FD
> reservation. It's done right after entering main(). I do this to avoid the
> problem with android logcat calls coming from where, and we don't need to
> disallow logcat calls before fork.
>
> Because of this I don't call base::CloseSuperfluousFds() after fork. The
> assumption under this is that we fork pretty early in the startup process,
> the file descriptors before fork should be necessary in the child process.
We are seeing some unknown memleak issue in FFOS 2.1 . Could you please give us a fix for v2.1 asap ? It is blocking us to debug memleak issue in FFOS v2.1
Flags: needinfo?(erahm)
Flags: needinfo?(cyu)
Comment 44•10 years ago
|
||
This is a temporary workaround for just DMD. It is untested, but based on previous attempts to fix this bug it should work.
Comment 45•10 years ago
|
||
Tapas, attachment 8491786 [details] should help workaround your DMD issues. It should apply cleanly to the 2.1 branch.
Flags: needinfo?(erahm)
(In reply to Eric Rahm [:erahm] from comment #45)
> Tapas, attachment 8491786 [details] should help workaround your DMD issues.
> It should apply cleanly to the 2.1 branch.
it seems like your patch is fixing 'website white-screen issue' in bug 1069542 comment 0 . We are landing that patch in our build for memleak debugging but we will still wait for Cervantes to land a real fix for this problem in v2.0.
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #46)
> We are landing that patch in our build for memleak debugging but we will still wait for Cervantes to land a real fix for this problem in v2.0.
We need fix in 2.1 (not in 2.0)
Sorry for the mistake
Updated•10 years ago
|
Whiteboard: [CR 726359] → [caf priority: p2][CR 726359]
Assignee | ||
Comment 48•10 years ago
|
||
Flags: needinfo?(cyu)
Reporter | ||
Comment 49•10 years ago
|
||
So this still doesn't make logcat work, right? Is that possible? If not, do we need the rest of this patch if we just block DMD from using logcat?
Flags: needinfo?(cyu)
Assignee | ||
Comment 50•10 years ago
|
||
This is supposed to make logcat work unless DMD should call StatusMsg() after it finishes initialization
( https://mxr.mozilla.org/mozilla-central/source/memory/replace/dmd/DMD.cpp#1672 )
and before entering main(). With this patch DMD is the only place that we suppress __android_log_print() calls. Once entering main(), we can safely use logcat.
Doesn't this patch work in the presense of DMD?
Flags: needinfo?(cyu)
Reporter | ||
Comment 51•10 years ago
|
||
DMD works, unless I remove the patch to DMD, of course. I was asking if we needed the rest of the patch if we just do the DMD bit.
Reporter | ||
Comment 52•10 years ago
|
||
Comment on attachment 8491290 [details] [diff] [review]
Make sure the magic file descriptors won't be taken for other uses (v5)
Review of attachment 8491290 [details] [diff] [review]:
-----------------------------------------------------------------
r=me
::: memory/replace/dmd/DMD.cpp
@@ +205,5 @@
> + // Don't call __android_log_vprint() during initialization, or the magic file
> + // descriptors will be occupied by android logcat.
> + if (gIsDMDRunning)
> +#endif
> + __android_log_vprint(ANDROID_LOG_INFO, "DMD", aFmt, ap);
I guess this is still useful if someone runs DMD on Firefox for Android.
Attachment #8491290 -
Flags: review?(khuey) → review+
Assignee | ||
Comment 53•10 years ago
|
||
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #51)
> DMD works, unless I remove the patch to DMD, of course. I was asking if we
> needed the rest of the patch if we just do the DMD bit.
We need to because even DMD doesn't write android logcat, XPCOM initialization (__android_log_print() in nsDebugImpl.cpp) and refcount tracing (fopen()) still creates file descriptors. They are both before we fork the child process. So we need the rest of the patch.
Assignee | ||
Comment 54•10 years ago
|
||
Comment 55•10 years ago
|
||
Reverted for non-unified build failures:
https://tbpl.mozilla.org/php/getParsedLog.php?id=48674154&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=48674722&tree=Mozilla-Inbound
remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/eb2ed80f96b8
Status: NEW → ASSIGNED
Assignee | ||
Comment 56•10 years ago
|
||
Fix non-unified build failures and repush:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1e25cd3e8219
Comment 57•10 years ago
|
||
This was merged to m-c but the bug wasn't marked for some reason.
https://hg.mozilla.org/mozilla-central/rev/1e25cd3e8219
Please request Aurora approval on the v2.1 patch when you get a chance.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
status-b2g-v2.1:
--- → affected
status-b2g-v2.2:
--- → fixed
status-firefox33:
--- → wontfix
status-firefox34:
--- → affected
status-firefox35:
--- → fixed
Flags: needinfo?(cyu)
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
Assignee | ||
Comment 58•10 years ago
|
||
This was backed out in https://hg.mozilla.org/mozilla-central/rev/14639e630d21
Later retest shows this patch to be non-offending: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=eb441928c187
I will reland it.
Assignee | ||
Comment 59•10 years ago
|
||
Treeherder link appears to be broken. The TPBL link: https://tbpl.mozilla.org/?tree=Try&rev=eb441928c187
Assignee | ||
Comment 60•10 years ago
|
||
Reland on inbound: https://hg.mozilla.org/integration/mozilla-inbound/rev/7392ced14502
Flags: needinfo?(cyu)
Assignee | ||
Comment 61•10 years ago
|
||
Comment on attachment 8492050 [details] [diff] [review]
Patch for 2.1
Approval Request Comment
[Feature/regressing bug #]: 977026
[User impact if declined]: The device fails to boot on engineering debug build or DMD is turned on.
[Describe test coverage new/current, TBPL]: Manual verification, TPBL.
[Risks and why]: Low to median. This is not a trivial patch, but we try to make it generic and prevent similar problems in the future.
The patch changes file descriptor handling in the early stage of b2g startup. The approach of file descriptor reservation should be safe and resistant to future additions of file descriptor operations.
Attachment #8492050 -
Flags: approval-mozilla-aurora?
Comment 62•10 years ago
|
||
Updated•10 years ago
|
Attachment #8492050 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 63•10 years ago
|
||
Comment 64•10 years ago
|
||
Backed out for bustage. Yes, this was the 2.1 patch. Looks like you never rolled the non-unified build fixes into it...
https://hg.mozilla.org/releases/mozilla-aurora/rev/e10f2bc2dd00
https://tbpl.mozilla.org/php/getParsedLog.php?id=49358071&tree=Mozilla-Aurora
Assignee | ||
Comment 65•10 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #64)
> Backed out for bustage. Yes, this was the 2.1 patch. Looks like you never
> rolled the non-unified build fixes into it...
> https://hg.mozilla.org/releases/mozilla-aurora/rev/e10f2bc2dd00
>
> https://tbpl.mozilla.org/php/getParsedLog.php?id=49358071&tree=Mozilla-Aurora
Oops, my bad. Will update the patch with the fix.
Flags: needinfo?(cyu)
Assignee | ||
Comment 66•10 years ago
|
||
Merged the fix for nonunified build.
Assignee | ||
Comment 67•10 years ago
|
||
Repush to aurora with the fix:
https://hg.mozilla.org/releases/mozilla-aurora/rev/baaa0c3ab8fd
Updated•10 years ago
|
Comment 68•10 years ago
|
||
This is causing a startup crash when I try to start with gdb
Program received signal SIGSEGV, Segmentation fault.
0x0000b616 in ReserveFileDescriptors (aReservedFds=...) at ../../../b2g/app/B2GLoader.cpp:239
239 MOZ_CRASH("ProcLoader error: a magic file descriptor is occupied.");
(gdb) bt
#0 0x0000b616 in ReserveFileDescriptors (aReservedFds=...) at ../../../b2g/app/B2GLoader.cpp:239
#1 main (argc=1, argv=0xbee3ea24) at ../../../b2g/app/B2GLoader.cpp:279
Comment 69•10 years ago
|
||
removing this line leads to the next moz_crash at line 255. I think we have to back this out :(
Updated•10 years ago
|
Flags: needinfo?(cyu)
Assignee | ||
Comment 70•10 years ago
|
||
Did you try to restart gdb inside a gdb session? Starting the gdb session from scratch works, but if you try to kill the b2g process and restart the gdb session you will see this crash because the b2g process is started with a new socket.
This is when the crash happens:
APPLICATION SEC USER PID PPID VSIZE RSS WCHAN PC NAME
b2g 0 root 552 32681 5056 1840 c019c3a4 0000b056 t /system/b2g/b2g
And the file descriptors of b2g and gdbserver:
root@flame:/proc/32681/fd # ls -l
lrwx------ root root 2014-10-06 11:29 0 -> /dev/pts/0
lrwx------ root root 2014-10-06 11:29 1 -> /dev/pts/0
lrwx------ root root 2014-10-06 11:29 2 -> /dev/pts/0
lrwx------ root root 2014-10-06 11:29 4 -> socket:[10196674]
lr-x------ root root 2014-10-06 11:29 8 -> /dev/__properties__
lrwx------ root root 2014-10-06 11:29 9 -> socket:[5657]
root@flame:/proc/32681/fd # cd /proc/552/fd
root@flame:/proc/552/fd # ls -l
lrwx------ root root 2014-10-06 10:36 0 -> /dev/pts/0
lrwx------ root root 2014-10-06 10:36 1 -> /dev/pts/0
lrwx------ root root 2014-10-06 10:36 2 -> /dev/pts/0
lrwx------ root root 2014-10-06 10:36 3 -> /dev/null
lrwx------ root root 2014-10-06 10:36 4 -> socket:[10196674]
lr-x------ root root 2014-10-06 10:36 8 -> /dev/__properties__
lrwx------ root root 2014-10-06 10:36 9 -> socket:[5657]
Note that file descriptor 4 triggers the guard and crashes. It is leaked from gdbserver to the inferior. This leak is from the beginning of exec() and we can't code to avoid this. gdbserver needs an upgrade.
Flags: needinfo?(cyu)
Assignee | ||
Comment 71•10 years ago
|
||
It appears that we encountered this bug:
https://sourceware.org/bugzilla/show_bug.cgi?id=15604
I tried to upgrade gdb/gdbserver. The gdb/gdbserver from ndk-10b still has the leak (and more more than the gdbserver we are using now). If we want to totally get rid of this problem, we need to upgrade to gdb/gdbserver 7.7.
The short-term solution is 'call close(4)' within gdb. This closes the socket leaked from gdbserver and you won't encounter the crash. It should be pretty safe in gdb.
The long-term solution could be that we remove the use of magic file descriptors. Then we can remove the trick of reserving file descriptors in this patch.
Greg, did you see this crash right after ./run-gdb.sh or after you killed and restarted b2g within gdb?
Now that this happens in gdb only and we can work this around, do you think we still need to back this out?
Flags: needinfo?(anygregor)
Updated•10 years ago
|
Flags: needinfo?(anygregor)
You need to log in
before you can comment on or make changes to this bug.
Description
•