Closed Bug 1139090 Opened 10 years ago Closed 10 years ago

Random crashes with vsync refresh driver enabled

Tracking

()

Status:

RESOLVED WORKSFORME

People

(Reporter: mchang, Assigned: mchang)

References

Details

(Keywords: crash)

Crash Data

Attachments

(6 files, 18 obsolete files)

Init VsyncParent on main thread 10 years ago Mason Chang [Inactive] [:mchang] (deleted), patch		Details \| Diff \| Splinter Review
add thread checking for gfxPlatform::GetPlatform() 10 years ago Jerry Shih[:jerry] (UTC+8) (inactive) (deleted), patch		Details \| Diff \| Splinter Review
WIP Part 1 Delete RefreshDriverVsyncObserver 10 years ago Mason Chang [Inactive] [:mchang] (deleted), patch		Details \| Diff \| Splinter Review
Part 2: Create Parent/Child finish init messages 10 years ago Mason Chang [Inactive] [:mchang] (deleted), patch		Details \| Diff \| Splinter Review
WIP Part 3 Refactor RefreshDriverVsyncObserver to listen to vsync directly 10 years ago Mason Chang [Inactive] [:mchang] (deleted), patch		Details \| Diff \| Splinter Review
Part 4: MOZ_RELEASE_ASSERT gfxPlatform::mVsyncSource is not null 10 years ago Mason Chang [Inactive] [:mchang] (deleted), patch		Details \| Diff \| Splinter Review
Add asserts to ensure vsync source is not null 10 years ago Mason Chang [Inactive] [:mchang] (deleted), patch	kats : review+	Details \| Diff \| Splinter Review
Full Crash Stack 10 years ago Mason Chang [Inactive] [:mchang] (deleted), text/plain		Details
First crash 10 years ago Mason Chang [Inactive] [:mchang] (deleted), text/x-log		Details
NUWA Crash after enough time 10 years ago Mason Chang [Inactive] [:mchang] (deleted), text/plain		Details
[test] check message object use-after-free problem. 10 years ago Jerry Shih[:jerry] (UTC+8) (inactive) (deleted), patch		Details \| Diff \| Splinter Review
change-refresh-timer-creation-timing.patch 10 years ago Jerry Shih[:jerry] (UTC+8) (inactive) (deleted), patch		Details \| Diff \| Splinter Review
Log so far 10 years ago Mason Chang [Inactive] [:mchang] (deleted), text/plain		Details
IPC Channnel Process Startup Log 10 years ago Mason Chang [Inactive] [:mchang] (deleted), text/plain		Details
check std::queue in ipc channel. 10 years ago Jerry Shih[:jerry] (UTC+8) (inactive) (deleted), patch		Details \| Diff \| Splinter Review
check std::queue in ipc channel. 10 years ago Jerry Shih[:jerry] (UTC+8) (inactive) (deleted), patch		Details \| Diff \| Splinter Review
check std::queue and OutputQueuePush() in ipc channel. 10 years ago Jerry Shih[:jerry] (UTC+8) (inactive) (deleted), patch		Details \| Diff \| Splinter Review
Log of race condition 10 years ago Mason Chang [Inactive] [:mchang] (deleted), text/plain		Details
Always use the async callback to init the content vsync refresh driver 10 years ago Mason Chang [Inactive] [:mchang] (deleted), patch		Details \| Diff \| Splinter Review
Fix ipc thread crash part1: change nsRefreshDriver::PVsyncActorCreated() to internal function. v1 10 years ago Jerry Shih[:jerry] (UTC+8) (inactive) (deleted), patch	kats : review+	Details \| Diff \| Splinter Review
Fix ipc thread crash part2: post vsync timer creation task to main thread. v1 10 years ago Jerry Shih[:jerry] (UTC+8) (inactive) (deleted), patch	roc : review+ cyu : feedback+	Details \| Diff \| Splinter Review
reset channel fd at the first nuwa final constructor callback. 10 years ago Jerry Shih[:jerry] (UTC+8) (inactive) (deleted), patch		Details \| Diff \| Splinter Review
add a new type of callback before NuwaAddFinalConstructor called. 10 years ago Jerry Shih[:jerry] (UTC+8) (inactive) (deleted), patch		Details \| Diff \| Splinter Review
The counter proposal: reset transports with NuwaAddConstructor() 10 years ago Cervantes Yu [:cyu] [:cervantes] (deleted), text/plain		Details

Mason Chang [Inactive] [:mchang]

Assignee

Description

•

10 years ago

There are random, timing specific crashes when we enable the vsync refresh driver. These occur across platforms. These are bugs 1131080, 1133865, and https://crash-stats.mozilla.com/report/index/7c23e927-c2b9-4f80-b225-ec7302150302. They all probably have the same root cause, which is the vsync refresh driver crashes on startup.

Mason Chang [Inactive] [:mchang]

Assignee

Updated

•

10 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1131080, https://bugzilla.mozilla.org/show_bug.cgi?id=1133865

Robert Kaiser

Updated

•

10 years ago

Crash Signature: [@ mozilla::gfx::VsyncSource::GetRefreshTimerVsyncDispatcher() ]

Mason Chang [Inactive] [:mchang]

Assignee

Comment 2

•

10 years ago

Oh, very helpful that we have a windows crash as well https://crash-stats.mozilla.com/report/index/d9a4dfd1-d8fb-45e1-b49b-dd1d72150226

Mason Chang [Inactive] [:mchang]

Assignee

Comment 3

•

10 years ago

Attached patch Init VsyncParent on main thread (obsolete) (deleted) — Details — Splinter Review

From the crash report in https://bugzilla.mozilla.org/show_bug.cgi?id=1139090#c2. It's crashing from VsyncParent::Create in calling gfxPlatform. I can only imagine this would happen if somehow the content process loaded before the main thread in the parent process ticked the refresh driver. Also, I don't think it's valid to get the gfxPlatform off main thread in general. This patch posts a task to the main thread to get the vsync refresh driver timer during the VsyncParent creation.

Attachment #8572246 - Flags: review?(bent.mozilla)

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 4

•

10 years ago

Attached patch add thread checking for gfxPlatform::GetPlatform() (obsolete) (deleted) — Details — Splinter Review

I think we also need to add the thread checking for the first time we create the gfxPlatform. We might call GetPlatform() off-main thread, but we should call GetPlatform() after we init the platform at main thread. This checking can't catch all the race condition since we don't use mutex for "gPlatform". We create the gPlatform at [1] and we access it at [2]. [1] https://hg.mozilla.org/mozilla-central/annotate/c5b90c003be8/gfx/thebes/gfxPlatform.cpp#l480 [2] https://hg.mozilla.org/mozilla-central/annotate/c5b90c003be8/gfx/thebes/gfxPlatform.cpp#l416

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 5

•

10 years ago

Comment on attachment 8572246 [details] [diff] [review] Init VsyncParent on main thread Review of attachment 8572246 [details] [diff] [review]: ----------------------------------------------------------------- ::: layout/ipc/VsyncParent.cpp @@ +52,5 @@ > + mVsyncDispatcher = vsyncSource->GetRefreshTimerVsyncDispatcher(); > + > + if (mObservingVsync) { > + mVsyncDispatcher->AddChildRefreshTimer(this); > + } We might get race when we receive message during InitOnMainThread. How about this? VsyncParent::RecvObserve() { mutex lock(mCreateLock); ... } VsyncParent::InitOnMainThread() { //init mVsyncDispatcher ..... mutex lock(mCreateLock); if (mObservingVsync) { mVsyncDispatcher->AddChildRefreshTimer(this); } }

Mason Chang [Inactive] [:mchang]

Assignee

Comment 6

•

10 years ago

Attached patch WIP Part 1 Delete RefreshDriverVsyncObserver (obsolete) (deleted) — Details — Splinter Review

Stepping back for a moment, the problem is that we're crashing when trying to read the VsyncSource::RefreshTimerVsyncDispatcher during startup. This is actually only created during gfxPlatform::Init, which is supposed to be run on the main thread. It is OK to call gfxPlatform::GfxPlatform() Off main thread, but it is NOT ok to call gfxPlatform::Init off main thread. Thus, the first call to gfxPlatform::gfxPlatform must be on main thread. The ONLY other places where we call GetHardwareVsync() off main thread is in VsyncParent::Create and HwcComposer::Vsync on b2g. However, HwcComposer::Vsync is guaranteed to be called AFTER gfxPlatform::Init() is called on the main thread because we don't actually enable Vsync until gfxPlatform::init on b2g. The b2g vsync callback is disabled on startup and only turned on during gfxPlatform::init. What this means then, is that these crashes can occur in the unlikely case that gfxPlatform::Init is called at the same time on two threads: main thread and PBackground thread. We could init incorrectly and not have the hardware vsync source. For example, at [1], we could set the flag to true on the main thread, not finish initializing when the background thread tries to read the vsync source and crash. I think the overall message is that we should move reading the gfxPlatform::GetPlatform() VsyncParent::Create() onto the main thread. We could add a few locks, but then it is kind of ugly because the child could send a SendObserve vsync message before the main thread gets to init gfxPlatform, so we'd keep having these checks to see if the parent is initialized. I think it might be better to let the VsyncChild/VsyncParent KNOW it is working with a RefreshTimerVsyncObserver, rather than just a generic VsyncObserver. Then, once the ActorCreate occurs on the child process, we can send a notification to the parent process that the child is setup. Then the parent can init on the main thread and send a message back to the child. So the order would be: 1) ChildActor sets up, sends a message to Parent 2) Parent inits gfxPlatform on main thread, sends a message to child it finished init 3) Child finishes init and swaps refresh drivers. If we only move the VsyncParent::Create just onto the main thread without some sync code, we have a race condition in the parent process between the background thread receiving an ObserveVsync message and the main thread initializing gfxPlatform. Then we have to keep a flag that says we should start sending messages once the main thread finishes initializing. The child process would also have to wait for the parent process' main thread to finish, which would hurt start up time. Instead, the child process with this proposal could keep ticking with software timers until the main thread parent process finishes. This would give a strict ordering on when the parent / child could start sending messages to each other. Another nice benefit from letting the VsyncParent/Child know it is working with the refresh driver is that we can delete the inner VsyncObserver within VsyncRefreshDriverTimer[2] and instead have the VsyncRefreshDriverTimer inherit from a RefreshTimerVsyncObserver that isn't refcounted. There is still a little bit of ugliness because the vsync observer would not be refcounted since everything is not refcounted in the refresh driver. I think this would be ok since everything is alive throughout the whole process and is already cleaned up through ActorDestroy and nsRefreshDriver::Shutdown. I'm still not sure this is actually better than adding a lock, but conceptually it's cleaner. The really weird thing though about this bug is that I would've expected one of the nsRuntimeAborts in gfxPlatform to trigger, but maybe both threads entered gfxPlatform::init at the same time. Attached are WIP to do three things: 1) Delete the RefreshDriverVsyncObserver 2) Create the IPC calls to initialize the parent and child and sync across both of them. Once the child and parent are initialized, the child sends a message to the parent. The parent inits gfxPlatform on the main thread, then sends a message back to the child. The child then swaps refresh drivers. Until then, the child continues to tick on software timers. 3) Create a new RefreshTimerVsyncObserver, which is not refcounted. Then the VsyncRefreshDriverTimer inherits from this, no longer needs an inner class to listen to vsync, and ticks on vsync messages. The lifetime is the same as it is destroyed in nsRefreshDriver::Shutdown. I'm still not sure we want to do this, but still churning in my head. [1] https://dxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxPlatform.cpp?from=gfxPlatform.cpp#457 [2] https://dxr.mozilla.org/mozilla-central/source/layout/base/nsRefreshDriver.cpp?from=nsRefreshDriver.cpp#312

Attachment #8572246 - Attachment is obsolete: true

Attachment #8572583 - Attachment is obsolete: true

Attachment #8572246 - Flags: review?(bent.mozilla)

Mason Chang [Inactive] [:mchang]

Assignee

Comment 7

•

10 years ago

Attached patch Part 2: Create Parent/Child finish init messages (obsolete) (deleted) — Details — Splinter Review

Mason Chang [Inactive] [:mchang]

Assignee

Comment 8

•

10 years ago

Attached patch WIP Part 3 Refactor RefreshDriverVsyncObserver to listen to vsync directly (obsolete) (deleted) — Details — Splinter Review

Mason Chang [Inactive] [:mchang]

Assignee

Updated

•

10 years ago

Depends on: 1139726

Mason Chang [Inactive] [:mchang]

Assignee

Comment 9

•

10 years ago

Attached patch Part 4: MOZ_RELEASE_ASSERT gfxPlatform::mVsyncSource is not null (obsolete) (deleted) — Details — Splinter Review

Mason Chang [Inactive] [:mchang]

Assignee

Comment 10

•

10 years ago

Hmm, even artificially holding the main thread and yielding() at gfxPlatform::init doesn't crash. The VsyncParent is always created much later on OS X. Will try on other platforms later.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 11

•

10 years ago

I can't reproduce this on b2g either, even with artificially holding up the main thread. Maybe there is a race condition between creating the content process and initializing the refresh driver. Still digging, but I do think we should still never read GetVsyncSource() off main thread in general, since all the enable/disable vsync stuff must occur on the main thread.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 12

•

10 years ago

I was discussing this with Mason on IRC. Personally I think it's highly unlikely that this crash is caused by off-main-thread call of gfxPlatform::Init, because (1) there's a very narrow window in which it could happen, and it would have to be concurrent with a main-thread call of gfxPlatform::Init, (2) if that were the case I would expect to see a higher volume of crashes from [1], and (3) of the 8 crash reports on crash-stats within the last 28 days, 6 of them have uptimes of over a minute (in some cases significantly higher) so this isn't really a startup crash. I think rather than jumping in and making changes we should wait until we have more data and/or can reproduce this to figure out a proper fix. [1] http://mxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxPlatform.cpp?rev=3da36cdb4198#456

Mason Chang [Inactive] [:mchang]

Assignee

Comment 13

•

10 years ago

Attached patch Add asserts to ensure vsync source is not null (deleted) — Details — Splinter Review

Add some assertions to ensure that the vsync source is initialized. Just some checks in case to ensure we initialized the vsync source correctly.

Attachment #8572986 - Attachment is obsolete: true

Attachment #8574133 - Flags: review?(bugmail.mozilla)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Updated

•

10 years ago

Attachment #8574133 - Flags: review?(bugmail.mozilla) → review+

Mason Chang [Inactive] [:mchang]

Assignee

Updated

•

10 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1140978

Mason Chang [Inactive] [:mchang]

Assignee

Comment 14

•

10 years ago

I'd really like to revisit NUWA here and see how we're initializing and cloning PBackground correctly with NUWA.

Robert Kaiser

Updated

•

10 years ago

Keywords: crash

Mason Chang [Inactive] [:mchang]

Assignee

Comment 15

•

10 years ago

Attached file Full Crash Stack (deleted) — Details

From https://crash-stats.mozilla.com/report/index/f8c4b6ae-b9ff-43ad-872c-768e52150309 Chatting with nhirata, having a sim card might make this occur, which would explain why I can't reproduce at all. Waiting on confirmation.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 16

•

10 years ago

From irc, address 0x5a5a5a5a means jemalloc poison, so either memory mismanagement or memory corruption.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 17

•

10 years ago

Just a summary, there seems to be some intermittent that occurs with or without the sim card + sd card. It doesn't seem to be carrier related at all. Crashes occur without any reference workload. It mostly looks like we create a new process in the background and then it crashes for some reason, so probably related to NUWA. I'd like to really dig into the NUWA process and see what's supposed to happen (https://bugzilla.mozilla.org/show_bug.cgi?id=1092978#c186). It also looks like this crash occurs either all the time or never. The intermittent seems to be about 2-3 minutes between crashes. From the crash signature, we're deallocating some memory somewhere and accessing it incorrectly. Naoki and Jayme Mercado tested a custom build with the patches attached to this bug where we make explicit the startup of the Vsync Child/Parent. Both were able to crash with these patches.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 18

•

10 years ago

While running a debug build, I see this a lot: I/Gecko ( 208): [208] WARNING: A runnable was posted to a worker that is already shutting down!: file ../../../dom/workers/WorkerPrivate.cpp, line 2520 I/Gecko ( 208): [208] WARNING: Failed to update worker languages!: file ../../../dom/workers/WorkerPrivate.cpp, line 3295

Mason Chang [Inactive] [:mchang]

Assignee

Updated

•

10 years ago

Depends on: 1141018

Mason Chang [Inactive] [:mchang]

Assignee

Comment 19

•

10 years ago

Attached file First crash (deleted) — Details

Was finally able to reproduce, reflashed everything. STR: 1) Redo v18-D full flash 2) Reset to 319 MB flame 3) Build and flash Gecko: 6e5e93073a9e11ff531570b3571f1136fde04255 4) Use lots of memory - take a video, picture, browse the web 5) Keep using more memory, connect to wifi, etc. Wait for crash

Mason Chang [Inactive] [:mchang]

Assignee

Comment 20

•

10 years ago

The kicker was to change to 319mb flame. Seems like we should handle a process getting killed.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 21

•

10 years ago

Attached file NUWA Crash after enough time (deleted) — Details

While connecting to wifi, lots of crashes in adb logcat due to NUWA crashing. We're probably sending vsync messages when we shouldn't be.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 22

•

10 years ago

4 crashes from the NUWA crash https://crash-stats.mozilla.com/report/index/00870962-339d-487f-8e7f-207962150310 https://crash-stats.mozilla.com/report/index/53b4f6b1-b17d-45e4-ba04-6cd2b2150310 https://crash-stats.mozilla.com/report/index/712232ad-c453-4081-9a83-1eb6d2150310 https://crash-stats.mozilla.com/report/index/b17771fb-fdfc-4981-aa5a-bcde62150310 Waiting on bug 1141316 to be able to download the crash dumps :/

Cervantes Yu [:cyu] [:cervantes]

Comment 23

•

10 years ago

Per the discussion with Jerry offline, the report in comment #15 looks like a use-after-free crash. 0x5a5a5a6e is dereferencing file_descriptor_set_ of a dead IPC::Message. It's still unclear how come the IPC channel can get a deleted message from its output queue and use it. My suggestion is we log the address of 'this' and the stack in Message's dtor, and before we use it in ChannelImpl::ProcessOutgingMessages(). Then we can compare to see how this message got freed.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 24

•

10 years ago

(In reply to Mason Chang [:mchang] from comment #22) > Waiting on bug 1141316 to be able to download the crash dumps :/ I'm not sure how these dumps will help you, but I downloaded them and emailed them to you.

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 25

•

10 years ago

(In reply to Cervantes Yu from comment #23) > Per the discussion with Jerry offline, the report in comment #15 looks like > a use-after-free crash. 0x5a5a5a6e is dereferencing file_descriptor_set_ of > a dead IPC::Message. It's still unclear how come the IPC channel can get a > deleted message from its output queue and use it. My suggestion is we log > the address of 'this' and the stack in Message's dtor, and before we use it > in ChannelImpl::ProcessOutgingMessages(). Then we can compare to see how > this message got freed. The test patch is already at: https://bugzilla.mozilla.org/show_bug.cgi?id=1140978#c29 https://bugzilla.mozilla.org/show_bug.cgi?id=1140978#c30 Thanks to Cervantes for going through the ipc message flow.

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 26

•

10 years ago

Attached patch [test] check message object use-after-free problem. (obsolete) (deleted) — Details — Splinter Review

this patch is from https://bugzilla.mozilla.org/show_bug.cgi?id=1140978#c30 use this patch to check IPC::Message object use-after-free problem

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 27

•

10 years ago

Attached patch change-refresh-timer-creation-timing.patch (obsolete) (deleted) — Details — Splinter Review

this patch is from https://bugzilla.mozilla.org/show_bug.cgi?id=1140978#c31 It creates refresh timer in ctor instead of the first call of ChooseTimer().

Mason Chang [Inactive] [:mchang]

Assignee

Comment 28

•

10 years ago

Attached file Log so far (obsolete) (deleted) — Details

From some initial investigation, it looks like we have the same message on the ChannelImpl::output_queue_. Then after we process the message, we delete it, but it gets popped again off the queue.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 29

•

10 years ago

Attached file IPC Channnel Process Startup Log (deleted) — Details

About as far as I can get is that during process creation, a message is somehow in the channel and is being deleted. The message gets deleted quite a few times, still not sure how. But this deleted message is left in one of the channel's output_queue_ [1]. When we finally get to ProcessingOutputMessages, i t's already been deleted, so we crash. [1] https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.h#105

Attachment #8575691 - Attachment is obsolete: true

Mason Chang [Inactive] [:mchang]

Assignee

Comment 30

•

10 years ago

Best steps to STR: 1) Set flame to 319 MB 2) Go through FTU, don't connect to wireless 3) Load every app from the top down. e.g. load contacts, then messages, then phone, then browser, etc. 4) Keep loading apps until you get to settings. Then connect to wifi. 5) Keep loading new apps such as email, clock, etc. 6) You will have probably crashed by now. If not, restart the phone and do it again. I can reproduce 4/5 times this way.

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 31

•

10 years ago

Attached patch check std::queue in ipc channel. (obsolete) (deleted) — Details — Splinter Review

My device is 319m v18d-1 flame. Thanks to Mason, I can get the crash with the STR in comment 30. I have a very interesting log. When we push a new message into the queue "output_queue_"(the type is std::queue<Message*>), the size of queue is not correct. The original queue is empty, but the size becomes 2 when we just push one data into it. In addition, the last data in queue is 0x5a5a5a5a. That's why we got 0x5a5a5a5a from queue. log: I/Gecko ( 3841): bignose message queue size error, orig:0, cur:2, front data:0xb2e2f5e0, back data:0x5a5a5a5a

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 32

•

10 years ago

Attached patch check std::queue in ipc channel. (obsolete) (deleted) — Details — Splinter Review

print the last message name before crash. I got the same message Background::Msg_PVsyncConstructor for 3 time. It's used for vsync protocol creation.

Jerry Shih[:jerry] (UTC+8) (inactive)

Updated

•

10 years ago

Attachment #8575931 - Attachment is obsolete: true

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 33

•

10 years ago

Hi Ben, There is a weird error in ipc outgoing message queue. The size is not correct when we push a new object(please check comment 31 and comment 32). I create the BackgroundChild at [1]. If we create background at other thread before this call(e.g. if we call GetForCurrentThread() at other thread), do we still create an individual backgroundChild for refresh driver? All vsyncChild message and nsIIPCBackgroundChildCreateCallback have main thread checking. Does it have thread racing for the outgoing message queue? Or do we have memory allocation problem for the std::queue? [1] https://hg.mozilla.org/mozilla-central/annotate/6686aacf006f/layout/base/nsRefreshDriver.cpp#l829

Flags: needinfo?(bent.mozilla)

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 34

•

10 years ago

Attached patch check std::queue and OutputQueuePush() in ipc channel. (obsolete) (deleted) — Details — Splinter Review

Add more log in OutputQueuePush() function. I have another weird log in OutputQueuePush(). We change the output_queue_length_ value only at io thread, but I got a mismatch value between output_queue_length_ and the output_queue_.size(). Log: I/Gecko (26099): bignose message queue size error, orig:0, cur:2, queue_length:2, front data:0xb2e2d460, back data:0x5a5a5a5a The original queue size is 0. I think output_queue_length_ should be 1, but i got 2 here.

Attachment #8575973 - Attachment is obsolete: true

Mason Chang [Inactive] [:mchang]

Assignee

Comment 35

•

10 years ago

(In reply to Jerry Shih[:jerry] (UTC+8) from comment #33) > Hi Ben, > > There is a weird error in ipc outgoing message queue. The size is not > correct when we push a new object(please check comment 31 and comment 32). > I create the BackgroundChild at [1]. If we create background at other thread > before this call(e.g. if we call GetForCurrentThread() at other thread), do > we still create an individual backgroundChild for refresh driver? All > vsyncChild message and nsIIPCBackgroundChildCreateCallback have main thread > checking. Does it have thread racing for the outgoing message queue? > Or do we have memory allocation problem for the std::queue? > > [1] > https://hg.mozilla.org/mozilla-central/annotate/6686aacf006f/layout/base/ > nsRefreshDriver.cpp#l829 Us breaking the queue sounds very weird. Us also breaking IPC sounds weird. I traced it to this [1], which sends the VsyncConstructor before the IPC channel gets a chance to reset all the file descriptors. If we also use the callback mechanism [2], I can't crash right now. [1] https://dxr.mozilla.org/mozilla-central/source/layout/base/nsRefreshDriver.cpp?from=nsRefreshDriver.cpp&case=true#833 [2] https://dxr.mozilla.org/mozilla-central/source/layout/base/nsRefreshDriver.cpp?from=nsRefreshDriver.cpp&case=true#837

Mason Chang [Inactive] [:mchang]

Assignee

Comment 36

•

10 years ago

If we ALWAYS use the callback mechanism at [2] in comment 35, I can't crash.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 37

•

10 years ago

Attached file Log of race condition (deleted) — Details

There is a race condition during the process startup. The hello message[1] is ALWAYS sent on the main thread from what I can tell. All messages to the IPC thread in the content process do not start until AFTER the hello message is sent in the working cases. The hello message is also a UniquePtr. After the hello message is sent, all IPC messages are sent on the IPC thread. The output queue [3] is not thread safe. In our case, the PVsyncConstructor sends an IPC message to construct the Vsync Child/Parent pair [2], which occurs on the IPC thread. This message is sometimes queued on the IPC channel's output queue at the exact same time the main thread is queuing up the hello message. After the IPC thread queues the vsync constructor message, the hello message from the main thread is deleted memory. I need to read up on unique_ptr and auto_ptr to understand why this gets deleted, but we have a race condition here. In the working cases, the PVSync constructor IPC message doesn't get sent/queued until AFTER the ContentChild finishes initialization[4]. There is probably a switch somewhere during the initialization process that switches IPC messages on those channels to always send messages on the IPC thread rather than the main thread OR, only the hello message is ever expected to occur on the main thread. I also want to check to ensure that the background child is setup [5] and when is the IPC thread safe to start sending messages. Or is the IPC thread always cloned with NUWA? Cervantes, does any of this make sense? Thanks! [1] https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc?from=EnqueueHelloMessage&case=true#373 [2] https://dxr.mozilla.org/mozilla-central/source/layout/base/nsRefreshDriver.cpp?from=nsrefreshDriver.cpp&case=true#833 [3] https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#886 [4] https://dxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp?from=ContentChild.cpp&case=true#533 [5] https://dxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp?from=ContentChild.cpp&case=true#737

Flags: needinfo?(cyu)

Mason Chang [Inactive] [:mchang]

Assignee

Comment 38

•

10 years ago

Attached patch Always use the async callback to init the content vsync refresh driver (obsolete) (deleted) — Details — Splinter Review

Instead of checking if the background actor is ready, use the async callback, which is called AFTER the background thread is initialized. Not sure this is the right fix versus it just bought us enough time for the hello messages in the ipc channel to finish executing.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 39

•

10 years ago

I think the proper answer is to probably wait for the OnChannelConnected (https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel.h#30), which is called at the hello message is sent.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 40

•

10 years ago

So I think we have a couple of options here that I can think. 1) Have the BackgroundChild not actually report as started up until the OnChannelConnected message has been received for all cloned channels. We would have to extend the BackgroundChild to listen to OnChannelConnected in the IPC channel. 2) Have NUWA delay firing NuwaAddFinalConstructor until all IPC channels have been opened and acknowledged. I'm not sure this is in the domain of NUWA yet though. 3) Change ipc_channel_posix to be thread safe. 4) Send the Hello messages on the iothread instead of the main thread and don't allow non-hello thread messages until after the ack.

Cervantes Yu [:cyu] [:cervantes]

Comment 41

•

10 years ago

From the log in comment #37 it appears that it's a race pushing messages to the output queue. BackgroundChild reports that it's up, but the underlying IPC channel is not fully cloned. In this case, NuwaAddFinalConstructor() is not suitable for you because you will send IPC messages to post a task to the IPC thread, which will race with CallNuwaSpawn(): https://dxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp#2418 I think using the callback will work because it dispatches a runnable. By the time the callback runs, the hello messages is guaranteed in the queue. Changing the output queue to be thread-safe just avoids the crash, but isn't the right fix because PVSync's constructor should be sent *after* the hello message. Sending the hello message on the IO thread has a similar problem. So I would suggest delaying calling CreateContentVsyncRefreshTimer() to fix the issue.

Flags: needinfo?(cyu)

Cervantes Yu [:cyu] [:cervantes]

Comment 42

•

10 years ago

Bill, since IPC channel's output queue is not thread safe and we could enqueue the hello message on the main thread or the IPC thread, do we plan to make it thread-safe? Thanks.

Flags: needinfo?(wmccloskey)

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 43

•

10 years ago

We call all NuwaAddFinalConstructor() registered callbacks in [1], and then we reset channel fd at [2][A] on the main thread. In NuwaAddFinalConstructor callback, we try to init the PVsyncChld protocol. That will post a task to ipc thread to enqueue a PVsyncConstructor message[B]. [A] enqueues a hello message on the mian thread, and [B] enqueues a PVsyncConstructor on ipc thead. Thus, we have a memory corruption in the outgoing buffer queue as comment 31. [1] https://hg.mozilla.org/mozilla-central/annotate/5334d2bead3e/dom/ipc/ContentChild.cpp#l2402 [2] https://hg.mozilla.org/mozilla-central/annotate/5334d2bead3e/dom/ipc/ContentChild.cpp#l2418

Flags: needinfo?(bent.mozilla)

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 44

•

10 years ago

Attached patch Fix ipc thread crash part1: change nsRefreshDriver::PVsyncActorCreated() to internal function. v1 (obsolete) (deleted) — Details — Splinter Review

Just clean the rd header. The PVsyncActorCreated is only used by vsync-base timer. I think we don't need to show that in header file.

Attachment #8575377 - Attachment is obsolete: true

Attachment #8575379 - Attachment is obsolete: true

Attachment #8576114 - Attachment is obsolete: true

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 45

•

10 years ago

Attached patch Fix ipc thread crash part2: post vsync timer creation task to main thread. v1 (obsolete) (deleted) — Details — Splinter Review

Post a runnable to init PVsync on the main thread. With this patch, I don't get crash with the STR in comment 30 with 10 times test.

Attachment #8576656 - Flags: review?(cyu)

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 46

•

10 years ago

Comment on attachment 8576655 [details] [diff] [review] Fix ipc thread crash part1: change nsRefreshDriver::PVsyncActorCreated() to internal function. v1 Just remove PVsyncActorCreated() from header.

Attachment #8576655 - Flags: review?(bugmail.mozilla)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 47

•

10 years ago

(In reply to Cervantes Yu from comment #42) > Bill, since IPC channel's output queue is not thread safe and we could > enqueue the hello message on the main thread or the IPC thread, do we plan > to make it thread-safe? Thanks. The hello message is normally enqueued only during construction of the channel. There's no risk that the channel can be used by any other threads at that time. However, ResetFileDescriptor also enqueues a hello message, and it seems to happen during Nuwa startup, when the channel can already be shared by other threads. So this is a Nuwa-specific problem. I'm worried that the patches here might fix vsync, but we'll have the same problem in other code. I know very little about Nuwa, but it seems like the file descriptors are getting reset too late. We should do it before we unfreeze threads. Could ContentChild register a callback via NuwaAddConstructor and reset the file descriptors there?

Flags: needinfo?(wmccloskey)

Mason Chang [Inactive] [:mchang]

Assignee

Comment 48

•

10 years ago

(In reply to Bill McCloskey (:billm) from comment #47) > (In reply to Cervantes Yu from comment #42) > > Bill, since IPC channel's output queue is not thread safe and we could > > enqueue the hello message on the main thread or the IPC thread, do we plan > > to make it thread-safe? Thanks. > > The hello message is normally enqueued only during construction of the > channel. There's no risk that the channel can be used by any other threads > at that time. However, ResetFileDescriptor also enqueues a hello message, > and it seems to happen during Nuwa startup, when the channel can already be > shared by other threads. So this is a Nuwa-specific problem. > > I'm worried that the patches here might fix vsync, but we'll have the same > problem in other code. I know very little about Nuwa, but it seems like the > file descriptors are getting reset too late. We should do it before we > unfreeze threads. Could ContentChild register a callback via > NuwaAddConstructor and reset the file descriptors there? I agree here, the proper fix should be not to delay the vsync constructor message, but either delay the Nuwa callback constructors until after the file descriptors are set or to block non hello messages in the IPC channel until the hello messages are acked. Depending on how fast we can resolve the issue here, a temporary fix for vsync might be necessary just because we need to enable the vsync refresh driver for 2.2, with the trains switching in 2 weeks. A temp fix for vsync to post a task onto the main thread might be ok with a follow up to delete the hack once a proper fix is landed.

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 49

•

10 years ago

Attached patch reset channel fd at the first nuwa final constructor callback. (obsolete) (deleted) — Details — Splinter Review

Like this one? I register the reset fd callback in ContentChild::Init(). I think that's the first callback in list. Or we need another new api for this type of callback registry(e.g. reset fd callback here)?

Attachment #8576747 - Flags: feedback?(wmccloskey)

Mason Chang [Inactive] [:mchang]

Assignee

Comment 50

•

10 years ago

(In reply to Jerry Shih[:jerry] (UTC+8) from comment #49) > Created attachment 8576747 [details] [diff] [review] > reset channel fd at the first nuwa final constructor callback. > > Like this one? > I register the reset fd callback in ContentChild::Init(). I think that's the > first callback in list. > Or we need another new api for this type of callback registry(e.g. reset fd > callback here)? I don't think we have a guaranteed order of when the Nuwa callbacks execute do we?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 51

•

10 years ago

Comment on attachment 8576747 [details] [diff] [review] reset channel fd at the first nuwa final constructor callback. Review of attachment 8576747 [details] [diff] [review]: ----------------------------------------------------------------- ::: dom/ipc/ContentChild.cpp @@ +673,5 @@ > #endif > > +#ifdef MOZ_NUWA_PROCESS > + if (IsNuwaProcess()) { > + NuwaAddFinalConstructor(ResetChannelFD, nullptr); I was thinking we should use NuwaAddConstructor since it runs before threads unfreeze. That would avoid the ordering problem Mason mentioned. However, somebody who understands Nuwa should review this. Maybe Cervantes?

Attachment #8576747 - Flags: feedback?(wmccloskey)

Mason Chang [Inactive] [:mchang]

Assignee

Comment 52

•

10 years ago

From comment 51 - Can we unfreeze the threads later in NUWA? Or ensure that we reset the FDs earlier? Thanks!

Flags: needinfo?(cyu)

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 53

•

10 years ago

Attached patch add a new type of callback before NuwaAddFinalConstructor called. (obsolete) (deleted) — Details — Splinter Review

Just a straightforward one. Add a new callback list and call them before NuwaAddFinalConstructor callback list.

Attachment #8576747 - Attachment is obsolete: true

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 54

•

10 years ago

Jerry, why not reset before the threads are unfrozen? That seems a lot safer to me. Isn't it possible that a thread could use a channel before it gets reset?

Flags: needinfo?(hshih)

Cervantes Yu [:cyu] [:cervantes]

Comment 55

•

10 years ago

(In reply to Mason Chang [:mchang] from comment #50) > (In reply to Jerry Shih[:jerry] (UTC+8) from comment #49) > > Created attachment 8576747 [details] [diff] [review] > > reset channel fd at the first nuwa final constructor callback. > > > > Like this one? > > I register the reset fd callback in ContentChild::Init(). I think that's the > > first callback in list. > > Or we need another new api for this type of callback registry(e.g. reset fd > > callback here)? > > I don't think we have a guaranteed order of when the Nuwa callbacks execute > do we? No. There is no guaranteed order when the callbacks are executed.

Flags: needinfo?(cyu)

Cervantes Yu [:cyu] [:cervantes]

Comment 56

•

10 years ago

(In reply to Jerry Shih[:jerry] (UTC+8) from comment #53) > Created attachment 8577024 [details] [diff] [review] > add a new type of callback before NuwaAddFinalConstructor called. > > Just a straightforward one. Add a new callback list and call them before > NuwaAddFinalConstructor callback list. Given the callback types we have: * NuwaAddConstructor() * NuwaAddFinalConstructor() * NuwaAddThreadConstructor() Adding the 4th type of callback NuwaAddThreadFinalConstructor() is making the interface too much complicated and confusing. I would prefer using the existing one instead of adding one. As Bill commented, NuwaAddConstructor() sounds like feasible but with one caveat: the actions added to NuwaAddConstructor() mustn't perform any action that depends on another thread that is not yet recreated.

Cervantes Yu [:cyu] [:cervantes]

Comment 57

•

10 years ago

Attached file The counter proposal: reset transports with NuwaAddConstructor() (obsolete) (deleted) — Details

Thinker, do you have any concern with this?

Attachment #8577064 - Flags: feedback?(tlee)

Mason Chang [Inactive] [:mchang]

Assignee

Comment 58

•

10 years ago

(In reply to Cervantes Yu from comment #56) > (In reply to Jerry Shih[:jerry] (UTC+8) from comment #53) > > Created attachment 8577024 [details] [diff] [review] > > add a new type of callback before NuwaAddFinalConstructor called. > > > > Just a straightforward one. Add a new callback list and call them before > > NuwaAddFinalConstructor callback list. > > Given the callback types we have: > * NuwaAddConstructor() > * NuwaAddFinalConstructor() > * NuwaAddThreadConstructor() > > Adding the 4th type of callback NuwaAddThreadFinalConstructor() is making > the interface too much complicated and confusing. I would prefer using the > existing one instead of adding one. > > As Bill commented, NuwaAddConstructor() sounds like feasible but with one > caveat: the actions added to NuwaAddConstructor() mustn't perform any action > that depends on another thread that is not yet recreated. Is it guaranteed that the IPC thread will be created by the time NuwaAddConstructor callbacks fire?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 59

•

10 years ago

(In reply to Mason Chang [:mchang] from comment #58) > Is it guaranteed that the IPC thread will be created by the time > NuwaAddConstructor callbacks fire? ResetFileDescriptor just creates the hello message and enqueues it directly. I don't think it needs the IPC thread to exist.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 60

•

10 years ago

(In reply to Bill McCloskey (:billm) from comment #59) > (In reply to Mason Chang [:mchang] from comment #58) > > Is it guaranteed that the IPC thread will be created by the time > > NuwaAddConstructor callbacks fire? > > ResetFileDescriptor just creates the hello message and enqueues it directly. > I don't think it needs the IPC thread to exist. Sure, I just want to make sure that a message sent on the IPC thread must occur AFTER the hello messages are enqueued. From my understanding, NuwaAddConstructor callbacks fire before the IPC thread is unfrozen? Also, do we only care that the hello messages have been queued, or must they also be acked?

Cervantes Yu [:cyu] [:cervantes]

Comment 61

•

10 years ago

It only has to be enqueued.

Cervantes Yu [:cyu] [:cervantes]

Comment 62

•

10 years ago

Comment on attachment 8576656 [details] [diff] [review] Fix ipc thread crash part2: post vsync timer creation task to main thread. v1 Review of attachment 8576656 [details] [diff] [review]: ----------------------------------------------------------------- f+ because in terms of fixing the race, delaying the creation of VSyncChild works. We need someone who knows better about the refresh driver to review.

Attachment #8576656 - Flags: review?(cyu) → feedback+

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 63

•

10 years ago

Comment on attachment 8576656 [details] [diff] [review] Fix ipc thread crash part2: post vsync timer creation task to main thread. v1 We have a race problem when we send ipc message in NuwaAddFinalConstructor callback. Please check comment 43. The temporary fix here is that we don't create the PVsyncChild protocol in that callback. Instead, we post a new task to main thread to create it. With this, all ipc cloning and reset stuff are already done(comment 48 and comment 62). After we fix the nuwa cloning issue, we can remove this patch.

Flags: needinfo?(hshih)

Attachment #8576656 - Flags: review?(roc)

Jerry Shih[:jerry] (UTC+8) (inactive)

Updated

•

10 years ago

Attachment #8576313 - Attachment is obsolete: true

Jerry Shih[:jerry] (UTC+8) (inactive)

Updated

•

10 years ago

Attachment #8577024 - Attachment is obsolete: true

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

10 years ago

Attachment #8576656 - Flags: review?(roc) → review+

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 64

•

10 years ago

Comment on attachment 8577064 [details] The counter proposal: reset transports with NuwaAddConstructor() move to Bug 1142935

Attachment #8577064 - Attachment is obsolete: true

Attachment #8577064 - Flags: feedback?(tlee)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 65

•

10 years ago

Comment on attachment 8576656 [details] [diff] [review] Fix ipc thread crash part2: post vsync timer creation task to main thread. v1 Review of attachment 8576656 [details] [diff] [review]: ----------------------------------------------------------------- ::: layout/base/nsRefreshDriver.cpp @@ +873,5 @@ > +}; > +#endif //MOZ_NUWA_PROCESS > + > +static void > +CreateNuwaContentVsyncRefreshTimer(void*) This function needs to go inside the ifdef MOZ_NUWA_PROCESS as well or this won't compile.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 66

•

10 years ago

Comment on attachment 8576655 [details] [diff] [review] Fix ipc thread crash part1: change nsRefreshDriver::PVsyncActorCreated() to internal function. v1 Review of attachment 8576655 [details] [diff] [review]: ----------------------------------------------------------------- This looks like it's just moving some code around, and I don't see why it's needed to fix this bug. r+ anyway since it looks like a no-op but if there's no need to land this then please don't, as we will need to uplift the fix to 2.2 and the smaller the patch the better.

Attachment #8576655 - Flags: review?(bugmail.mozilla) → review+

Mason Chang [Inactive] [:mchang]

Assignee

Comment 67

•

10 years ago

(In reply to Bill McCloskey (:billm) from comment #47) > (In reply to Cervantes Yu from comment #42) > > Bill, since IPC channel's output queue is not thread safe and we could > > enqueue the hello message on the main thread or the IPC thread, do we plan > > to make it thread-safe? Thanks. > > The hello message is normally enqueued only during construction of the > channel. There's no risk that the channel can be used by any other threads > at that time. However, ResetFileDescriptor also enqueues a hello message, > and it seems to happen during Nuwa startup, when the channel can already be > shared by other threads. So this is a Nuwa-specific problem. > > I'm worried that the patches here might fix vsync, but we'll have the same > problem in other code. I know very little about Nuwa, but it seems like the > file descriptors are getting reset too late. We should do it before we > unfreeze threads. Could ContentChild register a callback via > NuwaAddConstructor and reset the file descriptors there? Hmm it just occurred to me. Why does the ResetFileDescriptor need to enqueue another hello message if it's already enqueued during the construction of the channel? Do we need to have a second hello message?

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 68

•

10 years ago

(In reply to Mason Chang [:mchang] from comment #67) > Hmm it just occurred to me. Why does the ResetFileDescriptor need to enqueue > another hello message if it's already enqueued during the construction of > the channel? Do we need to have a second hello message? I think so. This is the code that receives the hello message: http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#616 Basically, the hello message passes on the pid of the new process to the parent. Since we have a new pid after the fork, we need a new hello message.

Flags: needinfo?(wmccloskey)

Mason Chang [Inactive] [:mchang]

Assignee

Updated

•

10 years ago

Attachment #8572978 - Attachment is obsolete: true

Mason Chang [Inactive] [:mchang]

Assignee

Updated

•

10 years ago

Attachment #8572979 - Attachment is obsolete: true

Mason Chang [Inactive] [:mchang]

Assignee

Updated

•

10 years ago

Attachment #8572980 - Attachment is obsolete: true

Mason Chang [Inactive] [:mchang]

Assignee

Comment 69

•

10 years ago

I hit the infinite crash loop on my mac today and I noticed it first occurred after I switched from nightly to the dev edition. The crash stacks that look like https://crash-stats.mozilla.com/report/index/7c23e927-c2b9-4f80-b225-ec7302150302 occur due to bug 1133526 not uplifted to the developer edition. I enabled the refresh driver without hardware vsync in about:config, which caused a crash on startup since we didn't check that hardware vsync was also enabled. Thus, we tried to get the vsync signal, but we never initialized the vsync signal on startup since the pref was off. So essentially, user error / I voided my warranty.

Mason Chang [Inactive] [:mchang]

Assignee

Updated

•

10 years ago

Depends on: 1142935

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 70

•

10 years ago

Since bug 1142935 landed, we don't need to post the vsync timer creation task to main thread.

Jerry Shih[:jerry] (UTC+8) (inactive)

Updated

•

10 years ago

Attachment #8576656 - Attachment is obsolete: true

Jerry Shih[:jerry] (UTC+8) (inactive)

Updated

•

10 years ago

Attachment #8576655 - Attachment is obsolete: true

Mason Chang [Inactive] [:mchang]

Assignee

Comment 71

•

10 years ago

From the initial bug reports, we had two major crashes with this bug. The first was related to bug 1133526 not being uplifted to aurora. Bug 1142935 fixed the race condition in NUWA reported in bug 1140978. QA and smoketest blockers today confirmed these are not occurring. The OSX vsync source crashes were resolved in bug 1144638 and not dependent on the refresh driver. Since the new nightlies on b2g + OS X enable the refresh driver by default, I'll let this sit for another couple of days to verify that the refresh driver is stable. For now, the bug is unactionable!

Mason Chang [Inactive] [:mchang]

Assignee

Comment 72

•

10 years ago

I haven't seen any refresh driver crashes on bugzilla, crash-stats, or pinged for by QA other than bug 1147753, which so far looks like an L Bug. Resolving as WFM.

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → WORKSFORME

You need to log in before you can comment on or make changes to this bug.