Closed Bug 1093278 Opened 10 years ago Closed 10 years ago

b2g has crashed in libxul.so!CanonicalizeXPCOMParticipant [nsCycleCollector.cpp : 934 + 0x6]

Categories

(Firefox OS Graveyard :: Stability, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:2.1+, firefox34 wontfix, firefox35 fixed, firefox36 fixed, firefox-esr31 fixed, b2g-v2.1 fixed, b2g-v2.2 fixed)

RESOLVED FIXED
2.1 S9 (21Nov)
blocking-b2g 2.1+
Tracking Status
firefox34 --- wontfix
firefox35 --- fixed
firefox36 --- fixed
firefox-esr31 --- fixed
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed

People

(Reporter: tkundu, Assigned: mccr8)

References

Details

(Keywords: crash, Whiteboard: [b2g-crash][caf-crash 394][caf priority: p1][CR 749784])

Crash Data

Attachments

(4 files, 2 obsolete files)

Attached file stack trace (deleted) —
STR: Random stability test for a long time.
[Blocking Requested - why for this release]: b2g has crashed Please note that we are seeing following memleak at time of crash : ==- 00051712, 1414999732, 2014-11-03 12:58:52, 2014-11-03 07:28:52 UTC, up 232334 -== | megabytes | NAME PID PPID CPU(s) NICE USS PSS RSS SWAP VSIZE OOM_ADJ USER b2g 2751 1 146854.6 0 248.2 259.2 279.2 0.0 488.2 0 root (Nuwa) 2757 2751 924.5 0 1.9 4.8 13.6 0.0 53.8 -16 root Built-in Keyboa 2953 2757 257.9 18 6.7 10.4 21.3 0.0 64.1 10 u0_a2953 Homescreen 3125 2751 4829.2 18 21.0 27.1 40.5 0.0 99.1 8 u0_a3125 (Preallocated a 32701 2757 0.2 18 5.3 8.6 18.9 0.0 60.7 1 u0_a32701 System memory info: Total 847.5 MB SwapTotal 0.0 MB Used - cache 489.1 MB B2G procs (PSS) 310.2 MB Non-B2G procs 178.9 MB Free + cache 358.4 MB Free 132.2 MB Cache 226.1 MB SwapFree 0.0 MB and /sys/class/kgsl/kgsl/page_alloc is 69472256 bytes = 69MB which clearly says kgsl memleak ny b2g process. unfortunately we don't have DMD report for now.. one more point is that 1) Crash has happened inside nsCycleCollector.cpp and system has enough memory to continue although b2g is leaking kgsl memory. I will create a separate bug for memleak issue. For this crash, we need a logging patch to confirm what exactly is happening inside gecko.
blocking-b2g: --- → 2.1?
Flags: needinfo?(khuey)
Whiteboard: [CR 749784]
b2g memleak bug 1093281 for comment 2 And here is the full logcat logs : https://drive.google.com/file/d/0B1cSMS8_GuAETGFQV0piQnI4eFE/view?usp=sharing crash timestamp is : "11-03 12:58:54.887 2752 2789 E AudioFlinger: no wake lock to update!" in full logcat logs.
I'll direct this to Andrew for now since I'm swamped.
Flags: needinfo?(khuey) → needinfo?(continuation)
Flags: needinfo?(continuation)
Flags: needinfo?(continuation)
Are these non-DEBUG builds? There are some checks we do in debug builds we could try in a non-debug build.
Flags: needinfo?(tkundu)
khuey in IRC said they are not
Flags: needinfo?(tkundu)
Whiteboard: [CR 749784] → [caf priority: p1][CR 749784]
Whiteboard: [caf priority: p1][CR 749784] → [b2g-crash][caf-crash 394][caf priority: p1][CR 749784]
Keywords: crash
Observed on: Device: msm8226 Gonk Version: AU_LINUX_GECKO_B2G_KK_2.0.01.04.00.114.127 Moz BuildID: 20141029001202 B2G Version: 2.1 Gecko Version: 34.0 Gaia: http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=eb0aab0f13c78c7ac378ad860e865c4b6eaf669f Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=e16ae12c5ab365be13fb2c98d3811b93b25ded6d Patches: bug 1070431, bug 1083449
Olli, what do you think this crash stack means? Maybe that the argument to CanonicalizeXPCOMParticipant is null?
Assignee: nobody → continuation
Flags: needinfo?(continuation) → needinfo?(bugs)
blocking-b2g: 2.1? → 2.1+
My guess here is that there's some kind of bug in UnmarkRemainingPurple, because I think we rarely actually call it, and it sets the mObject pointer to null. In fact, looking at the code now, if mCount > 0, in FreeBlocks(), then we null out pointers in mFirstBlock, but of course we don't actually delete it. Then if we trigger something that iterates over the purple buffer before the first block is filled, I think we'll hit null pointers. So maybe that's the issue? But it would probably be safer to just null check in VisitEntries(), if in fact it is always null like this.
Eh, it looks like UnmarkRemainingPurple() is okay, thanks to the InitBlocks() call after, so I'm not sure what is going on. We can try the null check I guess.
Attached patch Null check in nsPurpleBuffer::VisitEntries. (obsolete) (deleted) — Splinter Review
This should be pretty safe. I checked and everything that reads mObject from the purple buffer goes through VisitEntries so this shouldn't move the crash elsewhere.
Attachment #8516348 - Flags: review?(bugs)
Attached patch Null check in nsPurpleBuffer::VisitEntries. (obsolete) (deleted) — Splinter Review
Rebased on top of b2g34, though the patch looks identical, I think.
Attachment #8516348 - Attachment is obsolete: true
Attachment #8516348 - Flags: review?(bugs)
Attachment #8516354 - Flags: review?(bugs)
Olli pointed out that an assert would be good.
Attachment #8516354 - Attachment is obsolete: true
Attachment #8516354 - Flags: review?(bugs)
Attachment #8516365 - Flags: review?(bugs)
Tapas, you could try this patch. It should fix the immediate cause of the crash, assuming they are all null like this. I think we won't end up crashing elsewhere, but I can't tell for sure.
Flags: needinfo?(tkundu)
Attachment #8516365 - Flags: review?(bugs) → review+
I'm waiting on landing this until we think it actually solves the 2.1 issue. The try run against trunk is green: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=e42b6404cb16
(In reply to Andrew McCreight [:mccr8] from comment #15) > Tapas, you could try this patch. It should fix the immediate cause of the > crash, assuming they are all null like this. I think we won't end up > crashing elsewhere, but I can't tell for sure. We landed this patch internally and we are waiting for feedback from our test team. I will update here asap
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #17) > (In reply to Andrew McCreight [:mccr8] from comment #15) > > Tapas, you could try this patch. It should fix the immediate cause of the > > crash, assuming they are all null like this. I think we won't end up > > crashing elsewhere, but I can't tell for sure. > > We landed this patch internally and we are waiting for feedback from our > test team. I will update here asap We are not seeing this again with your patch. It seems to me that this bug is very difficult to reproduce (we have seen only once) and I am not sure this patch really fixed this bug or not. Please go ahead and land this patch if you are ok with this :) Thanks for your help .
Flags: needinfo?(tkundu) → needinfo?(continuation)
Alright, thanks!
Flags: needinfo?(continuation)
Keywords: checkin-needed
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → 2.1 S9 (21Nov)
Comment on attachment 8516365 [details] [diff] [review] Null check in nsPurpleBuffer::VisitEntries. [Approval Request Comment] Bug caused by (feature/regressing bug #): unknown User impact if declined: 2.1 blocker. I want to get this on non-B2G branches, too, because we run this code on desktop, and it will get us more testing. Testing completed: TBPL Risk to taking this patch (and alternatives if risky): very low, just a null check. String or UUID changes made by this patch: none
Attachment #8516365 - Flags: approval-mozilla-beta?
Attachment #8516365 - Flags: approval-mozilla-b2g34?
Attachment #8516365 - Flags: approval-mozilla-aurora?
Attachment #8516365 - Flags: approval-mozilla-b2g34? → approval-mozilla-b2g34+
Comment on attachment 8516365 [details] [diff] [review] Null check in nsPurpleBuffer::VisitEntries. I spoke with mccr8 about this request. It's not worth taking on Beta but can provide some value on Aurora. Beta- Aurora+
Attachment #8516365 - Flags: approval-mozilla-beta?
Attachment #8516365 - Flags: approval-mozilla-beta-
Attachment #8516365 - Flags: approval-mozilla-aurora?
Attachment #8516365 - Flags: approval-mozilla-aurora+
We see that crash on ESR 31.3. so would be worth to backport it there too.
Martin, any chance you have STR for the esr crash?
Flags: needinfo?(stransky)
Comment on attachment 8516365 [details] [diff] [review] Null check in nsPurpleBuffer::VisitEntries. [Approval Request Comment] If this is not a sec:{high,crit} bug, please state case for ESR consideration: Crashes, see comment 25. I don't know how common these are. User impact if declined: crashes Fix Landed on Version: 37 maybe? recent Risk to taking this patch (and alternatives if risky): very low String or UUID changes made by this patch: none
Attachment #8516365 - Flags: approval-mozilla-esr31?
Attached file ESR 31.3 Linux STR (deleted) —
There's the stracktrace. Catched on Linux/RHEL5.11.
Flags: needinfo?(stransky)
Unfortunately stack trace isn't enough here, since that is too generic. We just have a null pointer there, but the question is why.
I see a crash where also aEntry->mParticipant in VisitEntries() is null.
If I add the aEntry->mParticipant on the same place the app crashes in GCGraphBuilder::DescribeRefCountedNode(): 2051 if (refCount == UINT32_MAX) ? >2052 Fault("overflowing refcount", mCurrPi)
It is normal for mParticipant to be null in VisitEntries(). Please file a new bug for your issue if the patch here is insufficient. Thanks.
Attachment #8516365 - Flags: approval-mozilla-esr31? → approval-mozilla-esr31+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: