Closed Bug 1530207 Opened 6 years ago Closed 6 years ago

Startup crash in [@ InvalidArrayIndex_CRASH | mozilla::a11y::Accessible::InsertChildAt]

Categories

(Thunderbird :: General, defect)

Unspecified
All
defect
Not set
critical

Tracking

(thunderbird67+ affected, thunderbird68+ fixed)

VERIFIED FIXED
Thunderbird 68.0
Tracking Status
thunderbird67 + affected
thunderbird68 + fixed

People

(Reporter: wsmwk, Assigned: mkmelin)

References

Details

(Keywords: crash, reproducible, topcrash, Whiteboard: [startup crash][Thunderbird-testfailure: Z all][Thunderbird-disabled-test][XBL:3])

Crash Data

Attachments

(2 files)

#2 crash for nightlies, 20190222 and 20190223 builds. Looks like 2-3 users.

This bug is for crash report bp-aae15f22-c8f5-4eff-94ad-1d4660190222.

Top 10 frames of crashing thread:

0 mozglue.dll static void MOZ_Crash mfbt/Assertions.h:314
1 mozglue.dll MOZ_CrashPrintf mfbt/Assertions.cpp:55
2 xul.dll InvalidArrayIndex_CRASH xpcom/ds/nsTArray.cpp:27
3 xul.dll mozilla::a11y::Accessible::InsertChildAt accessible/generic/Accessible.cpp:2093
4 xul.dll mozilla::a11y::DocAccessible::MoveChild accessible/generic/DocAccessible.cpp:2125
5 xul.dll mozilla::a11y::DocAccessible::CacheChildrenInSubtree accessible/generic/DocAccessible.cpp:2153
6 xul.dll mozilla::a11y::DocAccessible::CreateSubtree accessible/generic/DocAccessible-inl.h:131
7 xul.dll mozilla::a11y::DocAccessible::ProcessContentInserted accessible/generic/DocAccessible.cpp:1734
8 xul.dll mozilla::a11y::NotificationController::WillRefresh accessible/base/NotificationController.cpp:766
9 xul.dll nsRefreshDriver::Tick layout/base/nsRefreshDriver.cpp:1850

Sure, that's the crash we're seeing on the tree and which I reported in bug 1529872 comment #6. Ugly.

Should be a dupe of that bug eventually. Sadly FF doesn't crash.

We saw this crash in MozMill testing in message-header/test-header-toolbar.js. That test passes now without crash after bug 1531282 was fixed. So I declare this ...

Fixed by bug 1531282.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 67.0

I was wrong, the crash is still there in testing:
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=05bbb10f557f67065a00841335e0801c336c1a5c&selectedJob=231331893
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=05bbb10f557f67065a00841335e0801c336c1a5c&selectedJob=231332272
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=05bbb10f557f67065a00841335e0801c336c1a5c&selectedJob=231330908

It's related to re-enabling the three tests in mail/test/mozmill/message-header/test-header-toolbar.js:
https://hg.mozilla.org/comm-central/rev/05bbb10f557f67065a00841335e0801c336c1a5c#l1.12

I'll disable them again :-(

Magnus, it's time for you to have a look. The test doesn't fail on its own, but it crashes the entire directory when those three tests are run.

Status: RESOLVED → REOPENED
Flags: needinfo?(mkmelin+mozilla)
Resolution: FIXED → ---
Target Milestone: Thunderbird 67.0 → ---

Victor, any idea what's going on here? This was first reported in bug 1529872 comment #6, which is the bug that captured the bustage after the tree conversion from XBL to CE.

Flags: needinfo?(vporof)

I'll have a look. Keeping ni?.

Good news and bad news: bug 1535265 will actually rework the tested parts, and those disabled tests are going to be removed in the process. So no crashing tests anymore. Bad news is of course that somewhere in the code there would still be code potentially triggering a crash (but hard to say if it would ever appear in the wild.)

Flags: needinfo?(mkmelin+mozilla)

Alexander, can you make any sense of this a11y crash? Only crashing on Thunderbird.

Flags: needinfo?(surkov.alexander)

bp-d1032989-a0a9-4526-be72-01be10190319 is only crash with a comment "closing settings tab".

I'm equally concerned about Bug 1529792 - Crash in [@ nsContentSecurityManager::doContentSecurityCheck] viewing successive nntp articles.

That said, the crash rates are quite low. So if we must wait for fixes then I think we'd benefit from shipping sooner, without fixes, so we start getting beta user feedback on 67 - we can set a low update rate and have time to assess the feedback.

(Thanks for the reminder, I asked over in the other bug.)

Flags: needinfo?(surkov.alexander)

Yes, easily reproducible if you build TB. From the object directory you run:
mozmake mozmill-one SOLO_TEST=message-header

It crashes in test-header-toolbar.js, but only if you run it as part of the entire directory. We've currently got three subtests disabled, so in that test change function disabled_test back to function test:
https://searchfox.org/comm-central/search?q=function+disabled_test&case=false&regexp=false&path=test-header-toolbar.js

Thanks in advance.

Sorry, on Linux or Mac it's just make. And it could be that it doesn't work on Mac at all, in which case you need to apply the patch from bug 1533922.

I've just tried it re-enabling the three tests and I still get:
Hit MOZ_CRASH(ElementAt(aIndex = 2, aLength = 0)) at c:/mozilla-source/comm-central/xpcom/ds/nsTArray.cpp:29

Attached patch enable-crashing-test.patch (deleted) — Splinter Review

For your convenience to enable the tests.

Alexandar, please see comment 11 - 13

This is now a topcrash for beta and nightly

Flags: needinfo?(surkov.alexander)
Keywords: topcrash

turned off beta updates until we have a patch to uplift.

Just of the record: I warned on TB drivers and I wanted this bug to be looked at before shipping TB 67 beta. But others knew better :-(

Just of the record: I warned on TB drivers and I wanted this bug to be looked at before shipping TB 67 beta. But others knew better :-(

Actually, I am totally comfortable and satisfied with that decision, and the results of that decision are excellent. The crashing could have been worse, or could have been better. Fact is we (including you) couldn't possibly know until we shipped.

As hoped it's not a crash rate which left users immobilized. Quite the opposite we are at ~50% updated and that crash rate is only double the previous #1 rated crash - that's pretty OK. I only turned off updates so as to not inconvenience those who haven't yet updated and because we now have the benefit of a) knowing there are not other new crashes associated with 67, b) knowing that the release is reasonably usable and c) having enough users on it that we get feedback that wasn't possible from nightlies. Again, that was the desired result.

~75% of crashes are startup, which we failed to note in the analysis.

But ~25% of crashes are NOT startup, which makes this Kinda interesting. Examples below. Do we have two different issues?
bp-f5f7f454-9126-409a-a400-99a920190401 2 minutes uptime
bp-d1032989-a0a9-4526-be72-01be10190319 6 minutes uptime
bp-a2fbc823-7f4d-45d0-bd9f-b394a0190322 16 minutes uptime

And extremely rare, is a couple linux crashes
bp-1ccc5e3e-c447-48dc-b753-9eb060190404 3 hours uptime
bp-6f6f0a7b-1679-44c2-b8ef-e7d5f0190405 17 hours uptime

Flags: needinfo?(mkmelin+mozilla)
Summary: Crash in [@ InvalidArrayIndex_CRASH | mozilla::a11y::Accessible::InsertChildAt] → Startup crash in [@ InvalidArrayIndex_CRASH | mozilla::a11y::Accessible::InsertChildAt]
Whiteboard: [startup crash]

I looked around, and while I assume there's some higher level bug, this looks very suspicious:
https://searchfox.org/comm-central/rev/29215b2012a35d80ea879bb5403290caa3f41242/mozilla/accessible/generic/DocAccessible.cpp#2129-2149

There we first check if the index is ok, then go on to remove one child, and go on, but then the first check regarding aIdxInParent would not necessarily be true anymore.

Flags: needinfo?(mkmelin+mozilla)

That link doesn't work. You can't use comm-central ... /mozilla links. This is what you want:
https://searchfox.org/mozilla-central/source/accessible/generic/DocAccessible.cpp#2129-2149

I'm not knowledgeable enough to know what's going on here and it looks like asurkov is looking at this already, removing ni.

Flags: needinfo?(vporof)

At a second look, that's changing different nodes so is not the issue.
But, looks like the below should be && instead of ||. Else -1 (which is passed in to the unsigned index later) is always true.

https://searchfox.org/mozilla-central/source/accessible/generic/DocAccessible.cpp#2135

bool hasInsertionPoint =
(aIdxInParent != -1) ||
(aIdxInParent <= static_cast<int32_t>(aNewParent->ChildCount()));

For -1 that would be false || true, which is still true.
I'll make a patch for that.

Assignee: nobody → mkmelin+mozilla

(In reply to Magnus Melin [:mkmelin] from comment #22)

At a second look, that's changing different nodes so is not the issue.
But, looks like the below should be && instead of ||. Else -1 (which is passed in to the unsigned index later) is always true.

https://searchfox.org/mozilla-central/source/accessible/generic/DocAccessible.cpp#2135

bool hasInsertionPoint =
(aIdxInParent != -1) ||
(aIdxInParent <= static_cast<int32_t>(aNewParent->ChildCount()));

For -1 that would be false || true, which is still true.
I'll make a patch for that.

Good catch! Alternatively (aIndexInParent >= 0 && aIndexInParent <= aNewParent->ChildCount()). Might be a bit more readable and error-prone.

Flags: needinfo?(surkov.alexander)

I changed the code in question to

-  bool hasInsertionPoint =
-      (aIdxInParent != -1) ||
-      (aIdxInParent <= static_cast<int32_t>(aNewParent->ChildCount()));
+  bool hasInsertionPoint = (0 <= aIdxInParent &&
+    aIdxInParent <= static_cast<int32_t>(aNewParent->ChildCount()));

and ran our test suite with the crash-producing tests enabled:
mozmake mozmill-one SOLO_TEST=message-header

It crashes just the same: :-(
Hit MOZ_CRASH(ElementAt(aIndex = 2, aLength = 0)) at c:/mozilla-source/comm-central/xpcom/ds/nsTArray.cpp:29
#01: mozilla::a11y::DocAccessible::MoveChild (c:\mozilla-source\comm-central\accessible\generic\DocAccessible.cpp:2149)
#02: mozilla::a11y::DocAccessible::CacheChildrenInSubtree (c:\mozilla-source\comm-central\accessible\generic\DocAccessible.cpp:2176)

Dug further and found the problem, but I'm not sure what the correct fix is.

The above mentioned fix is correct (but irrelevant). What happens is that the ChildCount()s will check check the wrong thing for trees: https://searchfox.org/mozilla-central/rev/6dab6dad9cc852011a14275a8b2c2c03ed7600a7/accessible/xul/XULTreeAccessible.cpp#354

With the below change, things will work correctly for our case (but I assume not in others, didn't check yet):

  • if (!true) return childCount;
  • if (!mTreeView) return childCount;

Since the childCount will count the rows for trees, and Accessible::InsertChildAt will look only at mChildren.Length() there will be a mismatch leading to a crash.

  • if (true) return childCount;
  • if (!mTreeView) return childCount;

if (true) was the change that made it work

Let's count the three disabled tests in test-header-toolbar.js here. Note that they will be removed in bug 1535265, but it would be good to fix the crash first before removing the reproducible evidence.

Status: REOPENED → ASSIGNED
Whiteboard: [startup crash] → [startup crash][Thunderbird-testfailure: Z all][Thunderbird-disabled-test][XBL:3]

(In reply to Magnus Melin [:mkmelin] from comment #25)

Dug further and found the problem, but I'm not sure what the correct fix is.

The above mentioned fix is correct (but irrelevant). What happens is that the ChildCount()s will check check the wrong thing for trees: https://searchfox.org/mozilla-central/rev/6dab6dad9cc852011a14275a8b2c2c03ed7600a7/accessible/xul/XULTreeAccessible.cpp#354

With the below change, things will work correctly for our case (but I assume not in others, didn't check yet):

  • if (!true) return childCount;
  • if (!mTreeView) return childCount;

Since the childCount will count the rows for trees, and Accessible::InsertChildAt will look only at mChildren.Length() there will be a mismatch leading to a crash.

Accessible::InsertChildAt operates on mChildren, so this part looks correct. It should assert on a wrong index though. I think DocAccessible::CacheChildrenInSubtree could be changed to use mChildren.Length() instead ChildCount(), what should fix the issue, but not sure wheather MoveChild() make sense on XUL trees at all. Do you know what caused MoveChild()? Is it because of aria-owns usage or it comes somehow natively?

For XULTreeAccessible, the ChildCount() is not only the mChildren, so check mChildren directly to make sure we stay within bounds

I made a try run earlier today which looks pretty good (and the THunderbird tests pass)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f885de35e89fd78b19106d682b2cd6a593e38424

Updated DocAccessible::CacheChildrenInSubtree to check mCHildren directly too. Unfortunately I don't know what's causing MoveChild, or the details of why it happens. For the crashing Thunderbird test case it is only triggered when running over all the tests in the directory (and not run one by one), so it's really hard to pinpoint what goes wrong. It does have something to do with customizing toolbar elements.

(In reply to Magnus Melin [:mkmelin] from comment #32)

It does have something to do with customizing toolbar elements.

How might that relate to the crash for users being on startup?

Flags: needinfo?(mkmelin+mozilla)

Can't say. Running the mozmill tests is the only way I know that reproduce this. There might be more than one thing triggering the crash of course.

Flags: needinfo?(mkmelin+mozilla)
Pushed by mkmelin@iki.fi: https://hg.mozilla.org/integration/autoland/rev/5982eef0c2cb fix crash [@ InvalidArrayIndex_CRASH | mozilla::a11y::Accessible::InsertChildAt] . r=surkov
Target Milestone: --- → Thunderbird 68.0

Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/13e57aaba5a0
re-enable three tests in test-header-toolbar.js. r=me

Status: ASSIGNED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED

Thanks for the patch.
Good news! No trunk crashes after build 20190418003825.

Can we uplift to beta to kill the #2 beta crasher?

Status: RESOLVED → VERIFIED

Well, this is really an M-C bug, so I'd have to do it on a branch for the TB 67 beta 2.

(In reply to Jorg K (GMT+2) from comment #39)

Well, this is really an M-C bug, so I'd have to do it on a branch for the TB 67 beta 2.

I was just throwing the question to the wind - whatever you guys arrange, landing on M-beta or a branch, is fine with me so we can turn beta updates back on. Fortunately any branch will be short lived.

TB 67 beta 2:
https://hg.mozilla.org/releases/comm-beta/rev/c6411719c75a4c91d0454aeb529d293b3f3d9ae3
Re-enabled tests to make sure there's no crash ;-)

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: