Closed Bug 1846957 Opened 1 year ago Closed 1 year ago

Crash in [@ mozilla::TrackBuffersManager::RemoveFrames], [@ mozilla::TrackBuffersManager::TrackData::GetTrackBuffer]

Categories

(Core :: Audio/Video: Playback, defect, P1)

defect

Tracking

()

RESOLVED FIXED
118 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox-esr115 --- unaffected
firefox116 --- unaffected
firefox117 --- unaffected
firefox118 blocking fixed

People

(Reporter: aryx, Assigned: alwu)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(4 files)

48 crashes from 23 installations with Firefox 118.0a1 after bug 1845213 had landed.

Crash report: https://crash-stats.mozilla.org/report/index/7a498ea6-f82f-4ddf-9861-43fe30230803

MOZ_CRASH Reason: MOZ_DIAGNOSTIC_ASSERT(data.IsEmpty())

Top 10 frames of crashing thread:

0  xul.dll  mozilla::TrackBuffersManager::RemoveFrames  dom/media/mediasource/TrackBuffersManager.cpp:2495
1  xul.dll  mozilla::TrackBuffersManager::CodedFrameRemoval  dom/media/mediasource/TrackBuffersManager.cpp:690
2  xul.dll  mozilla::TrackBuffersManager::ProcessTasks  dom/media/mediasource/TrackBuffersManager.cpp:255
3  xul.dll  mozilla::TrackBuffersManager::QueueTask  dom/media/mediasource/TrackBuffersManager.cpp:176
4  xul.dll  mozilla::TrackBuffersManager::CodedFrameRemovalWithPromise  dom/media/mediasource/TrackBuffersManager.cpp:627
5  xul.dll  mozilla::detail::RunnableMethodArguments<StoreCopyPassByRRef<mozilla::media::Interval<mozilla::media::TimeUnit> > >::apply<mozilla::TrackBuffersManager, RefPtr<mozilla::MozPromise<bool, nsresult, 1> >  const  xpcom/threads/nsThreadUtils.h:1164
5  xul.dll  std::invoke  /builds/worker/fetches/vs/VC/Tools/MSVC/14.29.30133/include/type_traits:1534
5  xul.dll  std::_Apply_impl  /builds/worker/fetches/vs/VC/Tools/MSVC/14.29.30133/include/tuple:974
5  xul.dll  std::apply  /builds/worker/fetches/vs/VC/Tools/MSVC/14.29.30133/include/tuple:979
5  xul.dll  mozilla::detail::RunnableMethodArguments<StoreCopyPassByRRef<mozilla::media::Interval<mozilla::media::TimeUnit> > >::apply  xpcom/threads/nsThreadUtils.h:1162
Flags: needinfo?(alwu)
Crash Signature: [@ mozilla::TrackBuffersManager::RemoveFrames] → [@ mozilla::TrackBuffersManager::RemoveFrames] [@ mozilla::TrackBuffersManager::TrackData::GetTrackBuffer]

Set release status flags based on info from the regressing bug 1845213

Assignee: nobody → alwu
Severity: -- → S2
Flags: needinfo?(alwu)
Priority: -- → P1
Keywords: leave-open

Depends on D185346

Setting as release blocker given the volume on Nightly.

Attachment #9347255 - Attachment description: Bug 1846957 - remove assertion because there could be a chance where some data with zero duration are still in the buffer. → Bug 1846957 - change assertion to check data's duration.
Pushed by alwu@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/428d57ac1a6b convert interval to sample's base before doing any comparison or operations. r=padenot https://hg.mozilla.org/integration/autoland/rev/069a74f7cc88 change assertion to check data's duration. r=padenot

:alwu Since the patches landed the volume has dropped in Fenix but there are still crashes. The Desktop crash volume is still high.

Flags: needinfo?(alwu)

Yes, I'm still monitoring this crash.

Flags: needinfo?(alwu)
Duplicate of this bug: 1406118

In this crash report, I found that when doing remove frame, the interval we supposed to remove was [80,000,000-2,692,131,704], but we only removed [80,000,000-167,985,000], which left [167,985,000-2,692,131,704] in the buffered range but there was already no data left in our buffer.

So the question here would be, if the highest timestamp in data was 167,985,000, why the highest end time in our buffered range was 2,692,131,704? It seems there would still be mismatching when adding frames. I am going to add another assertion to see if we can capture this problem.

Attachment #9347989 - Attachment description: Bug 1846957 - check whether the end of the buffered range matches the highest end time. → Bug 1846957 - ensure the sample range should match the sample time if it grows.
Attachment #9347989 - Attachment description: Bug 1846957 - ensure the sample range should match the sample time if it grows. → Bug 1846957 - ensure the sample range matches the sample time when it grows.
Pushed by alwu@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a3d69906f45e ensure the sample range matches the sample time when it grows. r=padenot
Pushed by alwu@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/48324d5aa385 ensure the sample range matches the sample time when it grows. r=padenot

For these two crashes [1], when the buffered range was empty, there was still a key frame left in the buffer. That makes me wonder if we have any logic error in this part of code. One guess is that the data in the buffer are stored based on the order of DTS, instead of PTS, would it be possible we break out the loop too early?

Another crash [2] showed that there were still 944 samples left when the buffered range was empty.... I still can't get the reasonable explanation about that. If we can't find the root cause in time, then I will need to downgrade those assertions to MOZ_ASSERT and keep debugging them until finding the real problem. Paul, if you have any insight in terms of these crashes, please let me know! Thanks!

[1]
https://crash-stats.mozilla.org/report/index/5c5f22a3-2f62-494f-83e5-4bb6a0230810
https://crash-stats.mozilla.org/report/index/c5d35fa7-e88a-4857-a037-42fce0230810

[2]
https://crash-stats.mozilla.org/report/index/cedaef99-ae0e-4769-859a-ee7100230810#tab-details

Flags: needinfo?(alwu) → needinfo?(padenot)
Pushed by padenot@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4bcd192af033 turn the assertion to debug-only, and do adjustment to ensure that the buffered range and data should match on non-debug build. r=padenot

(In reply to Alastor Wu [:alwu] from comment #16)

For these two crashes [1], when the buffered range was empty, there was still a key frame left in the buffer. That makes me wonder if we have any logic error in this part of code. One guess is that the data in the buffer are stored based on the order of DTS, instead of PTS, would it be possible we break out the loop too early?
https://crash-stats.mozilla.org/report/index/cedaef99-ae0e-4769-859a-ee7100230810#tab-details

Data is always stored in DTS order, not PTS order.
Could it be that there's a sample with a duration of 0 but has a high pts/dts?

Fx118 merge day is on 2023-08-28.
Wondering if there any additional patches planned under this bug?
There is a leave-open keyword is on the bug, and there are no crash reports since the latest patch landed.

Blocks: 1849216

We can now close this bug, because the crash rate has been significantly reduced. In the meanwhile, I filed bug 1849216 to keep tracking this issue because we didn't find out the real problem yet.

Could it be that there's a sample with a duration of 0 but has a high pts/dts?
I thought about this possibility before, so I added an assertion to ensure that if the duration of sample should be zero if the buffered range is not empty. But that also failed. Crash reports showed that there were still real sample (with non-zero duration) left in the buffer, but the range is already zero. (Also saw opposite situation, zero data left, but still had a buffered range) I still suspect that there might be some incorrect time unit conversion somewhere, but I couldn't find it now.

Status: NEW → RESOLVED
Closed: 1 year ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → 118 Branch
Flags: needinfo?(padenot)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: