1397539 - H264 rendering fails after several calls

Reporter

Description

•

7 years ago

After making about 10-20 calls on http://mozilla.github.io/webrtc-landing/pc_test.html with "Require H264" turned on remote video no longer gets rendered.

I repeatedly start and stop the call without reloading the page.

After several starts (10-20?) the call connects according to about:webrtc and the ICE stats indicate that data gets exchanged, but both remote video elements never start rendering anything.

When in this "blank white" state I turn off "Require H264" I get proper video rendering (with VP8) again. But as soon as I turn "Require H264" back on rendering still fails.

Reloading the page does not solve the problem. Closing the tab and trying it in another tab does not solve the problem.

This is with OpenH264 1.6, but also with the upcoming 1.7.1 releases.

Nils Ohlmeier [:drno]

Reporter

Comment 1

•

7 years ago

I create a test page to repro this: https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html

- Open the link
- give access to cam and click the check box for future permission

On my first try it took 27 connections until I start to see white boxes for both remote videos (the bigger videos).

about:webrtc suggests that no RTP gets exchanged (neither send or received) once this stage has been reached.

URL: https://nils-ohlmeier.github.io/webrt...

Randell Jesup [:jesup] (needinfo me)

Updated

•

7 years ago

Rank: 12

Bulk Bug Changes for mreavy's org

Comment 2

•

7 years ago

Mass change P1->P2 to align with new Mozilla triage process

Priority: P1 → P2

Nils Ohlmeier [:drno]

Reporter

Updated

•

7 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1284285

Nils Ohlmeier [:drno]

Reporter

Updated

•

7 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1408258

xpeng

Comment 3

•

7 years ago

we reproduced the issue using the demo : https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html
the log has been uploaded.

xpeng

Comment 4

•

7 years ago

Attached file firefox_openh264_not_work.rar (deleted) — Details

Andreas Pehrson [:pehrsons]

Updated

•

6 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1465617

Andreas Pehrson [:pehrsons]

Comment 5

•

6 years ago

We have a couple of different issues relating to H264 now. Byron could you investigate these? I'm guessing they all stem from the same root cause.

Flags: needinfo?(docfaraday)

Byron Campen [:bwc]

Comment 6

•

6 years ago

Not anytime soon. I might be able to give this a look next week?

Flags: needinfo?(docfaraday)

Andreas Pehrson [:pehrsons]

Comment 7

•

6 years ago

That's fine. Can we keep the needinfo so it doesn't fall off the charts?

Flags: needinfo?(docfaraday)

Byron Campen [:bwc]

Comment 8

•

6 years ago

This seems to happen when we start encoding frames before GMP is done de-initing from the previous test... looking further, but likely a bug in either the GMP code, or the wrapper code we use to interface with GMP.

Byron Campen [:bwc]

Comment 9

•

6 years ago

Yeah, this definitely happens when the GMPParent has not been closed yet when we try to start a new call; the encoder/decoder has been torn down, but the logic in the GMPParent thinks the h264 process is still open and doesn't re-init it. This is almost certainly a bug in either the GMP code proper, or maybe the openh264 stuff.

Byron Campen [:bwc]

Comment 10

•

6 years ago

Pushing further back, this seems to happen when ActorDestroy has not been called on both encoder parents and both decoder parents before starting the new call. This does not appear to be an issue of not waiting long enough to start the new call; either all of the expected ActorDestroy calls are made in short order, or they are simply not made regardless of how long we wait.

Flags: needinfo?(docfaraday)

Byron Campen [:bwc]

Comment 11

•

6 years ago

We are deadlocking in the plugin process here:

https://searchfox.org/mozilla-central/source/dom/media/gmp/GMPVideoEncoderChild.cpp#186

Byron Campen [:bwc]

Comment 12

•

6 years ago

I see the following code at the end of Encode_w (the encode function that is called on the worker thread on the plugin process) in the openh264 code:


    if (!has_frame) {
      // This frame must be destroyed on the main thread.
      TrySyncRunOnMainThread (WrapTask (
                                   this,
                                   &OpenH264VideoEncoder::DestroyInputFrame_m,
                                   inputImage));
      return;
    }

    // Synchronously send this back to the main thread for delivery.
    TrySyncRunOnMainThread (WrapTask (
                                   this,
                                   &OpenH264VideoEncoder::Encode_m,
                                   inputImage,
                                   &encoded,
encoded_type));

We are sync dispatching to main when we're done encoding a frame. Here's the implementation of EncodingComplete:

  virtual void EncodingComplete() {
    shutting_down = true;

    // Release the reference to the external objects, because it is no longer safe to call them
    host_     = nullptr;
    callback_ = nullptr;
    TearDownEncoder();

    Release();
}

TearDownEncoder is implemented thusly:

   void TearDownEncoder() {
     // Stop the worker thread first
     if (worker_thread_) {
       worker_thread_->Join();
       worker_thread_ = nullptr;
     }

     // Destroy OpenH264 encoder
     if (encoder_) {
       WelsDestroySVCEncoder(encoder_);
       encoder_ = nullptr;
     }
   }


This code is called on main. If we get a call to EncodingComplete while we are encoding a frame, "worker_thread_->Join()" will never return, because the worker thread is trying to sync dispatch back to main.

Byron Campen [:bwc]

Comment 13

•

6 years ago

I should note that, when we deadlock, the IPC logging looks like this:

[time: 1533313793767228][49950->49965] [PGMPVideoEncoderParent] Sending  PGMPVideoEncoder::Msg_Encode
[time: 1533313793767359][49950->49965] [PGMPVideoEncoderParent] Sending  PGMPVideoEncoder::Msg_EncodingComplete
[time: 1533313793767397][49965<-49950] [PGMPVideoEncoderChild] Received  PGMPVideoEncoder::Msg_Encode
[time: 1533313793769270][49965<-49950] [PGMPVideoEncoderChild] Received  PGMPVideoEncoder::Msg_EncodingComplete

This seems to line up with the openh264 bug I described in comment 12.

Byron Campen [:bwc]

Comment 14

•

6 years ago

I have filed https://github.com/cisco/openh264/issues/3007

Dan Minor [:dminor]

Updated

•

6 years ago

Blocks: 1486988

Hank Peng

Updated

•

6 years ago

Assignee: nobody → hankpeng

Hank Peng

Comment 15

•

6 years ago

This bug is supposed to be fixed by the new release v1.8.1.

Status: NEW → ASSIGNED

Dan Minor [:dminor]

Updated

•

6 years ago

Depends on: 1533001

BugBot [:suhaib / :marco/ :calixte]

Comment 16

•

3 years ago

The bug assignee didn't login in Bugzilla in the last 7 months.
:bryce, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: hankpeng → nobody

Status: ASSIGNED → NEW

Flags: needinfo?(bvandyk)

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Comment 17

•

3 years ago

It sounds like this should be fixed in the current OpenH264 release and needs to be verified. :bwc, it sounds like you've previously reproed this and are more familiar with the domain than I. Would you be able to test this again and close or confirm this isn't fixed as appropriate?

Severity: normal → S3

Flags: needinfo?(bvandyk) → needinfo?(docfaraday)

Priority: P2 → P3

Byron Campen [:bwc]

Comment 18

•

2 years ago

I have not had any luck reproducing this.

Status: NEW → RESOLVED

Closed: 2 years ago

Flags: needinfo?(docfaraday)

Resolution: --- → FIXED