Closed Bug 1657041 Opened 4 years ago Closed 3 years ago

[meta] Merge RDD and GPU processes into one

Categories

(Core :: Audio/Video: Playback, task)

task

Tracking

()

RESOLVED WONTFIX

People

(Reporter: jya, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: meta)

As presented in https://docs.google.com/document/d/1DxAyQpVtzzrzOHckwVk8CEFaFEiDA7aZW2E7GXy-9-s/edit?usp=sharing and https://docs.google.com/document/d/1JrPat99OGRn5XrterawmYRFes5fGJQgsXGvJDF4dVgo/edit?usp=sharing

To fix performance in the remote decoders; which is now even more troublesome following the recent use of AV1 with YouTube even for 1080p and over) ; we need to give to the RDD process access to the hardware via the win32k API.

This meta-bug will track the work required to have the RemoteDataDecoder framework running in the GPU process (likely to be renamed)

What will happen on Wayland, where there's no GPU process (but RDD is supported right now)?

https://docs.google.com/document/d/1JrPat99OGRn5XrterawmYRFes5fGJQgsXGvJDF4dVgo/edit?usp=sharing

Unfortunately the assumptions in this document are not correct:

"This will result in the RDD sandbox strength generally matching the current GPU process sandbox."
"Following these changes, there will be no differences between the RDD and GPU sandbox restrictions."

While we agreed that performance considerations mean that the RDD process would have to lose its Win32k lockdown (see e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=1595994#c27), the GPU process is currently completely unsandboxed. We don't have a clear path forward to enabling it either, for the history see https://bugzilla.mozilla.org/show_bug.cgi?id=1347710#c61 and the linked bugs. Bug 1630860 is a severe blocker where we have little actionable data right now, and we're still seeing regressions pop us such as bug 1651670, and performance parity work like bug 1654617.

Moreover, given that we don't know how hard we can actually tighten the GPU sandbox, any conclusions that the strength of the GPU sandbox and the RDD sandbox would "generally match" seem entirely premature. But it is clear that if you go forward and move both in to same process, security will only ever be that of the lowest common denominator. That may be something we can live with if the performance gains make it worthwhile - but it seems premature to make conclusions about that.

What will happen on Wayland, where there's no GPU process (but RDD is supported right now)?

It's not only on Wayland, it was also disabled for the X11 case:

https://bugzilla.mozilla.org/show_bug.cgi?id=1653443
https://bugzilla.mozilla.org/show_bug.cgi?id=1653444

So I'd agree the security of that platform would also be negatively impacted.

We don't have a GPU process on macOS either FWIW.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #2)

https://docs.google.com/document/d/1JrPat99OGRn5XrterawmYRFes5fGJQgsXGvJDF4dVgo/edit?usp=sharing

Unfortunately the assumptions in this document are not correct:

"This will result in the RDD sandbox strength generally matching the current GPU process sandbox."
"Following these changes, there will be no differences between the RDD and GPU sandbox restrictions."

While we agreed that performance considerations mean that the RDD process would have to lose its Win32k lockdown (see e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=1595994#c27), the GPU process is currently completely unsandboxed. We don't have a clear path forward to enabling it either, for the history see https://bugzilla.mozilla.org/show_bug.cgi?id=1347710#c61 and the linked bugs. Bug 1630860 is a severe blocker where we have little actionable data right now, and we're still seeing regressions pop us such as bug 1651670, and performance parity work like bug 1654617.

Moreover, given that we don't know how hard we can actually tighten the GPU sandbox, any conclusions that the strength of the GPU sandbox and the RDD sandbox would "generally match" seem entirely premature. But it is clear that if you go forward and move both in to same process, security will only ever be that of the lowest common denominator. That may be something we can live with if the performance gains make it worthwhile - but it seems premature to make conclusions about that.

I don't believe the document made any such conclusions. If you believe so that please add some comments to it so it can be corrected.

I don't think we can argue that the main security comes from not having any win32k access from the content process. Previous exploits only worked because decoding occurred in the content process where you could run user JS.
(and as follow-up the content process sendbox can be tightened as per bug 1381019).

Once you enable win32k in the RDD process ; what particular sandboxing would there be left?
If any, why couldn't that same restrictions be applied on the GPU process?

The only certainty is that such change will make massive improvements performance-wise. A single copy has shown to cause 7.3% of the CPU usage in bug 1589165. And that's currently the best scenario where decoding is performed in the content process.

Currently, when using the RDD we have 7 of such copies !
Once we move the decoding in the GPU process there will only ever be one in total.
As the resolution increase, the performance gain will be even greater

Ultimately, the process architecture will be similar to what blink is doing : all in the GPU

(In reply to Gian-Carlo Pascutto [:gcp] from comment #4)

We don't have a GPU process on macOS either FWIW.

did I state anywhere we had a GPU process on macOS ?

So I'd agree the security of that platform would also be negatively impacted.

Ultimately the aim is to make video playback suck less while minimising the security impact; with the direction we are taking we will ultimately gain the following:

1- Better performance
2- Greater overall security as no video decoders are running in the content process
3- All audio decoding will be performed in a wasm sandbox (in the content process)

Yes, AV1 will be theoretically less secure than now; except that currently the performance is so poor that people would not be using firefox.

BTW, those are all arguments we had discussed last year and are mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1595994#c27) ; they haven't been forgotten. Let's move forward

Once you enable win32k in the RDD process ; what particular sandboxing would there be left?
If any, why couldn't that same restrictions be applied on the GPU process?

Enabling a much lower level of sandboxing in the GPU process already breaks Firefox. See bug 1347710.

Ultimately, the process architecture will be similar to what blink is doing : all in the GPU

Note that Blink does ship with a GPU sandbox. (Which they've had to punch holes in on Linux to equal our video playback performance there. It's a hard problem for everyone!)

did I state anywhere we had a GPU process on macOS ?

That comment wasn't addressed at you: the question was asked what the state is on other platforms: https://bugzilla.mozilla.org/show_bug.cgi?id=1657041#c1.

BTW, those are all arguments we had discussed last year and are mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1595994#c27) ; they haven't been forgotten. Let's move forward

We discussed and agreed about removing win32k lockdown from RDD, not about removing the sandbox entirely, which is that this will end up being until/unless more progress can be made on bug 1347710.

It seems a long term project, setting severity to S3 for now. Feel free to change it.

Severity: -- → S3

(In reply to Jean-Yves Avenard [:jya] from comment #5)

Currently, when using the RDD we have 7 of such copies !

Can it be changed so that AV1 profits from bug 1403618 (bug 1629788/bug 1562818/bug 1653409/bug 1223270) as well if it doesn't already?
Wouldn't bug 1595994 go in this direction?

(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #9)

(In reply to Jean-Yves Avenard [:jya] from comment #5)

Currently, when using the RDD we have 7 of such copies !

Can it be changed so that AV1 profits from bug 1403618 (bug 1629788/bug 1562818/bug 1653409/bug 1223270) as well if it doesn't already?

No it will not. Most of those changes are for decoders currently running in the content and those optimisations are possible because for now the content process has a relaxed sandbox that gives access to the HW.
All those improvements rely on the decoder having access to the hardware. And in the RDD, the sandbox prevents that.

Wouldn't bug 1595994 go in this direction?
Bug 1595994 would greatly improve things, and we need to get that landed. However, with bug 1595994 we still need to copy the decoded surface into a shareable image.
If the decoding was occurring in the GPU process we could directly render it to screen with no copy whatsoever. This is what this bug hopes to achieve.

bug 1595994 is a dependency on this bug.

Type: defect → task

Hi Bryce, is this still required now that the RDD process has the access to do the texture uploading?

Flags: needinfo?(bvandyk)

(In reply to Bob Owen (:bobowen) from comment #11)

Hi Bryce, is this still required now that the RDD process has the access to do the texture uploading?

My understanding is that it's still something media would like to see happen. I'll check with some folks involved and make sure that's correct. Holding NI.

(In reply to Bryce Seager van Dyk (:bryce) from comment #12)

(In reply to Bob Owen (:bobowen) from comment #11)

Hi Bryce, is this still required now that the RDD process has the access to do the texture uploading?

My understanding is that it's still something media would like to see happen. I'll check with some folks involved and make sure that's correct. Holding NI.

The above is incorrect, and at this stage a merge is not planned.

The whole answer is more complicated and is still somewhat up in the air. The current plan is to keep the processes separate, and may involve reworking them further. My understanding is this reworking is to get the sandbox as tight as possible, but because of various restrictions put in place by the codecs we're running in the processes, we still want to keep these separate (and possibly add yet more).

Flags: needinfo?(bvandyk)

(In reply to Bryce Seager van Dyk (:bryce) from comment #13)

The whole answer is more complicated and is still somewhat up in the air.

I wrote up a plan of record here: https://docs.google.com/document/d/1WDEY5fQetK_YE5oxGxXK9BzC1A8kJP3q6F1gAPc2UGE/edit#

I don't think this blocks win32k lockdown for the content process any more.

No longer blocks: win32k-lockdown

(In reply to Bobby Holley (:bholley) from comment #14)

I wrote up a plan of record here: https://docs.google.com/document/d/1WDEY5fQetK_YE5oxGxXK9BzC1A8kJP3q6F1gAPc2UGE/edit#

And per the above plan I don't think we're going to do this bug as-filed.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.