Closed Bug 1661528 Opened 4 years ago Closed 4 years ago

Poor texture upload performance on Mali

Categories

(Core :: Graphics: WebRender, defect)

ARM64
Android
defect

Tracking

()

RESOLVED FIXED
84 Branch
Tracking Status
firefox84 --- fixed

People

(Reporter: jnicol, Assigned: jnicol)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

In bug 1661135 we established that stuttery zooming on Mali G71/G72 can occur due to excessive texture upload.

On Adreno, we did not notice this issue because the texture upload path is fast enough to hide it on most sites. On Mali, however, it is unbearably slow. We should investigate whether we are hitting a slow path on mali, and if so find out how to get on the fast path instead.

Here is a profile: https://share.firefox.dev/2QpwggF

Profiles on Adreno prior to bug 1498732 clearly showed lots of time in glTexSubImage3D or similar functions. In the Mali profile, all the time is less intuitively spent in eglSwapBuffers(). This indicates that the upload was being done synchronously on Adreno whereas on Mali it is in fact being done asynchronously, but taking too long, resulting in eglSwapBuffers blocking for it to complete. Hopefully despite it being peformed asynchronously it is not in fact taking the optimal path, and we can get an improvement.

Severity: -- → S3
Blocks: wr-malig-release
No longer blocks: wr-mali

I have figured out that the time taken to uploaded the texture data is correlated with the number of separate glTexSubImage calls, rather than the number of bytes in total. (These are quite often inversely proportional, for example when zooming text: when zoomed out there are many glyphs to upload but they are very small, when zoomed in there are fewer glyphs but they are very large.)

Some higher level ideas we have had which could help are:

  • Upload all subpixel-position variants of a glyph together
  • Cache text-runs as opposed to individual glyphs

The latter might help a lot but I'm not convinced the former would be enough.

A lower-level idea is to maintain a copy of the texture data in a persistently-mapped PBO, and copy texture updates in to this with the same memory layout as in the texture. It is then possible to call glTexSubImage for coarser regions than the individual items being uploaded, as the existing texture data will just be overwritten by equal data.

I've hacked together a prototype of this and it appears to be a huge improvement. I still need to determine how eager we should be to group updates. eg If a texture has 2 updates, one in layer 0 and one in layer 15, do we upload all 16 regions together?

There is also a question of what to do about data written to the texture cache from render tasks rather than uploads. If our buffer doesn't contain these then a subsequent upload could overwrite the data in the texture. Hopefully it is possible to simply glReadPixels these in to our PBO.

My biggest concern, however, is of code quality. We already have 3 texture upload paths, and longer term goals to implement bug 1602550 and bug 1640952. I'm struggling to come up with a clean abstraction for this. Dzmitry, do you have any ideas for how you want the belt uploader's API to look? Do you think it could be compatible with this system?

Glenn, can you think of any issues with the above plan?

Flags: needinfo?(gwatson)
Flags: needinfo?(dmalyshau)

That's an interesting discovery!

Going with glReadPixels to update the parts of the PBO just to upload it afterwards seems rather concerning.

I'm thinking about a middle path between what we do now (separate glTexSubImage for update region) and composing everything into a single update. What if get all the sub-regions, and upload them into a 512x512 texture (packed in some way), only to follow-up with a bunch of glCopyTexImage2D from that temporary into the real places where that data need to live?

Flags: needinfo?(dmalyshau)

I quite like that idea, and we'll actually need to do something similar for Adreno 3xx anyway. (Due to bug 1513185, where PBO uploads always write to layer 0 instead of the specified layer.)

I'll hack together a prototype, hopefully it will perform well.

I like the idea kvark proposed above. It's something we've discussed previously (I forget the exact details, I think it was a similar issue). It seems like a good potential solution to reduce driver sync points.

Flags: needinfo?(gwatson)

Provides glBufferStorage and glFlushMappedBufferRange, and fixes
glClientWaitSync's return type.

Assignee: nobody → jnicol
Status: NEW → ASSIGNED

Currently we upload texture data by first splitting the updates based
on their destination texture. Then we iterate through every update for
the texture and calculate the total required size, then allocate and
map a PBO large enough to contain the updates. Finally we fill the PBO
with the data and submit the uploads, before moving on to the next
texture. This results in a lot of allocating and mapping PBOs of
highly variable sizes, and also requires an extra iteration of the
updates to calculate the required size.

This change makes it so that we allocate fixed size PBOs. We fill the
PBOs with as many updates as they have room for, then start filling a
new one. Furthermore, we recycle the PBOs to avoid having to
reallocate them, and if possible we persistently map them too.

The TextureUploader struct can now be used to upload to more than one
texture, ensuring buffers are filled more efficiently. It is also now
possible to manually stage some data in the PBO to be uploaded, in
case the client requires more fine-grained control over the layout of
the data.

If a single upload is too large to fit within the fixed-size PBO, then
we allocate a single-use larger PBO for the upload. As before, on Mac
we unfortunately must use a separate PBO for each individual upload,
so we also allocate single-use custom-sized PBOs in that case.

Depends on D96023

Mali-Gxx performance really struggles when we make a large number of
glTexSubImage calls, even if the total amount of memory being uploaded
isn't unreasonably large. To work around this, this change makes it so
that we batch many texture updates in to fewer upload calls.

We sort the updates by their target texture format, then use a
guillotine allocator to pack the updates efficiently. Using the
TextureUploader's newly added ability to manually stage texture data,
we arrange the updates' data as they have been packed, then upload the
large texture in a single call. Finally, we blit each update from its
location in the staging texture to its correct location in the texture
cache.

Depends on D96024

Pushed by jnicol@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/77c01f13f298 Update gleam to 0.13.1. r=kvark https://hg.mozilla.org/integration/autoland/rev/7ba854d835c4 Use pool of fixed-size PBOs for uploading textures. r=kvark https://hg.mozilla.org/integration/autoland/rev/5f5c269d15d6 Batch smaller texture updates together in to fewer upload calls. r=kvark
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 84 Branch
Regressions: 1683266
Regressions: 1678585
Regressions: 1683936
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: