Closed Bug 1640441 Opened 4 years ago Closed 4 years ago

Slack hangs indefinitely in a onResize loop

Categories

(Web Compatibility :: Desktop, defect, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Harald, Assigned: emilio)

References

(Depends on 1 open bug)

Details

(Keywords: perf, webcompat:site-wait, Whiteboard: [wfh])

Attachments

(1 file, 1 obsolete file)

In my personal profile. I have Fission enabled for a few weeks now and had Slack hang indefinitely around 4 times.

This time it happened when I clicked an external link and managed to remember to profile: https://perfht.ml/3d4vT52

First look suggests an infinite react-state/resize loop.

So I hit this really recently too in #firefox, and took a profile as well: https://perfht.ml/2X0m2Yl

Given our discussion in the thread, this seems likely a Slack bug: https://mozilla.slack.com/archives/C4D3JFF26/p1590296114130000

But would be good to get them involved to either fix it or tell us where is Gecko going wrong :-)

Component: General → Desktop
Flags: needinfo?(miket)
Product: Core → Web Compatibility

Also, I checked and this also affects 76, so it's probably urgent-ish.

At least for me, as mentioned on that thread, it seems related to the black tooltip that shows up when hovering links. Some of them work, some hang.

And for reference the link I was seeing deterministic hangs on is https://news.ycombinator.com/item?id=23287038.

Ok, multiple people reproduced it just hovering over a message with just that link. At least me (Linux) and other two other folks on Mac.

This is a HiDPI screen, which may be relevant if this is related to pixel rounding (window.devicePixelRatio is 2).

I've also experienced this on Windows with a HiDPI screen.

FWIW when I filed bug 1637425 Firefox with my Slack tab was on a low DPI screen (attached to a laptop with a high DPI display which was also active).

Severity: -- → S2
Priority: -- → P1

Just sent an email an engineering contact at Slack.

Flags: needinfo?(miket)

I've also contacted someone I know on the engineering team.

Is there anything we can do on our end here? This bug is really annoying...

(In reply to Emilio Cobos Álvarez (:emilio) from comment #2)

Also, I checked and this also affects 76, so it's probably urgent-ish.

I just managed to reproduce this all the way back to 64 with mozregression (earlier versions of Firefox just crash when I use them with my profile with mozregression, probably unrelated to this bug). My reproduction steps are simple:

This may well be a bug in slack, but IMO it's still on us to try and fix what's going on. When it happens, it affects not only Slack but my entire Firefox session. This is a serious usability issue -- I'm now at the point of restarting Firefox several times a day because of this bug, if it continues for too much longer I'll probably have no choice but to install the Slack application which I really don't want to do.

Whiteboard: [wfh]

I just hit this by attempting cmd+click of a link... to this bug from a PM. I guess we still don't really understand what the issue is?

(edit: it seems like it's enough to just close the tab, and re-open a new one if you want to avoid a full browser restart. but that's super annoying)

Attached file relevant unminified code. (deleted) —

I'm poking at this fwiw. This is the code that doesn't reach an stable state... Still trying to figure out why, but maybe different rounding from getBoundingClientRect and offsetHeight or something of that sort.

Assignee: nobody → emilio

It is FP error, btw:

bounds 
Object { left: 331.9583435058594, top: 424.54998779296875 }
 contentBounds 
Object { left: 331.9583282470703, top: 424.54998779296875 }

So, relevant DOM and frame tree:

div@0x7fa55c0198b0 style="position: absolute; left: 331.958px; top: 674.75px; outline: currentcolor none medium; transition-duration: 80ms;" class="ReactModal__Content ReactModal__Content--after-open popover c-popover__content" tabindex="-1" role="presentation" state=[100020800000] flags=[00088402] primaryframe=0x7fa54d3a9178 refcount=7<
  div@0x7fa55c019820 role="presentation" state=[100020800000] flags=[00088404] primaryframe=0x7fa54d3a8fe8 refcount=2056<
    div@0x7fa55c019790 id="sk-tooltip-340" role="tooltip" class="c-tooltip__tip c-tooltip__tip--top c-tooltip__tip--small c-tooltip__tip--link" data-qa="tooltip-tip" data-sk="tooltip" state=[100020800000] flags=[00088400] primaryframe=0x7fa54d3a9c70 refcount=5<
      Text@0x7fa55c017c80 flags=[00300000] primaryframe=0x7fa54d3a98e0 refcount=3<https://news.ycombinator.com/item?id=23287038>
      div@0x7fa55c019700 class="c-tooltip__tip__arrow" style="left: 157.158px; right: initial;" state=[100020800000] flags=[00008000] primaryframe=0x7fa54d3a81e0 refcount=2057<>
    >
  >
>
Block(div)(0)@7fa54d3a9178 parent=7fa54d3a9620 (19918,40485,18859,2640) [state=0042062000d00300] [content=7fa55c0198b0] [cs=7fa54e156b58]<
  line 7fa552dd8d30: count=1 state=block,clean,prevmarginclean,not-impacted,not-wrapped,before:nobr,after:nobr[0x8] (0,0,18859,2640) <
    Block(div)(0)@7fa54d3a8fe8 parent=7fa54d3a9178 (0,0,18859,2640) [state=0002020000100200] [content=7fa55c019820] [cs=7fa54e156f18]<
      line 7fa552dd8e28: count=1 state=inline,clean,prevmarginclean,not-impacted,not-wrapped,before:nobr,after:nobr[0x180] (0,0,18859,2640) <
        Block(div)(0)@7fa54d3a9c70 parent=7fa54d3a8fe8 (0,0,18859,2160) vis-overflow=(0,0,18859,2521) scr-overflow=(0,0,18859,2521) normal-position=(0,0) [state=0002062000d00220] [content=7fa55c019790] [cs=7fa551360978]<
          line 7fa552dd8f20: count=2 state=inline,clean,prevmarginclean,not-impacted,not-wrapped,before:nobr,after:nobr[0x100] (720,480,17419,1080) <
            Text(0)"https://news.ycombinator.com/item?id=23287038"@7fa54d3a98e0 parent=7fa54d3a9c70 next=7fa558aafaf8 (720,540,17419,960) [state=00010000b0600000] [content=7fa55c017c80] [cs=7fa55172ca68:-moz-text] [run=7fa5516b92e0][0,45,T] 
            Placeholder(div)(1)@7fa558aafaf8 parent=7fa54d3a9c70 (18139,1320,0,0) [state=0000000000200000] [content=7fa55c019700] [cs=7fa55d53fa68:-moz-oof-placeholder] outOfFlowFrame=Block(div)(1)@7fa54d3a81e0
          >
          AbsoluteList 0x7fa5513c0fc0 <
            Block(div)(1)@7fa54d3a81e0 parent=7fa54d3a9c70 (9430,2011,509,509) vis-overflow=(-360,-211,720,721) scr-overflow=(-360,-211,720,721) [state=004026a000d10300] transformed [content=7fa55c019700] [cs=7fa54e159028]<
            >
          >
        >
      >
    >
  >
>

They have a mutation observer for everything in that tooltip. Positioning the tooltip causes a mutation of the style attribute of tooltip arrow's style attribute, with a stack like this:

0 Hn(e = [unavailable], t = [unavailable]) ["https://a.slack-edge.com/bv1-8-d2a8b3a/gantry-vendors.c07873c.min.js?cacheKey=gantry-1591870348":1:847085]
1 Cf(e = [unavailable], t = [unavailable]) ["https://a.slack-edge.com/bv1-8-d2a8b3a/gantry-vendors.c07873c.min.js?cacheKey=gantry-1591870348":1:902745]
2 Fs/<(e = [unavailable], t = [unavailable]) ["https://a.slack-edge.com/bv1-8-d2a8b3a/gantry-vendors.c07873c.min.js?cacheKey=gantry-1591870348":1:915875]
3 () ["self-hosted":948:0]
4 KqkS/t.unstable_runWithPriority(e = "1", t = [function]) ["https://a.slack-edge.com/bv1-8-d2a8b3a/gantry-vendors.c07873c.min.js?cacheKey=gantry-1591870348":1:198347]
5 za(e = [unavailable], t = [unavailable]) ["https://a.slack-edge.com/bv1-8-d2a8b3a/gantry-vendors.c07873c.min.js?cacheKey=gantry-1591870348":1:860500]
6 Fs(e = [unavailable]) ["https://a.slack-edge.com/bv1-8-d2a8b3a/gantry-vendors.c07873c.min.js?cacheKey=gantry-1591870348":1:913815]
7 vs(e = [unavailable], [unavailable]) ["https://a.slack-edge.com/bv1-8-d2a8b3a/gantry-vendors.c07873c.min.js?cacheKey=gantry-1591870348":1:909500]
8 vs([unavailable]) ["self-hosted":891:0]
9 Ja/<() ["https://a.slack-edge.com/bv1-8-d2a8b3a/gantry-vendors.c07873c.min.js?cacheKey=gantry-1591870348":1:860790]
10 KqkS/t.unstable_runWithPriority(e = "1", t = [function]) ["https://a.slack-edge.com/bv1-8-d2a8b3a/gantry-vendors.c07873c.min.js?cacheKey=gantry-1591870348":1:198347]
11 za(e = [unavailable], t = [unavailable]) ["https://a.slack-edge.com/bv1-8-d2a8b3a/gantry-vendors.c07873c.min.js?cacheKey=gantry-1591870348":1:860500]
12 Ja() ["https://a.slack-edge.com/bv1-8-d2a8b3a/gantry-vendors.c07873c.min.js?cacheKey=gantry-1591870348":1:860738]
13 Qa() ["https://a.slack-edge.com/bv1-8-d2a8b3a/gantry-vendors.c07873c.min.js?cacheKey=gantry-1591870348":1:860671]
14 Is(e = [unavailable], t = [unavailable]) ["https://a.slack-edge.com/bv1-8-d2a8b3a/gantry-vendors.c07873c.min.js?cacheKey=gantry-1591870348":1:906068]
15 enqueueSetState(e = [unavailable], t = [unavailable], n = [unavailable], [unavailable]) ["https://a.slack-edge.com/bv1-8-d2a8b3a/gantry-vendors.c07873c.min.js?cacheKey=gantry-1591870348":1:865578]
16 viRO/A.prototype.setState(e = [unavailable]) ["https://a.slack-edge.com/bv1-8-d2a8b3a/gantry-vendors.c07873c.min.js?cacheKey=gantry-1591870348":1:1155126]
17 measureContents() ["https://a.slack-edge.com/bv1-8-d2a8b3a/client-boot-imports.3281577.min.js?cacheKey=gantry-1591870348":318220:29]
18 onResize([unavailable], [unavailable]) ["https://a.slack-edge.com/bv1-8-d2a8b3a/client-boot-imports.3281577.min.js?cacheKey=gantry-1591870348":318072:92]
19 onResize([unavailable], [unavailable]) ["self-hosted":844:0]

Which in turn will cause another iteration of onResize. But the tooltip arrow changes the x of the result of getBoundingClientRect(). Here's an iteration of the hang:

// With div@0x7fa55c019700 class="c-tooltip__tip__arrow" style="left: 157.158px; right: initial;" state=[100020800000] flags=[00008000] primaryframe=0x7fa54d3a81e0 refcount=2056<>
getBoundingClientRect's aLayoutRect: (const nsRect &) @0x7fffec8f9830: {<mozilla::gfx::BaseRect<int, nsRect, nsPoint, nsSize, nsMargin>> = {x = 19918, y = 40485, width = 18859, height = 2640}, <No data fields>}
set left aValue: 157.15834045410156px
// After: div@0x7fa55c019700 class="c-tooltip__tip__arrow" style="left: 157.158px; right: initial;" state=[100020800000] flags=[00008000] primaryframe=0x7fa54d3a81e0 refcount=2058<>
// But we report change, because internally we store it with a bit more precision: 157.158325px vs 157.15834px. so onResize will run again!
// With div@0x7fa55c0198b0 style="position: absolute; left: 331.958px; top: 674.75px; outline: currentcolor none medium; transition-duration: 80ms;" class="ReactModal__Content ReactModal__Content--after-open popover c-popover__content" tabindex="-1" role="presentation" state=[100020800000] flags=[00008402] primaryframe=0x7fa54d3a9178 refcount=6<
set left aValue: 331.9583282470703px
// After: div@0x7fa55c0198b0 style="position: absolute; left: 331.958px; top: 674.75px; outline: currentcolor none medium; transition-duration: 80ms;" class="ReactModal__Content ReactModal__Content--after-open popover c-popover__content" tabindex="-1" role="presentation" state=[100020800000] flags=[00088402] primaryframe=0x7fa54d3a9178 refcount=7<
// But again, we report a change, though in this case it doesn't matter much because there's no observer going on.
getBoundingClientRect's aLayoutRect: (const nsRect &) @0x7fffec8f9830: {<mozilla::gfx::BaseRect<int, nsRect, nsPoint, nsSize, nsMargin>> = {x = 19917, y = 40485, width = 18859, height = 2640}, <No data fields>}
// Note how the x coordinate changes a fraction of a CSS pixel. That makes slack come up with different measurements.
Then we try to set left with value 157.1583251953125px and 331.9583435058594px respectively, rinse and repeat.

So this is a minor floating point difference in the style, which causes a minor subpixel difference in layout, and thus in the return value of getBoundingClientRect, which causes them to come up with an slightly different measurement.

So if this doesn't happen in other engines it seems mostly by chance (other engines use different floating point precision in layout, we use 1/60th of a CSS pixel and Blink uses 1/64th).

The floating point shenanigans we do in getBoundingClient rect here may be relevant to some extent, but this seems like an issue with Slack still.

Or maybe other browsers store it with different precision in the style attribute which causes us to fire the mutation observer, but not them...

https://treeherder.mozilla.org/#/jobs?repo=try&revision=49c440f54dcff7ef5004747ae4c07442bfed966e seems to "fix" it... Though maybe it just fixed the case I could repro. I think the change makes sense though.

There are some issues with this code. It's using doubles, but SetRect
only takes floats, so we lose precision there already. Instead of
getting fancy, use the same conversion that the rest of Gecko uses, so
that pages that compute style values out of getBoundingClientRect values
don't get minor floating point changes.

How do we make sure that this patch does not create new bugs where pages rely on the current behavior? However, I'm not sure whether it is even possible to rely on it...

Generally the patch makes rounding consistent between CSSOM and Layout and getBoundingClientRect, so I think it's a strict improvement. Depending on this behavior in such a way is pretty hard, fwiw... Specially since engines have different fixed-point factors.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #18)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=49c440f54dcff7ef5004747ae4c07442bfed966e seems to "fix" it... Though maybe it just fixed the case I could repro. I think the change makes sense though.

I just confirmed it fixes the one link where I could reproduce 100% of the time (on macOS).

(But yeah, still it's a mildly scary change)

Pushed by ealvarez@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/edfbe6020f30 Don't do fancy floating point precision shenanigans on DOMRect::SetLayoutRect. r=dholbert

Ah, try auto fail, there's my dogfood ;)

Ok, so my patch effectively increases the precision of getBoundingClientRect() in a way that breaks consumer's expectations, because suddenly stuff that should match doesn't because floating point is fun.

Blink and WebKit have a much lower precision for these, because their LayoutUnit fixed-point denominator is 64, so they don't hit crazy fractional units when doing float conversion. We could consider lowering the precision of getBoundingClientRect and co. to be similar to theirs I guess, though I need to think about all the implications...

So after thinking a bit more and talking with Daniel, I've decided to put the current state of affairs here, because there doesn't seem like any of the potential solutions below is either trivial or particularly satisfying.

To recap for the folks that I'm cc'ing, this is effectively a result of nscoord not being precisely representable as a float, and slack introducing minor floating point errors that cause us to switch layout units back and forth.

When it comes to exposing a layout rect in app units in float precision, we have the following alternatives:

  • Right now, we round to the closest 2^16, so that the results make sense for websites. But that means that minor floating point precision errors from JS code can cause us to move 1 fractional unit or such, which is the cause of this bug.

  • My patch removes that rounding, which means that nscoord values get represented much more closely to the floats we return, making the issue above not happen. However, I suspect that's going to bring more pain than joy, because that means that suddenly results from getBoundingClientRect have minor differences like in the tests that fail in comment 25, even though it should be the same value, depending on the order of operations. We could try to push through with this, I guess, though I'm not convinced it's going to be great and it's likely to break some expectations from web developers...

  • There's the option to round to a smaller power of two (2^14, etc...). The smaller we go, the easier it gets to make div.style.left = div.getBoundingClientRect().left + "px"; not round-trip, but the harder it is for JS to cause this kind of bug.

  • There's the option to round to a bigger power of two, I guess, increasing the precision of getBoundingClientRect(), which makes it closer to the behavior of my patch, but still potentially hitting these cases which isn't great. It might be worth a shot? I need to think harder about this.

  • There's the option to switch representation of layout units to something that is precisely representable as a float. Choosing a power of two like Blink / WebKit does also has the benefit of faster multiplication and division. It loses the properties outlined in bug 177805 comment 7 and such, which is a bit unfortunate. In any case that is not a call I can make alone.

cc'ing some folks which may have opinions on all the issues above...

Still the easiest fix is for slack to drop a few toFixed(2)s around the code in attachment 9155947 [details]... :-)

Flags: needinfo?(emilio)

Rounding to a smaller power of two may be the easiest / safest short-term solution if we need to fix this bug ASAP.

(ni? myself to try to dig up more contacts)

Flags: needinfo?(miket)

OK, sent a few more messages.

Flags: needinfo?(miket)

I just got news via a Slack contact that this has been fixed -- I've just verified after clearing the cache (via cmd + opt + r) that it no longer reproduces on the 100% reproducible case I had. In case it's useful for someone in the future, Slack said:

You may need to unregister your service worker if you are still experiencing the issue.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED

Emilio, when you get a chance, can you verify the fix on your ycombinator link?

Flags: needinfo?(emilio)

Actually, I should have asked Harald. :)

Flags: needinfo?(emilio) → needinfo?(hkirschner)
Attachment #9155991 - Attachment is obsolete: true

No hangs for that case (but Slack performance has not been great today either).

Flags: needinfo?(hkirschner)

I can't tell if I am hitting it here (Slack tab stuck with a slow script warning) because I didn't force-reload my Slack Service Worker or if this is another issue but I took a profile anyway: https://share.firefox.dev/2ACqWm1

This seems to have regressed again :(

https://share.firefox.dev/3gpLUTJ

Mike, can you reach out to slack? I can reproduce if hover over a message that says iszoomdownforeveryoneorjustme.com (with the backticks / quoted code font).

Flags: needinfo?(miket)

Yep, I just dropped a message in our shared channel (and added you).

Flags: needinfo?(miket)
Depends on: 1719314
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: