Open Bug 1708336 Opened 4 years ago Updated 1 year ago

Content documents intermittently fail to fire any a11y events in Windows/Mac "-qr" CI jobs

Categories

(Core :: Disability Access APIs, defect, P2)

Desktop
All
defect

Tracking

()

People

(Reporter: Jamie, Unassigned)

References

Details

We're seeing an increasing number of a11y browser tests intermittently timing out on windows10-64-qr and windows10-64-shippable-qr continuous integration jobs.

(The below is from bug 1652192 comment 44 with some additional context.)

Frustratingly, I can't reproduce it locally, nor on try. I spent days trying to trace through and document the DocAccessible creation/loading code to figure out what could be going wrong, but I'm still at a loss.

I initially thought this might be a problem with our firing of document load complete events. However, I landed a patch in bug 1652192 to log events for that test. Looking at the logged events for failures, it looks like we never get any events for the actual test document. I would have expected focus at least. That means either:

  1. We don't create the DocAccessible in the content process;
  2. We don't send it to the parent process; or
  3. We don't send events to the parent process.

The fact that this only seems to happen on Windows makes me think we're not getting events, since only Windows defers all events until after the parent COM proxy is received. Assuming that's correct, that of course raises the question: why is the parent COM proxy not being received? Also, if that were true, we should hit MOZ_ASSERT(!aParentCOMProxy.IsNull());, which isn't showing up in the debug failures.

I'm currently asking around to figure out whether there's something special about the -qr jobs (other than running tests with --enable-webrender). It really would be much easier if we could reproduce this locally or at least on try.

OS: Unspecified → Windows
Hardware: Unspecified → Desktop

I'm suspicious of our accessible/tests/browser/mac/browser_webarea.js test too: all its recorded failures are on mac-qr builds, and the screenshots logged in the failures are the same as the ones I see on the windows failures: the data URI for the test is loaded, no content is visible, test stalls seemingly on "entry" (we get no logs within the test itself). Might be two separate issues, but noting here just in case :)

Some more info about the logged screenshots, for reference:

  • accessible/tests/browser/states/browser_test_visibility.js: This test has a few different screenshot types. I looked at the 30 most recent runs
    -- Screenshot one here shows a browser window titled "Accessibility Test" . There is a background tab titled "New Tab", but the acc test tab is active. In the test tab, the URL bar has a data URI which is presumably the test content, but its hard to tell because its URL encoded with all the whitespace as %20, so there's a lot of noise. I see an HTML tag, a head tag, and a meta tag (but can't see the meta contents because it goes past the visible edge of the field). The URL bar is also red with the same shading I see when a test first starts locally -- I think locally it usually also has a tooltip that says something like "this tab is being remotely managed", but I don't see that here presumably because no one is interacting with the tab. The webarea of the page contains a rectangle with a blue border (~5px?). The rectangle takes up about half the page width and spans the full page height. (4/30)
    -- Screenshot two here shows a "New tab" background tab and an active foreground "Crash reporter" tab with the regular "Gah! Your tab has crashed. Report it to us?" page and dialog. The URL bar has the same data URI and is shaded and red. (20/30)
    -- Screenshot three here also has the background new tab and active crash reporter but the URL bar here is white and shaded instead of red and shaded. The loaded URI is the same.

  • accessible/tests/browser/events/browser_test_focus_urlbar.js: Looked at 30 recent runs all of which had the same screenshot: Two tabs "New Tab" and "Accessibility Test" with the latter as the active tab. The URL bar is orange and shaded, it contains the same URI (or the same prefix as the URI I mentioned above). The web area has no content.

  • accessible/tests/browser/events/browser_test_panel.js: Screenshot here. I see two tabs with the same titles as above, the acc test one is active with the same URI loaded. On 8/30 most recent runs the URL bar is white and shaded, and on 22/30 the URL bar is red and shaded. On all of the 30 most recent runs, the web area has no content.

  • accessible/tests/browser/mac/browser_webarea.js: This test has three screenshot types.
    -- Screenshot here. I see two tabs with the same titles, the acc test one is active with the same URI loaded. The URL bar here is orange with the same shading as the first test above. The webarea has an iframe (I think) with the text "hello world" inside. The application bar is normal. (1/12)
    -- Screenshot here. I see a completely blank window (no chrome or content at all). The mac application bar is visible but contains no text. (9/12)
    -- Screenshot here. I see a firefox window with an orange, shaded URL bar. it has no URL, but the placeholder "Search with Google" text is visible. The web area has no content. The application bar is normal. (1/12)
    I couldn't get a screenshot for one of the 12 builds I investigated. These stats are from builds run Mar 15 to 19

Thanks so much for the thorough details, Morgan. I'm not quite sure where it leaves us, since it seems there are some differences across tests here, but hopefully we'll be able to make something come out of it. :)

(In reply to Morgan Reschenberg [:morgan] from comment #2)

-- Screenshot two here shows a "New tab" background tab and an active foreground "Crash reporter" tab with the regular "Gah! Your tab has crashed. Report it to us?" page and dialog. The URL bar has the same data URI and is shaded and red. (20/30)
-- Screenshot three here also has the background new tab and active crash reporter but the URL bar here is white and shaded instead of red and shaded. The loaded URI is the same.

I suspect these crashes are actually bug 1709250 (now fixed); see bug 1652192 comment 51. So I think we can disregard those.

(In reply to James Teh [:Jamie] from comment #0)

I'm currently asking around to figure out whether there's something special about the -qr jobs (other than running tests with --enable-webrender).

It looks like they just enable webrender. Despite the timing changes inherent with webrender, there doesn't seem to be any clear reason this should impact a11y. That said, based on Morgan's findings in comment 2, this may be a document loading problem (at least some of the time), before a11y even comes into the picture.

Blocks: 1636476

Here's a hypothesis, albeit an unlikely one:

  1. The AccService isn't started until the first a11y test starts it. This might be after we've already pre-started content processes.
  2. ContentParent observes that a11y has started and sends an IPC message to start a11y in the content process.
  3. If the test document somehow loaded before the content process started a11y, that would mean a11y would create the DocAccessible after the DOM document was already loaded. In that case, we wouldn't fire doc load complete.

The problem with this is that I don't see how the document could load before the content process started a11y. We initialise accService before loading the document, so presumably, the IPC message is queued at that point. I assume loading a document is also an IPC message, and presumably, queued IPC messages are processed in the order they were sent. Still, I'm making quite a few assumptions here, so I'm not completely discounting this yet.

Another counter-argument is that I think we should still fire a state change busy false event even if we don't fire doc load complete. However, I don't see such an event in the logs described in comment 0.

Blocks: 1713040
Blocks: 1713663
Blocks: 1714067
Blocks: 1714576

This seems to be affecting Mac as well.

OS: Windows → All
Summary: Content documents intermittently fail to fire any a11y events in Windows "-qr" CI jobs → Content documents intermittently fail to fire any a11y events in Windows/Mac "-qr" CI jobs

My findings (and hopefully fixes) in bug 1638880 comment 78 onwards might help here.

Depends on: 1638880
Blocks: 1758534
Blocks: 1783890

Through a process of tedious try debugging, I've established that we aren't sending the parent COM proxy to the content process when this happens on Windows, so the content document doesn't fire any events. I don't yet know why we aren't sending the parent COM proxy (or perhaps why it isn't being received).

Note that this doesn't apply to Mac. If there's still an issue there, it's a different one.

Depends on: 1786496
No longer blocks: 1492259
No longer blocks: 1703620
You need to log in before you can comment on or make changes to this bug.