Closed Bug 1775241 Opened 2 years ago Closed 2 years ago

indexedDB tests fail randomly and highly intermittent when running on GCP workers instead of AWS

Categories

(Core :: Storage: IndexedDB, defect, P2)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: masterwayz, Assigned: janv)

References

Details

indexedDB tests fail randomly and highly intermittent when running on GCP workers instead of AWS

See this try push: https://treeherder.mozilla.org/jobs?repo=try&revision=065be3ea0f9e75c96e5a032efd53c110cf620a27

As we talked about on Matrix.

Flags: needinfo?(jstutte)

What are the steps to reproduce this with GCP with test change patches ? Is it enough to include D149703 in the patch stack and push normally to try?

Flags: needinfo?(jstutte) → needinfo?(mgoossens)
Component: General → Storage: IndexedDB
Product: Testing → Core
Version: Default → unspecified

Yes, make sure to have D149703 (only the change to taskcluster/ci/test/mochitest.yml is needed) and then just push normally, but make sure to have that patch in your stack.

Flags: needinfo?(mgoossens)
Severity: -- → S3
Priority: -- → P2

hi jstutte. This is currently blocking a test suite. Is there someone that can work on this fix?

Thanks!

Flags: needinfo?(jstutte)

Hi Michelle, it seems I am not able to reproduce the failures? Am I overlooking something obvious on how to start the tests?

Flags: needinfo?(jstutte) → needinfo?(mgoossens)

The commit you used already contains a lot of disabled tests. Try using https://hg.mozilla.org/try/rev/87e931bd817ffb6b32b94b1d9dad752f2579a028 instead, this shouldn't contain any additional disabled tests.

Flags: needinfo?(mgoossens)

(In reply to Michelle Goossens [:masterwayz] from comment #7)

The commit you used already contains a lot of disabled tests. Try using https://hg.mozilla.org/try/rev/87e931bd817ffb6b32b94b1d9dad752f2579a028 instead, this shouldn't contain any additional disabled tests.

I enabled the IndexedDB tests, AFAICS? Isn't that enough? Can you confirm that I triggered a run on GCP with my patch or not?

Flags: needinfo?(mgoossens)

Odd, just made a new push to see what happened, rebased on latest central.
Will update the findings here, leaving the NI in place.

Thanks!

(FWIW, I pushed also another try just enabling tests without other changes: https://treeherder.mozilla.org/jobs?repo=try&revision=e7210217a9f9b37f149a995f3eb943773dba9deb - I'll get back on Monday here)

Flags: needinfo?(mgoossens) → needinfo?(jstutte)

Yeah, also my latest push shows the error. The only difference to my other test is some more logging. There most be some fragile race here?

Flags: needinfo?(jstutte)

Maybe a race condition? I have no clue but that is very odd that just adding logging fixes it.

(In reply to Michelle Goossens [:masterwayz] from comment #14)

Maybe a race condition? I have no clue but that is very odd that just adding logging fixes it.

Likely a race, as I retriggered the test with logs and one of the runs shows the failure, too.

After two days of printf debugging and try pushes I think I have a trail now. Please note that the order of debug messages is confusing: The printf_stderr messages are printed earlier than the Javascript info messages. Reproposing the log snippet here in the assumed order of events:

[task 2022-06-27T13:40:41.613Z] 13:40:41     INFO - TEST-START | dom/indexedDB/test/test_object_identity.html
...
[task 2022-06-27T13:46:10.535Z] 13:46:10     INFO - Clearing old databases
[task 2022-06-27T13:46:10.536Z] 13:46:10     INFO - clearAllDatabases then nextTestHarnessStep
> [task 2022-06-27T13:40:41.672Z] 13:40:41     INFO - GECKO(10730) | State_Initial (numOps:1)
> [task 2022-06-27T13:40:41.674Z] 13:40:41     INFO - GECKO(10730) | State_DirectoryWorkOpen (numOps:1)
> [task 2022-06-27T13:40:41.675Z] 13:40:41     INFO - GECKO(10730) | State_UnblockingOpen (numOps:1)
> [task 2022-06-27T13:40:41.676Z] 13:40:41     INFO - GECKO(10730) | QuotaRequestBase::SendResults
> [task 2022-06-27T13:40:41.677Z] 13:40:41     INFO - GECKO(10730) | ClearStoragesForPrincipal request initiated
> [task 2022-06-27T13:40:41.679Z] 13:40:41     INFO - GECKO(10730) | Request::FireCallback fired.
[task 2022-06-27T13:46:10.536Z] 13:46:10     INFO - nextTestHarnessStep done: false
[task 2022-06-27T13:46:10.536Z] 13:46:10     INFO - Running test in worker
[task 2022-06-27T13:46:10.536Z] 13:46:10     INFO - Running test in worker as GeneratorFunction
[task 2022-06-27T13:46:10.537Z] 13:46:10     INFO - TEST-PASS | dom/indexedDB/test/test_object_identity.html | Unsupported function type 
[task 2022-06-27T13:46:10.537Z] 13:46:10     INFO - executeWorkerTestAndCleanUp for http://mochi.test:8888/tests/dom/indexedDB/test/unit/test_object_identity.js
[task 2022-06-27T13:46:10.537Z] 13:46:10     INFO - nextTestHarnessStep done: false
[task 2022-06-27T13:46:10.538Z] 13:46:10     INFO - Worker: loading ["http://mochi.test:8888/tests/dom/indexedDB/test/unit/test_object_identity.js"]
[task 2022-06-27T13:46:10.538Z] 13:46:10     INFO - Worker: starting tests
[task 2022-06-27T13:46:10.538Z] 13:46:10     INFO - TEST-PASS | dom/indexedDB/test/test_object_identity.html | Unsupported function type: undefined 
... (some more tests passing) ...
[task 2022-06-27T13:46:10.543Z] 13:46:10     INFO - TEST-PASS | dom/indexedDB/test/test_object_identity.html | Worker finished 
[task 2022-06-27T13:46:10.544Z] 13:46:10     INFO - Terminating the worker
[task 2022-06-27T13:46:10.544Z] 13:46:10     INFO - Cleaning up the databases
[task 2022-06-27T13:46:10.545Z] 13:46:10     INFO - clearAllDatabases then 
> [task 2022-06-27T13:40:41.726Z] 13:40:41     INFO - GECKO(10730) | State_Initial (numOps:1)
> [task 2022-06-27T13:40:41.726Z] 13:40:41     INFO - GECKO(10730) | DirectoryLockImpl::Overlaps -> true
> [task 2022-06-27T13:40:41.727Z] 13:40:41     INFO - GECKO(10730) | ClearStoragesForPrincipal request initiated
[task 2022-06-27T13:46:10.545Z] 13:46:10     INFO - Buffered messages finished
[task 2022-06-27T13:46:10.547Z] 13:46:10     INFO - TEST-UNEXPECTED-FAIL | dom/indexedDB/test/test_object_identity.html | Test timed out. - 

At the second call to clearAllDatabases we end up initializing the ClearOrigin operation, finding another directory lock active (DirectoryLockImpl::Overlaps -> true). But according to State_Initial (numOps:1) we are the only existing operation, so nothing will ever unlock us again and we are stuck.

So something breaks our lock bookkeeping, and it can well be some remainder from a previous test run for the same origin. Assuming that the bookkeeping is generally well tested, and given the apparent race, my best bet would be that we need to have a mutex that guards all the locklist manipulations and lookups in DirectoryLockImpl to ensure that we always have a consistent state of those lists.

Jan, wdyt?

Flags: needinfo?(jvarga)

Jens, thanks a lot for the info, I'll look what's wrong.

Just summarizing our conversation on slack:

  • a Mutex is not supposed to help, all DirectoryLock accesses run on on the same thread (and we trust the relative asserts)
  • there can be other calls to Acquire that are not coming from operations

We should still add more info to the logs for better understanding, Jan is looking into this.

Depends on: 1777914

The root cause has been fixed in bug 1777914. The transition to GCP is now possible.

Assignee: nobody → jvarga
Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(jvarga)
Resolution: --- → FIXED

That is awesome, thank you! This should resolve the two blockers that @masterwayz had noted.

You need to log in before you can comment on or make changes to this bug.