indexedDB tests fail randomly and highly intermittent when running on GCP workers instead of AWS
Categories
(Core :: Storage: IndexedDB, defect, P2)
Tracking
()
People
(Reporter: masterwayz, Assigned: janv)
References
Details
indexedDB tests fail randomly and highly intermittent when running on GCP workers instead of AWS
See this try push: https://treeherder.mozilla.org/jobs?repo=try&revision=065be3ea0f9e75c96e5a032efd53c110cf620a27
Comment 2•2 years ago
|
||
What are the steps to reproduce this with GCP with test change patches ? Is it enough to include D149703 in the patch stack and push normally to try?
Updated•2 years ago
|
Reporter | ||
Comment 3•2 years ago
|
||
Yes, make sure to have D149703 (only the change to taskcluster/ci/test/mochitest.yml
is needed) and then just push normally, but make sure to have that patch in your stack.
Updated•2 years ago
|
hi jstutte. This is currently blocking a test suite. Is there someone that can work on this fix?
Thanks!
Comment 5•2 years ago
|
||
Comment 6•2 years ago
|
||
Hi Michelle, it seems I am not able to reproduce the failures? Am I overlooking something obvious on how to start the tests?
Reporter | ||
Comment 7•2 years ago
|
||
The commit you used already contains a lot of disabled tests. Try using https://hg.mozilla.org/try/rev/87e931bd817ffb6b32b94b1d9dad752f2579a028 instead, this shouldn't contain any additional disabled tests.
Comment 8•2 years ago
|
||
(In reply to Michelle Goossens [:masterwayz] from comment #7)
The commit you used already contains a lot of disabled tests. Try using https://hg.mozilla.org/try/rev/87e931bd817ffb6b32b94b1d9dad752f2579a028 instead, this shouldn't contain any additional disabled tests.
I enabled the IndexedDB tests, AFAICS? Isn't that enough? Can you confirm that I triggered a run on GCP with my patch or not?
Reporter | ||
Comment 9•2 years ago
|
||
Odd, just made a new push to see what happened, rebased on latest central.
Will update the findings here, leaving the NI in place.
Comment 10•2 years ago
|
||
Thanks!
Comment 11•2 years ago
|
||
(FWIW, I pushed also another try just enabling tests without other changes: https://treeherder.mozilla.org/jobs?repo=try&revision=e7210217a9f9b37f149a995f3eb943773dba9deb - I'll get back on Monday here)
Reporter | ||
Comment 12•2 years ago
|
||
I see them here, so not fixed on its own: https://treeherder.mozilla.org/jobs?repo=try&revision=ebb4f089edead4b945ec933d5e252f619eceb396
Comment 13•2 years ago
|
||
Yeah, also my latest push shows the error. The only difference to my other test is some more logging. There most be some fragile race here?
Reporter | ||
Comment 14•2 years ago
|
||
Maybe a race condition? I have no clue but that is very odd that just adding logging fixes it.
Comment 15•2 years ago
|
||
(In reply to Michelle Goossens [:masterwayz] from comment #14)
Maybe a race condition? I have no clue but that is very odd that just adding logging fixes it.
Likely a race, as I retriggered the test with logs and one of the runs shows the failure, too.
Comment 16•2 years ago
|
||
After two days of printf debugging and try pushes I think I have a trail now. Please note that the order of debug messages is confusing: The printf_stderr
messages are printed earlier than the Javascript info
messages. Reproposing the log snippet here in the assumed order of events:
[task 2022-06-27T13:40:41.613Z] 13:40:41 INFO - TEST-START | dom/indexedDB/test/test_object_identity.html
...
[task 2022-06-27T13:46:10.535Z] 13:46:10 INFO - Clearing old databases
[task 2022-06-27T13:46:10.536Z] 13:46:10 INFO - clearAllDatabases then nextTestHarnessStep
> [task 2022-06-27T13:40:41.672Z] 13:40:41 INFO - GECKO(10730) | State_Initial (numOps:1)
> [task 2022-06-27T13:40:41.674Z] 13:40:41 INFO - GECKO(10730) | State_DirectoryWorkOpen (numOps:1)
> [task 2022-06-27T13:40:41.675Z] 13:40:41 INFO - GECKO(10730) | State_UnblockingOpen (numOps:1)
> [task 2022-06-27T13:40:41.676Z] 13:40:41 INFO - GECKO(10730) | QuotaRequestBase::SendResults
> [task 2022-06-27T13:40:41.677Z] 13:40:41 INFO - GECKO(10730) | ClearStoragesForPrincipal request initiated
> [task 2022-06-27T13:40:41.679Z] 13:40:41 INFO - GECKO(10730) | Request::FireCallback fired.
[task 2022-06-27T13:46:10.536Z] 13:46:10 INFO - nextTestHarnessStep done: false
[task 2022-06-27T13:46:10.536Z] 13:46:10 INFO - Running test in worker
[task 2022-06-27T13:46:10.536Z] 13:46:10 INFO - Running test in worker as GeneratorFunction
[task 2022-06-27T13:46:10.537Z] 13:46:10 INFO - TEST-PASS | dom/indexedDB/test/test_object_identity.html | Unsupported function type
[task 2022-06-27T13:46:10.537Z] 13:46:10 INFO - executeWorkerTestAndCleanUp for http://mochi.test:8888/tests/dom/indexedDB/test/unit/test_object_identity.js
[task 2022-06-27T13:46:10.537Z] 13:46:10 INFO - nextTestHarnessStep done: false
[task 2022-06-27T13:46:10.538Z] 13:46:10 INFO - Worker: loading ["http://mochi.test:8888/tests/dom/indexedDB/test/unit/test_object_identity.js"]
[task 2022-06-27T13:46:10.538Z] 13:46:10 INFO - Worker: starting tests
[task 2022-06-27T13:46:10.538Z] 13:46:10 INFO - TEST-PASS | dom/indexedDB/test/test_object_identity.html | Unsupported function type: undefined
... (some more tests passing) ...
[task 2022-06-27T13:46:10.543Z] 13:46:10 INFO - TEST-PASS | dom/indexedDB/test/test_object_identity.html | Worker finished
[task 2022-06-27T13:46:10.544Z] 13:46:10 INFO - Terminating the worker
[task 2022-06-27T13:46:10.544Z] 13:46:10 INFO - Cleaning up the databases
[task 2022-06-27T13:46:10.545Z] 13:46:10 INFO - clearAllDatabases then
> [task 2022-06-27T13:40:41.726Z] 13:40:41 INFO - GECKO(10730) | State_Initial (numOps:1)
> [task 2022-06-27T13:40:41.726Z] 13:40:41 INFO - GECKO(10730) | DirectoryLockImpl::Overlaps -> true
> [task 2022-06-27T13:40:41.727Z] 13:40:41 INFO - GECKO(10730) | ClearStoragesForPrincipal request initiated
[task 2022-06-27T13:46:10.545Z] 13:46:10 INFO - Buffered messages finished
[task 2022-06-27T13:46:10.547Z] 13:46:10 INFO - TEST-UNEXPECTED-FAIL | dom/indexedDB/test/test_object_identity.html | Test timed out. -
At the second call to clearAllDatabases
we end up initializing the ClearOrigin
operation, finding another directory lock active (DirectoryLockImpl::Overlaps -> true
). But according to State_Initial (numOps:1)
we are the only existing operation, so nothing will ever unlock us again and we are stuck.
So something breaks our lock bookkeeping, and it can well be some remainder from a previous test run for the same origin. Assuming that the bookkeeping is generally well tested, and given the apparent race, my best bet would be that we need to have a mutex that guards all the locklist manipulations and lookups in DirectoryLockImpl
to ensure that we always have a consistent state of those lists.
Jan, wdyt?
Assignee | ||
Comment 17•2 years ago
|
||
Jens, thanks a lot for the info, I'll look what's wrong.
Comment 18•2 years ago
|
||
Just summarizing our conversation on slack:
- a Mutex is not supposed to help, all DirectoryLock accesses run on on the same thread (and we trust the relative asserts)
- there can be other calls to
Acquire
that are not coming from operations
We should still add more info to the logs for better understanding, Jan is looking into this.
Updated•2 years ago
|
Assignee | ||
Comment 19•2 years ago
|
||
The root cause has been fixed in bug 1777914. The transition to GCP is now possible.
Comment 20•2 years ago
|
||
That is awesome, thank you! This should resolve the two blockers that @masterwayz had noted.
Description
•