1775241 - indexedDB tests fail randomly and highly intermittent when running on GCP workers instead of AWS

Yes, make sure to have D149703 (only the change to taskcluster/ci/test/mochitest.yml is needed) and then just push normally, but make sure to have that patch in your stack.

Flags: needinfo?(mgoossens)

Jens Stutte [:jstutte]

Updated

•

2 years ago

Severity: -- → S3

Priority: -- → P2

Rosanna

Comment 4

•

2 years ago

hi jstutte. This is currently blocking a test suite. Is there someone that can work on this fix?

Thanks!

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Comment 5

•

2 years ago

Investigating:

https://treeherder.mozilla.org/jobs?repo=try&revision=f680a5144a74c177c580601945c79f92523a5fc0

https://treeherder.mozilla.org/#/jobs?repo=try&revision=77f69a5d6f10e892d9ffebf1a215c829c95027fe

Jens Stutte [:jstutte]

Comment 6

•

2 years ago

Hi Michelle, it seems I am not able to reproduce the failures? Am I overlooking something obvious on how to start the tests?

Flags: needinfo?(jstutte) → needinfo?(mgoossens)

Michelle Goossens [:masterwayz]

Reporter

Comment 7

•

2 years ago

The commit you used already contains a lot of disabled tests. Try using https://hg.mozilla.org/try/rev/87e931bd817ffb6b32b94b1d9dad752f2579a028 instead, this shouldn't contain any additional disabled tests.

Flags: needinfo?(mgoossens)

Jens Stutte [:jstutte]

Comment 8

•

2 years ago

(In reply to Michelle Goossens [:masterwayz] from comment #7)

The commit you used already contains a lot of disabled tests. Try using https://hg.mozilla.org/try/rev/87e931bd817ffb6b32b94b1d9dad752f2579a028 instead, this shouldn't contain any additional disabled tests.

I enabled the IndexedDB tests, AFAICS? Isn't that enough? Can you confirm that I triggered a run on GCP with my patch or not?

Flags: needinfo?(mgoossens)

Michelle Goossens [:masterwayz]

Reporter

Comment 9

•

2 years ago

Odd, just made a new push to see what happened, rebased on latest central.
Will update the findings here, leaving the NI in place.

Jens Stutte [:jstutte]

Comment 10

•

2 years ago

Thanks!

Jens Stutte [:jstutte]

Comment 11

•

2 years ago

(FWIW, I pushed also another try just enabling tests without other changes: https://treeherder.mozilla.org/jobs?repo=try&revision=e7210217a9f9b37f149a995f3eb943773dba9deb - I'll get back on Monday here)

Michelle Goossens [:masterwayz]

Reporter

Comment 12

•

2 years ago

I see them here, so not fixed on its own: https://treeherder.mozilla.org/jobs?repo=try&revision=ebb4f089edead4b945ec933d5e252f619eceb396

Flags: needinfo?(mgoossens) → needinfo?(jstutte)

Jens Stutte [:jstutte]

Comment 13

•

2 years ago

Yeah, also my latest push shows the error. The only difference to my other test is some more logging. There most be some fragile race here?

Flags: needinfo?(jstutte)

Michelle Goossens [:masterwayz]

Reporter

Comment 14

•

2 years ago

Maybe a race condition? I have no clue but that is very odd that just adding logging fixes it.

Jens Stutte [:jstutte]

Comment 15

•

2 years ago

(In reply to Michelle Goossens [:masterwayz] from comment #14)

Maybe a race condition? I have no clue but that is very odd that just adding logging fixes it.

Likely a race, as I retriggered the test with logs and one of the runs shows the failure, too.

Jens Stutte [:jstutte]

Comment 16

•

2 years ago

After two days of printf debugging and try pushes I think I have a trail now. Please note that the order of debug messages is confusing: The printf_stderr messages are printed earlier than the Javascript info messages. Reproposing the log snippet here in the assumed order of events:

[task 2022-06-27T13:40:41.613Z] 13:40:41     INFO - TEST-START | dom/indexedDB/test/test_object_identity.html
...
[task 2022-06-27T13:46:10.535Z] 13:46:10     INFO - Clearing old databases
[task 2022-06-27T13:46:10.536Z] 13:46:10     INFO - clearAllDatabases then nextTestHarnessStep
> [task 2022-06-27T13:40:41.672Z] 13:40:41     INFO - GECKO(10730) | State_Initial (numOps:1)
> [task 2022-06-27T13:40:41.674Z] 13:40:41     INFO - GECKO(10730) | State_DirectoryWorkOpen (numOps:1)
> [task 2022-06-27T13:40:41.675Z] 13:40:41     INFO - GECKO(10730) | State_UnblockingOpen (numOps:1)
> [task 2022-06-27T13:40:41.676Z] 13:40:41     INFO - GECKO(10730) | QuotaRequestBase::SendResults
> [task 2022-06-27T13:40:41.677Z] 13:40:41     INFO - GECKO(10730) | ClearStoragesForPrincipal request initiated
> [task 2022-06-27T13:40:41.679Z] 13:40:41     INFO - GECKO(10730) | Request::FireCallback fired.
[task 2022-06-27T13:46:10.536Z] 13:46:10     INFO - nextTestHarnessStep done: false
[task 2022-06-27T13:46:10.536Z] 13:46:10     INFO - Running test in worker
[task 2022-06-27T13:46:10.536Z] 13:46:10     INFO - Running test in worker as GeneratorFunction
[task 2022-06-27T13:46:10.537Z] 13:46:10     INFO - TEST-PASS | dom/indexedDB/test/test_object_identity.html | Unsupported function type 
[task 2022-06-27T13:46:10.537Z] 13:46:10     INFO - executeWorkerTestAndCleanUp for http://mochi.test:8888/tests/dom/indexedDB/test/unit/test_object_identity.js
[task 2022-06-27T13:46:10.537Z] 13:46:10     INFO - nextTestHarnessStep done: false
[task 2022-06-27T13:46:10.538Z] 13:46:10     INFO - Worker: loading ["http://mochi.test:8888/tests/dom/indexedDB/test/unit/test_object_identity.js"]
[task 2022-06-27T13:46:10.538Z] 13:46:10     INFO - Worker: starting tests
[task 2022-06-27T13:46:10.538Z] 13:46:10     INFO - TEST-PASS | dom/indexedDB/test/test_object_identity.html | Unsupported function type: undefined 
... (some more tests passing) ...
[task 2022-06-27T13:46:10.543Z] 13:46:10     INFO - TEST-PASS | dom/indexedDB/test/test_object_identity.html | Worker finished 
[task 2022-06-27T13:46:10.544Z] 13:46:10     INFO - Terminating the worker
[task 2022-06-27T13:46:10.544Z] 13:46:10     INFO - Cleaning up the databases
[task 2022-06-27T13:46:10.545Z] 13:46:10     INFO - clearAllDatabases then 
> [task 2022-06-27T13:40:41.726Z] 13:40:41     INFO - GECKO(10730) | State_Initial (numOps:1)
> [task 2022-06-27T13:40:41.726Z] 13:40:41     INFO - GECKO(10730) | DirectoryLockImpl::Overlaps -> true
> [task 2022-06-27T13:40:41.727Z] 13:40:41     INFO - GECKO(10730) | ClearStoragesForPrincipal request initiated
[task 2022-06-27T13:46:10.545Z] 13:46:10     INFO - Buffered messages finished
[task 2022-06-27T13:46:10.547Z] 13:46:10     INFO - TEST-UNEXPECTED-FAIL | dom/indexedDB/test/test_object_identity.html | Test timed out. -

At the second call to clearAllDatabases we end up initializing the ClearOrigin operation, finding another directory lock active (DirectoryLockImpl::Overlaps -> true). But according to State_Initial (numOps:1) we are the only existing operation, so nothing will ever unlock us again and we are stuck.

So something breaks our lock bookkeeping, and it can well be some remainder from a previous test run for the same origin. Assuming that the bookkeeping is generally well tested, and given the apparent race, my best bet would be that we need to have a mutex that guards all the locklist manipulations and lookups in DirectoryLockImpl to ensure that we always have a consistent state of those lists.

Jan, wdyt?

Flags: needinfo?(jvarga)

Jan Varga [:janv]

Assignee

Comment 17

•

2 years ago

Jens, thanks a lot for the info, I'll look what's wrong.

Jens Stutte [:jstutte]

Comment 18

•

2 years ago

Just summarizing our conversation on slack:

a Mutex is not supposed to help, all DirectoryLock accesses run on on the same thread (and we trust the relative asserts)
there can be other calls to Acquire that are not coming from operations

We should still add more info to the logs for better understanding, Jan is looking into this.

Jan Varga [:janv]

Assignee

Updated

•

2 years ago

Depends on: 1777914

Jens Stutte [:jstutte]

Updated

•

2 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1777921

Jan Varga [:janv]

Assignee

Comment 19

•

2 years ago

The root cause has been fixed in bug 1777914. The transition to GCP is now possible.

Assignee: nobody → jvarga

Status: NEW → RESOLVED

Closed: 2 years ago

Flags: needinfo?(jvarga)

Resolution: --- → FIXED

Rosanna

Comment 20

•

2 years ago

That is awesome, thank you! This should resolve the two blockers that @masterwayz had noted.

Bugzilla

indexedDB tests fail randomly and highly intermittent when running on GCP workers instead of AWS

Categories

(Core :: Storage: IndexedDB, defect, P2)

Tracking

()

People

(Reporter: masterwayz, Assigned: janv)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Updated

Updated

Comment 19

Comment 20