Closed Bug 1545438 Opened 5 years ago Closed 4 years ago

We shouldn't disable the blocklist in talos, at least for startup tests

Categories

(Testing :: Talos, task, P2)

Version 3
task

Tracking

(firefox76 fixed)

RESOLVED FIXED
mozilla76
Tracking Status
firefox76 --- fixed

People

(Reporter: Gijs, Assigned: Gijs)

References

(Blocks 1 open bug)

Details

(Whiteboard: [fxperf:p2])

Attachments

(2 files)

bug 808824 disabled these because of blocklist updates causing instability in xperf/tp5 numbers. That's fair enough (ish), but there are ~3-7% regressions in startup tests (ts_paint**, sessionrestore**) if toggling that pref. The talos tests should be measuring that startup costs so we can see if attempts to reduce it actually work.

Bugbug thinks this bug is a task, but please change it back in case of error.

Type: defect → task
Whiteboard: [fxperf] → [fxperf:p2]
Priority: -- → P2
Assignee: nobody → mconley

(In reply to Mike Conley (:mconley) (:⚙️) (Wayyyy behind on needinfos) from comment #2)

I have a try push up that tries to do this.

Comparison coming in here:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=370986286359f9bb5289aa3f1e2e9012f54077b7&newProject=try&newRevision=48cf0e3a00768b2768ca9e721f0d360706e12b74&framework=1

Huh. Nothing? That's odd.

I dredged up what must have been my trypush from 7 months ago. The jobs are gone on treeherder but fortunately it seems perfherder still has the results:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=91df712f4eabda28d9ce5601722636fa4132b375&newProject=try&newRevision=331132e5bb70715ab184c8368d5a392d0d1bc107&framework=1

I just removed the user_pref("extensions.blocklist.enabled", false) line from the "perf" profile ( https://searchfox.org/mozilla-central/source/testing/profiles/perf/user.js#49 ). Maybe I missed something?

So the baseline numbers here have changed over the past 6 months (which may separately be something we want to look at, doesn't look like some of the significant increases in sessionrestore even got alerts, for some reason?), but the other thing that might make a difference here is artifact builds are enabled on your trypushes?

Flags: needinfo?(mconley)

Yeah, you did shippable comparisons. I'll do those next. I'll also try a similar patch as your original in that comparison.

Flags: needinfo?(mconley)

Re-trying Gijs's original patch from 7 months ago and pushing to try yields no improvement anymore, so presumably, scheduling has changed such that the improvement we were seeing is no longer happening.

Baseline vs Gijs's patch from 7 months ago: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=c9232fa45fb6306d64bde8738cbe9cefb54382e2&newProject=try&newRevision=8c1ccbda9e06f257796e49489bf8fbedf020d97f&framework=1

Baseline vs mconley's patch to enable blocklist for startup tests: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=c9232fa45fb6306d64bde8738cbe9cefb54382e2&newProject=try&newRevision=608588ccea1f1b71eeee377257c5bf4ec21f4434&framework=1

Pushed by mconley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b4b3dc473bc4
Enable the blocklist for startup Talos tests. r=rwood
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72
Regressions: 1598924
Status: RESOLVED → REOPENED
Flags: needinfo?(mconley)
Resolution: FIXED → ---
Target Milestone: mozilla72 → ---

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:mconley, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(mconley)

I'll get back to this soon - I just need to figure out why we get these frequent timeouts on sessionstore_no_auto_restore with the patch.

Flags: needinfo?(mconley)

I'm just going to be dumb and as a first step, see if this is still happening: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee11f85d22b14a7293777b07f246cc50abef4302 ...

OK, so I'm finally having some luck with reproducing the hang from bug 1598924 (which got this backed out) locally. My initial hypothesis was an asyncshutdown hang related to the blocklist (or something else that got tripped by additional timing), but after doing extensive logging, it seems that is Not It. Instead, having now gotten some locally reproducing things going on and just breaking into a debugger, it seems we're stuck on Quota Manager's ActorsParent's SpinEventLoopUntil.

This is sounding vaguely familiar from bug 1594521 where we hit something similar - but I can't tell off-hand if it's really the same issue. This also seems to only affect Windows so pernosco is out, I think? :asuth, any idea how I gather some more debug data here, and/or why we'd be stuck in this event loop for roughly 30 seconds at a time? Happy to add logging or look at specific things when stopping... it's a bit hit-and-miss to reproduce but at the moment I'm having enough luck that hopefully rebuilding and retrying won't be much of a problem.

Edit: If there is existing NSPR/MOZ_LOG logging that'd be helpful for this that I should turn on, I'd also be happy to hear about it.

I'll also note that when this happens, talos as-is ends up killing the launcher process, which leaves the "real" process behind, which then breaks the next start, which is unhelpful, and could probably be the subject of a separate bug.

Assignee: mconley → gijskruitbosch+bugs
Status: REOPENED → ASSIGNED
Flags: needinfo?(bugmail)

Only IndexedDB really has extensive logging, but it's also the most likely to have something weird going on . You can use MOZ_LOG on "IndexedDB", "QuotaManager", "WorkerPrivate", and "WorkerTimeouts".

Note that if using the profiler, IDB also can directly create profiler markers if you set the dom.indexedDB.logging.profiler-marks pref to true, although maybe the MOZ_LOG-marker-maker thing is already good enough? (Unfortunately, it seems like the treeherder "Create Gecko Profile" option both doesn't sample enough threads and I guess doesn't cover shutdown. I ran one on your orange result and got https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=292184307&revision=ee11f85d22b14a7293777b07f246cc50abef4302 to see what it did, obviously with the timeout not happening.)

Flags: needinfo?(bugmail)
Attached file log-hang.txt (deleted) —

OK. Here's output with asyncshutdown logging as well as the 4 modules you mentioned. I think the critical bit is this:

15:48:18     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 scheduled idle GC timer
15:48:18     INFO -  PID 9456 | 3:48:18 PM - DEBUG: Completed blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
15:48:18     INFO -  PID 9456 | console.error: "Could not load engine google@search.mozilla.org: [Exception... \"AddonManager is not initialized\"  nsresult: \"0xc1f30001 (NS_ERROR_NOT_INITIALIZED)\"  location: \"JS frame :: resource://gre/modules/AddonManager.jsm :: callAddonListeners :: line 1613\"  data: no]"
15:48:18     INFO -  PID 9456 | console.error: "Could not load engine amazondotcom@search.mozilla.org: [Exception... \"AddonManager is not initialized\"  nsresult: \"0xc1f30001 (NS_ERROR_NOT_INITIALIZED)\"  location: \"JS frame :: resource://gre/modules/AddonManager.jsm :: installBuiltinAddon :: line 2540\"  data: no]"
15:48:18     INFO -  PID 9456 | console.error: "Could not load engine ebay@search.mozilla.org: [Exception... \"AddonManager is not initialized\"  nsresult: \"0xc1f30001 (NS_ERROR_NOT_INITIALIZED)\"  location: \"JS frame :: resource://gre/modules/AddonManager.jsm :: installBuiltinAddon :: line 2540\"  data: no]"
15:48:18     INFO -  PID 9456 | console.error: "Could not load engine ddg@search.mozilla.org: [Exception... \"AddonManager is not initialized\"  nsresult: \"0xc1f30001 (NS_ERROR_NOT_INITIALIZED)\"  location: \"JS frame :: resource://gre/modules/AddonManager.jsm :: installBuiltinAddon :: line 2540\"  data: no]"
15:48:18     INFO -  PID 9456 | console.error: "Could not load engine bing@search.mozilla.org: [Exception... \"AddonManager is not initialized\"  nsresult: \"0xc1f30001 (NS_ERROR_NOT_INITIALIZED)\"  location: \"JS frame :: resource://gre/modules/AddonManager.jsm :: installBuiltinAddon :: line 2540\"  data: no]"
15:48:18     INFO -  PID 9456 | console.error: "Could not load engine twitter@search.mozilla.org: [Exception... \"AddonManager is not initialized\"  nsresult: \"0xc1f30001 (NS_ERROR_NOT_INITIALIZED)\"  location: \"JS frame :: resource://gre/modules/AddonManager.jsm :: installBuiltinAddon :: line 2540\"  data: no]"
15:48:18     INFO -  PID 9456 | console.error: "Could not load engine wikipedia@search.mozilla.org: [Exception... \"AddonManager is not initialized\"  nsresult: \"0xc1f30001 (NS_ERROR_NOT_INITIALIZED)\"  location: \"JS frame :: resource://gre/modules/AddonManager.jsm :: installBuiltinAddon :: line 2540\"  data: no]"
15:48:19     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 canceled GC timer because periodic
15:48:19     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 scheduled periodic GC timer
15:48:19     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 canceled GC timer because none
15:48:19     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 canceled GC timer because periodic
15:48:19     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 scheduled periodic GC timer
15:48:19     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 canceled GC timer because idle
15:48:19     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 scheduled idle GC timer
15:48:19     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 canceled GC timer because periodic
15:48:19     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 scheduled periodic GC timer
15:48:19     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 canceled GC timer because none
15:48:19     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 canceled GC timer because periodic
15:48:19     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 scheduled periodic GC timer
15:48:19     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 canceled GC timer because idle
15:48:19     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 scheduled idle GC timer
15:48:22     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667C800 collected idle garbage
15:48:24     INFO -  PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 collected idle garbage
15:48:25     INFO -  Browser shutdown timed out after 5 seconds, waiting
15:48:32     INFO -  PID 9456 | console.error: (new Error("SessionFile is closed", "resource:///modules/sessionstore/SessionFile.jsm", 433))
15:48:48     INFO -  PID 9456 | [Parent 3280: IPDL Background]: D/IndexedDB ConnectionPool shutting down thread 2
15:48:48     INFO -  PID 9456 | [Parent 3280: IPDL Background]: D/IndexedDB ConnectionPool shutting down thread 3
15:48:48     INFO -  PID 9456 | 3:48:48 PM - DEBUG: Starting phase profile-before-change-telemetry
15:48:48     INFO -  PID 9456 | 3:48:48 PM - DEBUG: Spinning the event loop

Note that there's pretty much exactly 30 seconds of no action in the QM/IndexedDB logging, and then it's able to shut down threads 2 and 3.

Is it possible this is due to inactivity on a DB connection which we then close when it's been idle for a sufficient amount of time, or something? I tried looking for indexeddb activity on those threads earlier in the log but wasn't able to find something that the log didn't claim had just finished successfully... (ni for this)

Going on the hunch, though, I spotted the failing getLastModified call from the remote settings code - though I'm fairly sure I have seen that on successful runs, too, I wondered if the error would mean we don't close the DB connection. So I wrote a speculative patch that ensures we call db.close() in finally clauses throughout the remote settings code ( https://hg.mozilla.org/try/rev/4d70983fc3d33f95844db708c0f38ba51b985e4f ). https://treeherder.mozilla.org/#/jobs?repo=try&revision=04c0b8501689338a99dda524386e8b1e2dacf343 to see if it helps, I guess...

Flags: needinfo?(bugmail)

dom/indexedDB/ActorsParent.cpp:249

// The length of time that idle threads will stay alive before being shut down.
const uint32_t kConnectionThreadIdleMS = 30 * 1000;  // 30 seconds

It does seem suspiciously like yes, an IndexedDB database is being left open and idle. Although I would have expected that connection to be forced to shutdown. Perhaps something weird happens where ScheduleQueuedTransactions is being called by NoteClosedDatabase or something is effectively bringing IDB back to life when it's meant to be shutting down. Tom, do you have any thoughts?

Flags: needinfo?(bugmail) → needinfo?(ttung)

The remote settings change doesn't seem to be sufficient/correct, at least, because there is still hanging happening on https://treeherder.mozilla.org/#/jobs?repo=try&revision=04c0b8501689338a99dda524386e8b1e2dacf343&selectedJob=292530851 (at time of writing the jobs haven't all finished but I inspected one of the remaining ones and it is hanging in the same manner as before).

Is this a thing that could theoretically reproduce on linux and under pernosco?

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #19)

Is this a thing that could theoretically reproduce on linux and under pernosco?

Theoretically, maybe yes, but (a) we've not ever seen it there that I'm aware, and because I'm not sure what is tripping this I don't know how to make it reproduce there, and (b) I still haven't had to use pernosco (I know!) and given that it's likely to involve JS I don't know how useful it'd be (based on my experience with JS + rr).

Hm, though I've just had an idea. I think the thing that is actually making this fail on windows is the following:

  1. we hang on shutdown
  2. the harness kills us after 5 seconds (cf. https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=292530855&repo=try&lineNumber=4293 ). That kills the launcher process but nothing else.
  3. the harness restarts Firefox for the next run, with the same profile
  4. we show the "but this profile is in use" error, because the previous process hasn't gone away yet and so the lockfile is in use -- this is what hangs the test forever.

So I imagine that as-is this won't ever reproduce on linux in the "the test is orange" sense (because there's no launcher process), but it might reproduce in the "we hang on shutdown" sense, and if we could detect this and make the job error then that might make our job easier...

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #19)

Is this a thing that could theoretically reproduce on linux and under pernosco?

Let's try this with my idea from comment #21... : https://treeherder.mozilla.org/#/jobs?repo=try&revision=9359d9ae0aca0a379fe616cd6ba46ba471362cd0

So in that log, the .open() call late in the process is concerning:

15:48:18 INFO - PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 scheduled idle GC timer
15:48:18 INFO - PID 9456 | [Parent 3280: IndexedDB #1]: D/IndexedDB ConnectionPool closed connection 0x000001C1C9D700B0
15:48:18 INFO - PID 9456 | [Parent 3280: Main Thread]: W/IndexedDB IndexedDB {eb2fb1e0-4609-46a0-b71d-fa52c9136483}: Child Request[69]: indexedDB.open("remote-settings", 2)
15:48:18 INFO - PID 9456 | 3:48:18 PM - DEBUG: Completed blocker OS.File: flush I/O queued before profileBeforeChange for phase profile-before-change
15:48:18 INFO - PID 9456 | 3:48:18 PM - DEBUG: Finished phase profile-before-change
15:48:18 INFO - PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 canceled GC timer because periodic

The fact that we're seeing unbalanced ConnectionPool shutdowns after that (filtered down to just IndexedDB here) seems bad too:

5:48:18 INFO - PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 scheduled periodic GC timer
15:48:18 INFO - PID 9456 | [Parent 3280: IndexedDB #3]: D/IndexedDB ConnectionPool closed connection 0x000001C1C8BF7890
15:48:18 INFO - PID 9456 | [Parent 3280: IndexedDB #2]: D/IndexedDB ConnectionPool closed connection 0x000001C1C8BF7040
15:48:18 INFO - PID 9456 | [Parent 3280: IPDL Background]: D/IndexedDB ConnectionPool shutting down thread 1
15:48:18 INFO - PID 9456 | [Parent 3280: IndexedDB #5]: D/IndexedDB ConnectionPool closed connection 0x000001C1C7EC7F20
15:48:18 INFO - PID 9456 | [Parent 3280: IPDL Background]: D/IndexedDB ConnectionPool shutting down thread 5
15:48:18 INFO - PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 canceled GC timer because none
15:48:18 INFO - PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 canceled GC timer because periodic
15:48:18 INFO - PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 scheduled periodic GC timer
15:48:18 INFO - PID 9456 | [Parent 3280: IndexedDB #4]: D/IndexedDB ConnectionPool closed connection 0x000001C1C9D5C2E0
15:48:18 INFO - PID 9456 | [Parent 3280: IPDL Background]: D/IndexedDB ConnectionPool shutting down thread 4
15:48:18 INFO - PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 canceled GC timer because none
15:48:18 INFO - PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 canceled GC timer because periodic

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #23)

So in that log, the .open() call late in the process is concerning:

15:48:18 INFO - PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 scheduled idle GC timer
15:48:18 INFO - PID 9456 | [Parent 3280: IndexedDB #1]: D/IndexedDB ConnectionPool closed connection 0x000001C1C9D700B0
15:48:18 INFO - PID 9456 | [Parent 3280: Main Thread]: W/IndexedDB IndexedDB {eb2fb1e0-4609-46a0-b71d-fa52c9136483}: Child Request[69]: indexedDB.open("remote-settings", 2)
15:48:18 INFO - PID 9456 | 3:48:18 PM - DEBUG: Completed blocker OS.File: flush I/O queued before profileBeforeChange for phase profile-before-change
15:48:18 INFO - PID 9456 | 3:48:18 PM - DEBUG: Finished phase profile-before-change
15:48:18 INFO - PID 9456 | [Parent 3280: DOM Worker]: D/WorkerPrivate Worker 000001C1C667A000 canceled GC timer because periodic

Hm, but that request completes a little later:

15:48:18 INFO - PID 9456 | [Parent 3280: Main Thread]: W/IndexedDB IndexedDB {eb2fb1e0-4609-46a0-b71d-fa52c9136483}: Child Request[69]: Firing "success" event

right? What confuses me is that the log doesn't actually include adding / completing the shutdown blocker added in bug 1594521. I guess that one is in the worker, and perhaps the worker wasn't needed in this run. We could add a shutdown blocker for the indexeddb usage from RemoteSettingsClient.jsm, but that feels odd - is the expectation that all indexeddb consumers do their own shutdown blocking?

The fact that we're seeing unbalanced ConnectionPool shutdowns after that (filtered down to just IndexedDB here) seems bad too:

Unfortunately I don't know what that means or how to act on it...

Flags: needinfo?(bugmail)

(In reply to :Gijs (back Thu 12; he/him) from comment #22)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #19)

Is this a thing that could theoretically reproduce on linux and under pernosco?

Let's try this with my idea from comment #21... : https://treeherder.mozilla.org/#/jobs?repo=try&revision=9359d9ae0aca0a379fe616cd6ba46ba471362cd0

This has a Linux failure (yay!), which also has:

[task 2020-03-10T21:41:12.860Z] 21:41:12 INFO - PID 11557 | ###!!! [Parent][DispatchAsyncMessage] Error: PBackgroundIDBFactory::Msg_PBackgroundIDBFactoryRequestConstructor Value error: message was deserialized, but contained an illegal value

I enabled pernosco on the push but I don't know if it'll work given my bugmail + hg account aren't under my mozilla.com username, and I guess even if it tries to reproduce it might not be successful, or perhaps it won't work for talos jobs, only tests?

(In reply to :Gijs (back Thu 12; he/him) from comment #24)

We could add a shutdown blocker for the indexeddb usage from RemoteSettingsClient.jsm, but that feels odd - is the expectation that all indexeddb consumers do their own shutdown blocking?

To somewhat reiterate the second part of https://bugzilla.mozilla.org/show_bug.cgi?id=1594521#c19:

  • System code really should be aware of shutdown.
    • https://phabricator.services.mozilla.com/D53751 in bug 1594521 took care of the remote settings worker's IDB usage around this, I think, but I hadn't quite realized there was still main thread stuff going on with IndexedDB.
    • Enhancement bug 1597693 that was spun off from https://bugzilla.mozilla.org/show_bug.cgi?id=1594521#c12 and later comments would address the general problem of the worker global living beyond the point that it's safe for system level JS to be running, but there's not currently any affordance like this for main-thread system JS beyond shutdown blockers that I'm aware of.
    • It might be possible to file an enhancement such that JSM's can be associated with a phase at which point their global/compartment/sandbox/whatever-they-are gets destroyed and causes nsIGlobalObject::DisconnectEventTargetObjects() to be called which could similarly help reduce the effort to be shutdown aware.
  • There's definitely an IndexedDB bug, which I think might be bug 1597312 that I filed as a result of that bug discussion, so I'm going to drop the needinfo on Tom.
    • When this is fixed, code trying to use IndexedDB will receive some combination of synchronous and asynchronous errors rather than resulting in weird hangs, but this is likely to really mess up the telemetry statistics for any callers plus any telemetry IndexedDB gathers on error rates, so it's nice for callers to be shutdown aware.
Flags: needinfo?(ttung)
Flags: needinfo?(bugmail)

(In reply to :Gijs (back Thu 12; he/him) from comment #25)

This has a Linux failure (yay!), which also has:

[task 2020-03-10T21:41:12.860Z] 21:41:12 INFO - PID 11557 | ###!!! [Parent][DispatchAsyncMessage] Error: PBackgroundIDBFactory::Msg_PBackgroundIDBFactoryRequestConstructor Value error: message was deserialized, but contained an illegal value

I enabled pernosco on the push but I don't know if it'll work given my bugmail + hg account aren't under my mozilla.com username, and I guess even if it tries to reproduce it might not be successful, or perhaps it won't work for talos jobs, only tests?

Okay, this is definitely bug 1597312 then, and having a pernosco repro of the situation would be fantastic to help fix it. I'll poke the pernosco self-serve API too maybe?

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #27)

(In reply to :Gijs (back Thu 12; he/him) from comment #25)

This has a Linux failure (yay!), which also has:

[task 2020-03-10T21:41:12.860Z] 21:41:12 INFO - PID 11557 | ###!!! [Parent][DispatchAsyncMessage] Error: PBackgroundIDBFactory::Msg_PBackgroundIDBFactoryRequestConstructor Value error: message was deserialized, but contained an illegal value

I enabled pernosco on the push but I don't know if it'll work given my bugmail + hg account aren't under my mozilla.com username, and I guess even if it tries to reproduce it might not be successful, or perhaps it won't work for talos jobs, only tests?

Okay, this is definitely bug 1597312 then, and having a pernosco repro of the situation would be fantastic to help fix it. I'll poke the pernosco self-serve API too maybe?

Please do; I'm out today and I still haven't seen any emails from pernosco (which mccr8's thread from a while back says I'm supposed to get once the pernosco run finishes?)

(In reply to :Gijs (he/him) from comment #28)

Please do; I'm out today and I still haven't seen any emails from pernosco (which mccr8's thread from a while back says I'm supposed to get once the pernosco run finishes?)

It turns out pernosco doesn't know how to reproduce talos failures at this time (and because of long runtimes and lack of a specific test to reproduce against may be difficult to add support for), so neither of our reproduction attempts worked.

Please also see bug 1507121 where it happens permanently for wdspec (wpt) tests on mozilla-beta. Maybe that helps us to investigate the problem?

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #30)

Please also see bug 1507121 where it happens permanently for wdspec (wpt) tests on mozilla-beta. Maybe that helps us to investigate the problem?

Unfortunately it seems like https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-03-06&endday=2020-03-13&tree=all&bug=1507121 only shows windows failures, which doesn't help pernosco directly, although it does seem happy with the test type in general. I've issued a bunch of re-triggers on linux for the Wd1 job at https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&selectedJob=293076085&searchStr=wd&revision=048c7cb557453e3abe49216c52fb868ed9a4819c in the hope some of them will fail and then we can encourage pernosco to reproduce.

Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/403ce6f34590
Enable the blocklist for startup Talos tests. r=rwood,perftest-reviewers,whimboo
Status: ASSIGNED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla76
Blocks: 1629005
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: