Closed Bug 1582376 Opened 5 years ago Closed 5 years ago

too many task-group-resolved pulse messages

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: dustin)

References

Details

Attachments

(1 file)

Brian,

Something strange going on here - see screenshot attached. Any ideas?

This came up in IRC today.

Bastien / Marco may have more context.

Flags: needinfo?(mcastelluccio)
Flags: needinfo?(bstack)
Flags: needinfo?(bastien)

This message is still being sent. You can check yourself by using this direct link for Pulse Inspector and "Start Listening". After a few seconds, you should see a lot of duplicates with the routing key primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._.

This is the first time in months we see that happening on that queue. I initially thought it was due to a bug on our side (Python client listening to pulse events for code coverage builds), but it looks like it's on the pulse server side...

Our code coverage generation is slowed down a lot because of this bug.

Flags: needinfo?(bastien)

Ah, indeed, something's funky. I'm seeing a lot of

2019-09-19T16:01:11.760517+00:00 app[deadlineResolver.2]:  {
  "EnvVersion": "2.0",
  "Fields": {
    "exchange": "exchange/taskcluster-queue/v1/task-exception",
    "message": "PulsePublisher.sendDeadline exceeded",
    "name": "Error",
    "retries": 1,
    "routingKey": "primary.GXWgubLFSROd1XyDt9eElw.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._",
    "stack": "Error: PulsePublisher.sendDeadline exceeded\n    at failAtDeadline (/app/libraries/pulse/src/publisher.js:371:32)",
    "v": 1
  },
  "Hostname": "8877d995-6ecd-4c8a-b7b5-f0e7b2188878",
  "Logger": "taskcluster.queue.deadline-resolver",
  "Pid": 4,
  "Severity": 4,
  "Timestamp": 1568908871760000000,
  "Type": "monitor.error",
  "message": "Error: PulsePublisher.sendDeadline exceeded\n    at failAtDeadline (/app/libraries/pulse/src/publisher.js:371:32)",
  "serviceContext": {
    "service": "queue",
    "version": "6682067b968a187e31a516195a0ae4f725515a09"
  },
  "severity": "WARNING"
}

I will restart the deadline resolver. I'm guessing that something is mixed up with the error handling here that is causing the message to be repeated.

2019-09-19T16:18:52.769992+00:00 app[deadlineResolver.1]: {
  "EnvVersion": "2.0",
  "Fields": {
    "exchange": "exchange/taskcluster-queue/v1/task-group-resolved",
    "message": "Channel closed",
    "name": "IllegalOperationError",
    "retries": 1,
    "routingKey": "primary.MTTuK83kQAmfyzfNZMBSpg.gecko-level-3._",
    "stack": "IllegalOperationError: Channel closed\n    at ConfirmChannel.<anonymous> (/app/node_modules/amqplib/lib/channel.js:154:11)\n    at ConfirmChannel.C.publish (/app/node_modules/amqplib/lib/channel_model.js:158:15)\n    at ConfirmChannel.CC.publish (/app/node_modules/amqplib/lib/channel_model.js:294:20)\n    at Promise (/app/libraries/pulse/src/publisher.js:338:21)\n    at new Promise (<anonymous>)\n    at retry (/app/libraries/pulse/src/publisher.js:337:17)\n    at process._tickCallback (internal/process/next_tick.js:68:7)",
    "stackAtStateChange": "Stack capture: Connection closed: 501 (FRAME-ERROR) with message \"FRAME_ERROR - type 2, all octets = <<>>: {frame_too_large,4852,4088}\"\n    at Object.accept (/app/node_modules/amqplib/lib/connection.js:90:15)\n    at Connection.mainAccept [as accept] (/app/node_modules/amqplib/lib/connection.js:64:33)\n    at TLSSocket.go (/app/node_modules/amqplib/lib/connection.js:478:48)\n    at TLSSocket.emit (events.js:198:13)\n    at TLSSocket.EventEmitter.emit (domain.js:448:20)\n    at emitReadable_ (_stream_readable.js:554:12)\n    at process._tickCallback (internal/process/next_tick.js:63:19)",
    "v": 1
  },
  "Hostname": "5a2ff5eb-fbcf-47b1-823b-3775e66e6e8b",
  "Logger": "taskcluster.queue.deadline-resolver",
  "Pid": 4,
  "Severity": 4,
  "Timestamp": 1568909932769000000,
  "Type": "monitor.error",
  "message": "IllegalOperationError: Channel closed\n    at ConfirmChannel.<anonymous> (/app/node_modules/amqplib/lib/channel.js:154:11)\n    at ConfirmChannel.C.publish (/app/node_modules/amqplib/lib/channel_model.js:158:15)\n    at ConfirmChannel.CC.publish (/app/node_modules/amqplib/lib/channel_model.js:294:20)\n    at Promise (/app/libraries/pulse/src/publisher.js:338:21)\n    at new Promise (<anonymous>)\n    at retry (/app/libraries/pulse/src/publisher.js:337:17)\n    at process._tickCallback (internal/process/next_tick.js:68:7)",
  "serviceContext": {
    "service": "queue",
    "version": "6682067b968a187e31a516195a0ae4f725515a09"
  },
  "severity": "WARNING"
}

And in particular, Connection closed: 501 (FRAME-ERROR) with message \"FRAME_ERROR - type 2, all octets = <<>>: {frame_too_large,4852,4088}\"\

Flags: needinfo?(mcastelluccio)
Flags: needinfo?(bstack)

That's weird, though -- that message is of (small) finite size.

Ah, but https://queue.taskcluster.net/v1/task/GXWgubLFSROd1XyDt9eElw is very large. I think the deadline resolver is trying to resolve that and other tasks as past their deadline, but is unable to send a message about them, which causes the connection to fail. It probably sends the taskGroupResolved message first, so on every try it sends another one of those.

I've temporarily turned off the deadline resolver.

I wish there was an AMQP library for node that had decent error handling...

Regressions: 1323792

In bug 1323792, we increased the number of allowable routes from 10 to 64. I remember a lot of discussion about AMQP header sizes at the time, so.. why did we do this?

It seems that was a "temporary" change, in hopes it wouldn't bite us "soon". Well, it's bitten us.

I've temporarily turned off the deadline resolver.

Do to a fun UI bug in Heroku, I in fact dialed the deadline resolver up to 46 dynos. It's now at zero dynos.

Through writing some test code, I can confirm that sending a message with a too-big header will kill the AMQP connection and will give an error from the channel.publish callback. The publish code then retries (once), first waiting for a new connection and channel.

I can't see anything distinct about the error object that would let me know that it's not transient. AMQP does a good job of obscuring its errors, unfortunately.

I just restarted both testing & production pulse listeners on our side: they seem to work OK now.

Confirmed, there's no way to get this information from amqplib. I did find a minor patch to make tc-lib-pulse a little better-behaved in this case (it was returning an error from .send() and then retrying the send one more time), but that wouldn't have affected this bug.

https://github.com/taskcluster/taskcluster/pull/1373 contains that fix and a change to frame_max. That should get us out of the woods here.

I don't see the link between these too-big messages and looping sends of this message, though.

I turned it on briefly, and recorded the messages it failed to send:

{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.V_U3LTzGTL-qT_4LEh-jhA.0._._.gce.gecko-3-b-linux.gecko-level-3.RMtKaBrhRPCsuGwGcdx-Xg._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.fr9UR3s3QWmaDFIrUog1RQ.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.CYGjuZ31TyyHs9hzWt9ZmA.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.cY_J14rWTua96gP7rbN10g.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.A6HQb6jLSFeULG3QCsX5Cg.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.YKQqX5XUQVqFAUggCH-qCw.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.UVKH5VeNThqf66pQI2I97w.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.LQOOdQLQTDOah6mn2HyUJA.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.GU_PzIl0SEyRExL0vmKI6A.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.LQOOdQLQTDOah6mn2HyUJA.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.EzkmaAINTG6rnSv6kU5Pmw.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.UVKH5VeNThqf66pQI2I97w.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.FrueLHbFSsOhmGFinRdAqQ.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.DeGGEeiYSX-6c4RaOHy5Xw.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.LQOOdQLQTDOah6mn2HyUJA.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.FrueLHbFSsOhmGFinRdAqQ.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.FrueLHbFSsOhmGFinRdAqQ.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.YKQqX5XUQVqFAUggCH-qCw.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.SQje13b8S6ifkotvIzfXHQ.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.UC-_xRI_SFm3Ii1sPhRK9g.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.SQje13b8S6ifkotvIzfXHQ.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.DeGGEeiYSX-6c4RaOHy5Xw.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.EzkmaAINTG6rnSv6kU5Pmw.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.ZzbToLumSumLYPB9dTF_pQ.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.ZzbToLumSumLYPB9dTF_pQ.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.IJBy2JjIQBm74PHI0W-tdw.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.fTY7QS8OQeSRF_SiANv4tQ.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.fTY7QS8OQeSRF_SiANv4tQ.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.cz9KmxQqTwuwGUmMge6YhA.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.cwQFLDNnTIiAvssh0Hbc0A.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.YKQqX5XUQVqFAUggCH-qCw.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.UC-_xRI_SFm3Ii1sPhRK9g.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.cwQFLDNnTIiAvssh0Hbc0A.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}                                                                                                                                                                                                                                               
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.YKQqX5XUQVqFAUggCH-qCw.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}                                                                                                                                                                                    
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.FrueLHbFSsOhmGFinRdAqQ.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.cz9KmxQqTwuwGUmMge6YhA.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.GU_PzIl0SEyRExL0vmKI6A.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.YJDMKPguSqGlcIplywwW9Q.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.CwOBwksIQpi8pUcBsWl6ng.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.CwOBwksIQpi8pUcBsWl6ng.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.CwOBwksIQpi8pUcBsWl6ng.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.a3T5EjqFTimLfobr5zZnLQ.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.OKzh57orSuK9lwRc-itm2g.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.OKzh57orSuK9lwRc-itm2g.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.Ydj5iGVjQFiwQbMuHnHs1A.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-group-resolved","routingKey":"primary.SIlcBZkkT3e-8ZFxVK4nwQ.comm-level-1._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.IJBy2JjIQBm74PHI0W-tdw.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.B1KQlYqzRNWZm1jEvA4YTQ.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.Olyq7Z41QvSvlIEYAReaAQ.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.BrEq8PRqQXKtSy2GooIGDQ.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.FrueLHbFSsOhmGFinRdAqQ.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.OKzh57orSuK9lwRc-itm2g.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.bWWgsPMPQF-P_EulVs9w8Q.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.F7YGbJ3aQnKhR1JpaJv-yQ.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.f1Pdm3CiR4ql46LPQiI3xg.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.f1Pdm3CiR4ql46LPQiI3xg.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.Fxj8_Wg2Sg6v0R0xcMdheA.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.Fxj8_Wg2Sg6v0R0xcMdheA.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.QxmLR1R6QqOv4pD1wmnHiw.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.P97TcU5XQAq7HDiw5B7QWQ.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.cwQFLDNnTIiAvssh0Hbc0A.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.CwOBwksIQpi8pUcBsWl6ng.0._._.aws-provisioner-v1.gecko-1-b-win2012.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.OKzh57orSuK9lwRc-itm2g.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}
{"exchange":"exchange/taskcluster-queue/v1/task-exception","routingKey":"primary.f2ZHqZ-RS928Zt4FUrLF6g.0._._.aws-provisioner-v1.gecko-1-b-linux.comm-level-1.SIlcBZkkT3e-8ZFxVK4nwQ._"}

So I think what's happening is that deadlineResolver is sending one message to say "this task is past its deadline" and then another to indicate the taskGroup is complete, for each task. And something's failing there so that on the next deadlineResolver iteration it does the same thing all over again. Two questions: (1) why does it think the taskGroup is finished after each task? (2) why does it send the same message for each task on each iteration?

I can answer both of those!

deadlineResolver operates on an Azure queue, and doesn't mark the queued item as "done" until it has been successful. That's why it keeps trying (2).

Whether to send a taskGroup completion message is based on whether there are any active tasks left in the taskGroup:

    // check for emptiness of the partition
    let result = await this.TaskGroupActiveSet.query({taskGroupId}, {limit: 1});

and that's already zero since all of these tasks have been marked as resolved in the azure queue (1).

This is probably OK -- it's all consistent with the "at least once" semantics of pulse.. it just happens to be a lot more than once.

OK, I landed part of https://github.com/taskcluster/taskcluster/pull/1373 on queue and turned the deadline resolver back on, and it seems to have spat out 170 (or maybe 510?) taskGroupResolved messages for that taskGroup, but then stopped.. So I think we're good. I'll land the real thing Monday.

Assignee: nobody → dustin

Nice sleuthing dustin! 👍

OK, landed the real thing and deployed to queue.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: