Open Bug 620596 Opened 14 years ago Updated 8 years ago

Need some form of queue for posting of results to graphs server

Categories

(Webtools Graveyard :: Graph Server, defect)

defect
Not set
normal

Tracking

(Not tracked)

People

(Reporter: fox2mike, Unassigned)

Details

If talos machines are unable to reach the graphs DB, tests fail (as of now) and the tree starts showing up red. This should be modified to a queue system which will change the colour to notify of possible issues with uploading to graphs and re-try every x minutes over y hours before failing and making it go red. Opinions? Thoughts? I'm CC'ing zandr since he's going to have a talk to joduinn about this in person.
Also, I understand that this might not be the desired behaviour, but would like some discussion before we decide one way or another.
I think this a Graph Server bug, not necessarily RelEng one.
Component: Release Engineering → Graph Server
Product: mozilla.org → Webtools
QA Contact: release → graph.server
Not really. Who handles the part where the talos machines write to the graph server? RelEng I bet :) 19:24:14 < fox2mike> bhearsum: who handles the code that makes the talos machines contact the graphs server? 19:24:23 < bhearsum> releng 19:24:27 < bhearsum> it should be a server side queue, though
Component: Graph Server → Release Engineering
Product: Webtools → mozilla.org
QA Contact: graph.server → release
Fine, this can stay here. I still don't believe that such a queue should be disassociated with the server, though.
If the graphs server team handles talos code, I'd be happy to pass this to them :) I'm not the one decide where release engg related bug go, so I'll defer to you guys on that :D
We're bikeshedding about the wrong thing here. 1) graphserver as SPOF is not new. I agree that some form of redundant collector would be good, but only if we can do it without turning graphserver into Son of Socorro. 2) Talos machines don't have any long-term persistence. They reboot and clobber with great frequency. So if the results aren't posted, they're gone. As such, this is desired behavior in the current world. There is a whole different discussion about distinguishing between failed tests and failed testers, but that's Not Trivial. 2) The recent breakage was caused by graphserver posting to the AMO db, and apparently doing that synchronously with the post from the slave. That is insane, unacceptable, and the response to https://bugzilla.mozilla.org/show_bug.cgi?id=620570#c10 is where that conversation will take place.
I think that the most basic solution here is a message queue, with the Talos results being producers, and the graph server as a consumer. Might need some adjustment to Talos/unittests if graph server sends back any data.
Component: Release Engineering → Talos
Product: mozilla.org → Testing
QA Contact: release → talos
Version: other → Trunk
Component: Talos → Webdev
Product: Testing → mozilla.org
Version: Trunk → other
So this is a graphserver issue. The queue needs to be graphserver side, not Talos side (potentially, it could live elsewhere in infrastructure as well, but I'm guessing graphserver makes the most sense). That said, graphserver is going to be replaced with datazilla, which already has such a queuing system in place
Component: Webdev → Graph Server
Product: mozilla.org → Webtools
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.