Closed Bug 1347945 Opened 8 years ago Closed 7 years ago

Periodic Treeherder CloudAMQP alerts about backlogs on the log parsing queues

Categories

(Tree Management :: Treeherder: Infrastructure, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

(Depends on 1 open bug)

Details

Occurred twice, and each time affected both stage and prod. Latest alert was at 1230 UTC today, previous at 2215 UTC yesterday. Seemed to mainly affect the `log_crossreference_error_lines` queues, rather than the standard log parser queues. Example slow transaction traces on NR: https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?tw%5Bend%5D=1489673405&tw%5Bstart%5D=1489587005#id=5b224f746865725472616e73616374696f6e2f43656c6572792f73746f72652d6661696c7572652d6c696e6573222c22225d&app_trace_id=1be2a27e-09e1-11e7-aa5e-f8bc124256a0_29864_31156 https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?tw%5Bend%5D=1489675876&tw%5Bstart%5D=1489654276#id=5b224f746865725472616e73616374696f6e2f43656c6572792f73746f72652d6661696c7572652d6c696e6573222c22225d&app_trace_id=d31c019f-0a3f-11e7-aa5e-f8bc124256a0_21419_22707 #1 was from last night and took 57s #2 was from today and took 6s. (There are many more examples of each) Most of the profile is in the download of eg: https://queue.taskcluster.net/v1/task/EnHkkQP8RN-7mJf9ZHo4Ug/runs/0/artifacts/public/test_info//reftest-no-accel_errorsummary.log The file is only 3 MB, so shouldn't have taken that long. However notably, these files are not being served with gzip, even though normal logs are. ie compare: curl -IL --compressed "https://queue.taskcluster.net/v1/task/EnHkkQP8RN-7mJf9ZHo4Ug/runs/0/artifacts/public/test_info//reftest-no-accel_errorsummary.log" curl -IL --compressed "https://queue.taskcluster.net/v1/task/dTgEO-dOQ4KOnc-ZJ6FL LQ/runs/0/artifacts/public/logs/live_backing.log" The latter's response (after the HTTP 303) is `Content-Encoding: gzip` whereas the former is not. In addition, it would be good to get rid of this HTTP 303 if possible (presuming it's a side effect of the live log streaming features, but seeing that the log URL is only submitted after the job is completed, we might be able to avoid this).
Depends on: 1347956
So the links to the live_backing.log are compressed, the workers do that because that's a file the worker maintains and knows to upload it compressed. The worker makes no assumption about the compression to use when uploading files that a task creates. We could probably just assume that all plain/text artifact should be gzipped. Jonas, thoughts?
Flags: needinfo?(jopsen)
Looks like I might have commented on the wrong bug. I'm going to duplicate my comment under 1347956 and flag Jonas. I'm sorry.
Flags: needinfo?(jopsen)
Depends on: 1348071
Depends on: 1348072
Making this bug more generic, since the alerts are still periodically occurring, but for a variety of reasons.
Summary: [Alert] Cloudamqp: Queue total messages alarm: log_crossreference_error_lines (2017-03-16) → Periodic Treeherder CloudAMQP alerts about backlogs on the log parsing queues
Depends on: 1372639
Bug 1370359 reduced the size of the error_summary json files, which should help very slightly here.
Depends on: 1295997, 1294544
Depends on: 1370359
Depends on: 1372668
Assignee: nobody → emorley
Depends on: 1372922
There was another log parser backlog alert in the last hour, due to logs that are 80MB compressed (720MB uncompressed!). For example: https://queue.taskcluster.net/v1/task/Akn3hNH4S_as2GdcAAFsCw/runs/0/artifacts/public/logs/live_backing.log https://tools.taskcluster.net/groups/ZWxhSUOhTb6lrzXkDZwp8Q/tasks/Akn3hNH4S_as2GdcAAFsCw/details Miko, would it be possible to reduce the number of jobs that are run on each try push, or else reduce the amount of log output? Cancelling in-progress try jobs when you don't need the remaining results (eg pushing again) would help too :-)
(In reply to Ed Morley [:emorley] from comment #5) > Miko, would it be possible to reduce the number of jobs that are run on each > try push, or else reduce the amount of log output? Cancelling in-progress > try jobs when you don't need the remaining results (eg pushing again) would > help too :-) Sorry about that! I forgot to remove (some very spammy) debug prints before pushing to try, I'll be more mindful of them from now on.
Priority: P2 → P1
These have mostly gone away; dep bugs are open for additional mitigations.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.