Closed
Bug 1347945
Opened 8 years ago
Closed 7 years ago
Periodic Treeherder CloudAMQP alerts about backlogs on the log parsing queues
Categories
(Tree Management :: Treeherder: Infrastructure, enhancement, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: emorley)
References
(Depends on 1 open bug)
Details
Occurred twice, and each time affected both stage and prod.
Latest alert was at 1230 UTC today, previous at 2215 UTC yesterday.
Seemed to mainly affect the `log_crossreference_error_lines` queues, rather than the standard log parser queues.
Example slow transaction traces on NR:
https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?tw%5Bend%5D=1489673405&tw%5Bstart%5D=1489587005#id=5b224f746865725472616e73616374696f6e2f43656c6572792f73746f72652d6661696c7572652d6c696e6573222c22225d&app_trace_id=1be2a27e-09e1-11e7-aa5e-f8bc124256a0_29864_31156
https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?tw%5Bend%5D=1489675876&tw%5Bstart%5D=1489654276#id=5b224f746865725472616e73616374696f6e2f43656c6572792f73746f72652d6661696c7572652d6c696e6573222c22225d&app_trace_id=d31c019f-0a3f-11e7-aa5e-f8bc124256a0_21419_22707
#1 was from last night and took 57s
#2 was from today and took 6s.
(There are many more examples of each)
Most of the profile is in the download of eg:
https://queue.taskcluster.net/v1/task/EnHkkQP8RN-7mJf9ZHo4Ug/runs/0/artifacts/public/test_info//reftest-no-accel_errorsummary.log
The file is only 3 MB, so shouldn't have taken that long.
However notably, these files are not being served with gzip, even though normal logs are.
ie compare:
curl -IL --compressed "https://queue.taskcluster.net/v1/task/EnHkkQP8RN-7mJf9ZHo4Ug/runs/0/artifacts/public/test_info//reftest-no-accel_errorsummary.log"
curl -IL --compressed "https://queue.taskcluster.net/v1/task/dTgEO-dOQ4KOnc-ZJ6FL LQ/runs/0/artifacts/public/logs/live_backing.log"
The latter's response (after the HTTP 303) is `Content-Encoding: gzip` whereas the former is not.
In addition, it would be good to get rid of this HTTP 303 if possible (presuming it's a side effect of the live log streaming features, but seeing that the log URL is only submitted after the job is completed, we might be able to avoid this).
Comment 1•8 years ago
|
||
So the links to the live_backing.log are compressed, the workers do that because that's a file the worker maintains and knows to upload it compressed.
The worker makes no assumption about the compression to use when uploading files that a task creates. We could probably just assume that all plain/text artifact should be gzipped.
Jonas, thoughts?
Flags: needinfo?(jopsen)
Comment 2•8 years ago
|
||
Looks like I might have commented on the wrong bug. I'm going to duplicate my comment under 1347956 and flag Jonas. I'm sorry.
Flags: needinfo?(jopsen)
Assignee | ||
Updated•8 years ago
|
Assignee | ||
Comment 3•7 years ago
|
||
Making this bug more generic, since the alerts are still periodically occurring, but for a variety of reasons.
Summary: [Alert] Cloudamqp: Queue total messages alarm: log_crossreference_error_lines (2017-03-16) → Periodic Treeherder CloudAMQP alerts about backlogs on the log parsing queues
Assignee | ||
Comment 4•7 years ago
|
||
Bug 1370359 reduced the size of the error_summary json files, which should help very slightly here.
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → emorley
Assignee | ||
Comment 5•7 years ago
|
||
There was another log parser backlog alert in the last hour, due to logs that are 80MB compressed (720MB uncompressed!). For example:
https://queue.taskcluster.net/v1/task/Akn3hNH4S_as2GdcAAFsCw/runs/0/artifacts/public/logs/live_backing.log
https://tools.taskcluster.net/groups/ZWxhSUOhTb6lrzXkDZwp8Q/tasks/Akn3hNH4S_as2GdcAAFsCw/details
Miko, would it be possible to reduce the number of jobs that are run on each try push, or else reduce the amount of log output? Cancelling in-progress try jobs when you don't need the remaining results (eg pushing again) would help too :-)
Comment 6•7 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #5)
> Miko, would it be possible to reduce the number of jobs that are run on each
> try push, or else reduce the amount of log output? Cancelling in-progress
> try jobs when you don't need the remaining results (eg pushing again) would
> help too :-)
Sorry about that! I forgot to remove (some very spammy) debug prints before pushing to try, I'll be more mindful of them from now on.
Assignee | ||
Updated•7 years ago
|
Priority: P2 → P1
Assignee | ||
Comment 7•7 years ago
|
||
These have mostly gone away; dep bugs are open for additional mitigations.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•