1347945 - Periodic Treeherder CloudAMQP alerts about backlogs on the log parsing queues

Assignee

Description

•

8 years ago

Occurred twice, and each time affected both stage and prod. Latest alert was at 1230 UTC today, previous at 2215 UTC yesterday. Seemed to mainly affect the `log_crossreference_error_lines` queues, rather than the standard log parser queues. Example slow transaction traces on NR: https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?tw%5Bend%5D=1489673405&tw%5Bstart%5D=1489587005#id=5b224f746865725472616e73616374696f6e2f43656c6572792f73746f72652d6661696c7572652d6c696e6573222c22225d&app_trace_id=1be2a27e-09e1-11e7-aa5e-f8bc124256a0_29864_31156 https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?tw%5Bend%5D=1489675876&tw%5Bstart%5D=1489654276#id=5b224f746865725472616e73616374696f6e2f43656c6572792f73746f72652d6661696c7572652d6c696e6573222c22225d&app_trace_id=d31c019f-0a3f-11e7-aa5e-f8bc124256a0_21419_22707 #1 was from last night and took 57s #2 was from today and took 6s. (There are many more examples of each) Most of the profile is in the download of eg: https://queue.taskcluster.net/v1/task/EnHkkQP8RN-7mJf9ZHo4Ug/runs/0/artifacts/public/test_info//reftest-no-accel_errorsummary.log The file is only 3 MB, so shouldn't have taken that long. However notably, these files are not being served with gzip, even though normal logs are. ie compare: curl -IL --compressed "https://queue.taskcluster.net/v1/task/EnHkkQP8RN-7mJf9ZHo4Ug/runs/0/artifacts/public/test_info//reftest-no-accel_errorsummary.log" curl -IL --compressed "https://queue.taskcluster.net/v1/task/dTgEO-dOQ4KOnc-ZJ6FL LQ/runs/0/artifacts/public/logs/live_backing.log" The latter's response (after the HTTP 303) is `Content-Encoding: gzip` whereas the former is not. In addition, it would be good to get rid of this HTTP 303 if possible (presuming it's a side effect of the live log streaming features, but seeing that the log URL is only submitted after the job is completed, we might be able to avoid this).

Ed Morley [:emorley]

Assignee

Updated

•

8 years ago

Depends on: 1347956

Greg Arndt [:garndt]

Comment 1

•

8 years ago

So the links to the live_backing.log are compressed, the workers do that because that's a file the worker maintains and knows to upload it compressed. The worker makes no assumption about the compression to use when uploading files that a task creates. We could probably just assume that all plain/text artifact should be gzipped. Jonas, thoughts?

Flags: needinfo?(jopsen)

Greg Arndt [:garndt]

Comment 2

•

8 years ago

Looks like I might have commented on the wrong bug. I'm going to duplicate my comment under 1347956 and flag Jonas. I'm sorry.

Flags: needinfo?(jopsen)

Ed Morley [:emorley]

Assignee

Updated

•

8 years ago

Depends on: 1348071

Ed Morley [:emorley]

Assignee

Updated

•

8 years ago

Depends on: 1348072

Ed Morley [:emorley]

Assignee

Updated

•

8 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1358593

Ed Morley [:emorley]

Assignee

Comment 3

•

7 years ago

Making this bug more generic, since the alerts are still periodically occurring, but for a variety of reasons.

Summary: [Alert] Cloudamqp: Queue total messages alarm: log_crossreference_error_lines (2017-03-16) → Periodic Treeherder CloudAMQP alerts about backlogs on the log parsing queues

Ed Morley [:emorley]

Assignee

Updated

•

7 years ago

Depends on: 1372639

Ed Morley [:emorley]

Assignee

Comment 4

•

7 years ago

Bug 1370359 reduced the size of the error_summary json files, which should help very slightly here.

Depends on: 1295997, 1294544

Ed Morley [:emorley]

Assignee

Updated

•

7 years ago

Depends on: 1370359

Ed Morley [:emorley]

Assignee

Updated

•

7 years ago

Depends on: 1372668

Ed Morley [:emorley]

Assignee

Updated

•

7 years ago

Assignee: nobody → emorley

Ed Morley [:emorley]

Assignee

Updated

•

7 years ago

Depends on: 1372922

Ed Morley [:emorley]

Assignee

Comment 5

•

7 years ago

There was another log parser backlog alert in the last hour, due to logs that are 80MB compressed (720MB uncompressed!). For example: https://queue.taskcluster.net/v1/task/Akn3hNH4S_as2GdcAAFsCw/runs/0/artifacts/public/logs/live_backing.log https://tools.taskcluster.net/groups/ZWxhSUOhTb6lrzXkDZwp8Q/tasks/Akn3hNH4S_as2GdcAAFsCw/details Miko, would it be possible to reduce the number of jobs that are run on each try push, or else reduce the amount of log output? Cancelling in-progress try jobs when you don't need the remaining results (eg pushing again) would help too :-)

Miko Mynttinen

Comment 6

•

7 years ago

(In reply to Ed Morley [:emorley] from comment #5) > Miko, would it be possible to reduce the number of jobs that are run on each > try push, or else reduce the amount of log output? Cancelling in-progress > try jobs when you don't need the remaining results (eg pushing again) would > help too :-) Sorry about that! I forgot to remove (some very spammy) debug prints before pushing to try, I'll be more mindful of them from now on.

Ed Morley [:emorley]

Assignee

Updated

•

7 years ago

Priority: P2 → P1

Ed Morley [:emorley]

Assignee

Comment 7

•

7 years ago

These have mostly gone away; dep bugs are open for additional mitigations.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

Periodic Treeherder CloudAMQP alerts about backlogs on the log parsing queues

Categories

(Tree Management :: Treeherder: Infrastructure, enhancement, P1)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: emorley)

References

(Depends on 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Updated

Updated

Updated

Comment 3

Updated

Comment 4

Updated

Updated

Updated

Updated

Comment 5

Comment 6

Updated

Comment 7