Closed
Bug 1125124
Opened 10 years ago
Closed 10 years ago
Figure out why the log processor nodes regularly stop handling tasks
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 1123479
People
(Reporter: emorley, Unassigned)
Details
This morning the sheriffs reported many jobs with unparsed logs.
It turns out two of the prod log processors (processor2 and processor3) had stopped handling tasks - you can see the drop here, until I restarted them:
https://rpm.newrelic.com/accounts/677903/servers/4303155?tw[end]=1422021783&tw[start]=1421978583
https://rpm.newrelic.com/accounts/677903/servers/5313432?tw[end]=1422021846&tw[start]=1421935446
One hadn't been taking jobs for ~7 hours, the other ~10 hours.
This resulted in a 40,000 job backlog for log parsing.
We have other bugs on file for: (a) getting queue counts in new relic, and thus alerts, (b) improving log parser performance; but it would be great to figure out why they stop taking tasks like this.
Reporter | ||
Comment 1•10 years ago
|
||
I should add - there hadn't been a deployment for 4.5 days, so it wasn't due to bug 1079270.
Reporter | ||
Comment 2•10 years ago
|
||
Three hours after restarting the log processors, the queues are going down, but still:
default 1852
high_priority 0
log_parser 24042
log_parser_fail 0
log_parser_hp 14109
I think we could get away with clearing the log_parser_hp queue, since aiui it's just on-demand log parse requests caused by people clicking around in the UI - all of which should duplicate the main queue - if my hunch about bug 1125088 is correct.
However I can't clear the log_parser_hp queue, since rabbitmqctl doesn't support clearing queues, I don't have the password for the web admin page, and the only other cli option (rabbitmqadmin) isn't installed as far as I can see.
Reporter | ||
Comment 3•10 years ago
|
||
treeherder-processor3 just became idle again, so I had to restart them all (I couldn't |sudo supervisorctl restart run_celery_worker_gevent| just on that node, since it prompted for a password when sudoing).
Reporter | ||
Comment 4•10 years ago
|
||
Slowly getting there:
default 2168
high_priority 0
log_parser 21430
log_parser_fail 11
log_parser_hp 9148
Reporter | ||
Comment 5•10 years ago
|
||
default 2490
log_parser 21219
log_parser_fail 17
log_parser_hp 5850
Reporter | ||
Comment 6•10 years ago
|
||
One step forwards, two back:
default 3015
log_parser 31179
log_parser_fail 1
log_parser_hp 2092
Reporter | ||
Comment 7•10 years ago
|
||
We appear to have lost proc3 again:
https://rpm.newrelic.com/accounts/677903/servers/5313432?tw[end]=1422053068&tw[start]=1422049468
default 3569
log_parser 33399
log_parser_fail 53
log_parser_hp 516
Restarting.
Reporter | ||
Comment 8•10 years ago
|
||
Well on the plus side we're catching up:
default 4955
log_parser 14587
log_parser_fail 4
log_parser_hp 0
The only problem is that's just because we're not ingesting any completed jobs (bug 1125410), so once that's resolved, we'll have another backlog, yey \o/
Reporter | ||
Comment 9•10 years ago
|
||
default 834
log_parser 87
log_parser_fail 3
log_parser_hp 0
Comment 10•10 years ago
|
||
this happened again, so bumping priority since this makes sheriffing more complicated with having no insight into the results
Severity: normal → critical
Reporter | ||
Comment 11•10 years ago
|
||
And again:
buildapi 0
celery@buildapi.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox 0
celery@buildapi.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox 0
celery@default.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox 0
celery@hp.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox 0
celery@log_parser.treeherder-processor1.private.scl3.mozilla.com.celery.pidbox 0
celery@log_parser.treeherder-processor2.private.scl3.mozilla.com.celery.pidbox 0
celery@log_parser.treeherder-processor3.private.scl3.mozilla.com.celery.pidbox 0
celery@pushlog.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox 0
celery@pushlog.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox 0
celeryev.1bda9ad7-fcff-429b-b520-5dcf58333982 0
celeryev.24f6138b-eafa-49d2-a6d5-b9c0388f47cc 0
celeryev.609cc011-de8a-4f17-9ae9-70d944116c1d 17110
celeryev.86309a15-671d-4a24-bc6d-242d4afced72 3606
celeryev.b5409c1a-2dd5-412d-871e-c091bac8c3c2 0
celeryev.c224d58c-744b-4f6a-9752-7382ee1cd299 0
celeryev.e2da783f-e4e6-46bd-b64b-dae217d28ffe 0
celeryev.e84ddfc4-8131-4ffd-9216-9cf81980c6fc 0
celeryev.ed36265a-b998-43b7-a0ae-80eed999710a 0
default 1220
high_priority 0
log_parser 4525
log_parser_fail 478
log_parser_hp 464
pushlog 0
treeherder-processor[23] have been idle for the last 60-90 mins.
I've restarted them.
(In reply to Carsten Book [:Tomcat] from comment #10)
> this happened again, so bumping priority since this makes sheriffing more
> complicated with having no insight into the results
Agree this is very high priority - we use the priority field rather than the severity field - and the priority field is at P1 already.
Reporter | ||
Comment 12•10 years ago
|
||
processor3 disconnected again, restarted. worker log:
http://pastebin.mozilla.org/8378673 (the restart started at 13:08:59)
Queues are coming down now:
[emorley@treeherder-rabbitmq1.private.scl3 ~]$ sudo rabbitmqctl list_queues -p treeherder
Listing queues ...
buildapi 0
celery@buildapi.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox 0
celery@buildapi.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox 0
celery@default.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox 0
celery@hp.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox 0
celery@log_parser.treeherder-processor1.private.scl3.mozilla.com.celery.pidbox 0
celery@log_parser.treeherder-processor2.private.scl3.mozilla.com.celery.pidbox 0
celery@log_parser.treeherder-processor3.private.scl3.mozilla.com.celery.pidbox 0
celery@pushlog.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox 0
celery@pushlog.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox 0
celeryev.24f6138b-eafa-49d2-a6d5-b9c0388f47cc 0
celeryev.6104655e-5688-4e29-a364-d9f59831dc80 0
celeryev.b5409c1a-2dd5-412d-871e-c091bac8c3c2 0
celeryev.b6c24cfb-fe30-4067-b962-9fa9eee76633 0
celeryev.bf41b634-d32f-46da-ad0c-1c60cf48310e 0
celeryev.c224d58c-744b-4f6a-9752-7382ee1cd299 0
celeryev.e2da783f-e4e6-46bd-b64b-dae217d28ffe 0
celeryev.e84ddfc4-8131-4ffd-9216-9cf81980c6fc 0
celeryev.ed36265a-b998-43b7-a0ae-80eed999710a 0
default 1486
high_priority 0
log_parser 1443
log_parser_fail 6
log_parser_hp 1
pushlog 0
...done.
Reporter | ||
Updated•10 years ago
|
Summary: Figure out why two log processor nodes stopped handling tasks → Figure out why the log processor nodes regularly stop handling tasks
Reporter | ||
Comment 13•10 years ago
|
||
Going to call this fixed by bug 1123479 until proven otherwise :-)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•