Closed Bug 1125124 Opened 10 years ago Closed 10 years ago

Figure out why the log processor nodes regularly stop handling tasks

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1123479

People

(Reporter: emorley, Unassigned)

Details

This morning the sheriffs reported many jobs with unparsed logs. It turns out two of the prod log processors (processor2 and processor3) had stopped handling tasks - you can see the drop here, until I restarted them: https://rpm.newrelic.com/accounts/677903/servers/4303155?tw[end]=1422021783&tw[start]=1421978583 https://rpm.newrelic.com/accounts/677903/servers/5313432?tw[end]=1422021846&tw[start]=1421935446 One hadn't been taking jobs for ~7 hours, the other ~10 hours. This resulted in a 40,000 job backlog for log parsing. We have other bugs on file for: (a) getting queue counts in new relic, and thus alerts, (b) improving log parser performance; but it would be great to figure out why they stop taking tasks like this.
I should add - there hadn't been a deployment for 4.5 days, so it wasn't due to bug 1079270.
Three hours after restarting the log processors, the queues are going down, but still: default 1852 high_priority 0 log_parser 24042 log_parser_fail 0 log_parser_hp 14109 I think we could get away with clearing the log_parser_hp queue, since aiui it's just on-demand log parse requests caused by people clicking around in the UI - all of which should duplicate the main queue - if my hunch about bug 1125088 is correct. However I can't clear the log_parser_hp queue, since rabbitmqctl doesn't support clearing queues, I don't have the password for the web admin page, and the only other cli option (rabbitmqadmin) isn't installed as far as I can see.
treeherder-processor3 just became idle again, so I had to restart them all (I couldn't |sudo supervisorctl restart run_celery_worker_gevent| just on that node, since it prompted for a password when sudoing).
Slowly getting there: default 2168 high_priority 0 log_parser 21430 log_parser_fail 11 log_parser_hp 9148
default 2490 log_parser 21219 log_parser_fail 17 log_parser_hp 5850
One step forwards, two back: default 3015 log_parser 31179 log_parser_fail 1 log_parser_hp 2092
We appear to have lost proc3 again: https://rpm.newrelic.com/accounts/677903/servers/5313432?tw[end]=1422053068&tw[start]=1422049468 default 3569 log_parser 33399 log_parser_fail 53 log_parser_hp 516 Restarting.
Well on the plus side we're catching up: default 4955 log_parser 14587 log_parser_fail 4 log_parser_hp 0 The only problem is that's just because we're not ingesting any completed jobs (bug 1125410), so once that's resolved, we'll have another backlog, yey \o/
default 834 log_parser 87 log_parser_fail 3 log_parser_hp 0
this happened again, so bumping priority since this makes sheriffing more complicated with having no insight into the results
Severity: normal → critical
And again: buildapi 0 celery@buildapi.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox 0 celery@buildapi.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox 0 celery@default.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox 0 celery@hp.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox 0 celery@log_parser.treeherder-processor1.private.scl3.mozilla.com.celery.pidbox 0 celery@log_parser.treeherder-processor2.private.scl3.mozilla.com.celery.pidbox 0 celery@log_parser.treeherder-processor3.private.scl3.mozilla.com.celery.pidbox 0 celery@pushlog.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox 0 celery@pushlog.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox 0 celeryev.1bda9ad7-fcff-429b-b520-5dcf58333982 0 celeryev.24f6138b-eafa-49d2-a6d5-b9c0388f47cc 0 celeryev.609cc011-de8a-4f17-9ae9-70d944116c1d 17110 celeryev.86309a15-671d-4a24-bc6d-242d4afced72 3606 celeryev.b5409c1a-2dd5-412d-871e-c091bac8c3c2 0 celeryev.c224d58c-744b-4f6a-9752-7382ee1cd299 0 celeryev.e2da783f-e4e6-46bd-b64b-dae217d28ffe 0 celeryev.e84ddfc4-8131-4ffd-9216-9cf81980c6fc 0 celeryev.ed36265a-b998-43b7-a0ae-80eed999710a 0 default 1220 high_priority 0 log_parser 4525 log_parser_fail 478 log_parser_hp 464 pushlog 0 treeherder-processor[23] have been idle for the last 60-90 mins. I've restarted them. (In reply to Carsten Book [:Tomcat] from comment #10) > this happened again, so bumping priority since this makes sheriffing more > complicated with having no insight into the results Agree this is very high priority - we use the priority field rather than the severity field - and the priority field is at P1 already.
processor3 disconnected again, restarted. worker log: http://pastebin.mozilla.org/8378673 (the restart started at 13:08:59) Queues are coming down now: [emorley@treeherder-rabbitmq1.private.scl3 ~]$ sudo rabbitmqctl list_queues -p treeherder Listing queues ... buildapi 0 celery@buildapi.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox 0 celery@buildapi.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox 0 celery@default.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox 0 celery@hp.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox 0 celery@log_parser.treeherder-processor1.private.scl3.mozilla.com.celery.pidbox 0 celery@log_parser.treeherder-processor2.private.scl3.mozilla.com.celery.pidbox 0 celery@log_parser.treeherder-processor3.private.scl3.mozilla.com.celery.pidbox 0 celery@pushlog.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox 0 celery@pushlog.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox 0 celeryev.24f6138b-eafa-49d2-a6d5-b9c0388f47cc 0 celeryev.6104655e-5688-4e29-a364-d9f59831dc80 0 celeryev.b5409c1a-2dd5-412d-871e-c091bac8c3c2 0 celeryev.b6c24cfb-fe30-4067-b962-9fa9eee76633 0 celeryev.bf41b634-d32f-46da-ad0c-1c60cf48310e 0 celeryev.c224d58c-744b-4f6a-9752-7382ee1cd299 0 celeryev.e2da783f-e4e6-46bd-b64b-dae217d28ffe 0 celeryev.e84ddfc4-8131-4ffd-9216-9cf81980c6fc 0 celeryev.ed36265a-b998-43b7-a0ae-80eed999710a 0 default 1486 high_priority 0 log_parser 1443 log_parser_fail 6 log_parser_hp 1 pushlog 0 ...done.
Summary: Figure out why two log processor nodes stopped handling tasks → Figure out why the log processor nodes regularly stop handling tasks
Going to call this fixed by bug 1123479 until proven otherwise :-)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.