1125124 - Figure out why the log processor nodes regularly stop handling tasks

Reporter

Description

•

10 years ago

This morning the sheriffs reported many jobs with unparsed logs. It turns out two of the prod log processors (processor2 and processor3) had stopped handling tasks - you can see the drop here, until I restarted them: https://rpm.newrelic.com/accounts/677903/servers/4303155?tw[end]=1422021783&tw[start]=1421978583 https://rpm.newrelic.com/accounts/677903/servers/5313432?tw[end]=1422021846&tw[start]=1421935446 One hadn't been taking jobs for ~7 hours, the other ~10 hours. This resulted in a 40,000 job backlog for log parsing. We have other bugs on file for: (a) getting queue counts in new relic, and thus alerts, (b) improving log parser performance; but it would be great to figure out why they stop taking tasks like this.

Ed Morley [:emorley]

Reporter

Comment 1

•

10 years ago

I should add - there hadn't been a deployment for 4.5 days, so it wasn't due to bug 1079270.

Ed Morley [:emorley]

Reporter

Comment 2

•

10 years ago

Three hours after restarting the log processors, the queues are going down, but still: default 1852 high_priority 0 log_parser 24042 log_parser_fail 0 log_parser_hp 14109 I think we could get away with clearing the log_parser_hp queue, since aiui it's just on-demand log parse requests caused by people clicking around in the UI - all of which should duplicate the main queue - if my hunch about bug 1125088 is correct. However I can't clear the log_parser_hp queue, since rabbitmqctl doesn't support clearing queues, I don't have the password for the web admin page, and the only other cli option (rabbitmqadmin) isn't installed as far as I can see.

Ed Morley [:emorley]

Reporter

Comment 3

•

10 years ago

treeherder-processor3 just became idle again, so I had to restart them all (I couldn't |sudo supervisorctl restart run_celery_worker_gevent| just on that node, since it prompted for a password when sudoing).

Ed Morley [:emorley]

Reporter

Comment 4

•

10 years ago

Slowly getting there: default 2168 high_priority 0 log_parser 21430 log_parser_fail 11 log_parser_hp 9148

Ed Morley [:emorley]

Reporter

Comment 5

•

10 years ago

default 2490 log_parser 21219 log_parser_fail 17 log_parser_hp 5850

Ed Morley [:emorley]

Reporter

Comment 6

•

10 years ago

One step forwards, two back: default 3015 log_parser 31179 log_parser_fail 1 log_parser_hp 2092

Ed Morley [:emorley]

Reporter

Comment 7

•

10 years ago

We appear to have lost proc3 again: https://rpm.newrelic.com/accounts/677903/servers/5313432?tw[end]=1422053068&tw[start]=1422049468 default 3569 log_parser 33399 log_parser_fail 53 log_parser_hp 516 Restarting.

Ed Morley [:emorley]

Reporter

Comment 8

•

10 years ago

Well on the plus side we're catching up: default 4955 log_parser 14587 log_parser_fail 4 log_parser_hp 0 The only problem is that's just because we're not ingesting any completed jobs (bug 1125410), so once that's resolved, we'll have another backlog, yey \o/

Ed Morley [:emorley]

Reporter

Comment 9

•

10 years ago

default 834 log_parser 87 log_parser_fail 3 log_parser_hp 0

Carsten Book [:Tomcat]

Comment 10

•

10 years ago

this happened again, so bumping priority since this makes sheriffing more complicated with having no insight into the results

Severity: normal → critical

Ed Morley [:emorley]

Reporter

Comment 11

•

10 years ago

And again: buildapi 0 celery@buildapi.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox 0 celery@buildapi.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox 0 celery@default.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox 0 celery@hp.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox 0 celery@log_parser.treeherder-processor1.private.scl3.mozilla.com.celery.pidbox 0 celery@log_parser.treeherder-processor2.private.scl3.mozilla.com.celery.pidbox 0 celery@log_parser.treeherder-processor3.private.scl3.mozilla.com.celery.pidbox 0 celery@pushlog.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox 0 celery@pushlog.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox 0 celeryev.1bda9ad7-fcff-429b-b520-5dcf58333982 0 celeryev.24f6138b-eafa-49d2-a6d5-b9c0388f47cc 0 celeryev.609cc011-de8a-4f17-9ae9-70d944116c1d 17110 celeryev.86309a15-671d-4a24-bc6d-242d4afced72 3606 celeryev.b5409c1a-2dd5-412d-871e-c091bac8c3c2 0 celeryev.c224d58c-744b-4f6a-9752-7382ee1cd299 0 celeryev.e2da783f-e4e6-46bd-b64b-dae217d28ffe 0 celeryev.e84ddfc4-8131-4ffd-9216-9cf81980c6fc 0 celeryev.ed36265a-b998-43b7-a0ae-80eed999710a 0 default 1220 high_priority 0 log_parser 4525 log_parser_fail 478 log_parser_hp 464 pushlog 0 treeherder-processor[23] have been idle for the last 60-90 mins. I've restarted them. (In reply to Carsten Book [:Tomcat] from comment #10) > this happened again, so bumping priority since this makes sheriffing more > complicated with having no insight into the results Agree this is very high priority - we use the priority field rather than the severity field - and the priority field is at P1 already.

Ed Morley [:emorley]

Reporter

Comment 12

•

10 years ago

processor3 disconnected again, restarted. worker log: http://pastebin.mozilla.org/8378673 (the restart started at 13:08:59) Queues are coming down now: [emorley@treeherder-rabbitmq1.private.scl3 ~]$ sudo rabbitmqctl list_queues -p treeherder Listing queues ... buildapi 0 celery@buildapi.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox 0 celery@buildapi.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox 0 celery@default.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox 0 celery@hp.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox 0 celery@log_parser.treeherder-processor1.private.scl3.mozilla.com.celery.pidbox 0 celery@log_parser.treeherder-processor2.private.scl3.mozilla.com.celery.pidbox 0 celery@log_parser.treeherder-processor3.private.scl3.mozilla.com.celery.pidbox 0 celery@pushlog.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox 0 celery@pushlog.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox 0 celeryev.24f6138b-eafa-49d2-a6d5-b9c0388f47cc 0 celeryev.6104655e-5688-4e29-a364-d9f59831dc80 0 celeryev.b5409c1a-2dd5-412d-871e-c091bac8c3c2 0 celeryev.b6c24cfb-fe30-4067-b962-9fa9eee76633 0 celeryev.bf41b634-d32f-46da-ad0c-1c60cf48310e 0 celeryev.c224d58c-744b-4f6a-9752-7382ee1cd299 0 celeryev.e2da783f-e4e6-46bd-b64b-dae217d28ffe 0 celeryev.e84ddfc4-8131-4ffd-9216-9cf81980c6fc 0 celeryev.ed36265a-b998-43b7-a0ae-80eed999710a 0 default 1486 high_priority 0 log_parser 1443 log_parser_fail 6 log_parser_hp 1 pushlog 0 ...done.

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Summary: Figure out why two log processor nodes stopped handling tasks → Figure out why the log processor nodes regularly stop handling tasks

Ed Morley [:emorley]

Reporter

Comment 13

•

10 years ago

Going to call this fixed by bug 1123479 until proven otherwise :-)

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → DUPLICATE

Bugzilla

Figure out why the log processor nodes regularly stop handling tasks

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

Tracking

(Not tracked)

People

(Reporter: emorley, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Comment 13