Closed Bug 1129731 Opened 10 years ago Closed 10 years ago

treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 50% free

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: vinh, Assigned: pir)

References

Details

Getting alerted in nagios for: treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 50% free I'm unable to remedy SWAP. Can someone help?
Component: Treeherder → Treeherder: Infrastructure
QA Contact: laura
Group: mozilla-employee-confidential
(Unhiding since this doesn't contain anything confidential) Mauro/Cameron, do we know why this has started occurring? More load from TaskCluster?
Group: mozilla-employee-confidential
OS: Other → All
Priority: -- → P1
Hardware: x86 → All
A bit of this as well; Thu 23:50:09 PST [5209] treeherder-rabbitmq1.private.scl3.mozilla.com:Rabbit Unread Messages is CRITICAL: RABBITMQ_OVERVIEW CRITICAL - messages CRITICAL (131210) messages_ready CRITICAL (131161)
fubar, any idea why the celery process mem usage has grown so much in the last week? Did we update anything? I don't seem to recall a prod push in that timeline: https://rpm.newrelic.com/accounts/677903/servers/5575925/processes?tw[end]=1423232521&tw[start]=1420554121#id=721569760
Flags: needinfo?(klibby)
See also bugs 1094814 and 1113115. Probably just a memory leak from general usage; we've seen it on the other celery nodes, as well as celery tasks on other services.
Depends on: 1113115
Flags: needinfo?(klibby)
I don't believe it's a memory leak in this case, we deployed to prod twice in that range and the deploy script would have restarted the processes.
<nagios-scl3:#sysadmins> Sun 02:54:09 PST [5804] treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is CRITICAL: SWAP CRITICAL - 10% free (202 MB out of 2047 MB) (http://m.mozilla.org/Swap) PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20968 treeherd 20 0 846m 378m 3448 S 0.0 9.9 25:03.68 python 20970 treeherd 20 0 831m 365m 3472 S 0.3 9.5 25:12.17 python 20969 treeherd 20 0 801m 336m 3544 S 0.3 8.8 23:44.89 python 25661 treeherd 20 0 582m 134m 2416 S 0.0 3.5 47:39.14 celery 495 20968 5.6 9.8 866792 387512 ? Sl 03:37 25:03 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h ├─supervisord,5479 /usr/bin/supervisord │ ├─celery,20771 /usr/bin/celery -A treeherder beat -f /var/log/celery/celerybeat.log │ ├─newrelic_plugin,1747 /usr/bin/newrelic_plugin_agent -c /etc/newrelic/agent.yml -f │ ├─python,17210 /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs ... │ │ ├─python,20968 /usr/bin/celery -A treeherder worker -c 3 -Q ... │ │ │ └─{python},20971 │ │ ├─python,20969 /usr/bin/celery -A treeherder worker -c 3 -Q ... │ │ │ └─{python},20975 │ │ └─python,20970 /usr/bin/celery -A treeherder worker -c 3 -Q ... │ │ └─{python},20973 │ └─python,19437 /usr/bin/celery -A treeherder worker -c 1 -Q high_priority -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker_hp.log ... │ └─python,20949 /usr/bin/celery -A treeherder worker -c 1 -Q high_priority -E --maxtasksperchild=500 ... │ └─{python},20950 [root@treeherder-rabbitmq1.private.scl3 pradcliffe]# supervisorctl newrelic-plugin-agent RUNNING pid 25178, uptime 0:02:31 run_celery_worker RUNNING pid 25195, uptime 0:02:26 run_celery_worker_hp RUNNING pid 25194, uptime 0:02:26 run_celerybeat RUNNING pid 25179, uptime 0:02:31 supervisor> elp *** Unknown syntax: elp supervisor> help default commands (type help <topic>): ===================================== add clear fg open quit remove restart start stop update avail exit maintail pid reload reread shutdown status tail version supervisor> stop run_celery_worker run_celery_worker: stopped supervisor> stop run_celery_worker_hp stop run_celery_worker_hp: stopped supervisor> stop run_celerybeat stoprun_celerybeat: stopped supervisor> stop newrelic-plugin-agent newrelic-plugin-agent: stopped [root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery| wc -l 40 Some of these have been around for longer than the others, haven't been restarted in the last week: 495 13831 0.1 1.5 568428 60012 ? Sl Jan30 18:29 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h 495 13832 0.1 1.9 562700 75608 ? Sl Jan30 18:58 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h 495 13833 0.1 0.9 556728 35504 ? Sl Jan30 19:39 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h 495 15701 0.2 2.3 518292 90852 ? Sl Jan30 38:54 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h 495 15713 0.2 1.7 601252 68512 ? Sl Jan30 36:57 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h 495 15876 0.2 1.3 508244 53924 ? Sl Jan30 29:14 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h [root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery | awk '{print $2}' | xargs kill kill 25460: No such process [root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery | awk '{print $2}' | xargs kill -9 kill 25482: No such process [root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery | awk '{print $2}' 25489 [root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | grep '[c]elery' | wc -l 0 supervisor> start newrelic-plugin-agent newrelic-plugin-agent: started supervisor> start run_celery_worker run_celery_worker: started supervisor> start run_celery_worker_hp run_celery_worker_hp: started supervisor> start run_celerybeat run_celerybeat: started [root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | grep '[c]elery' | wc -l 7 <nagios-scl3:#sysadmins> Sun 03:14:09 PST [5805] treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is OK: SWAP OK - 99% free (2015 MB out of 2047 MB) (http://m.mozilla.org/Swap)
Peter, thanks for spotting that - looks like we're getting zombie celery processes. Wonder if we could make the deploy script more aggressive at cleaning up processes once it's tried gracefully? The running instance count here makes it pretty easy to spot, retrospectively: https://rpm.newrelic.com/accounts/677903/servers/5575925/processes#id=721569760 This definitely wasn't anything to do with the leaks we've seen elsewhere. Anyway, I'll file another bug for (a) one-off checking all nodes for any other zombies, (b) coming up with a longer term solution for preventing them being created in the first place, or else cleaning them up automatically.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee: nobody → pradcliffe+bugzilla
Depends on: 1131059
You need to log in before you can comment on or make changes to this bug.