Closed Bug 574537 Opened 14 years ago Closed 14 years ago

production-master03 ran out of diskspace without warning

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: chizu)

Details

Today, production-master03 ran out of disk space and crashed, closing the tree. At this time, all is back up and running, but this bug is to track the bustage, and figure out how to prevent it happening again. From #build, it looks like this machine went from "all ok" to "out of space" with no warning. [joduinn@dm-peep01 Mozilla]$ cat \#build.log | grep "production-master03" 15:18 < nagios> [58] production-master03.build:disk - /builds is CRITICAL: DISK CRITICAL - free space: /builds 0 MB (0% inode=88%): 15:28 < nagios> production-master03.build:disk - /builds is OK: DISK OK - free space: /builds 4232 MB (14% inode=88%): ....and similar seemed to have happened a few days ago: 02:46 < nagios> [01] production-master03.build:disk - /builds is CRITICAL: DISK CRITICAL - free space: /builds 0 MB (0% inode=85%): 03:06 < nagios> production-master03.build:disk - /builds is OK: DISK OK - free space: /builds 38 MB (0% inode=86%): Normally, our VMs have nagios alerts for diskspace for: OK, WARNING, CRITICAL... in this case, what happened to the WARNING alert?
The master had gone from OK to WARNING several days ago. We did some cleanup work then, but it never got back to the OK state.
fox2mike, what are the WARNING/THRESHOLD thresholds set at for 'production-master03.build disk - /builds' ?
The bulk of the space was taken up by buildbot logs. It was generating logs at rate of a few megabytes a minute. I've patched buildbot to remove one major offender to the logs, so now we're getting a megabyte of log data every 5 minutes or so.
Looking at nagios logs for the last month, I see: 06-22-2010 03:18:16 06-22-2010 03:28:20 0d 0h 10m 4s SERVICE WARNING (HARD) DISK WARNING - free space: /builds 4 MB (0% inode=86%) ...so it looks like there is some sort of nagios "low free space" warning setup. Still unclear why this did not happen this afternoon. dmoore/Phong: how would nagios handle if the disk usage suddenly jumped, so that the machine went from healthy, past warning, and straight to error? Would nagios report both the warning alert and error alert at same time? Or would nagios skip the warning, and just report the more severe alert?
(In reply to comment #1) > The master had gone from OK to WARNING several days ago. We did some cleanup > work then, but it never got back to the OK state. Sorry, I dont follow. If I'm reading the nagios logs correctly, a few days ago, we hit a WARNING at 06-22-2010 03:18:16 and then an OK at 06-22-2010 03:28:20. As best as I can tell, we then stayed OK until the CRITICAL failure today at 06-24-2010 15:18:12. Of course, I could be missing something - what makes you think we never got back to an OK state on the 22nd?
This is what I see in the nagios logs: CRITICAL 06-22-2010 02:46:12 DISK CRITICAL - free space: /builds 0 MB (0% inode=85%) OK 06-22-2010 03:06:12 DISK OK - free space: /builds 38 MB (0% inode=86%) WARNING 06-22-2010 03:18:16 DISK WARNING - free space: /builds 4 MB (0% inode=86%) CRITICAL 06-24-2010 15:18:12 DISK CRITICAL - free space: /builds 0 MB (0% inode=88%) OK 06-24-2010 15:28:13 DISK OK - free space: /builds 4232 MB (14% inode=88%) Rail noticed the CRITICAL alert on the 22nd and did some cleanup work, but we got back into a WARNING state almost immediately, and never got out of it.
Returning OK with only 38MB free is bogus, as is WARNING for 4MB. We just need to pick some more sensible values. 500MB & 1GB sound OK ?
I just went through the nagios configs with dmoore. Turns out that all RelEng masters have the same disk alert levels as the RelEng slaves: define service{ use generic-service host_name production-master03.build service_description disk - /builds notification_options w,c,r normal_check_interval 10 contact_groups build check_command check_nrpe_disk!25!0!/builds notification_period 24x7 } ...which means WARNING at 25MB free, and ERROR at 0MB free. This might be ok for slaves, but is *not* ok for any of our masters. Please adjust the disk alert thresholds for all RelEng masters to be: WARNING: 90% full ERROR: 95% full Pushing to ServerOps to fix the nagios configs. Also, raising priority, because the incorrect warning cost us a sudden crashed master, with dropped slaves, burning builds and surprise tree closure.
Assignee: joduinn → server-ops
Severity: normal → major
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Assignee: server-ops → thardcastle
Here's the list of hosts where I changed /builds to 10%,5% from 25,0: production-1.8-master.build sm-try-master staging-master.build sm-staging-try-master production-master.build production-master01.build production-master02.build production-master03.build preproduction-master.build talos-master test-master01.build talos-master02.build talos-staging-master02
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.