Closed
Bug 574537
Opened 14 years ago
Closed 14 years ago
production-master03 ran out of diskspace without warning
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: joduinn, Assigned: chizu)
Details
Today, production-master03 ran out of disk space and crashed, closing the tree. At this time, all is back up and running, but this bug is to track the bustage, and figure out how to prevent it happening again.
From #build, it looks like this machine went from "all ok" to "out of space" with no warning.
[joduinn@dm-peep01 Mozilla]$ cat \#build.log | grep "production-master03"
15:18 < nagios> [58] production-master03.build:disk - /builds is CRITICAL: DISK CRITICAL - free space: /builds 0 MB (0% inode=88%):
15:28 < nagios> production-master03.build:disk - /builds is OK: DISK OK - free space: /builds 4232 MB (14% inode=88%):
....and similar seemed to have happened a few days ago:
02:46 < nagios> [01] production-master03.build:disk - /builds is CRITICAL: DISK CRITICAL - free space: /builds 0 MB (0% inode=85%):
03:06 < nagios> production-master03.build:disk - /builds is OK: DISK OK - free space: /builds 38 MB (0% inode=86%):
Normally, our VMs have nagios alerts for diskspace for: OK, WARNING, CRITICAL... in this case, what happened to the WARNING alert?
Comment 1•14 years ago
|
||
The master had gone from OK to WARNING several days ago. We did some cleanup work then, but it never got back to the OK state.
Comment 2•14 years ago
|
||
fox2mike, what are the WARNING/THRESHOLD thresholds set at for 'production-master03.build disk - /builds' ?
Comment 3•14 years ago
|
||
The bulk of the space was taken up by buildbot logs. It was generating logs at rate of a few megabytes a minute.
I've patched buildbot to remove one major offender to the logs, so now we're getting a megabyte of log data every 5 minutes or so.
Reporter | ||
Comment 4•14 years ago
|
||
Looking at nagios logs for the last month, I see:
06-22-2010 03:18:16 06-22-2010 03:28:20 0d 0h 10m 4s SERVICE WARNING
(HARD) DISK WARNING - free space: /builds 4 MB (0% inode=86%)
...so it looks like there is some sort of nagios "low free space" warning
setup. Still unclear why this did not happen this afternoon.
dmoore/Phong: how would nagios handle if the disk usage suddenly jumped, so
that the machine went from healthy, past warning, and straight to error? Would
nagios report both the warning alert and error alert at same time? Or would
nagios skip the warning, and just report the more severe alert?
Reporter | ||
Comment 5•14 years ago
|
||
(In reply to comment #1)
> The master had gone from OK to WARNING several days ago. We did some cleanup
> work then, but it never got back to the OK state.
Sorry, I dont follow.
If I'm reading the nagios logs correctly, a few days ago, we hit a WARNING at 06-22-2010 03:18:16 and then an OK at 06-22-2010 03:28:20. As best as I can tell, we then stayed OK until the CRITICAL failure today at 06-24-2010 15:18:12.
Of course, I could be missing something - what makes you think we never got back to an OK state on the 22nd?
Comment 6•14 years ago
|
||
This is what I see in the nagios logs:
CRITICAL 06-22-2010 02:46:12 DISK CRITICAL - free space: /builds 0 MB (0% inode=85%)
OK 06-22-2010 03:06:12 DISK OK - free space: /builds 38 MB (0% inode=86%)
WARNING 06-22-2010 03:18:16 DISK WARNING - free space: /builds 4 MB (0% inode=86%)
CRITICAL 06-24-2010 15:18:12 DISK CRITICAL - free space: /builds 0 MB (0% inode=88%)
OK 06-24-2010 15:28:13 DISK OK - free space: /builds 4232 MB (14% inode=88%)
Rail noticed the CRITICAL alert on the 22nd and did some cleanup work, but we got back into a WARNING state almost immediately, and never got out of it.
Comment 7•14 years ago
|
||
Returning OK with only 38MB free is bogus, as is WARNING for 4MB. We just need to pick some more sensible values. 500MB & 1GB sound OK ?
Reporter | ||
Comment 8•14 years ago
|
||
I just went through the nagios configs with dmoore. Turns out that all RelEng masters have the same disk alert levels as the RelEng slaves:
define service{
use generic-service
host_name production-master03.build
service_description disk - /builds
notification_options w,c,r
normal_check_interval 10
contact_groups build
check_command check_nrpe_disk!25!0!/builds
notification_period 24x7
}
...which means WARNING at 25MB free, and ERROR at 0MB free. This might be ok for slaves, but is *not* ok for any of our masters.
Please adjust the disk alert thresholds for all RelEng masters to be:
WARNING: 90% full
ERROR: 95% full
Pushing to ServerOps to fix the nagios configs. Also, raising priority, because the incorrect warning cost us a sudden crashed master, with dropped slaves, burning builds and surprise tree closure.
Assignee: joduinn → server-ops
Severity: normal → major
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Here's the list of hosts where I changed /builds to 10%,5% from 25,0:
production-1.8-master.build
sm-try-master
staging-master.build
sm-staging-try-master
production-master.build
production-master01.build
production-master02.build
production-master03.build
preproduction-master.build
talos-master
test-master01.build
talos-master02.build
talos-staging-master02
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•