574537 - production-master03 ran out of diskspace without warning

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Description

•

14 years ago

Today, production-master03 ran out of disk space and crashed, closing the tree. At this time, all is back up and running, but this bug is to track the bustage, and figure out how to prevent it happening again. From #build, it looks like this machine went from "all ok" to "out of space" with no warning. [joduinn@dm-peep01 Mozilla]$ cat \#build.log | grep "production-master03" 15:18 < nagios> [58] production-master03.build:disk - /builds is CRITICAL: DISK CRITICAL - free space: /builds 0 MB (0% inode=88%): 15:28 < nagios> production-master03.build:disk - /builds is OK: DISK OK - free space: /builds 4232 MB (14% inode=88%): ....and similar seemed to have happened a few days ago: 02:46 < nagios> [01] production-master03.build:disk - /builds is CRITICAL: DISK CRITICAL - free space: /builds 0 MB (0% inode=85%): 03:06 < nagios> production-master03.build:disk - /builds is OK: DISK OK - free space: /builds 38 MB (0% inode=86%): Normally, our VMs have nagios alerts for diskspace for: OK, WARNING, CRITICAL... in this case, what happened to the WARNING alert?

Chris AtLee [:catlee]

Comment 1

•

14 years ago

The master had gone from OK to WARNING several days ago. We did some cleanup work then, but it never got back to the OK state.

Nick Thomas [:nthomas] (UTC+12)

Comment 2

•

14 years ago

fox2mike, what are the WARNING/THRESHOLD thresholds set at for 'production-master03.build disk - /builds' ?

Chris AtLee [:catlee]

Comment 3

•

14 years ago

The bulk of the space was taken up by buildbot logs. It was generating logs at rate of a few megabytes a minute. I've patched buildbot to remove one major offender to the logs, so now we're getting a megabyte of log data every 5 minutes or so.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 4

•

14 years ago

Looking at nagios logs for the last month, I see: 06-22-2010 03:18:16 06-22-2010 03:28:20 0d 0h 10m 4s SERVICE WARNING (HARD) DISK WARNING - free space: /builds 4 MB (0% inode=86%) ...so it looks like there is some sort of nagios "low free space" warning setup. Still unclear why this did not happen this afternoon. dmoore/Phong: how would nagios handle if the disk usage suddenly jumped, so that the machine went from healthy, past warning, and straight to error? Would nagios report both the warning alert and error alert at same time? Or would nagios skip the warning, and just report the more severe alert?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 5

•

14 years ago

(In reply to comment #1) > The master had gone from OK to WARNING several days ago. We did some cleanup > work then, but it never got back to the OK state. Sorry, I dont follow. If I'm reading the nagios logs correctly, a few days ago, we hit a WARNING at 06-22-2010 03:18:16 and then an OK at 06-22-2010 03:28:20. As best as I can tell, we then stayed OK until the CRITICAL failure today at 06-24-2010 15:18:12. Of course, I could be missing something - what makes you think we never got back to an OK state on the 22nd?

Chris AtLee [:catlee]

Comment 6

•

14 years ago

This is what I see in the nagios logs: CRITICAL 06-22-2010 02:46:12 DISK CRITICAL - free space: /builds 0 MB (0% inode=85%) OK 06-22-2010 03:06:12 DISK OK - free space: /builds 38 MB (0% inode=86%) WARNING 06-22-2010 03:18:16 DISK WARNING - free space: /builds 4 MB (0% inode=86%) CRITICAL 06-24-2010 15:18:12 DISK CRITICAL - free space: /builds 0 MB (0% inode=88%) OK 06-24-2010 15:28:13 DISK OK - free space: /builds 4232 MB (14% inode=88%) Rail noticed the CRITICAL alert on the 22nd and did some cleanup work, but we got back into a WARNING state almost immediately, and never got out of it.

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

14 years ago

Returning OK with only 38MB free is bogus, as is WARNING for 4MB. We just need to pick some more sensible values. 500MB & 1GB sound OK ?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 8

•

14 years ago

I just went through the nagios configs with dmoore. Turns out that all RelEng masters have the same disk alert levels as the RelEng slaves: define service{ use generic-service host_name production-master03.build service_description disk - /builds notification_options w,c,r normal_check_interval 10 contact_groups build check_command check_nrpe_disk!25!0!/builds notification_period 24x7 } ...which means WARNING at 25MB free, and ERROR at 0MB free. This might be ok for slaves, but is *not* ok for any of our masters. Please adjust the disk alert thresholds for all RelEng masters to be: WARNING: 90% full ERROR: 95% full Pushing to ServerOps to fix the nagios configs. Also, raising priority, because the incorrect warning cost us a sudden crashed master, with dropped slaves, burning builds and surprise tree closure.

Assignee: joduinn → server-ops

Severity: normal → major

Component: Release Engineering → Server Operations

QA Contact: release → mrz

chizu

Assignee

Updated

•

14 years ago

Assignee: server-ops → thardcastle

chizu

Assignee

Comment 9

•

14 years ago

Here's the list of hosts where I changed /builds to 10%,5% from 25,0: production-1.8-master.build sm-try-master staging-master.build sm-staging-try-master production-master.build production-master01.build production-master02.build production-master03.build preproduction-master.build talos-master test-master01.build talos-master02.build talos-staging-master02

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard

Bugzilla

production-master03 ran out of diskspace without warning

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: joduinn, Assigned: chizu)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Updated