839375 - foopy24 problem tracking

Reporter

Description

•

12 years ago

So, foopy24 currently has on it: bash-3.2$ ls -d tegra-* tegra-191 tegra-201 tegra-277 tegra-279 tegra-281 tegra-283 tegra-285 tegra-192 tegra-276 tegra-278 tegra-280 tegra-282 tegra-284 and was having issues earlier, (timeouts in steps doing foopy disk utilization) (rm -rf ...; unzip -oq ...) At fear of a larger issue I stopped clientproxy on all devices on this foopy after the poke from philor. I'm doing some cleanup (`rm -f /builds/tegra-*/{client*,twistd}.log.*`) right now, because there was a LOT of these old copies of files around here, but disk capacity is awfully low usage: bash-3.2$ df -h Filesystem Size Used Avail Capacity Mounted on /dev/disk0s2 465Gi 42Gi 423Gi 10% / devfs 183Ki 183Ki 0Bi 100% /dev map -hosts 0Bi 0Bi 0Bi 100% /net map auto_home 0Bi 0Bi 0Bi 100% /home Also ganglia doesn't show any obvious pattern of issue: http://ganglia3.build.mtv1.mozilla.com/ganglia/?r=4hr&cs=&ce=&c=RelEngMTV1&h=foopy24.build.mtv1.mozilla.com&tab=m&vn=&mc=2&z=small&metric_group=ALLGROUPS (I'll note however, that ganglia is not showing stats for wio at all on this foopy, which might just be a factor of it being a mac foopy, or something in the config) Rebooting this foopy in moments, before I bring all the devices back up, having seen no obvious problems, this bug is to help track incase of future issues. ================ [Times ET] [21:38:13] philor huh, what would produce a sudden rash of timeouts in tegra 'rm -rf build' steps? [21:43:54] Callek|Buildduty philor: tegra rm -rf build !? odd [21:47:14] philor Callek|Buildduty: tegra-191, tegra-279, tegra-201, tegra-278, tegra-282, https://tbpl.mozilla.org/php/getParsedLog.php?id=19553407&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=19553415&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=19553399&tree=Mozilla-Inbound [21:47:31] philor is that maybe all one foopy? [21:48:09] Callek|Buildduty about to find out [21:48:10] Callek|Buildduty [21:48:58] Callek|Buildduty philor: yes, *ALL* one foopy [21:49:13] Callek|Buildduty philor: sounds like a good case of "stop them all then investigate" [21:49:14] Callek|Buildduty on it [21:49:16] =-= philor is now known as philor|away [21:49:59] Callek|Buildduty philor|away: I'm killing all on that one foopy [23:05:52] Callek|Buildduty philor: sooo, I'm not seeing the rm -rf failure -- I do see hwoever in all those logs timeouts when unzipping tests [23:07:54] philor Callek|Buildduty: ah, five failures that all appeared at once, the last two were unzipping, the first three were rm -rf, and I didn't notice the last two weren't the same [23:08:28] Callek|Buildduty ahh ok [23:08:29] philor https://tbpl.mozilla.org/php/getParsedLog.php?id=19553399&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=19553410&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=19553411&tree=Mozilla-Inbound

Justin Wood (:Callek)

Reporter

Comment 1

•

12 years ago

(In reply to Justin Wood (:Callek) from comment #0) > I'm doing some cleanup (`rm -f /builds/tegra-*/{client*,twistd}.log.*`) > right now, because there was a LOT of these old copies of files around here, > but disk capacity is awfully low usage: Ok, having thought about this, it is likely the problem, heres why: Foopies have a high demand on disk i/o as it is, we need a lot of read/writes for the actual jobs. twistd, will do an enumeration+rename of *every* twistd.log* file when it needs to rotate, and expand on that, indefinitely. twistd.log.2 --> twistd.log.3 , twistd.log.1 --> twistd.log.2 [c.f. http://mxr.mozilla.org/build/source/twisted/twisted/python/logfile.py#177 ] clientproxy also uses log rotation, of a similar logic with RotatingFileHandler: http://docs.python.org/release/2.4.4/lib/node348.html This combined can yeild a horrid I/O throughput if the conditions line up right. I'll be going through foopies tomorrow to clean up old logs, due to this issue.

Summary: foopy24 problem tracking → Rotating logfiles indefinitely on foopies leads to failures in I/O heavy buildbot steps.

Justin Wood (:Callek)

Reporter

Comment 2

•

12 years ago

Hrm maybe not... philor just reported more issues here... re-stopping all devices on this foopy, and will enroll it in diagnostics (stuck rm is not a good sign, imo) -- also whole thing is acting VERY slow/problematic when I am sshing in right now. Processes: 97 total, 3 running, 10 stuck, 84 sleeping, 281 threads 22:38:35 Load Avg: 0.15, 0.15, 0.16 CPU usage: 0.72% user, 2.42% sys, 96.85% idle SharedLibs: 12M resident, 4376K data, 0B linkedit. MemRegions: 10162 total, 648M resident, 19M private, 192M shared. PhysMem: 1303M wired, 606M active, 425M inactive, 2333M used, 5856M free. VM: 231G vsize, 1148M framework vsize, 17489(0) pageins, 0(0) pageouts. Networks: packets: 2098605/2001M in, 880390/337M out. Disks: 293904/3555M read, 736228/14G written. PID COMMAND %CPU TIME #TH #WQ #POR #MRE RPRVT RSHRD RSIZE VPRVT VSIZE PGRP PPID STATE UID 2909 rm 0.5 00:00.01 1 0 17 22+ 104K- 224K+ 380K+ 17M- 2378M 646 647 stuck 501 2908 rm 0.7 00:00.01 1 0 17 21 108K- 224K+ 380K+ 17M- 2378M 667 668 stuck 501 2907 rm 0.6 00:00.01 1 0 17 21 112K+ 220K- 380K+ 17M+ 2378M 634 635 stuck 501 2894 top 2.9 00:00.34 1/1 0 32 38 1040K 216K 1752K 19M 2379M 2894 2892 running 0 2892 bash 0.0 00:00.01 1 0 21 23 272K 840K 960K 17M 2378M 2892 2891 sleeping 501 2891 sshd 0.0 00:00.00 1 0 13 86 252K 1628K 816K 1504K 2413M 2889 2889 sleeping 501 2889 sshd 0.0 00:00.12 2 1 37 81 480K 1628K 2956K 712K 2412M 2889 1 sleeping 0 2887 unzip 1.0 00:00.58 1 0 18 23 228K 328K 588K 17M 2378M 494 495 stuck 501 2886 rm 0.2 00:00.43 1 0 17 21 128K 220K 396K 17M 2378M 488 489 stuck 501 2884 wget 0.2 00:00.48 2 1 40 67 652K 480K 1708K 40M 2404M 360 361 stuck 501 2781 Python 0.5 00:10.90 2 1 33 209 21M 1920K 42M 104M 2478M 548 549 stuck 501 2686 unzip 0.4 00:04.96 1 0 17 23 228K 324K 576K 17M 2378M 362 363 stuck 501 2557 distnoted 0.0 00:00.01 2 1 41 54 548K 240K 1184K 42M 2411M 2557 2540 sleeping 501 2540 launchd 0.0 00:00.09 2 0 52 48 412K 412K 844K 48M 2409M 2540 1 sleeping 501 1023 taskgated 0.0 00:00.36 1 0 30 43 720K 364K 2080K 46M 2406M 1023 1 sleeping 0 668 Python 0.0 00:02.48 2 0 11 181 18M 3568K- 13M 52M 2425M 667 1 sleeping 501 647 Python 0.0 00:02.58 2 0 11 180 18M 3568K- 13M 51M 2424M 646 1 sleeping 501 635 Python 0.0 00:02.61 2 0 11 181 18M 3568K 13M 52M 2425M 634 1 sleeping 501 604 Python 0.0 00:00.59 1 0 9 135 1288K 10M 2812K 2676K 2390M 601 602 sleeping 501 602 Python 0.0 00:00.19 2 0 10 143 1724K 10M 3520K 30M 2417M 601 1 sleeping 501 566 Python 0.0 00:02.47 2 0 11 181 16M 3568K 13M 52M 2425M 565 1 sleeping 501 551 Python 0.0 00:02.54 2 0 11 181 16M 3568K 13M 52M 2425M 550 1 sleeping 501 549 Python 0.0 00:02.68 2 0 11 181 17M 3568K 13M 52M 2425M 548 1 sleeping 501

Justin Wood (:Callek)

Reporter