Closed
Bug 902903
Opened 11 years ago
Closed 11 years ago
Buildbot doesn't start after post-install reboot on Lion
Categories
(Infrastructure & Operations :: RelOps: Puppet, task)
Infrastructure & Operations
RelOps: Puppet
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dustin)
References
Details
Attachments
(1 file)
(deleted),
patch
|
coop
:
review+
|
Details | Diff | Splinter Review |
(moved from bug 891880)
Buildbot is not starting after the puppetize.sh-induced reboot. It starts fine on the next reboot, though.
First boot:
> Aug 6 13:02:21 talos-r4-lion-011 puppet-agent[234]: Unable to fetch my node definition, but the agent run will continue:
> Aug 6 13:02:21 talos-r4-lion-011 puppet-agent[234]: Error 400 on SERVER: this master is not a CA
> Aug 6 13:02:21 talos-r4-lion-011 puppet-agent[234]: (/File[/var/lib/puppet/lib/puppet]/ensure) created
> Aug 6 13:02:21 talos-r4-lion-011 puppet-agent[234]: (/File[/var/lib/puppet/lib/puppet/provider]/ensure) created
> Aug 6 13:02:21 talos-r4-lion-011 puppet-agent[234]: (/File[/var/lib/puppet/lib/puppet/provider/file_line]/ensure) created
> ..
> Aug 6 13:04:39 talos-r4-lion-011 puppet-agent[234]: (/Stage[main]/Buildslave::Startup::Launchd/File[/Library/LaunchAgents/com.mozilla.buildslave.plist]/ensure) defined content as '{md5}bcf1fcd8bd1b38f2f839b146945f7301'
> ..
> Aug 6 13:05:25 talos-r4-lion-011 puppet-agent[234]: Finished catalog run in 168.92 seconds
Second boot:
> Aug 6 13:05:38 talos-r4-lion-011 reboot[2533]: rebooted by _locationd
> Aug 6 13:05:38 talos-r4-lion-011 reboot[2533]: BOOT_TIME: 1375819538 424994
> rc.server[ 8 ]: Tuning server for 8 GB (rounded down).
> Aug 6 13:05:48 localhost bootlog[0]: BOOT_TIME 1375819548 0
> Aug 6 13:06:01 localhost UserEventAgent[30]: starting CaptiveNetworkSupport as SystemEventAgent built May 25 2011 12:27:35
> Aug 6 13:05:49 localhost com.apple.launchd[1]: *** launchd[1] has started up. ***
> Aug 6 13:05:49 localhost com.apple.launchd[1]: *** Verbose boot, will log to /dev/console. ***
> Aug 6 13:06:00 localhost com.apple.launchd[1] (com.apple.xgridd.pcastserver): Bug: launchd_core_logic.c:5174 (25247):2
> Aug 6 13:06:00 localhost com.apple.launchd[1] (com.apple.xgridd.pcastserver): Path monitoring failed on "/var/pcast/server/xgridd/keepalive": No such file or directory
> Aug 6 13:06:01 localhost UserEventAgent[30]: CertsKeychainMonitor: configuring
> Aug 6 13:06:01 localhost UserEventAgent[30]: WirelessAirPortDeviceNameCopy(): no BSD interface name found for object 12551
> Aug 6 13:06:01 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptiveSCCopyWiFiDevices:388 WiFi Device Name == NULL
> Aug 6 13:06:03 localhost mDNSResponder[31]: mDNSResponder mDNSResponder-320.10 (Aug 2 2011 19:56:51) starting OSXVers 11
> Aug 6 13:06:03 localhost configd[34]: BLUETOOTH: Failed to start job com.apple.blued
> Aug 6 13:06:06 localhost airportd[51]: _processDLILEvent: en1 attached (down)
> Aug 6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CreateInterfaceWatchList:2788 WiFi Devices Found. :)
> Aug 6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptivePublishState:1211 en1 - PreProbe
> Aug 6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptiveSCRebuildCache:81 Failed to get service order
> Aug 6 13:06:06: --- last message repeated 1 time ---
> Aug 6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptivePublishState:1211 en1 - PreProbe
> Aug 6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptiveSCRebuildCache:81 Failed to get service order
> Aug 6 13:06:06: --- last message repeated 1 time ---
> Aug 6 13:06:06 localhost UserEventAgent[30]: CaptiveNetworkSupport:CaptivePublishState:1211 en1 - PreProbe
> Aug 6 13:06:06 localhost systemkeychain[46]: done file: /var/run/systemkeychaincheck.done
> Aug 6 13:06:06 localhost mDNSResponder[31]: D2D_IPC: Loaded
> Aug 6 13:06:06 localhost mDNSResponder[31]: D2DInitialize succeeded
> Aug 6 13:06:06 localhost com.apple.ucupdate.plist[72]: ucupdate: Checked 1 update, no match found.
> Aug 6 13:06:06 localhost com.apple.launchd[1] (com.mozilla.puppet[102]): open("/var/log/puppet/puppet.out", ...): No such file or directory
> Aug 6 13:06:06 localhost com.apple.pfctl[81]: No ALTQ support in kernel
> Aug 6 13:06:06 localhost com.apple.pfctl[81]: ALTQ related functions disabled
> Aug 6 13:06:06 localhost com.apple.launchd[1] (com.mozilla.puppet[102]): open("/var/log/puppet/puppet.err", ...): No such file or directory
> Aug 6 13:06:06 localhost emond[94]: SetUpLogs: uid = 0 gid = 0
> Aug 6 13:06:06 localhost emond[94]: SetUpLogs: opening /Library/Logs/EventMonitor/EventMonitor.error.log
> Aug 6 13:06:07 localhost com.mozilla.puppet[102]: Starting run-puppet.sh at Tue Aug 6 13:06:07 PDT 2013
> Aug 6 13:06:07 localhost com.mozilla.puppet[102]: No DNS configuration available
> Aug 6 13:06:07 localhost com.mozilla.puppet[102]: ..waiting for DNS
> HeadlessStartup: Already setup or this is an upgrade so we will not set the password.
> Aug 6 13:06:07 localhost com.apple.usbmuxd[71]: usbmuxd-211 built on May 16 2011 at 00:14:56 on May 16 2011 at 00:14:55, running 64 bit
> Aug 6 13:06:07 localhost loginwindow[85]: Login Window Application Started
> Aug 6 13:06:07 localhost freshclam[67]: Can't query current.cvd.clamav.net
> Aug 6 13:06:07 localhost freshclam[67]: Invalid DNS reply. Falling back to HTTP mode.
> Aug 6 13:06:08 localhost configd[34]: bootp_session_transmit: bpf_write(en1) failed: Network is down (50)
> Aug 6 13:06:08 localhost configd[34]: DHCP en1: INIT transmit failed
> Aug 6 13:06:08 localhost com.mozilla.puppet[102]: No DNS configuration available
> Aug 6 13:06:08 localhost com.mozilla.puppet[102]: ..waiting for DNS
> Aug 6 13:06:08 localhost UserEventAgent[30]: get_backup_share_points no AFP
> Aug 6 13:06:08 localhost UserEventAgent[30]: WebUserEventAgent: installed
> Aug 6 13:06:08 talos-r4-lion-011 configd[34]: setting hostname to "talos-r4-lion-011.build.scl1.mozilla.com"
> Aug 6 13:06:08 talos-r4-lion-011 configd[34]: network configuration changed.
> Aug 6 13:06:08 talos-r4-lion-011 freshclam[67]: Can't get information about database.clamav.net: nodename nor servname provided, or not known
> Aug 6 13:06:08 talos-r4-lion-011 freshclam[67]: Can't read main.cvd header from database.clamav.net (IP: )
> Aug 6 13:06:09 talos-r4-lion-011 mds[83]: (Normal) FMW: FMW 0 0
> Aug 6 13:06:10 talos-r4-lion-011 loginwindow[85]: **DMPROXY** Found `/System/Library/CoreServices/DMProxy'.
> Aug 6 13:06:11 talos-r4-lion-011 netbiosd[137]: Unable to start NetBIOS name service:
> Aug 6 13:06:12 talos-r4-lion-011 loginwindow[85]: Login Window Started Security Agent
> Aug 6 13:06:12 talos-r4-lion-011 com.apple.launchctl.LoginWindow[141]: com.apple.findmymacmessenger: Already loaded
> Aug 6 13:06:12 talos-r4-lion-011 rpcsvchost[139]: sandbox_init: com.apple.msrpc.netlogon.sb succeeded
> Aug 6 13:06:12 talos-r4-lion-011 SecurityAgent[149]: Echo enabled
> Aug 6 13:06:13 talos-r4-lion-011 SecurityAgent[149]: User info context values set for cltbld
> Aug 6 13:06:13 talos-r4-lion-011 freshclam[67]: Can't query current.cvd.clamav.net
> Aug 6 13:06:13 talos-r4-lion-011 freshclam[67]: Invalid DNS reply. Falling back to HTTP mode.
> Aug 6 13:06:14 talos-r4-lion-011 freshclam[67]: Can't get information about database.clamav.net: nodename nor servname provided, or not known
> Aug 6 13:06:14 talos-r4-lion-011 freshclam[67]: Can't read main.cvd header from database.clamav.net (IP: )
> Aug 6 13:06:14 talos-r4-lion-011 SecurityAgent[149]: Login Window login proceeding
> Aug 6 13:06:15 talos-r4-lion-011 configd[34]: network configuration changed.
> Aug 6 13:06:16 talos-r4-lion-011 ntpd[66]: proto: precision = 1.000 usec
> Aug 6 13:06:16 talos-r4-lion-011 XProtectUpdater[69]: Ignoring new signature plist: Not an increase in version
> Aug 6 13:06:16 talos-r4-lion-011 com.apple.launchd[1] (com.apple.xprotectupdater[69]): Exited with code: 252
> Aug 6 13:06:19 talos-r4-lion-011 loginwindow[85]: Login Window - Returned from Security Agent
> Aug 6 13:06:19 talos-r4-lion-011 loginwindow[85]: USER_PROCESS: 85 console
> Aug 6 13:06:19 talos-r4-lion-011 freshclam[67]: Your ClamAV installation is OUTDATED!
> Aug 6 13:06:19 talos-r4-lion-011 freshclam[67]: Local version: 0.97.1 Recommended version: 0.97.8
> Aug 6 13:06:19 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.ReportCrash): Falling back to default Mach exception handler. Could not find: com.apple.ReportCrash.Self
> Aug 6 13:06:19 talos-r4-lion-011 com.apple.launchctl.Aqua[205]: load: option requires an argument -- D
> Aug 6 13:06:19 talos-r4-lion-011 com.apple.launchctl.Aqua[205]: usage: launchctl load [-wF] [-D <user|local|network|system|all>] paths...
> Aug 6 13:06:20 talos-r4-lion-011 UserEventAgent[30]: CaptiveNetworkSupport:CNSServerRegisterUserAgent:187 new user agent port: 29983
> Aug 6 13:06:20 talos-r4-lion-011 mds[83]: (Warning) Server: No stores registered for metascope "kMDQueryScopeComputerIndexed"
> Aug 6 13:06:20 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.launchctl.Aqua[205]): Exited with code: 1
> Aug 6 13:06:21 talos-r4-lion-011 sshd[197]: USER_PROCESS: 228 ttys000
> Aug 6 13:06:23 talos-r4-lion-011 com.mozilla.puppet[102]: ;; connection timed out; no servers could be reached
> Aug 6 13:06:23 talos-r4-lion-011 com.mozilla.puppet[102]: releng-puppet2.srv.releng.scl3.mozilla.com has address 10.26.48.50
> Aug 6 13:06:23 talos-r4-lion-011 com.mozilla.puppet[102]: Running puppet agent against server 'releng-puppet2.build.scl1.mozilla.com'
> Aug 6 13:06:23 talos-r4-lion-011 MiniLauncher[232]: Launching iCloud prefs for user 28
> Aug 6 13:06:27 talos-r4-lion-011 iCal Helper[247]: Could not find Meta Data for persistent Store
> Aug 6 13:06:28 talos-r4-lion-011 System Preferences[241]: contentsOfDirectoryAtPath returned:Error Domain=NSCocoaErrorDomain Code=260 "The folder <E2><80><9C>PreferencePanes<E2><80><9D> doesn<E2><80><99>t exist." UserInfo=0x4004849e0 {NSFilePath=/Users/cltbld/Library/PreferencePanes, NSUserStringVariant=(
> Folder
> ), NSUnderlyingError=0x400484a20 "The operation couldn<E2><80><99>t be completed. (OSStatus error -43.)"} for path:/Users/cltbld/Library/PreferencePanes
> Aug 6 13:06:32 talos-r4-lion-011 WindowServer[119]: kCGErrorFailure: Set a breakpoint @ CGErrorBreakpoint() to catch errors as they are logged.
> Aug 6 13:06:36 talos-r4-lion-011 Dock[223]: Bookmark failed to issue extension for item /Developer/Applications (depth=1): No such file or directory
> Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: 2013-08-06 13:06:40.749 com.apple.dock.extra[287:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: 2013-08-06 13:06:40.750 com.apple.dock.extra[287:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: 2013-08-06 13:06:40.750 com.apple.dock.extra[287:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:06:40 talos-r4-lion-011 com.apple.dock.extra[287]: 2013-08-06 13:06:40.751 com.apple.dock.extra[287:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:06:41 talos-r4-lion-011 mds[83]: (/)(Warning) IndexQuery in bool preIterate_FSI(SISearchCtx_FSI*):Throttling inefficient file system query
> Aug 6 13:06:46 talos-r4-lion-011 Finder[229]: Persistent UI failed to open file file://localhost/Users/cltbld/Library/Saved%20Application%20State/com.apple.finder.savedState/window_1.data: No such file or directory (2)
> Aug 6 13:06:48 talos-r4-lion-011 UserEventAgent[30]: CertsKeychainMonitor: ready to process keychain & timer events
> Aug 6 13:06:49 talos-r4-lion-011 System Preferences[241]: Database file is missing: /Users/cltbld/Library/Application Support/AddressBook/AddressBook-v22.abcddb
> Aug 6 13:06:49 talos-r4-lion-011 System Preferences[241]: unable to dlopen /usr/lib/sasl2/pwauxprop.so: dlopen(/usr/lib/sasl2/pwauxprop.so, 2): no suitable image found. Did find:
> /usr/lib/sasl2/pwauxprop.so: GC capability mismatch
> Aug 6 13:06:55 talos-r4-lion-011 System Preferences[241]: The IAAccountDiscovery completed notification
> Aug 6 13:06:56 talos-r4-lion-011 System Preferences[241]: Persistent UI failed to open file file://localhost/Users/cltbld/Library/Saved%20Application%20State/com.apple.systempreferences.savedState/window_3.data: No such file or directory (2)
> Aug 6 13:07:00 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.AddressBook.abd): Throttling respawn: Will start in 8 seconds
> Aug 6 13:07:01 talos-r4-lion-011 AddressBookSync[306]: Could not replace account with identifier: _local
> Aug 6 13:07:01 talos-r4-lion-011 [0x0-0xc00c].com.apple.systempreferences[241]: 2013-08-06 13:07:01.398 AddressBookSync[306:707] Could not replace account with identifier: _local
> Aug 6 13:07:01 talos-r4-lion-011 genatsdb[252]: *GENATSDB* FontObjects generated = 324
> Aug 6 13:07:03 talos-r4-lion-011 freshclam[67]: getfile: daily-17290.cdiff not found on remote server (IP: 69.163.100.14)
> Aug 6 13:07:03 talos-r4-lion-011 freshclam[67]: getpatch: Can't download daily-17290.cdiff from database.clamav.net
> Aug 6 13:07:04 talos-r4-lion-011 freshclam[67]: getfile: daily-17290.cdiff not found on remote server (IP: 150.214.142.197)
> Aug 6 13:07:04 talos-r4-lion-011 freshclam[67]: getpatch: Can't download daily-17290.cdiff from database.clamav.net
> Aug 6 13:07:04 talos-r4-lion-011 freshclam[67]: getfile: daily-17290.cdiff not found on remote server (IP: 207.57.106.31)
> Aug 6 13:07:04 talos-r4-lion-011 freshclam[67]: getpatch: Can't download daily-17290.cdiff from database.clamav.net
> Aug 6 13:07:04 talos-r4-lion-011 org.clamav.freshclam-init[67]: ERROR: getpatch: Can't download daily-17290.cdiff from database.clamav.net
> Aug 6 13:07:04 talos-r4-lion-011 freshclam[67]: Incremental update failed, trying to download daily.cvd
> Aug 6 13:07:10 talos-r4-lion-011 servermgrd[77]: servermgr_ipfilter:ipfw config:Notice:Flushed IPv4 rules
> Aug 6 13:07:10 talos-r4-lion-011 servermgrd[77]: servermgr_ipfilter:ipfw config:Notice:Flushed IPv6 rules
> Aug 6 13:07:25 talos-r4-lion-011 freshclam[67]: Your ClamAV installation is OUTDATED!
> Aug 6 13:07:25 talos-r4-lion-011 freshclam[67]: Current functionality level = 61, recommended = 63
> Aug 6 13:07:40 talos-r4-lion-011 puppet-agent[240]: (/Stage[users]/Users::Builder::Account/Darwinuser[cltbld]/gid) gid changed '0' to 'staff'
> Aug 6 13:07:40 talos-r4-lion-011 puppet-agent[240]: (/Stage[users]/Users::Builder::Account/Exec[kill-builder-keychain]) Triggered 'refresh' from 1 events
> Aug 6 13:07:40 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Tweaks::Cleanup/Exec[find /tmp/* -mmin +15 -print | xargs -n1 rm -rf]/returns) executed successfully
> Aug 6 13:08:09 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Gui::Appearance/File[/usr/local/bin/changebackground.sh]/ensure) defined content as '{md5}32b1681c3cf1e2d83684d00fe233c430'
> Aug 6 13:08:09 talos-r4-lion-011 defaults[1085]:
> The domain/default pair of (com.apple.desktop, Background) does not exist
> Aug 6 13:08:09 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Gui::Appearance/Exec[set-background-image]/returns) executed successfully
> Aug 6 13:08:09 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Gui::Appearance/Exec[set-background-image]) Triggered 'refresh' from 1 events
> Aug 6 13:08:09 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Gui::Appearance/Exec[restart-Dock]) Triggered 'refresh' from 2 events
> Aug 6 13:08:10 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.Dock.agent[223]): Exited with code: 1
> Aug 6 13:08:10 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.Dock.agent[1094]): The GID of the account (20) changed out from under us (0)!
> Aug 6 13:08:10 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.Dock.agent[1094]): In a future build of the OS, this error will be fatal.
> Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: 2013-08-06 13:08:10.826 com.apple.dock.extra[1095:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: 2013-08-06 13:08:10.827 com.apple.dock.extra[1095:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: 2013-08-06 13:08:10.828 com.apple.dock.extra[1095:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:08:10 talos-r4-lion-011 com.apple.dock.extra[1095]: 2013-08-06 13:08:10.829 com.apple.dock.extra[1095:1707] Could not connect the action buttonPressed: to target of class NSApplication
> Aug 6 13:08:11 talos-r4-lion-011 screenresolution[1097]: starting screenresolution argv=/usr/local/bin/screenresolution get
> Aug 6 13:08:11 talos-r4-lion-011 screenresolution[1097]: Display 0: 1600x1200x32
> Aug 6 13:08:13 talos-r4-lion-011 defaults[1145]:
> The domain/default pair of (/Users/cltbld/Library/Preferences/.GlobalPreferences.plist, NSQuitAlwaysKeepsWindows) does not exist
> Aug 6 13:08:13 talos-r4-lion-011 puppet-agent[240]: (/Stage[main]/Clean::Appstate/Osxutils::Defaults[builder-NSQuitAlwaysKeframework, Our status with coreservicesd is unusual, 2, and things eruser.28[203] (com.apple.cookied[1203]): The GID of the account (20) changed out from under us (0)!
> Aug 6 13:09:32 talos-r4-lion-011 com.apple.launchd.peruser.28[203] (com.apple.cookied[1203]): In a future build of the
I suspect that the changing gid is the cause. One fix, then, might be to switch to the newer version of the user provider that's now packaged with Puppet (bug 894900).
A short-term fix, if that's the case, may be to run puppet twice on Darwin before the initial reboot.
Assignee | ||
Comment 1•11 years ago
|
||
I added the following to puppetize.sh on pxe1.build.scl1:
# on Darwin, run puppet again, as the user provider doesn't quite get things (particularly gid) right on the first try - bug 902903
if [ "$OS" = "Darwin" ]; then
while ! /usr/bin/puppet agent --no-daemonize --onetime --server=${PUPPET_SERVER:-puppet} --pluginsync --ssldir=/var/lib/puppet/ssl; do
echo "Second puppet run failed; re-trying after 10m"
sleep 600
done
fi
Coop will try later today and let me know if it helped.
Comment 2•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #1)
> Coop will try later today and let me know if it helped.
I've blessed talos-r4-lion-031 and rebooted it.
Comment 3•11 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #2)
> I've blessed talos-r4-lion-031 and rebooted it.
talos-r4-lion-031 came back from netboot properly, but buildbot didn't start up. It's currently sitting in that state.
Did I miss a step here?
Assignee | ||
Comment 4•11 years ago
|
||
Nope, no steps missed. I'll have a look now.
Assignee | ||
Comment 5•11 years ago
|
||
Finished the first catalog run:
> Aug 9 09:07:43 talos-r4-lion-031 puppet-agent[242]: Finished catalog run in 168.88 seconds
Started the second, including the suspect gid change:
> Aug 9 09:08:09 talos-r4-lion-031 puppet-agent[2532]: (/Stage[users]/Users::Builder::Account/Darwinuser[cltbld]/gid) gid changed '0' to 'staff'
> Aug 9 09:08:09 talos-r4-lion-031 puppet-agent[2532]: (/Stage[users]/Users::Builder::Account/Exec[kill-builder-keychain]) Triggered 'refresh' from 1 events
> Aug 9 09:08:10 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Tweaks::Cleanup/Exec[find /tmp/* -mmin +15 -print | xargs -n1 rm -rf]/returns) executed successfully
> Aug 9 09:08:44 talos-r4-lion-031 com.apple.launchd[1] (com.deploystudio.finalizeCleanup[3437]): posix_spawn("/etc/deploystudio/sbin/ds_finalize_cleanup.sh", ...): No such file or directory
> Aug 9 09:08:44 talos-r4-lion-031 com.apple.launchd[1] (com.deploystudio.finalizeCleanup[3437]): Exited with code: 1
> Aug 9 09:08:44 talos-r4-lion-031 com.apple.launchd[1] (com.deploystudio.finalizeCleanup): Throttling respawn: Will start in 10 seconds
> Aug 9 09:08:45 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Gui::Appearance/File[/usr/local/bin/changebackground.sh]/ensure) defined content as '{md5}32b1681c3cf1e2d83684d00fe233c430'
> Aug 9 09:08:45 talos-r4-lion-031 defaults[3456]:
> The domain/default pair of (com.apple.desktop, Background) does not exist
> Aug 9 09:08:45 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Gui::Appearance/Exec[set-background-image]/returns) executed successfully
> Aug 9 09:08:45 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Gui::Appearance/Exec[set-background-image]) Triggered 'refresh' from 1 events
> Aug 9 09:08:45 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Gui::Appearance/Exec[restart-Dock]/returns) No matching processes were found
> Aug 9 09:08:45 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Gui::Appearance/Exec[restart-Dock]) Failed to call refresh: /usr/bin/killall Dock returned 1 instead of one of [0]
> Aug 9 09:08:45 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Gui::Appearance/Exec[restart-Dock]) /usr/bin/killall Dock returned 1 instead of one of [0]
it tries to set the screenresolution when it shouldn't because FACTER_PUPPETIZING is not set; this is harmless
> Aug 9 09:08:45 talos-r4-lion-031 screenresolution[3465]: starting screenresolution argv=/usr/local/bin/screenresolution get
> Aug 9 09:08:45 talos-r4-lion-031 screenresolution[3465]: 3891612: (connectAndCheck) Untrusted apps are not allowed to connect to or launch Window Server before login.
> Aug 9 09:08:45 talos-r4-lion-031 screenresolution[3465]: kCGErrorFailure: Set a breakpoint @ CGErrorBreakpoint() to catch errors as they are logged.
> Aug 9 09:08:45 talos-r4-lion-031 screenresolution[3465]: Error: failed to get list of active displays
> Aug 9 09:08:45 talos-r4-lion-031 screenresolution[3467]: starting screenresolution argv=/usr/local/bin/screenresolution set 1600x1200x32
> Aug 9 09:08:46 talos-r4-lion-031 screenresolution[3467]: 3891612: (connectAndCheck) Untrusted apps are not allowed to connect to or launch Window Server before login.
> Aug 9 09:08:46 talos-r4-lion-031 screenresolution[3467]: kCGErrorFailure: Set a breakpoint @ CGErrorBreakpoint() to catch errors as they are logged.
> Aug 9 09:08:46 talos-r4-lion-031 screenresolution[3467]: Error: failed to get list of active displays
> Aug 9 09:08:46 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Screenresolution/Exec[set-resolution]/returns) 2013-08-09 09:08:45.996 screenresolution[3467:707] starting screenresolution argv=/usr/local/bin/screenresolution set 1600x1200x32
> Aug 9 09:08:46 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Screenresolution/Exec[set-resolution]/returns) 2013-08-09 09:08:46.000 screenresolution[3467:707] Error: failed to get list of active displays
> Aug 9 09:08:46 talos-r4-lion-031 puppet-agent[2532]: /usr/local/bin/screenresolution set 1600x1200x32 returned 1 instead of one of [0]
> Aug 9 09:08:46 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Screenresolution/Exec[set-resolution]/returns) change from notrun to 0 failed: /usr/local/bin/screenresolution set 1600x1200x32 returned 1 instead of one of [0]
> Aug 9 09:08:46 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Puppet::Atboot/File[/etc/puppet/puppetmasters.txt]/content) content changed '{md5}22db71f3f0bca0d11d7bd94fc873341b' to '{md5}fdf71264736f9a8b0fcf316f384bc280'
> Aug 9 09:08:46 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Users::Builder::Setup/Python::User_pip_conf[cltbld]/File[/Users/cltbld/.pip/pip.conf]/content) content changed '{md5}a408bfe315e1832a457fc0e1193bc244' to '{md5}056e5ae36816c7b96ebd1e2fcc6adb95'
> Aug 9 09:08:48 talos-r4-lion-031 puppet-agent[2532]: (/Stage[main]/Users::Root::Setup/Python::User_pip_conf[root]/File[/var/root/.pip/pip.conf]/content) content changed '{md5}a408bfe315e1832a457fc0e1193bc244' to '{md5}056e5ae36816c7b96ebd1e2fcc6adb95'
> Aug 9 09:08:49 talos-r4-lion-031 puppet-agent[2532]: Finished catalog run in 52.19 seconds
It reboots..
> Aug 9 09:09:03 talos-r4-lion-031 reboot[3555]: rebooted by _locationd
> Aug 9 09:09:03 talos-r4-lion-031 reboot[3555]: BOOT_TIME: 1376064543 474837
And runs puppet..
> Aug 9 09:10:50 talos-r4-lion-031 puppet-agent[235]: (/Stage[main]/Tweaks::Cleanup/Exec[find /tmp/* -mmin +15 -print | xargs -n1 rm -rf]/returns) executed successfully
> Aug 9 09:11:22 talos-r4-lion-031 screenresolution[1058]: starting screenresolution argv=/usr/local/bin/screenresolution get
> Aug 9 09:11:22 talos-r4-lion-031 screenresolution[1058]: Display 0: 1600x1200x32
> Aug 9 09:11:24 talos-r4-lion-031 defaults[1106]:
> The domain/default pair of (/Users/cltbld/Library/Preferences/.GlobalPreferences.plist, NSQuitAlwaysKeepsWindows) does not exist
> Aug 9 09:11:24 talos-r4-lion-031 puppet-agent[235]: (/Stage[main]/Clean::Appstate/Osxutils::Defaults[builder-NSQuitAlwaysKeepsWindows]/Exec[osx_defaults write /Users/cltbld/Library/Preferences/.GlobalPreferences.plist NSQuitAlwaysKeepsWindows=>0]/returns) executed successfully
> Aug 9 09:11:27 talos-r4-lion-031 puppet-agent[235]: Finished catalog run in 53.52 seconds
but doesn't run Buildbot.
cltbld is logged in and its launchd is running and watching /var/tmp/puppet.finished
And I just noticed the slave is disabled! So, that would explain it for this host at least.
Assignee | ||
Comment 6•11 years ago
|
||
We re-imaged 031, and it mysteriously started Buildbot after I ran dtruss on launchd and touched puppet.finished. I'll watch more closely next time (-033).
Assignee | ||
Comment 7•11 years ago
|
||
I mentioned in bug 902970, but it might be relevant here, too -- the image we're building these r4 minis out with was originally developed for r5 minis. That *might* be relevant.
There may also be a timing issue, where puppet touches the semaphore file before cltbld's launchd is watching it.
Assignee | ||
Comment 8•11 years ago
|
||
The system logs for lion startup contain lots of errors, but those errors are the same on the old build machines and the new talos machines, so that's probably not a useful lead.
In the system logs, I see lots of logs from com.apple.launchd.peruser.28 before I see the puppet run, so I think the race condition in comment 7 is also not present. I also don't see why a reboot would fix that issue.
So, still stumped.
Assignee | ||
Comment 9•11 years ago
|
||
Aug 12 15:45:36 talos-r4-lion-005 puppet-agent[228]: Finished catalog run in 50.26 seconds
Aug 12 15:45:39 talos-r4-lion-005 XProtectUpdater[69]: Ignoring new signature plist: Not an increase in version
Aug 12 15:45:39 talos-r4-lion-005 com.apple.launchd[1] (com.apple.xprotectupdater[69]): Exited with code: 252
Aug 12 15:46:59 talos-r4-lion-005 helpd[243]: CarbonCore.framework, Our status with coreservicesd is unusual, 2, and things likely will stop working.
Aug 12 15:46:59 talos-r4-lion-005 helpd[243]: CarbonCore.framework: Unable to connect to coreservicesd. Things likely won't work. Sorry.
Aug 12 15:48:41 talos-r4-lion-005 servermgrd[77]: No requests in 300 seconds, shutting down
Aug 12 15:48:50 talos-r4-lion-005 talagent[207]: PersistentUI: CSBackupSetItemExcluded() failed with OSStatus -61
[root@talos-r4-lion-005.build.scl1.mozilla.com ~]# ls -al /builds/slave/{,talos-slave} /var/tmp/puppet.finished
-rw-r--r-- 1 root wheel 0 Aug 12 15:45 /var/tmp/puppet.finished
/builds/slave/:
total 0
drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 .
drwxr-xr-x 3 root wheel 102 Aug 12 15:39 ..
drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 talos-slave
/builds/slave/talos-slave:
total 0
drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 .
drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 ..
drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 talos-data
It must be something about missing the "touch" in that file - some sort of launchd bug. If I touch puppet.finished again, it invokes runslave a second or so later:
[root@talos-r4-lion-005.build.scl1.mozilla.com ~]# touch /var/tmp/puppet.finished
[root@talos-r4-lion-005.build.scl1.mozilla.com ~]# ls -al /builds/slave/{,talos-slave} /var/tmp/puppet.finished
-rw-r--r-- 1 root wheel 0 Aug 12 15:56 /var/tmp/puppet.finished
/builds/slave/:
total 0
drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 .
drwxr-xr-x 3 root wheel 102 Aug 12 15:39 ..
drwxr-xr-x 4 cltbld staff 136 Aug 12 15:57 talos-slave
/builds/slave/talos-slave:
total 8
drwxr-xr-x 4 cltbld staff 136 Aug 12 15:57 .
drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 ..
-rw-r--r-- 1 cltbld staff 1397 Aug 12 15:57 buildbot.tac
drwxr-xr-x 3 cltbld staff 102 Aug 12 15:39 talos-data
I suspect we need to find a more robust way to start buildbot after puppet.
Assignee | ||
Comment 10•11 years ago
|
||
09:55 <@gneagle> djmitche: This fails because once a WatchPath doesn't exist (for example, if it's removed), launchd stops paying attention to it.
09:55 <@gneagle> djmitche: Instead, use KeepAlive:PathState
09:55 < djmitche> oh!
09:55 < djmitche> that explains why this only fails once - user's launchd starts *before* the path exists
09:55 < djmitche> thanks!
09:56 <@gneagle> djmitche: Example here: http://code.google.com/p/munki/source/browse/launchd/LaunchDaemons/com.googlecode.munki.managedsoftwareupdate-manualcheck.plist
Assignee | ||
Comment 11•11 years ago
|
||
09:57 <@gneagle> Also: http://code.google.com/p/munki/source/browse/launchd/LaunchAgents/com.googlecode.munki.ManagedSoftwareUpdate.plist
09:58 <@gneagle> I use this technique for user-level processes to trigger root-level processes and the other way around, which sounds like what you are doing as well.
Assignee | ||
Comment 12•11 years ago
|
||
So the process using PathState would be this:
* puppet startup deletes the semaphore file (to avoid early startup of buildbot if a machine crashes while starting buildbot)
* on success, puppet creates the semaphore file
* buildbot script immediately deletes the semaphore file when it starts
My worry is that launchd will kill the buildbot script when the semaphore file is removed, but we'll see.
Assignee | ||
Comment 13•11 years ago
|
||
Launchd doesn't see buildbot - it just sees run-buildslave.sh/runslave.py exit, so it doesn't kill anything.
I've only tested this on snow leopard so far, but let's get the review going while I finish testing.
Attachment #789884 -
Flags: review?(coop)
Assignee | ||
Comment 14•11 years ago
|
||
Works fine on mountain lion, too:
Aug 14 06:10:21 talos-mtnlion-r5-001.test.releng.scl3.mozilla.com puppet-agent[140]: Finished catalog run in 41.53 seconds
Aug 14 06:10:22 talos-mtnlion-r5-001.test.releng.scl3.mozilla.com run-buildslave[974]: starting runslave.py
Can I borrow a lion host briefly?
Comment 15•11 years ago
|
||
Comment on attachment 789884 [details] [diff] [review]
bug902903.patch
Review of attachment 789884 [details] [diff] [review]:
-----------------------------------------------------------------
::: modules/buildslave/manifests/startup/launchd.pp
@@ +23,3 @@
>
> $buildslave_path = "/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin"
>
WS cleanup on landing, please? (I know it's not yours.)
Attachment #789884 -
Flags: review?(coop) → review+
Comment 16•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #14)
> Can I borrow a lion host briefly?
talos-r4-lion-001 is ready for you. I disabled it in slavealloc previously, but it is now re-enabled and should connect to buildbot on the next reboot/netboot.
Assignee | ||
Comment 17•11 years ago
|
||
Looks good there, too:
Aug 14 09:09:46 talos-r4-lion-001 puppet-agent[137]: Finished catalog run in 55.02 seconds
Aug 14 09:09:48 talos-r4-lion-001 run-buildslave[1088]: starting runslave.py
Assignee | ||
Comment 18•11 years ago
|
||
Landed, and I see no problems cropping up. We'll see for sure if this fixed the problem with imaging next time we re-image a host.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•