Closed Bug 739384 Opened 13 years ago Closed 13 years ago

Deploy to production foopies verification scripts (changes from Bug 690311 Part 1)

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P1)

x86_64
Windows 7

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: armenzg)

References

Details

Armen asked me to file a bug against him for this. We need to deploy Bug 690311 Part 1 to the various foopies in a bit of a staged rollout (so that if something breaks we don't break everything at once) He said he will tackle the first prod foopy of this in his AM.
Steps to follow: * load up bm19 and each of the slaves for that foopy * hit graceful shutdown for each * once *all* of them are done, run /builds/stop_cp.sh (it stops all tegras on that foopy) ** you can modify stop_cp.sh to send each call to stop.py to be in the background * once *all* have stopped, update the /builds/tools checkout * run /builds/start_cp.sh (it starts all tegras on that foopy) I added the --debug flag to start_cp.sh.
Callek, would you mind having a look at tegra-036, tegra-037 and tegra-200? I can add your pub key to foopy07. It seems that all of them started but these 3 failed on some jobs after starting. I assume this is fine but I would prefer to have an extra pair of eyes. I wish we had a way to gracefully shutdown a tegra when failing and make it run through the verify steps. Or something like that. An interesting case is this one: http://buildbot-master19.build.mtv1.mozilla.com:8201/builders/Android%20Tegra%20250%20mozilla-inbound%20talos%20remote-twinopen/builds/1545 Where the job is going well until hitting "Run Performance tests" which fails and then everything fails after that.
I guess we have a little bit of insight by looking at this: http://mobile-dashboard1.build.mtv1.mozilla.com/tegras/tegra-037_status.log http://mobile-dashboard1.build.mtv1.mozilla.com/tegras/tegra-036_status.log I really don't like the message "SUTAgent not present;" but I bet that is completely unrelated to our changes. I think tegra-037 is not in a good shape even prior to our changes.
I realized that I run ./start_cp.sh without "screen -x". Why have the jobs not failed?
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #5) > I realized that I run ./start_cp.sh without "screen -x". Why have the jobs > not failed? They don't fail to start, what happens is that sometime in the future they will fail. It is random how long that will take. (or seems to be random)
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #4) > I guess we have a little bit of insight by looking at this: > http://mobile-dashboard1.build.mtv1.mozilla.com/tegras/tegra-037_status.log > http://mobile-dashboard1.build.mtv1.mozilla.com/tegras/tegra-036_status.log > > I really don't like the message "SUTAgent not present;" but I bet that is > completely unrelated to our changes. > > I think tegra-037 is not in a good shape even prior to our changes. "SUTAgent not present" just means that clientproxy could not connect to the SUTAgent daemon running on the Tegra. Who knows why the SUTAgent isn't running - that is beyond clientproxies ability to determine.
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #5) > I realized that I run ./start_cp.sh without "screen -x". Why have the jobs > not failed? So we have many reds on m-c and inbound from runs with the tegras attached to foopy07, what I can see so far indicates its not because of my code, but is from this deploy. We should stop_cp and re-run start_cp with the proper "screen -x" I think Symptoms are |hg clone| having python throwing failing to |import site| and errors like the following: hg clone http://hg.mozilla.org/build/tools tools in dir /builds/tegra-035/test/. (timeout 1320 secs) watching logfiles {} argv: ['hg', 'clone', 'http://hg.mozilla.org/build/tools', 'tools'] environment: PATH=/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin PWD=/builds/tegra-035/test SUT_IP=10.250.49.22 SUT_NAME=tegra-035 __CF_USER_TEXT_ENCODING=0x1F6:0:0 closing stdin using PTY: False 'import site' failed; use -v for traceback Traceback (most recent call last): File "/opt/local/bin/hg", line 38, in <module> mercurial.dispatch.run() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mercurial/dispatch.py", line 16, in run sys.exit(dispatch(sys.argv[1:])) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mercurial/dispatch.py", line 21, in dispatch u = uimod.ui() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mercurial/ui.py", line 35, in __init__ for f in util.rcpath(): File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mercurial/util.py", line 1346, in rcpath _rcpath = os_rcpath() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mercurial/util.py", line 1321, in os_rcpath path.extend(user_rcpath()) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mercurial/posix.py", line 53, in user_rcpath return [os.path.expanduser('~/.hgrc')] File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/posixpath.py", line 259, in expanduser userhome = pwd.getpwuid(os.getuid()).pw_dir KeyError: 'getpwuid(): uid not found: 502' program finished with exit code 1 There is a small code error though, that affects reporting not outcome of verify.py I'll attach it to bug 690311 momentarily and land it as soon as it gets review.
I'm restarting clientproxy via screen for all the tegras connected to foopy07. WIll update when done.
(In reply to Chris Cooper [:coop] from comment #9) > I'm restarting clientproxy via screen for all the tegras connected to > foopy07. WIll update when done. This is done now.
Thanks coop. Callek, the jobs since the restart seem good.
Callek, anything left before I take care of the remaining foopies? I would like to do this tomorrow morning.
Summary: Deploy changes from Bug 690311 Part 1 → Deploy to production foopies verification scripts (changes from Bug 690311 Part 1)
Priority: -- → P1
We are good to go forward with the rest of these. Though in IRC you suggested possibly waiting until friday. If you do delay it will not hold me up, so whichever.
I will be doing this tomorrow as today was busy for me.
bm19 has been done (using screen). I will now work on bm20.
For the record, the original revisions of tools was dace6c4e8902 in case we need to revert to it (I really hope we don't). bm20 is down and I am now stopping all foopies/tegras for it.
This is done. Now, let's watch that nothing breaks.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Sooooo looks like we were not done here, bm20 wasn't updated, bear accomplished most of that for us. bm19 has at _least_ one foopy we never updated, so we need to check all that as well. I wrote up an etherpad for the relevant info we have/need here: https://etherpad.mozilla.org/usN39EKFDB
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Justin Wood (:Callek) from comment #18) > bm19 has at _least_ one foopy we never updated, so we need to check all that > as well. Besides that one foopy, there was only one other to update (foopy09 and foopy19 for history). Those are done now, thanks bear!
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.