Closed Bug 457885 Opened 16 years ago Closed 16 years ago

production talos skips builds based upon incorrect ftp scraping

Categories

(Release Engineering :: General, defect, P3)

x86
Windows XP
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dbaron, Assigned: catlee)

References

Details

Attachments

(2 files, 2 obsolete files)

In the afternoon/evening of September 29, the windows talos testing machines stopped testing any builds for quite a while, despite the windows builders producing a number of builds during that interval. This was rather unfortunate, since: * we were relying on the fact that the talos testers would test pretty much every Windows build in order to reland a patch that caused a regression only on the Windows XP talos boxes in pieces to figure out which piece was responsible * there was, in fact, a regression, but we don't know yet which of the 20 or so changesets was responsible. In particular, talos boxes (XP and Vista) tested: http://hg.mozilla.org/mozilla-central/rev/61642beb4c16 but then did not test the windows builds for: http://hg.mozilla.org//mozilla-central/index.cgi/rev/07da123ca0b3 (2 builds) http://hg.mozilla.org//mozilla-central/index.cgi/rev/edc314aed893 (2 builds) http://hg.mozilla.org//mozilla-central/index.cgi/rev/38a48d485876 (2 builds) They finally then tested the third build that built http://hg.mozilla.org//mozilla-central/index.cgi/rev/38a48d485876 There were a number of red builds during that time period, but that shouldn't have prevented the talos builds from testing the green ones. See http://tinderbox.mozilla.org/showbuilds.cgi?tree=Firefox&hours=24&maxdate=1222794000 for the waterfall.
Oops, forgot to clarify why I'm filing this: it seems like it would be good to understand why this happened so that we can prevent it from happening in the future.
My theory is that there is something going on here with multiple builders reporting to the same waterfall column - so talos ends up seeing just the busted build as the most recent and misses out on the working green builds.
Assignee: nobody → anodelman
Priority: -- → P2
Testing a possible solution on talos staging.
Attached patch ftppoller.py (obsolete) (deleted) — Splinter Review
New code to scrape ftp directory listings of builds instead of the waterfall for new browsers to test. Works for both dated directory/nightly builds. Should solve our skipping-builds issues - though, talos will still collapse the testing queue if it is overwhelmed with test requests. This will give us a better idea if we need to add more talos slaves to the system to keep up with demand. I think that this code might also be adaptable to working with try talos, which has its own problem with skipping builds. Patches to follow for integration into the master.cfg.
Attachment #341530 - Flags: review?(bhearsum)
Added support for new ftp scraping in perfmaster and perf-staging. For moz-central we only have to scrape a single page, as both tinderbox builds and nightlies end up in the same place. For 1.8/1.9 we scrape both tinderbox build directories and nightly build directories to ensure that we are testing everything available. Post review, this should only be pushed live during a downtime. Even though it is tested on stage we would need confirmation on the live set up that we are still correcting pulling builds and reporting to the waterfall.
Attachment #341541 - Flags: review?(bhearsum)
Oh - I also took the opportunity to scrub out all the actionmonkey cruft in the various talos buildbot files.
Attachment #341530 - Flags: review?(bhearsum) → review-
Comment on attachment 341530 [details] [diff] [review] ftppoller.py >Index: ftppoller.py >=================================================================== >+class ftpParser: nit: FtpPoller would be more consistent with our existing classes >+ def __init__(self, url, searchString): >+ self.dirs = [] >+ self.dates = [] >+ f = urlopen(url) See below for comments on this urlopen() >+ s = f.read() >+ f.close() >+ lines = s.split('\n') >+ #for parsing the page of dated directories >+ for line in lines: >+ if line == "": continue >+ match = re.match(self.findBuildDirs, line) >+ if match: >+ self.dirs.append(match.group(1)) >+ #for parsing the nightly builds page >+ findLastDate = re.compile('^.*' + searchString + '</a></td><td align="right">(\d\d-[a-zA-Z]{3}-\d\d\d\d \d\d:\d\d).*$') This is pretty fragile. If you're polling ftp.m.o I think it'd be better to urlopen(ftp://ftp.m.o/....). Something like this: http://pastebin.mozilla.org/546648 I fear us breaking if apache is changed on ftp.m.o, or some other factor causes the HTML output to change. >+class ftpPoller(base.ChangeSource): nit: FtpPoller instead of ftpPoller >+ def __init__(self, branch="", tree="Firefox", pollInterval=30, ftpURLs=[], searchString=""): >+ """ >+ @type ftpURLs: list of strings >+ @param ftpURLs: The ftp directorys to monitor >+ nit: s/directorys/directories/ >+ def poll(self): >+ if self.working > 0: >+ log.msg("Not polling Tinderbox because last poll is still working (%s)" % (str(self.working))) >+ else: >+ for url in self.ftpURLs: >+ self.working = self.working + 1 >+ d = self._get_changes(url) >+ d.addCallback(self._process_changes) >+ d.addBoth(self._finished) >+ return >+ >+ def _finished(self, res): >+ self.working = self.working - 1 >+ >+ def _get_changes(self, url): >+ log.msg("Polling ftp dir %s" % url) >+ return defer.maybeDeferred(urlopen, url) >+ >+ def _process_changes(self, query): >+ try: >+ url = query.geturl() >+ #log.msg("processing changes for dir: %s" % url) >+ parser = ftpParser(url, self.searchString) I don't quite understand this and the following sections. If you're doing the urlopen() here why not pass the results into ftpParser rather than passing the URL and doing it again there? You could do it the other way too, where you only pass the URL in and explicitly do not call urlopen() here. You've already got the deferreds setup here...so it's probably easier to do from here. The only change would be passing 'query' to ftpParser instead of the URL - afaict. Does that make sense? >+ for dir in dirList: >+ buildDate = int(dir) >+ if self.lastChanges[url] >= buildDate: >+ # change too old >+ continue >+ self.lastChanges[url] = buildDate >+ c = changes.Change(who = url, >+ comments = "success", >+ files = ['TODO: filename goes here'], >+ branch = self.branch, >+ when = buildDate, >+ links = url + dir + '/') >+ self.parent.addChange(c) >+ log.msg("found a tinderbox build to test (%s%s)" % (url, dir)) >+ >+ #special case for nightlies >+ for buildDate in dateList: >+ if self.lastChanges[url] >= buildDate: >+ # change too old >+ continue >+ self.lastChanges[url] = buildDate >+ c = changes.Change(who = url, >+ comments = "success", >+ files = ['TODO: filename goes here'], >+ branch = self.branch, >+ when = buildDate, >+ links = url) >+ self.parent.addChange(c) >+ log.msg("found a nightly build to test (%s%s)" % (url, self.searchString)) I'm very excited that URLs are going to be in the Change object now. This makes it possible for us to use the 'Rebuild' feature and re-test builds without waiting on the build machine. Overall this looks good. r-'ing specifically because of the urlopen() comments, and ftp vs http URLs. Please make sure you only call urlopen() in one place, and that it's wrapped in a Deferred object to prevent blocking. I'd also really like to switch to ftp://... URLs so we can get out of this business of parsing raw HTML.
Comment on attachment 341541 [details] [diff] [review] updates to perfmaster and perf-staging to scrape ftp >Index: perfmaster2/perfrunner.py >=================================================================== >+ #a link is always provided by the poller >+ self.url = self.changes[-1].links Hawt. >+WIN32_MOZ2_CENTRAL_BUILDDIRS=["http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32/"] >+LINUX_MOZ2_CENTRAL_BUILDDIRS=["http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux/"] >+MAC_MOZ2_CENTRAL_BUILDDIRS=["http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx/"] >+#1.9 >+WIN32_TRUNK_BUILDDIRS=["http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/FX-WIN32-TBOX-mozilla1.9.0/", "http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla1.9.0/"] >+LINUX_TRUNK_BUILDDIRS=["http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/fx-linux-tbox-mozilla1.9.0/", "http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla1.9.0/"] >+MAC_TRUNK_BUILDDIRS=["http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/bm-xserve08-mozilla1.9.0/", "http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla1.9.0/"] >+#1.8 >+WIN32_BRANCH_BUILDDIRS=["http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/production-pacifica-vm02-mozilla1.8/", "http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla1.8/"] >+LINUX_BRANCH_BUILDDIRS=["http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/production-prometheus-vm02-mozilla1.8/", "http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla1.8/"] >+MAC_BRANCH_BUILDDIRS=["http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/bm-xserve05-mozilla1.8/", "http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla1.8/"] This will probably need to be switched to ftp:// URIs. >-import tinderboxpoller >-reload(tinderboxpoller) >-from tinderboxpoller import TinderboxPoller >+import ftppoller >+reload(ftppoller) >+from ftppoller import ftpPoller > Is there any chance I can convince you to put this class somewhere inside of buildbotcustom?
I'm part way there for a fully ftp solution - except that the last-modified times reported for builds from ftp://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla-central/ are 7 hours later than http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla-central/ The data that I'm getting from urlopen(ftp) is also a lot leaner, with dates of the format: "Oct 03 09:46" - no year. I could assume a year but then we hit weirdness around Jan 1st. I'm not very happy doing all this date munging. Always strikes me as a recipe for disaster. I'm starting to be of the opinion that the http:// route is simply a lot safer as it offers 1) dates consistent with other dates in our system and 2) more information associated with a given build and less assumptions.
Here's an example of the data that I get for a urlopen (just so that you can get an idea what I'm working with): ftp://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla-central/ -rwxr-xr-x 1 ftp ftp 9549103 Oct 03 09:46 firefox-3.1b1pre.en-US.linux-i686.complete.mar\r\n-rwxr-xr-x 1 ftp ftp 9463372 Oct 03 09:46 firefox-3.1b1pre.en-US.linux-i686.tar.bz2\r\n-rwxr-xr-x 1 ftp ftp 10757675 Oct 03 09:59 firefox-3.1b1pre.en-US.linux-x86_64.complete.mar\r\n-rwxr-xr-x 1 ftp ftp 10690758 Oct 03 09:59 firefox-3.1b1pre.en-US.linux-x86_64.tar.bz2\r\n-rwxr-xr-x 1 ftp ftp 16840986 Oct 03 10:37 firefox-3.1b1pre.en-US.mac.complete.mar\r\n-rwxr-xr-x 1 ftp ftp 17202267 Oct 03 10:37 firefox-3.1b1pre.en-US.mac.dmg\r\n-rwxr-xr-x 1 ftp ftp 10065222 Oct 03 12:00 firefox-3.1b1pre.en-US.win32.complete.mar\r\n-rwxr-xr-x 1 ftp ftp 7547965 Oct 03 12:00 firefox-3.1b1pre.en-US.win32.installer.exe\r\n-rwxr-xr-x 1 ftp ftp 10767184 Oct 03 12:00 firefox-3.1b1pre.en-US.win32.zip\r\n Which is also different from the 'Show page source' for that ftp page, as the browser seems to do some of its own formatting when interacting with ftp.
I'm becoming of the very strong opinion that: "I fear us breaking if apache is changed on ftp.m.o, or some other factor causes the HTML output to change. " is not something that we should concern ourselves with too deeply. How often does apache get updated? How often does apache change the directory listing structure? I'm thinking that this risk is pretty low, and we'll notice immediately because we'll stop testing builds across the board.
How about doing the http approach for a short term solution and maybe write a serverside script to help on this. You could write a script to run on the FTP server that you query for the latest build of the type you are looking for and it returns as an XML page (or something else that can easily be parsed) the timestamp and URL path to the file.
Alice, why do we have to do date parsing like that? We currently extract the date from the tinderbox builds directory name, right? Can't we continue doing that?
I think the issue is getting nightlies for Firefox 3.0, since they don't go into firefox/tinderbox-builds/. Might be better to look in firefox/nightly/YYYY/MM/YYYY-MM-DD-HH-mozilla1.9.0 than rely on the timestamps in firefox/nightly/latest-mozilla1.9.0.
Attached patch [Checked in]ftppoller.py (deleted) — Splinter Review
fixes the following: - only queries the url through deferred, no more blocking - now provides the full path to the browser so that the perfrunner.py doesn't have to (there is a blocking urlopen hiding in there that we can excise) - able to search through both nightly directory structure and hourly directory structure I know, you won't like that it isn't through ftp. So, I've improved the regexes to make them less fragile. As an aside, the ftp format that would be scrapable isn't any easier to read because it is a) not delimited in any way (those are all individual spaces in there, not tabs) and b) previous issues with date parsing as mentioned.
Attachment #341530 - Attachment is obsolete: true
Attachment #342158 - Flags: review?(bhearsum)
Pretty much the same as before, but I've pulled out some more crufty code from perfrunner.py to get it down to just the stuff that we run/use. Also, this gets the main master.cfg for talos down to 940 lines. Woo!
Attachment #342167 - Flags: review?(bhearsum)
Depends on: 458243
Attachment #341541 - Attachment is obsolete: true
Attachment #341541 - Flags: review?(bhearsum)
Comment on attachment 342158 [details] [diff] [review] [Checked in]ftppoller.py This looks pretty good. I've got a nit and a question though. I understand the recursive polling of directories, I don't understand the 'if forceDate > 0' part. AFAICT it will always be nonzero for subdirectories - this is there to make sure the 'date' talos gets is the dated directory date, rather than the last modified date of the file? I guess I'm confused because _process_changes() appears to handle builds in the top-level and in subdirectories. nit: I think you need 'files = [buildname]', I was looking at Talos stage and saw this: http://qm-buildbot01.mozilla.org:2008/changes/4701
Comment on attachment 342167 [details] [diff] [review] [Checked in]updates to perfmaster and perf-staging to scrape ftp (take 2) >Index: perfmaster2/master.cfg >=================================================================== >-import tinderboxpoller >-reload(tinderboxpoller) >-from tinderboxpoller import TinderboxPoller >+import ftppoller >+reload(ftppoller) >+from ftppoller import FtpPoller > Is there any chance I can convince you to land this in buildbotcustom instead? The only change on the Talos Buildbot master would be requiring a checkout of buildbotcustom and an adjusted PYTHONPATH. This looks fine otherwise.
Attachment #342167 - Flags: review?(bhearsum) → review+
For comment #17, because process_changes works with both files located in a single directory (like the nightly directories) and for files located in dated directories we require the forceDate. For the dated directories we want to force the date to be that of the directory, for the nightly structure we want to use the last modified time of the file.
Comment on attachment 342158 [details] [diff] [review] [Checked in]ftppoller.py Ahhhhhhh right. Thanks for the reminder Alice.
Attachment #342158 - Flags: review?(bhearsum) → review+
FYI, calling blocking code (urlopen) in deferreds still leaves them blocking. See http://twistedmatrix.com/trac/browser/tags/releases/twisted-8.1.0/twisted/internet/defer.py#L84, it immediately calls urlopen, and returns a succeed(). The network traffic doesn't use timeouts either, which might be a good idea. I'm somewhat surprised that I lost the timeouts in my version of hgpoller in the recent refactorings. http://hg.mozilla.org/users/axel_mozilla.com/tooling/file/a850a69c5b7a/buildbotcustom/buildbotcustom/changes/hgpoller.py is my current code for the hgpoller, which I apparently have to hack on again. I introduced the Pluggable class to deal with unreported network errors, but those might have been the lacking timeouts. I'll add the timeouts back, but probably leave the Pluggable in. There is one call into getPage(url, timeout = self.timeout) in there, through :-/. getPage is kinda noisy by default, a # make getPage silent, comment out for debugging purposes from twisted.web.client import HTTPClientFactory HTTPClientFactory.noisy = False in master.cfg fixes that if wanted. I put that in master.cfg, so that it's easy to switch if you care about the debugging info.
Hey Alice, I hate to do this, but can you call getPage instead of urlopen? The last thing I want is for the Talos master to hang because of this. I think the only change here is s/maybeDeferred/getPage/.
actually, there's a further change, getPage returns the content, and not a file-like object. You could probably already specify a timeout, too. If you don't, you get locked up on network loss. Not sure what a good default would be. Might depend on the number of loads you need, too. Or you just start with 30 seconds or so, and tweak later.
Considering that the downtime for this change is tomorrow morning, I don't think that there is enough time to code/test to push out any changes to the patch. I think that it would be reasonable to file a separate bug with a high priority to go along with the landing of the initial patches. That, or we'll end up delayed on landing this again and it is something that we really need before the next code freeze.
(In reply to comment #24) > Considering that the downtime for this change is tomorrow morning, I don't > think that there is enough time to code/test to push out any changes to the > patch. > > I think that it would be reasonable to file a separate bug with a high priority > to go along with the landing of the initial patches. That, or we'll end up > delayed on landing this again and it is something that we really need before > the next code freeze. wfm as long as we make sure it stays high priority
Comment on attachment 342167 [details] [diff] [review] [Checked in]updates to perfmaster and perf-staging to scrape ftp (take 2) Checking in perf-staging/master.cfg; /cvsroot/mozilla/tools/buildbot-configs/testing/talos/perf-staging/master.cfg,v <-- master.cfg new revision: 1.20; previous revision: 1.19 done Checking in perfmaster2/perfrunner.py; /cvsroot/mozilla/tools/buildbot-configs/testing/talos/perfmaster/perfrunner.py,v <-- perfrunner.py new revision: 1.26; previous revision: 1.25 done Checking in perfmaster2/master.cfg; /cvsroot/mozilla/tools/buildbot-configs/testing/talos/perfmaster/master.cfg,v <-- master.cfg new revision: 1.80; previous revision: 1.79 done RCS file: /cvsroot/mozilla/tools/buildbot-configs/testing/talos/perfmaster/ftppoller.py,v done
Attachment #342167 - Attachment description: updates to perfmaster and perf-staging to scrape ftp (take 2) → [Checked in]updates to perfmaster and perf-staging to scrape ftp (take 2)
Attachment #342158 - Attachment description: ftppoller.py → [Checked in]ftppoller.py
Comment on attachment 342158 [details] [diff] [review] [Checked in]ftppoller.py Checking in perfmaster2/ftppoller.py; /cvsroot/mozilla/tools/buildbot-configs/testing/talos/perfmaster/ftppoller.py,v <-- ftppoller.py initial revision: 1.1 done
I thought this over some more, and scraping the ftp directories is still going to lead to situations where we skip builds. I've seen situations on the waterfall where we have multiple builders starting one after the other but, for whatever reason the last to start can occasionally be the first the finish. Since we date builds based upon start time and talos determines if it has seen a build before based upon that date, we would consume that one build and ignore the other builds as they finish. The ftppoller is a good half way step, but we need to go to a system where a builder triggers a sendchange upon completion to ensure that talos gets full notification.
ftppoller pushed to production and working correctly.
Not working on this in next couple of weeks.
Assignee: anodelman → nobody
Component: Release Engineering: Talos → Release Engineering: Future
Updating summary based upon current bug status.
Summary: windows talos machines stopped testing anything for quite a few hours → production talos skips builds based upon incorrect ftp scraping
Component: Release Engineering: Future → Release Engineering: Talos
Priority: P2 → P3
Assignee: nobody → anodelman
This may be fixed by #468731
Depends on: 468731
Assignee: anodelman → catlee
Just to be clear here - we will still be skipping builds if there are not enough free talos machines to handle all the incoming builds.
No longer depends on: 458243
We're not using the ftp poller any more, except for trunk (Firefox 3.0) builds.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Component: Release Engineering: Talos → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: