Closed Bug 590526 Opened 14 years ago Closed 14 years ago

Don't start two runs of the same job at the same time, since Tinderbox hides one of them

Categories

(Tree Management Graveyard :: TBPL, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 630538

People

(Reporter: bhearsum, Unassigned)

References

Details

But, in at least one case I can see it in Buildbot.

http://test-master01.build.mozilla.org:8012/builders/Rev3%20MacOSX%20Snow%20Leopard%2010.6.2%20tryserver%20opt%20test%20mochitests-1%2F5/builds/592 is a run of opt mochitest 1/5 on ebac4228ed4a. It doesn't show up on TBPL: http://grab.by/64tW

And I can't seem to find it on plain old Tinderbox either.
Looks like this is caused by start time collisions. The following all started opt mochitest 1/5 on 64-bit mac at Wed Aug 25 01:32:18 2010:
bb55562ff200
ae49d384bc7d
a5117807db6f
b1c4b5f417dd
eb2a4ec3eab7
ebac4228ed4a
4a7d0ce9f897
e53fab4ba86c


Based on the sheer improbability of all of these builds having sendchanges sent at the same instant I'm going to guess that either the test master was hung shortly before this time, or had issues with the connection to the db, or something like that.
Component: Release Engineering → Tinderboxpushlog
Product: mozilla.org → Webtools
QA Contact: release → tinderboxpushlog
This happens all the time to me.  Is there some way forward towards fixing this issue?
Not depending on tinderbox’ keying on machine+starttime I guess.
I know nothing about this system, but why can't we use the same mechanism that gets the logs into http://ftp.mozilla.org/pub/mozilla.org/firefox/tryserver-builds ?  That seems to work consistently.
We just use what tinderbox provides us, for now that is. Dropping our dependency is long underway but progress is very, very slow. :-(
(In reply to comment #7)
> We just use what tinderbox provides us, for now that is. Dropping our
> dependency is long underway but progress is very, very slow. :-(

Yes, and we've come a long way thus far!  But the logs are out there, somewhere; is there a technical reason right now that we can't get the logs from [1], or use the same mechanism for log retrieval as [1]?

I'm not comfortable with the idea that we're stuck with this bug until a "very, very slow" process finishes.

[1] http://ftp.mozilla.org/pub/mozilla.org/firefox/tryserver-builds
This has (very little) to do with log retrieval.
The problem at hand is that we ask tinderbox, β€œwhat jobs were run in the past 12 hours” and it just doesn’t return all the jobs.

But you are right in the sense that log retrieval and parsing is one small step to using builddb for the β€œtell me what happened” part.
This is becoming problematic.  Every tryserver push I've made in the past few weeks (with the exception of single-job pushes) has been missing at least one job.  I'd guess that things are getting worse because try is getting busier.
There is no shortcut for tbpl: tinderbox *will not* tell us about two builds with the same start time, so the only thing we can do is completely rewrite tbpl to use a data source which is not currently available to us which will tell us about them while also giving us everything else we need. If you're going to get a solution other than that, it would have to be from buildbot not starting two builds of the same job at the same time, which from my vague understanding of how it works you aren't going to get.
Component: Tinderboxpushlog → Release Engineering
Priority: P1 → --
Product: Webtools → mozilla.org
QA Contact: tinderboxpushlog → release
Summary: lots of test runs are missing from TBPL on try → Don't start two runs of the same job at the same time, since Tinderbox hides one of them
The fix to tinderbox would be invasive enough that it makes more sense to invest that effort in making TBPL get this log data from another source.
Component: Release Engineering → Tinderboxpushlog
Product: mozilla.org → Webtools
QA Contact: release → tinderboxpushlog
No longer blocks: 630538
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → DUPLICATE
The workaround for this problem is to load up the self-serve URL to find out the status (green/orange/red/pending/completed):
https://build.mozilla.org/buildapi/self-serve/try/rev/eb6877e334c8

If you needed to actually see the logs you would have to load up tinderbox and search for your changeset across the different columns:
http://tinderbox.mozilla.org/admintree.cgi?tree=Try

There is a greasemonkey script to help looking around:
http://www.google.ca/search?q=greasemonkey+tinderbox&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:unofficial&client=firefox-a
Um, no. If your workaround includes the word "tinderbox" then you're misunderstanding the problem.

Buildbot sends email to Tinderbox saying "a run of Linux opt mochitests-5/5 on 281be3877d80 which started at 1306774890 was orange, here's the log" and it stores that as linux-opt-mochitests-5/5-1306774890. Buildbot then sends email to Tinderbox saying "a run of Linux opt mochitests-5/5 on 777a46249d2a which started at 1306774890 was green, here's the log" and Tinderbox overwrites linux-opt-mochitests-5/5-1306774890, completely erasing any sign of the previous orange. Tinderbox believes that there's a 1:1 correspondence between job names and physical machines, so it doesn't believe it's possible for two jobs to start at the same time.

The actual, miserable, workaround for try is to not filter try email to the trash, and for anything which you got email saying it wasn't green to download the full logs from ftp, dig through them with no showlog.cgi to help you find the failure, and search Bugzilla with no help from tbpl to see if it's a known failure.
So could buildbot be convinced to instead send email to Tinderbox saying "a run of Linux opt mochitests-5/5 on 777a46249d2a which started at 1306774890." + random(1000000) + " was green, here's the log"? What's a few milliseconds between friends?
Product: Webtools → Tree Management
Version: other → unspecified
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.