Closed
Bug 1129898
Opened 10 years ago
Closed 8 years ago
Killing in-progress builds can lead to broken objdirs
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: ted, Unassigned)
References
Details
philor pointed out in bug 1068209 comment 48 that he killed some in-progress builds and then the next builds on those builders failed due to truncated object files in the objdir.
I'm not sure how we kill an in-progress job, but maybe we could either do that more leniently to allow the compiler to exit gracefully, or do more cleanup after it (clobber the objdir?) so we don't hit this situation.
We don't know if this is the only root cause of bug 1068209, but it's certainly a problem.
Comment 1•10 years ago
|
||
sounds like bug 658934, which we fixed a while ago. killing builds via self-serve *should* set clobbers for them.
Comment 2•10 years ago
|
||
Clobberer's been rewritten since then, hasn't it?
Comment 3•10 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #1)
> sounds like bug 658934, which we fixed a while ago. killing builds via
> self-serve *should* set clobbers for them.
(In reply to Phil Ringnalda (:philor) from comment #2)
> Clobberer's been rewritten since then, hasn't it?
Morgan, I don't suppose you could take a look at this? Looks like the clobberer service rewrite might have regressed bug 658934. Thanks :-)
Blocks: 944005
Flags: needinfo?(winter2718)
Comment 4•10 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #3)
> (In reply to Chris AtLee [:catlee] from comment #1)
> > sounds like bug 658934, which we fixed a while ago. killing builds via
> > self-serve *should* set clobbers for them.
>
> (In reply to Phil Ringnalda (:philor) from comment #2)
> > Clobberer's been rewritten since then, hasn't it?
>
> Morgan, I don't suppose you could take a look at this? Looks like the
> clobberer service rewrite might have regressed bug 658934. Thanks :-)
Yes, so, I just added "bulk" clobbers which are deployed when no build is happening at all. With that in mind, I'm curious if the clobbers no longer happening, or are they just not showing up in the logs?
Flags: needinfo?(winter2718)
Comment 5•10 years ago
|
||
FWIW:
https://github.com/mozilla/build-relengapi-clobberer/blob/1222dbcf4925632e68012e2318882ec47470dfbf/relengapi/blueprints/clobberer/__init__.py#L147
and https://github.com/mozilla/build-relengapi-clobberer/blob/1222dbcf4925632e68012e2318882ec47470dfbf/relengapi/blueprints/clobberer/__init__.py#L102
Coupled with self serve url:
http://mxr.mozilla.org/build/source/puppet/manifests/moco-config.pp#169
and the still existing code at:
http://mxr.mozilla.org/build/source/buildapi/buildapi/scripts/selfserve-agent.py#130
Makes me think there this should be working.
BUT a check on a single relengweb node:
[root@web1.releng.webapp.scl3 ~]# grep "lastclobber" /var/log/relengapi/relengapi.log* | wc -l
4608
[root@web1.releng.webapp.scl3 ~]# grep "by-builder" /var/log/relengapi/relengapi.log* | wc -l
0
[root@web1.releng.webapp.scl3 ~]# grep "/clobber " /var/log/httpd/api.pub.build.mozilla.org/{access,error}_log | wc -l
0
[root@web1.releng.webapp.scl3 ~]# for f in /var/log/httpd/api.pub.build.mozilla.org/*.gz; do gunzip -c $f; done | grep "/clobber
" | wc -l
8
[root@web1.releng.webapp.scl3 ~]# for f in /var/log/httpd/api.pub.build.mozilla.org/*.gz; do gunzip -c $f; done | grep "/by-buil
der " | wc -l
46
[root@web1.releng.webapp.scl3 ~]# for f in /var/log/httpd/api.pub.build.mozilla.org/*.gz; do gunzip -c $f; done | grep "/by-buil
der/" | wc -l
17
[root@web1.releng.webapp.scl3 ~]# grep "/by-builder " /var/log/httpd/api.pub.build.mozilla.org/{access,error}_log | wc -l
6
[root@web1.releng.webapp.scl3 ~]# grep "/by-builder/" /var/log/httpd/api.pub.build.mozilla.org/{access,error}_log | wc -l
0
Comment 7•10 years ago
|
||
I think I see the problem now. We have a new endpoint for requesting clobbers, so if the old one was being used clobbers are no longer being requested properly. Can we assign this a priority? I'm happy to work on this, but I need to understand how it fits in the scheme of other priorities (i.e. "deliverables")
Flags: needinfo?(winter2718)
Comment 8•10 years ago
|
||
Does bug 1139978 fix this too I wonder?
(In reply to Morgan Phillips [:mrrrgn] from comment #7)
> I think I see the problem now. We have a new endpoint for requesting
> clobbers, so if the old one was being used clobbers are no longer being
> requested properly. Can we assign this a priority? I'm happy to work on
> this, but I need to understand how it fits in the scheme of other priorities
> (i.e. "deliverables")
It causes confusing failures for sheriffs and other devs when builds are cancelled on non-try, so it would be good to get it fixed in the next few weeks, if possible? :-)
Comment 9•10 years ago
|
||
If only confusion were the only impact.
First there's the confusing failures, then there's the tree closure to stare confusedly, then there's clobbering every slave because that's the only button we have to get just a few slaves clobbered, then there's the even longer continuation of the closure as we wait for slow clobbered builds to complete to be sure both that clobbering fixed the problem and that no other problem crept in while we were confused and partly broken. Naively killing a few builds to "save resources" can easily turn into a four hour tree closure.
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
Assignee | ||
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•