Closed Bug 1129898 Opened 10 years ago Closed 8 years ago

Killing in-progress builds can lead to broken objdirs

Categories

(Release Engineering :: General, defect)

All
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: ted, Unassigned)

References

Details

philor pointed out in bug 1068209 comment 48 that he killed some in-progress builds and then the next builds on those builders failed due to truncated object files in the objdir. I'm not sure how we kill an in-progress job, but maybe we could either do that more leniently to allow the compiler to exit gracefully, or do more cleanup after it (clobber the objdir?) so we don't hit this situation. We don't know if this is the only root cause of bug 1068209, but it's certainly a problem.
sounds like bug 658934, which we fixed a while ago. killing builds via self-serve *should* set clobbers for them.
Clobberer's been rewritten since then, hasn't it?
Blocks: 1139948
(In reply to Chris AtLee [:catlee] from comment #1) > sounds like bug 658934, which we fixed a while ago. killing builds via > self-serve *should* set clobbers for them. (In reply to Phil Ringnalda (:philor) from comment #2) > Clobberer's been rewritten since then, hasn't it? Morgan, I don't suppose you could take a look at this? Looks like the clobberer service rewrite might have regressed bug 658934. Thanks :-)
Blocks: 944005
Flags: needinfo?(winter2718)
(In reply to Ed Morley [:edmorley] from comment #3) > (In reply to Chris AtLee [:catlee] from comment #1) > > sounds like bug 658934, which we fixed a while ago. killing builds via > > self-serve *should* set clobbers for them. > > (In reply to Phil Ringnalda (:philor) from comment #2) > > Clobberer's been rewritten since then, hasn't it? > > Morgan, I don't suppose you could take a look at this? Looks like the > clobberer service rewrite might have regressed bug 658934. Thanks :-) Yes, so, I just added "bulk" clobbers which are deployed when no build is happening at all. With that in mind, I'm curious if the clobbers no longer happening, or are they just not showing up in the logs?
Flags: needinfo?(winter2718)
FWIW: https://github.com/mozilla/build-relengapi-clobberer/blob/1222dbcf4925632e68012e2318882ec47470dfbf/relengapi/blueprints/clobberer/__init__.py#L147 and https://github.com/mozilla/build-relengapi-clobberer/blob/1222dbcf4925632e68012e2318882ec47470dfbf/relengapi/blueprints/clobberer/__init__.py#L102 Coupled with self serve url: http://mxr.mozilla.org/build/source/puppet/manifests/moco-config.pp#169 and the still existing code at: http://mxr.mozilla.org/build/source/buildapi/buildapi/scripts/selfserve-agent.py#130 Makes me think there this should be working. BUT a check on a single relengweb node: [root@web1.releng.webapp.scl3 ~]# grep "lastclobber" /var/log/relengapi/relengapi.log* | wc -l 4608 [root@web1.releng.webapp.scl3 ~]# grep "by-builder" /var/log/relengapi/relengapi.log* | wc -l 0 [root@web1.releng.webapp.scl3 ~]# grep "/clobber " /var/log/httpd/api.pub.build.mozilla.org/{access,error}_log | wc -l 0 [root@web1.releng.webapp.scl3 ~]# for f in /var/log/httpd/api.pub.build.mozilla.org/*.gz; do gunzip -c $f; done | grep "/clobber " | wc -l 8 [root@web1.releng.webapp.scl3 ~]# for f in /var/log/httpd/api.pub.build.mozilla.org/*.gz; do gunzip -c $f; done | grep "/by-buil der " | wc -l 46 [root@web1.releng.webapp.scl3 ~]# for f in /var/log/httpd/api.pub.build.mozilla.org/*.gz; do gunzip -c $f; done | grep "/by-buil der/" | wc -l 17 [root@web1.releng.webapp.scl3 ~]# grep "/by-builder " /var/log/httpd/api.pub.build.mozilla.org/{access,error}_log | wc -l 6 [root@web1.releng.webapp.scl3 ~]# grep "/by-builder/" /var/log/httpd/api.pub.build.mozilla.org/{access,error}_log | wc -l 0
Would you mind driving this? :-)
Flags: needinfo?(winter2718)
I think I see the problem now. We have a new endpoint for requesting clobbers, so if the old one was being used clobbers are no longer being requested properly. Can we assign this a priority? I'm happy to work on this, but I need to understand how it fits in the scheme of other priorities (i.e. "deliverables")
Flags: needinfo?(winter2718)
Does bug 1139978 fix this too I wonder? (In reply to Morgan Phillips [:mrrrgn] from comment #7) > I think I see the problem now. We have a new endpoint for requesting > clobbers, so if the old one was being used clobbers are no longer being > requested properly. Can we assign this a priority? I'm happy to work on > this, but I need to understand how it fits in the scheme of other priorities > (i.e. "deliverables") It causes confusing failures for sheriffs and other devs when builds are cancelled on non-try, so it would be good to get it fixed in the next few weeks, if possible? :-)
If only confusion were the only impact. First there's the confusing failures, then there's the tree closure to stare confusedly, then there's clobbering every slave because that's the only button we have to get just a few slaves clobbered, then there's the even longer continuation of the closure as we wait for slow clobbered builds to complete to be sure both that clobbering fixed the problem and that no other problem crept in while we were confused and partly broken. Naively killing a few builds to "save resources" can easily turn into a four hour tree closure.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.