Closed
Bug 782542
Opened 12 years ago
Closed 12 years ago
Secure necko IPDL usage
Categories
(Core :: IPC, defect, P1)
Core
IPC
Tracking
()
People
(Reporter: jduell.mcbugs, Assigned: jduell.mcbugs)
References
Details
(Whiteboard: [LOE:M])
Attachments
(15 files, 15 obsolete files)
(deleted),
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
jduell.mcbugs
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jduell.mcbugs
:
review+
justin.lebar+bug
:
feedback+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jduell.mcbugs
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jduell.mcbugs
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jduell.mcbugs
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jduell.mcbugs
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jduell.mcbugs
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jdm
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
ted
:
review+
jduell.mcbugs
:
feedback+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jduell.mcbugs
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jduell.mcbugs
:
review+
|
Details | Diff | Splinter Review |
Necko IPDL channels are opened from the child. Right now we trust whatever appid/inBrowserElement they tell us they are. We must verify that the correct AppID is used for the channel using parent-side verification. From my understanding so far, this will probably be achieved by making sure all necko IPDL channels are constructed passing a TabParent/child instance which we can use to get the appID on the parent side (cjones tells me we already have the code to detect if a bogus TabChild is passed from the child).
There may be some cases where necko channels do not have a TabChild and/or a nsILoadContext. (smaug and jdm inform me that "frame scripts" and addons are cases for child-side code w/o a LoadContext: we get the nsITabChild from a channel's callbacks, so I'm not sure that's always set either.) We need to figure out what appID/inbrowser/etc credentials they need/get in that case (in particular what are valid channels, what might be rogue processes trying to get more permission than they ought to, and how to detect/enforce the difference).
This bug also include the IPDL protocols for Cookies and the AppCache, and we may have similar issues with ensuring we always have the Tabparent/loadcontext in those cases too.
Assignee | ||
Comment 1•12 years ago
|
||
Most of the other bugs blocking bug 776652 are blockers, so I assume this is too.
Hoping this is LOE:S, but likely to be M to deal with all the security holes/issues here. We can split off into sub-issues and categorize as blocking or not as needed.
blocking-basecamp: --- → ?
Whiteboard: [LOE:M]
Assignee: nobody → jduell.mcbugs
blocking-basecamp: ? → +
Updated•12 years ago
|
Whiteboard: [LOE:M] → [LOE:M][WebAPI:P2]
Assignee | ||
Comment 2•12 years ago
|
||
cjones, mounir, jlebar, and other b2g folks,
I'm wondering if there's an easier way to fix this.
The plan so far has been that channels (which are created from the child-side) would need to have callbacks that implement both TabParent and nsILoadContext, so that we can 1) send the {appId, inBrowserElement} for the load (and also usePrivateBrowsing for nextgen private browsings) to the parent, and 2) provide the parent with a TabParent that can be securely used to get the mozIApplication.localId, which will tell us if the child is passing in a bogus appId.
The problem with this plan is that lots of necko channels seem to not have callbacks set that provide nsILoadContext and/or TabParent. There are potentially a lot of call sites to change to fix that. We'll need to fix them eventually to make nextGen private browsing work (and ehsan has been valiantly fixing some of them--see the dependencies for bug 787743), but it's unclear if we'll get to them all soon enough.
Here's a different plan, which may be less work?
1) Change IPDL inheritance so that PNecko is managed by PBrowser instead of PContent.
2) Change 'mozbrowser' processes (which is just firefox now?) to use it's own unique appID instead of sharing an appID with it's app (IIUC the mozbrowser process is currently using 0, and so is already not sharing B2G firefox's appID, correct? I'd be proposing we change it to some non-zero unique appid).
Change #1 would let me get the appID for a channel (via IPDL GetParent()) without relying on the child to pass in the TabParent. Change #2 would let me get the isInBrowserElement value without the child passing in a nsILoadGroup (assuming we provide some kind of mapping of which appIDs are mozbrowsers).
Thoughts?
I think #1 would be very hard to do since you'd have to find the right PBrowser any time we make a network request so that you could then find the PNecko belonging to that browser. That seems basically just as hard as finding the appropriate {appid,browserflag}
Comment 4•12 years ago
|
||
> 2) Change 'mozbrowser' processes (which is just firefox now?) to use it's own unique appID instead
> of sharing an appID with it's app (IIUC the mozbrowser process is currently using 0, and so is
> already not sharing B2G firefox's appID, correct? I'd be proposing we change it to some non-zero
> unique appid).
Our story with these IDs is already so complicated, I'm really hesitant to complicate it further by having one part of the system tell a different story about the ID than another part of the system.
We use the current system of (app-id, is-browser) everywhere, including in Necko itself for managing cookies, right?
> Change #2 would let me get the isInBrowserElement value without the child passing in a
> nsILoadGroup (assuming we provide some kind of mapping of which appIDs are mozbrowsers).
If you can get the app-id off the tabparent, can't you also get the is-browser-element value off the tab-parent, or does that not work for some reason?
Assignee | ||
Comment 5•12 years ago
|
||
> I think #1 would be very hard to do...
Ah, ok. I was hoping it would be easier.
> can't you also get the is-browser-element value off the tab-parent...
No, it comes from the nsILoadContext.
I've got :baku working on figuring out how many sites lack TabParent/nsILoadContext info, so we should soon have more info on how many call sites need to be fixed, and hopefully we can fix them quickly.
thanks for fast replies!
Comment 6•12 years ago
|
||
This should probably never have had the feature keyword. Sorry for the noise.
Keywords: feature
Priority: -- → P2
Whiteboard: [LOE:M][WebAPI:P2] → [LOE:M]
Comment 8•12 years ago
|
||
Milestoning for C2 (deadline of 12/10), as this meets the criteria of "known P2 bugs found before or during C1".
Target Milestone: --- → B2G C2 (20nov-10dec)
Comment 9•12 years ago
|
||
No updates since *September*. We need an update on where this is at, and whether it truly needs to block the release.
Assignee | ||
Comment 10•12 years ago
|
||
Talked to jlebar, have design, am writing patch right now. Should have in a day or so.
Whether it should block: we'd allow apps to read other app's cookies, so probably needs to block.
Flags: needinfo?(jduell.mcbugs)
Comment 11•12 years ago
|
||
Thanks Jason. Any update on a patch for this? We're 7 days out from our C2 milestone, so need to get this landed as soon as possible.
Assignee | ||
Comment 12•12 years ago
|
||
I'm close here, but have context-switched to work on bug 815523, which seems of equal or higher priority and was higher risk. I think I've got a handle on both now.
Comment 13•12 years ago
|
||
Still have a handle on both, Jason? Pressure is high. :)
Assignee | ||
Comment 14•12 years ago
|
||
We need to be able to disable IPC checks for xpcshell tests, which don't have the TabParent needed to pass them.
Attachment #689839 -
Flags: review?(josh)
Assignee | ||
Comment 15•12 years ago
|
||
Same thing but for NeckoChild, so we can fail in debug mode, etc, when we know on child side that we lack info needed to pass checks on parent.
Attachment #689840 -
Flags: review?(josh)
Comment 16•12 years ago
|
||
Comment on attachment 689839 [details] [diff] [review]
part1: add new network.disable.ipc.security pref and make NeckoParent observe it
Review of attachment 689839 [details] [diff] [review]:
-----------------------------------------------------------------
The vast majority of these changes can be avoided by just using a static global var and Preferences::AddUintVarCache.
Attachment #689839 -
Flags: review?(josh) → review-
Comment 17•12 years ago
|
||
Comment on attachment 689840 [details] [diff] [review]
part2: also make NeckoChild observe
Review of attachment 689840 [details] [diff] [review]:
-----------------------------------------------------------------
Likewise.
Attachment #689840 -
Flags: review?(josh) → review-
Assignee | ||
Comment 18•12 years ago
|
||
ted: the alternative would be to add head.js files to directories that need it, but that's essentially all xpcshell tests that do networking, so seemed easier to make it opt-in. We'll get test coverage with checks on in mochitests.
Attachment #689845 -
Flags: review?(ted)
Comment 19•12 years ago
|
||
Comment on attachment 689845 [details] [diff] [review]
part2.5: disable necko IPC security by default for xpcshell tests
Review of attachment 689845 [details] [diff] [review]:
-----------------------------------------------------------------
::: testing/xpcshell/head.js
@@ +40,5 @@
> try {
> + runningInParent = Components.classes["@mozilla.org/xre/runtime;1"].
> + getService(Components.interfaces.nsIXULRuntime).processType
> + == Components.interfaces.nsIXULRuntime.PROCESS_TYPE_DEFAULT;
> +}
nit: trailing whitespace
Attachment #689845 -
Flags: review?(ted) → review+
Assignee | ||
Comment 20•12 years ago
|
||
jlebar: could you take a look at the checks I do in NeckoParent::CreateChannelLoadContext() and tell me if they seem right to you?
This is a WIP: it handles HTTP channels (it compiles and run w/o crashing, not extensively tested yet). We need to add same TabParent param and checks to rest of necko channels (PFtpChannel, PwyciwygChannel, PWebSocket.ipdl, PCookieSvc, and POfflineCacheUpdate). jdm, if you're up for working on this while I finish bug 815523 you're my hero.
Attachment #689874 -
Flags: feedback?(justin.lebar+bug)
Assignee | ||
Comment 21•12 years ago
|
||
Comment 22•12 years ago
|
||
Assignee | ||
Comment 23•12 years ago
|
||
Comment on attachment 689920 [details] [diff] [review]
Parts 1+2: Control Necko IPC security checks via a pref.
Review of attachment 689920 [details] [diff] [review]:
-----------------------------------------------------------------
Yes, AddTYPEVarCache is much easier than registering observers. Yet we've somehow never used it before in necko-land. Good to know.
::: netwerk/ipc/NeckoChild.cpp
@@ +21,5 @@
> namespace mozilla {
> namespace net {
>
> +static bool gDisableIPCSecurity = false;
> +static const char kPrefDisableIPCSecurity[] = "network.disable.ipc.security";
I'd rather see this as a #define in NeckoCommon.h so it's centralized. Linker might also be more likely to coalesce two identical string constants than two static vars with same value?
Attachment #689920 -
Flags: review+
Assignee | ||
Updated•12 years ago
|
Attachment #689839 -
Attachment is obsolete: true
Assignee | ||
Updated•12 years ago
|
Attachment #689840 -
Attachment is obsolete: true
Comment 24•12 years ago
|
||
Status update: I have completely untested patches that add validation for PCookieService requests, PFTPChannel, PWyciwygChannel, PWebSocket, and POfflineCacheUpdate. I have also completely broken the private browsing-related code, so I'm going to come up with a plan to fix that up.
Comment 25•12 years ago
|
||
Attachment #690727 -
Flags: review?(jduell.mcbugs)
Comment 26•12 years ago
|
||
Attachment #690728 -
Flags: review?(jduell.mcbugs)
Comment 27•12 years ago
|
||
Attachment #690729 -
Flags: review?(jduell.mcbugs)
Comment 28•12 years ago
|
||
Attachment #690730 -
Flags: review?(jduell.mcbugs)
Comment 29•12 years ago
|
||
Attachment #690731 -
Flags: review?(jduell.mcbugs)
Comment 30•12 years ago
|
||
Attachment #690732 -
Flags: review?(jduell.mcbugs)
Updated•12 years ago
|
Attachment #690732 -
Attachment description: Add PBrowser context to remote wyciwyg channels. → Part 8: Add PBrowser context to remote wyciwyg channels.
Comment 31•12 years ago
|
||
I thinking of making a utility GetPBrowserChildCallback method that could replace the goop that's present in every patch.
Comment 32•12 years ago
|
||
Sorry, last minute changes caused a build error.
Attachment #690736 -
Flags: review?(jduell.mcbugs)
Updated•12 years ago
|
Attachment #690732 -
Attachment is obsolete: true
Attachment #690732 -
Flags: review?(jduell.mcbugs)
Comment 33•12 years ago
|
||
Sorry, too many patches to keep track of.
Attachment #690738 -
Flags: review?(jduell.mcbugs)
Updated•12 years ago
|
Attachment #690731 -
Attachment is obsolete: true
Attachment #690731 -
Flags: review?(jduell.mcbugs)
Updated•12 years ago
|
Attachment #690736 -
Attachment is obsolete: true
Attachment #690736 -
Flags: review?(jduell.mcbugs)
Assignee | ||
Updated•12 years ago
|
Attachment #689845 -
Attachment description: part3: disable necko IPC security by default for xpcshell tests → part2.5: disable necko IPC security by default for xpcshell tests
Assignee | ||
Updated•12 years ago
|
Attachment #689874 -
Attachment is obsolete: true
Attachment #689874 -
Flags: feedback?(justin.lebar+bug)
Assignee | ||
Comment 35•12 years ago
|
||
Comment on attachment 690727 [details] [diff] [review]
Part 3: Use validated app data for remote HTTP channels.
Review of attachment 690727 [details] [diff] [review]:
-----------------------------------------------------------------
I don't think we actually need all the nsIPrivateBrowsingChannel support here and in the rest of the patch series, but it's harmless to include for now. Filed bug 820304 to remove it.
jdm: re: our IRC chat about using UNKNOWN_APP vs NO_APP as default for xpcshell tests: looks like you were right and we should go with NO_APP.
::: docshell/base/LoadContext.h
@@ +31,5 @@
> public:
> NS_DECL_ISUPPORTS
> NS_DECL_NSILOADCONTEXT
>
> +#ifndef CHECK_WE_ONLY_USE_THIS_FROM_NEW_CODE
We need to get rid of this, or the code it surrounds. I'll file a followup patch with this and other changes from my comments.
Attachment #690727 -
Flags: review?(jduell.mcbugs) → review+
Assignee | ||
Updated•12 years ago
|
Attachment #690728 -
Flags: review?(jduell.mcbugs) → review+
Assignee | ||
Comment 36•12 years ago
|
||
Comment on attachment 690729 [details] [diff] [review]
Part 5: Use validated app data for remote cookie requests.
Review of attachment 690729 [details] [diff] [review]:
-----------------------------------------------------------------
::: netwerk/ipc/NeckoParent.cpp
@@ +119,5 @@
> + }
> + if (!aSerialized.IsNotNull()) {
> + return "SerializedLoadContext from child is null";
> + }
> + }
Above checks are duplicated in call to GetValidatedAppInfo, so we should remove them here.
Attachment #690729 -
Flags: review?(jduell.mcbugs) → review+
Assignee | ||
Comment 37•12 years ago
|
||
Comment on attachment 690730 [details] [diff] [review]
Part 6: Use verified app data for remote offline cache requests instead of insecure values from the child.
Review of attachment 690730 [details] [diff] [review]:
-----------------------------------------------------------------
::: dom/ipc/TabParent.cpp
@@ +1125,5 @@
> {
> nsRefPtr<mozilla::docshell::OfflineCacheUpdateParent> update =
> new mozilla::docshell::OfflineCacheUpdateParent();
>
> + nsresult rv = update->Schedule(aManifestURI, aDocumentURI, stickDocument, this);
Minor change: I think it's cleaner to have the OfflineCacheUpdate constructor take appID/inBrowser and call it from here with (OwnOrContainingAppId(), IsBrowserElement).
Attachment #690730 -
Flags: review?(jduell.mcbugs) → review+
Assignee | ||
Updated•12 years ago
|
Attachment #690738 -
Flags: review?(jduell.mcbugs) → review+
Assignee | ||
Comment 38•12 years ago
|
||
Comment on attachment 690740 [details] [diff] [review]
Part 8: Use validated app data for remote wyciwyg channels.
Review of attachment 690740 [details] [diff] [review]:
-----------------------------------------------------------------
This patch is different than all the other IPDL protocols in that we don't do the security checks right at IPDL construction time, which means we have to be more careful that a malicious child process can't send us IPDL messages in some sequence that bypasses the checks we do in AsyncOpen. That said, I don't see any useful attacks here, because there's not much that can be done until AsyncOpen is called. Also wyciwyg channels only touch the HTTP cache, which doesn't have per-app isolation (yet), so there's not much to protect. So I think this is good for now.
Attachment #690740 -
Flags: review?(jduell.mcbugs) → review+
Assignee | ||
Comment 39•12 years ago
|
||
jdm: let me know if you find any of these changes objectionable.
Attachment #690792 -
Flags: review?(josh)
Assignee | ||
Comment 40•12 years ago
|
||
Comment on attachment 690727 [details] [diff] [review]
Part 3: Use validated app data for remote HTTP channels.
Review of attachment 690727 [details] [diff] [review]:
-----------------------------------------------------------------
jlebar: it would still be nice to have feedback as to whether the sec checks (see NeckoParent::CreateChannelLoadContext in this patch) look right to you.
Attachment #690727 -
Flags: feedback?(justin.lebar+bug)
Assignee | ||
Comment 41•12 years ago
|
||
try build:
https://tbpl.mozilla.org/?tree=Try&rev=4eb7c5a23f42
Patches don't apply cleanly to beta. Haven't had a chance to try on phone yet. Will get back to this when I've got patches for bug 815523 done.
Updated•12 years ago
|
Attachment #690792 -
Flags: review?(josh) → review+
Comment 42•12 years ago
|
||
Comment on attachment 690727 [details] [diff] [review]
Part 3: Use validated app data for remote HTTP channels.
Review of attachment 690727 [details] [diff] [review]:
-----------------------------------------------------------------
This looks right to me. Sorry for leaving you hanging!
Attachment #690727 -
Flags: feedback?(justin.lebar+bug) → feedback+
Assignee | ||
Comment 43•12 years ago
|
||
Merged into big combo patch for easier approvals/backouts/etc.
https://hg.mozilla.org/integration/mozilla-inbound/rev/34c9ccee8058
Attachment #690960 -
Flags: review+
Comment 44•12 years ago
|
||
Backed out because this broke some mochitests: https://hg.mozilla.org/integration/mozilla-inbound/rev/cdc37128bd0d
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=34c9ccee8058
Comment 46•12 years ago
|
||
Assignee: jduell.mcbugs → josh
Attachment #691155 -
Flags: review?(ted)
Assignee | ||
Comment 47•12 years ago
|
||
Comment on attachment 691155 [details] [diff] [review]
Part 10: Disable security tests for browser-element tests.
Review of attachment 691155 [details] [diff] [review]:
-----------------------------------------------------------------
// our test harness tests browser elements
// without sticking them in apps, and the security checks dislike that.
jlebar told me we weren't using this configuration in production yet, and I didn't know about the tests. I've filed bug 820712 for the work to figure this out--doesn't need to block this bug IMO. Meanwhile disabling security for test runs is good workaround.
Attachment #691155 -
Flags: feedback+
Assignee | ||
Comment 48•12 years ago
|
||
Comment on attachment 690960 [details] [diff] [review]
Combined patch as landed on m-c
I've got a beta version of the combined patch ready (aurora doesn't require any mods). akeybl told me on IRC we can land these w/o +a process since this is all e10s.
Comment 49•12 years ago
|
||
Comment on attachment 691155 [details] [diff] [review]
Part 10: Disable security tests for browser-element tests.
Review of attachment 691155 [details] [diff] [review]:
-----------------------------------------------------------------
::: dom/browser-element/mochitest/browserElementTestHelpers.js
@@ +70,5 @@
> + },
> +
> + setIPCSecurityDisabledPref: function(value) {
> + return this._setBoolPref("network.disable.ipc.security", value);
> + },
I don't understand the extra layers of indirection here, but it's consistent with what's already here.
Attachment #691155 -
Flags: review?(ted) → review+
Comment 51•12 years ago
|
||
Rollup for inbound.
Updated•12 years ago
|
Attachment #690960 -
Attachment is obsolete: true
Assignee | ||
Comment 52•12 years ago
|
||
Hopefully this is the last of the test fixed. Pushed to try
https://tbpl.mozilla.org/?tree=Try&rev=f5dc5426ebe5
Attachment #691565 -
Flags: review?(ted)
Assignee | ||
Updated•12 years ago
|
Attachment #691155 -
Attachment description: Disable security tests for browser-element tests. → Part 10: Disable security tests for browser-element tests.
Comment 53•12 years ago
|
||
Comment on attachment 691565 [details] [diff] [review]
Part 11: turn off sec checks for 3 more OOP browser-but-not-app tests
You should clear the pref once test_child_process_shutdown_message.html finishes running.
Assignee | ||
Comment 54•12 years ago
|
||
Attachment #691565 -
Attachment is obsolete: true
Attachment #691565 -
Flags: review?(ted)
Attachment #691693 -
Flags: review?(ted)
Comment 55•12 years ago
|
||
Comment on attachment 691693 [details] [diff] [review]
Part 11: turn off sec checks for 3 more OOP browser-but-not-app tests
Review of attachment 691693 [details] [diff] [review]:
-----------------------------------------------------------------
::: content/base/test/test_child_process_shutdown_message.html
@@ +113,5 @@
> SpecialPowers.addPermission("browser", true, window.document);
> SpecialPowers.addPermission("embed-apps", true, window.document);
> +
> + // TODO: remove in bug 820712
> + SpecialPowers.setBoolPref("network.disable.ipc.security", true);
Given the way this test is structured, you could pretty easily change it to use SpecialPowers.pushPrefEnv, which does the cleanup for you:
http://mxr.mozilla.org/mozilla-central/source/testing/specialpowers/content/specialpowersAPI.js#538
But since the test already has a cleanup function it's not super important.
::: dom/devicestorage/ipc/test_ipc.html
@@ +152,5 @@
> ["device.storage.enabled", true],
> ["device.storage.testing", true],
> ["device.storage.prompt.testing", true],
>
> + // TODO: remove this as part of bug 820712
nit: trailing whitespace here
::: dom/indexedDB/ipc/test_ipc.html
@@ +167,5 @@
> addEventListener("load", function() {
> SpecialPowers.addPermission("browser", true, document);
> SpecialPowers.pushPrefEnv({
> "set": [
> + // TODO: remove this as part of bug 820712
nit: here too
Attachment #691693 -
Flags: review?(ted) → review+
Assignee | ||
Comment 56•12 years ago
|
||
apparently this is going to be an iterative process...
https://tbpl.mozilla.org/?tree=Try&rev=9385fe428974
Assignee | ||
Comment 57•12 years ago
|
||
Attachment #691693 -
Attachment is obsolete: true
Attachment #692007 -
Flags: review+
Assignee | ||
Comment 58•12 years ago
|
||
fixed typo
Attachment #692001 -
Attachment is obsolete: true
Attachment #692001 -
Flags: review?(ted)
Attachment #692012 -
Flags: review?(ted)
Assignee | ||
Comment 59•12 years ago
|
||
Assignee | ||
Comment 60•12 years ago
|
||
Comment on attachment 692012 [details] [diff] [review]
part 12: v2 another test needs checks disabled.
Hmm, test fails reliably on all platforms on try, but works fine on my desktop. Probably some interaction issue with other mochitests--investigating.
Attachment #692012 -
Flags: review?(ted)
Assignee | ||
Comment 61•12 years ago
|
||
ok, this seems to be working:
https://tbpl.mozilla.org/?tree=Try&rev=bb0efd6090e6
Started full try run:
https://tbpl.mozilla.org/?tree=Try&rev=4336a651a3d4
Attachment #692012 -
Attachment is obsolete: true
Attachment #692255 -
Flags: review?(ted)
Updated•12 years ago
|
Attachment #692255 -
Flags: review?(ted) → review+
Assignee | ||
Comment 62•12 years ago
|
||
Attachment #691351 -
Attachment is obsolete: true
Attachment #692697 -
Flags: review+
Comment 63•12 years ago
|
||
The try build for this got this Linux Opt Cipc failure:
TEST-UNEXPECTED-FAIL | http://localhost:4444/1355488614395/2/330010-1.xul | application timed out after 330 seconds with no output
It was thought to be random, however every build on inbound since this landed has resulted in this error, so it would seem this is a real issue.
Comment 64•12 years ago
|
||
Backed out for crashtest-ipc timeouts:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=3e4e600adc3b
https://hg.mozilla.org/integration/mozilla-inbound/rev/463b740c3e2f
Why did this land if the try run was broken?
Whilst people not using Try is frustrating, I can at least empathise that in many cases it's just not worth the infra load to test a particular changeset before landing. However, using our (scarce) Try resources and then still landing after it was broken on Try (without asking #developers whether they thought the failure was real, or even just doing a retrigger on Try), seems like the worst of both worlds...
Comment 65•12 years ago
|
||
> Why did this land if the try run was broken?
Because the Cpic failure matched the signature of a known randomorange?
Comment 66•12 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #65)
> > Why did this land if the try run was broken?
>
> Because the Cpic failure matched the signature of a known randomorange?
But it doesn't...
Comment 67•12 years ago
|
||
(In reply to Ed Morley [UTC+0; email:edmorley@moco] from comment #66)
> > Because the Cpic failure matched the signature of a known randomorange?
>
> But it doesn't...
The Cpic failure in comment 61 says:
> TEST-UNEXPECTED-FAIL | http://localhost:4444/1355488614395/2/330010-1.xul | application timed out after 330 seconds with no output
> Bug 742455 - Intermittent crashtest content/xul/templates/src/crashtests/397148-1.xul, 330010-1.xul, 330012-1.xul | Exited with code -1073741819 during test run [@ RDFContentSinkImpl::HandleEndElement] PROCESS-CRASH | http://localhost:4444/1355488614395/2/330010-1.xul | application crashed [@ linux-gate.so + 0x424]
> Bug 742455 - Intermittent crashtest content/xul/templates/src/crashtests/397148-1.xul, 330010-1.xul, 330012-1.xul | Exited with code -1073741819 during test run [@ RDFContentSinkImpl::HandleEndElement] Thread 0 (crashed)
So you're right; if you look at this closely, the failure in the try push is a timeout, whereas the known orange is a crash.
But given that at least three people (jduell, whoever was sheriffinb m-i, and now me) looked at this TBPL output and believed the failure to be a random orange, it seems reasonable to me to suggest that the tool is more at fault here than the users.
Comment 68•12 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #67)
> But given that at least three people (jduell, whoever was sheriffinb m-i,
> and now me) looked at this TBPL output and believed the failure to be a
> random orange, it seems reasonable to me to suggest that the tool is more at
> fault here than the users.
It's a weekend, no one is really sheriffing inbound, and it's not clear at this point whether jduell saw the cipc orange.
That said, if you can think if ways in which to make it easier to spot that string 1 != string 2 (given that in this case you didn't even need to look at the log, and string matches are bolded, so most of this string wasn't in bold, indicating it didn't match... so it seems pretty clear already), then I'll gladly try and improve the annotated summary :-)
Comment 69•12 years ago
|
||
Comment 63 made it sound like someone believed that the failure was random on m-i. Maybe it was jduell himself looking at the m-i failures, but anyway it seems unlikely that I'm the only one who looked at this TBPL log and misinterpreted it.
> if you can think if ways in which to make it easier to spot that string 1 != string 2
How about not requiring the user to do a strstr (computer-aided or otherwise)? tbpl could explicitly separate matches into "likely your failure" and "likely not your failure" instead of relying on users to look for typographic hints.
If TBPL can't do this collation correctly based on boldness, then surely naive users also can't make this determination of "likely/not likely a known failure" based on boldness, and then the claim that "likely a known failure" is a straightforward determination isn't true.
Comment 70•12 years ago
|
||
I wrote comment 63 and was giving you the benefit of the doubt as to why you may have landed despite the obvious try server test issue. Don't try to make my comment justify this action.
Comment 71•12 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #69)
> instead of relying on users to look
> for typographic hints.
Calling them 'hints' is a bit of a stretch. The two failure types shown were really quite different.
Perhaps we just need to have a day a week where inbound is closed and devs self-sheriff, so people don't lose touch with how to see if their push was successful?
> If TBPL can't do this collation correctly based on boldness, then surely
> naive users also can't make this determination of "likely/not likely a known
> failure" based on boldness, and then the claim that "likely a known failure"
> is a straightforward determination isn't true.
Using a simple % of string bolded is just not practical, due to additional content often needed in bug summaries. Unless we have structured logging (and a dozen accompanying bugzilla fields to be able to perform definitive matching against filed intermittents), then we're always going to have to match against bug summaries.
There are many many ways in which TBPL could be improved (and hopefully will be over the coming months, with a possible TBPLv2 on the cards next year), but ultimately at some point, devs need to accept they may not always get a dumbed down red/green indication, and actually have to read 20 characters of text now and again ;-)
Comment 72•12 years ago
|
||
Again, I do not think blaming users is a productive course of action here.
> devs need to accept they may not always get a dumbed down red/green indication, and
> actually have to read 20 characters of text now and again ;-)
That's fine if you can accept that we'll get it wrong occasionally. But comment 64 suggests that you're expecting a high degree of accuracy from a process carried out by busy non-experts many times a week. I think is unrealistic and ultimately unfair to developers.
Indeed, if I'm going to get this kind of flack when I push to try and miss an orange, then I think a rational reaction would be to push to try less often and let experts sort out the potential failures on m-i. That's obviously counter to your goal, so I again suggest that blaming users here is counter-productive.
> Perhaps we just need to have a day a week where inbound is closed and devs self-sheriff,
> so people don't lose touch with how to see if their push was successful?
If the goal here is to maximize developer productivity, it seems unlikely to me that this is a better course of action than the status quo.
I maintain that it's unreasonable to expect that engineers, who may push to try or m-i ten times in a week and see five random oranges per push, should examine every orange / red in their push beyond clicking it and seeing if tbpl suggests a match (and even that is pushing it, imo). Indeed, the presence of a TBPL match is strongly indicative of whether the failure is a known orange, so you're asking us to look carefully at each of these fifty failures to catch what is in reality an unlikely event, making the expectation that we will carry this out with a high degree of accuracy even more unrealistic.
If you want fewer csets pushed to m-i after failing try pushes, let's figure out what can be done automatically. For example, maybe we can automatically re-trigger oranges on try. Or we could push for increased test capacity so we could stop coalescing tests on m-i so pushing an orange wouldn't be as big a deal.
Comment 73•12 years ago
|
||
I don't get you position here. My comment 63 was trying to give you an excuse for how you obviously screwed up by landing something that failed on try.
Comment 74•12 years ago
|
||
(In reply to Bill Gianopoulos [:WG9s] from comment #73)
> I don't get you position here. My comment 63 was trying to give you an
> excuse for how you obviously screwed up by landing something that failed on
> try.
I didn't land anything; Jason did. I'm defending his oversight, which I myself also made when looking at the try results after I read comment 64, on the grounds that this oversight is an easy one to make and that it's unreasonable to expect a high degree of accuracy from developers interpreting their try results when that process requires so many manual steps.
My position is that if the accuracy we're seeing from developers interpreting try results is too low, we should focus on ways to improve the system, rather than blaming users of that system for being human and screwing up.
Comment 75•12 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #72)
> I maintain that it's unreasonable to expect that engineers, who may push to
> try or m-i ten times in a week and see five random oranges per push, should
> examine every orange / red in their push beyond clicking it and seeing if
> tbpl suggests a match (and even that is pushing it, imo).
I really don't think it's pushing it to ask devs to click on 3-4 failures on a run to even see if there are suggestions - I suspect we're going to have to agree to disagree on this front :-)
Both in this discussion (and previous on dev.{platform,planning}) you have advocated pushing most of the burden onto sheriffs - which is great in principal (and a place where I would like us to be at eventually, given it's an inefficient use of dev time for them to be doing some of these tasks), but is just not practical given the single employed sheriff, the tooling deficit we've built up over the years & our hard-working but understaffed releng team who don't have the cycles to fix our capacity problems as fast as we would like.
Thankfully, we're almost about to fill another sheriff position, which will reduce the load on me slightly (and actually give me more time to work on things like TBPL improvements), but even then, the world you (and myself too!) would like us to be in, just isn't where we're at - so (IMO) both the devs and the sheriffs need to share the burden for the short term.
> If you want fewer csets pushed to m-i after failing try pushes, let's figure
> out what can be done automatically. For example, maybe we can automatically
> re-trigger oranges on try. Or we could push for increased test capacity so
> we could stop coalescing tests on m-i so pushing an orange wouldn't be as
> big a deal.
The recently stated Futurama project has some great suggestions so far for these kind of things - and I'm really excited to see what comes out of it :-D
https://wiki.mozilla.org/Auto-tools/Projects/Futurama
Comment 76•12 years ago
|
||
> Both in this discussion (and previous on dev.{platform,planning}) you have advocated
> pushing most of the burden onto sheriffs
I'd love it if you guys could fix the problems we've identified with TBPL, but that's only half of my argument. The other half is that given the current state of affairs -- given that you don't have the resources to fix problems in TBPL, for example -- you shouldn't blame devs when we get bitten by deficiencies in the system. I further contend that blaming devs for this kind of mistake may cause us to use try less often, and may thus be counter to your goals.
In contrast, I think you were/are explicitly suggesting that the burdens here should be carried by developers -- that developers should spend time carefully combing through their pushes' oranges so as to notice when TBPL incorrectly suggests a bug, for example -- and that it's not acceptable for developers to screw up when carrying out these tasks. In other words, you were/are arguing that the failure here was the user's fault. Again I object to this notion as inefficient, counter-productive, and ultimately unrealistic.
When you spend your whole day interacting with one tool, it's easy to think that people are careless for missing aspects about that tool which are obvious to you. For example, you said in comment 71 that calling the boldness a "typographic hint" was a stretch, because the types of failures "really [are] quite different".
Instead, I'd encourage you to take an alternative point of view and use the fact that this indicator was missed not as evidence of carelessness but as a pointer to an area of the tool which can be improved. When you blame a user instead of looking at things from the user's point of view, you lose important pieces of evidence for how to improve and essentially trade them for an ad hom. This is not productive.
Assignee | ||
Comment 77•12 years ago
|
||
FWIW here's what I was thinking (and I'm totally open to hearing from the sheriffs if this approach is a terrible strategy). I assumed the oranges in my try build were random (I tend to make that assumption more especially with timeout reports that only happened on one platform and where there's already some sort of bad exit report for the test). I pushed to inbound. I starred what were obvious known oranges and stared at the rest. I was actually worried about a different orange (browser_819510_perwindowpb.js failed, without suggestions, which seemed likely to be a problem: turns out it's not). I considered backing my patch out, but I figured that it wasn't burning the tree, so one or two following pushes to inbound would provide essentially more try runs to reveal if there was really a problem here. I had rolled my patch up into one commit so it's easy to backout, so... no problem? (I also figured I'd get up early and check/backout it myself, but instead slept through my alarm and woke up instead to find my bug comments section morphed into a poor cousin of dev.platform :)
Unless I hear otherwise from the sheriffs, I'm going to assume that I should have just backed out at the least sign of trouble and pushed to try a few more times (unless there was a sheriff on IRC who I could let know what was going on, and who was OK with letting my patch bake for a while). There's a trade-off here between sheriff resources, developer time, and try server consumption, obviously. Maybe we can talk about that larger issue somewhere other than in more comments in this bug? Unless you're brief :)
Comment 78•12 years ago
|
||
Are we going to try for a re-land today? This is one of a small handful of remaining LOE:M work, and needs to land as early this week as possible.
Priority: P2 → P1
Target Milestone: B2G C2 (20nov-10dec) → B2G C3 (12dec-1jan)
Assignee | ||
Comment 79•12 years ago
|
||
I can't repro the issue on my box. Try builds are taking forever to return 'Cipc' test results:
https://tbpl.mozilla.org/?tree=Try&rev=9a532b225c42
https://tbpl.mozilla.org/?tree=Try&rev=a73db2de25f0
https://tbpl.mozilla.org/?tree=Try&rev=3b9367e2ed7c
One of these builds is using a different strategy--to simply allow the "mozbrowser w/o app" to pass security checks. If that passes and the others don't I'll land that tomorrow.
Assignee | ||
Comment 80•12 years ago
|
||
So we're reliably failing Cipc on the try servers regardless of whether we turn on mozbrowser-w/o-app checks or not. And I can't repro the failure on my desktop :(
I'm installing a 32bit Fedora on a VM in case that shows the issue for me. I've also filed bug 822998 to get access to an actual buildbot in case that doesn't work.
Assignee | ||
Comment 81•12 years ago
|
||
Maybe patch 13 is my lucky patch!
This skip-if's a crashtest in e10s mode because it uses RDFService which doesn't provide TabChild in HTTP channel callbacks (and thus crashes).
I filed bug 823470 about the larger issue of whether we'll ever need RDFService in child.
Try build:
https://tbpl.mozilla.org/?tree=Try&rev=34bab1c0e437
Attachment #694328 -
Flags: review?(ted)
Comment 82•12 years ago
|
||
Comment on attachment 694328 [details] [diff] [review]
Part 13: disable RDF test when in e10s mode
Review of attachment 694328 [details] [diff] [review]:
-----------------------------------------------------------------
::: content/xul/templates/src/crashtests/crashtests.list
@@ +1,4 @@
> load 257752-1-recursion.xul
> load 329335-1.xul
> load 329884-1.xul
> +skip-if(winWidget||browserIsRemote) HTTP load 330010-1.xul # bug 742455
Can you add the new bug # that you filed here?
Attachment #694328 -
Flags: review?(ted) → review+
Assignee | ||
Comment 83•12 years ago
|
||
> Can you add the new bug # that you filed here?
done
Attachment #694534 -
Flags: review+
Assignee | ||
Updated•12 years ago
|
Attachment #692697 -
Attachment is obsolete: true
Assignee | ||
Comment 84•12 years ago
|
||
Assignee | ||
Comment 85•12 years ago
|
||
Assignee | ||
Updated•12 years ago
|
status-b2g18:
--- → fixed
status-firefox19:
--- → fixed
Assignee | ||
Comment 86•12 years ago
|
||
Note: if your B2G child process starts crashing after this lands with a message that it's failed necko security checks, it's because you have a bug (you need to pass security info to all necko channels), so backing this patch out is probably not your best first response. If we run into enough trouble that cjones, etc, thinks this needs backing out, then by all means, back out :) So far we're passing automated tests at least...
Sorry folks, this code may be an innocent bystander but it caused bug 823962. Need to back out and investigate.
Aaaand I just saw comment 86. However, I don't see any necko errors.
I/Gecko ( 3436): [Parent 3436] WARNING: NeckoParent::AllocPHttpChannel: FATAL error: missing required PBrowser argument: KILLING CHILD PROCESS
Clock is running though.
Hahahahahahahahhahaha
(gdb) printf "%s", PrintJSStack()
0 anonymous() ["jar:file:///system/b2g/omni.ja!/components/contentSecurityPolicy.js":338]
this = [object Object]
1 anonymous() ["jar:file:///system/b2g/omni.ja!/components/contentSecurityPolicy.js":518]
this = function () {
[sourceless code]
}
Sorry guys, we have some nontrivial regressions from this. Clock is up :/. Will help with regressions where I can.
Updated•12 years ago
|
status-b2g18:
fixed → ---
status-firefox19:
fixed → ---
Assignee | ||
Comment 93•12 years ago
|
||
ready to land once we fix the dependencies.
Attachment #694534 -
Attachment is obsolete: true
Assignee | ||
Comment 94•12 years ago
|
||
Missed a '+' from merging patch.
Attachment #695193 -
Attachment is obsolete: true
Attachment #695238 -
Flags: review+
Comment 95•12 years ago
|
||
* bug 823962 has landed on aurora and b2g18.
* bug 824177 has landed on inbound.
* jdm told me that bug 824200 is likely fixed by bug 753981
* jdm _also_ told me that he's going to try to land this again :)
Comment 96•12 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/6cd44513d285
https://hg.mozilla.org/releases/mozilla-aurora/rev/46e707cbe7b3
status-firefox19:
--- → fixed
status-firefox20:
--- → fixed
Comment 97•12 years ago
|
||
status-b2g18:
--- → fixed
Comment 98•12 years ago
|
||
If any further problems are encountered, let's please just flip the pref. The full patch is a pain to rebase.
Comment 99•12 years ago
|
||
Nevermind, there was an actual test failure :(
https://hg.mozilla.org/integration/mozilla-inbound/rev/8aa341f3e3a6
https://hg.mozilla.org/releases/mozilla-aurora/rev/831232d1e065
https://hg.mozilla.org/releases/mozilla-b2g18/rev/c707d415f13b
I'll fix it up tonight and try again.
Comment 100•12 years ago
|
||
Comment 101•12 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•