Closed Bug 1385249 Opened 7 years ago Closed 7 years ago

Reenable the preallocated process manager

Categories

(Core :: DOM: Content Processes, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox58 --- fixed

People

(Reporter: gkrizsanits, Assigned: gkrizsanits)

References

Details

(Whiteboard: [e10s-multi:+])

Attachments

(2 files, 1 obsolete file)

No description provided.
Depends on: 1385252
Depends on: 1376895
Blocks: 1364849
Whiteboard: [e10s-multi:+]
Blocks: 1382608
Assignee: nobody → gkrizsanits
Attachment #8895940 - Flags: review?(mrbkap)
Comment on attachment 8895940 [details] [diff] [review] Reenable the preallocated process. Let's do this!
Attachment #8895940 - Flags: review?(mrbkap) → review+
Priority: -- → P1
Hi Bill, I'm struggling with a frequent content process startup crash like this one: https://treeherder.mozilla.org/logviewer.html#?job_id=124114971&repo=try&lineNumber=1793 more crashes can be found here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4a867feff713b819a851508ca15925af0028d724 I cannot reproduce it locally for some reason... The main thread is in the ProcessChild ctor waiting in Thread::StartWithOptions while another thread is calling CheckResponsivenessTask::DoFirstDispatchIfNeeded which is trying to DispatchToMainThread which is crashing. Do you know what this might mean? The patch is about reenabling the preallocated process manager but I don't know what would that case a process startup crash... Do you have any ideas? (maybe the main process start to shut down in the meanwhile, short after it initiated the child process creation...)
Flags: needinfo?(wmccloskey)
It's a little odd that we're crashing here instead of just printing a warning. Maybe we crash for NS_ASSERTIONS in xpcshell tests now? I'm not sure. I don't see any place where we would do that. In any case, it seems like the profiler thread is trying to post a task very early in startup. Maybe a profiler person could help?
Flags: needinfo?(wmccloskey) → needinfo?(n.nethercote)
mstange understands CheckResponsivenessTask and profiler IPC better than I do, so I will defer to him...
Flags: needinfo?(n.nethercote) → needinfo?(mstange)
(In reply to Bill McCloskey (:billm) from comment #4) > In any case, it seems like the profiler thread is trying to post a task very > early in startup. That's what's happening, yes. If a content process is launched while the parent process is being profiled, then we launch the content process with the MOZ_PROFILER_STARTUP=1 environment variable, which causes us to start the profiler [1] very early during startup [2], and then we'll try to dispatch a check responsiveness event to the main thread [3]. [1] http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/tools/profiler/core/platform.cpp#2375 [2] http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/toolkit/xre/nsEmbedFunctions.cpp#411 [3] http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/tools/profiler/gecko/ThreadResponsiveness.cpp#45 It's possible that I've never tested startup profiling on a debug build. The assertion that it's hitting seems like it would have always been triggered under these conditions. The crash stack in the log of the failed test shows this frame: 05:01:31 INFO - 3 XUL!NS_DispatchToMainThread(already_AddRefed<nsIRunnable>&&, unsigned int) [nsThreadUtils.cpp:4a867feff713 : 258 + 0x21] so it's inside nsThreadUtils.cpp at line 258, which is this assertion: nsresult rv = NS_GetMainThread(getter_AddRefs(thread)); if (NS_WARN_IF(NS_FAILED(rv))) { NS_ASSERTION(false, "Failed NS_DispatchToMainThread() in shutdown; leaking"); So NS_GetMainThread isn't set up at that stage. I think we can work around this by checking NS_GetMainThread in ThreadResponsiveness.cpp before calling SystemGroup::Dispatch.
Flags: needinfo?(mstange)
Depends on: 1393123
I've filed bug 1393123 about this.
backing this out appears to cause an improvement in awsy memory measurements: == Change summary for alert #9135 (as of August 30 2017 22:48 UTC) == Improvements: 3% Explicit Memory summary linux64 opt 297,866,944.90 -> 290,115,887.16 2% Heap Unclassified summary linux64 opt 54,748,050.26 -> 53,583,599.85 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9135
So the failures here are only happening in setup that we don't run on try, if I understand it correctly that means they are tier2 tests and patches should not be backed out because of them. I cannot reproduce the failures locally and believe that it's just some stupid timing issue in the reftest harness, can I re-land the patch as it is? It would help to figure out if there is any other issue going on with this feature and detect it in time. Otherwise I have no idea how long it will take to figure out why are these tests failing on this specific machine. One thing is for sure that the process reusing that makes these tests fail, as another option I can just turn that part off for reftests temporarily until I can figure out what's going on.
Flags: needinfo?(gkrizsanits) → needinfo?(jmaher)
If we want to land it fully and see if it sticks that is fine- a few lower frequency failures would be ok to live with. What is the work required to turn this off for reftests only? could we do that after the fact if we find the failures are too frequent?
Flags: needinfo?(jmaher)
To be clear, Windows 8 reftests are *not* Tier 2. You do have to opt into them explicitly in your Try syntax, however.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #13) > To be clear, Windows 8 reftests are *not* Tier 2. You do have to opt into > them explicitly in your Try syntax, however. Alright, then I guess I have no choice and have to figure out how to fix this properly even if that might take long.
So it seems like something goes wrong in the nsStringBundle::LoadProperties in the first content process which we usually kill early and won't cause any harm, but since this patch attempts to reuse that process and the function keeps failing after the first failed LoadProperties attempt, we cannot get the localized string for anything. The failing reftests all use some sort of localized strings. It's been fun and joy that this code just keeps failing silently when the entire browser is probably broken without localized strings. Kris, you worked on this area recently and I suspect your patches (bug 1363482) are causing this issue (this used to work a couple of month ago so either that or Bug 1377377), do you have any idea what might be going on? First, I think if something fails in this method we should handle it better (in debug build it should be a crash for sure). Second, I have the feeling that the try once and then never again approach is not the best any more... I can probably just mark the process not to be reused, land my patch and move on, but this bug scares me a bit, so I don't want to just walk away without investigate it further. A content process without localized string access (with all this code that just fails silently and renders everything incorrectly) might be a worse user experience than a crash, so let's try to fix this. So my plan is to figure out what fails (ofc I cannot reproduce it locally so that's hard). If that takes too much time, I will mark the process, land this patch, file a followup to handle failure better and then see where that other bug will lead us. Does that plan make sense to you? Do you have any ideas what might fail? https://treeherder.mozilla.org/#/jobs?repo=try&revision=1fae22fd38442a9d611ea70d19396497844d4b99&selectedJob=131918009
Depends on: 1363482
Flags: needinfo?(kmaglione+bmo)
I guess the reason for the failure is this: {"source": "reftest harness", "process": "ProcessReader", "thread": "ProcessReader", "time": 1505807737494, "action": "process_output", "data": "Couldn't convert chrome URL: chrome://branding/locale/brand.properties", "pid": 6828} Note that this happens really early in the startup.
The channel creation fails sometimes, if we're too early in the startup: http://searchfox.org/mozilla-central/rev/3dbb47302e114219c53e99ebaf50c5cb727358ab/intl/strres/nsStringBundle.cpp#106 So it probably unrelated to your patch. If I remove the bits that prevents consequent attempts, after a few tries it always works: https://treeherder.mozilla.org/#/jobs?repo=try&revision=464da6e565ba4050439b40669a0ac6f14d8de5d6&selectedJob=133569282 I'm not sure who owns this code though...
Flags: needinfo?(kmaglione+bmo)
Attached patch Let nsStringBundle retry LoadProperties. v1 (obsolete) (deleted) — Splinter Review
Olli, could you review this patch? Or if not, do you know who is the right person to talk to? I have no idea who owns this code...
Attachment #8912751 - Flags: review?(bugs)
Comment on attachment 8912751 [details] [diff] [review] Let nsStringBundle retry LoadProperties. v1 So in which case does the load fail? I assume when http://searchfox.org/mozilla-central/rev/3dbb47302e114219c53e99ebaf50c5cb727358ab/dom/base/nsContentUtils.cpp#804 triggers the preload, am I right? If so, shouldn't we move that preloading happening when we know that loading should succeeed, like when necko is up and running or so. We don't want to basically regress Bug 1377377 in some cases when preloading has failed. Have you looked at necko why loading might fail, or asked necko devs? Could we start preloading once we have both gecko (in practice nsContentUtils) and necko up and running?
Attachment #8912751 - Flags: review?(bugs) → review-
(In reply to Olli Pettay [:smaug] from comment #19) > Comment on attachment 8912751 [details] [diff] [review] > Let nsStringBundle retry LoadProperties. v1 > > So in which case does the load fail? I assume when > http://searchfox.org/mozilla-central/rev/ > 3dbb47302e114219c53e99ebaf50c5cb727358ab/dom/base/nsContentUtils.cpp#804 > triggers the preload, am I right? Possibly, I haven't checked, I cannot reproduce it locally which does not help. But based on the logs: [1] it seems like there are several failed attempts from various places so simply fixing the preloading part won't be enough. > If so, shouldn't we move that preloading happening when we know that loading > should succeeed, like when necko is up and running or so. I'm planing to file a followup as this code is clearly not working as intended, but that should be a separate bug. > We don't want to basically regress Bug 1377377 in some cases when preloading > has failed. Note that loading actual content happens much later so that is very unlikely. And if this situation occurs (and it does just luckily in a content process we throw away) the current behavior is a completely broken content process that cannot render content that has localized string. I would prefer some non-optimal initial performance over a completely broken behavior any time of the day. > > Have you looked at necko why loading might fail, or asked necko devs? The chrome URL is unknown based on comment 16. So it's simply too early and the chrome URL registration is not finished completely yet. > Could we start preloading once we have both gecko (in practice > nsContentUtils) and necko up and running? I think figuring that out should be a followup since that has nothing to do with this bug. Probably timing that right will involve some sort of message from parent to content to signal the right time, or just setting some initial process data to tell the process that it's too early to start preloading for this process. But as I said earlier timing preloading won't be enough. Is there a reason to block the preallocated process patch by this work? My fix is a clear improvement over the broken situation we have today and I'd prefer to tackle this problem as a followup, more precisely it would be great if someone who worked on this could fix this problem in a followup. [1]: https://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/a97aa77a290a31f6360118f744c4f8544e90a7c3c0d536d1eeb7fde5eb357ec58f4368d707ab044713ac35cc2d69fd4364955e418103dab1265c664f796e6c20
Flags: needinfo?(bugs)
(discussed on IRC, and I think we should fix the actual bug, and not make service first fail during startup and then starting to work later. If it can't work during startup, then its users should use it later.)
Flags: needinfo?(bugs)
Let's try a weaker version. This patch enables retry's only if the first failure comes from a preload. https://treeherder.mozilla.org/#/jobs?repo=try&revision=efb59a47c3cbe22a3288574c215a8897890b5238
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #22) > Let's try a weaker version. This patch enables retry's only if the first > failure comes from a preload. > > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=efb59a47c3cbe22a3288574c215a8897890b5238 So the failure comes from preloading. We can do 2 things: [1] Do what the patch in Comment 22 does and let nsStringBundle retry the load ONLY if the first failing attempt came from a preload. (so if the preload fails we just ignore it and try again later but only once as the original API) [2] Throttle the preload until it is guaranteed to succeed (all the chrome URLs are registered in the content process) [2] Seems like the cleaner approach but it might come with some consistent performance cost and we will always have to worry about the right timing of the preload in early content processes. I'm not quite sure what should we wait for to make sure that the given string bundle is ready to be loaded or what sort of performance cost this approach might have. [1] is a bit less predictable, it's a hit or miss for preload with high probability of success. In case of success it's a bit faster than [2] in case of failure probably a bit slower. I guess we should somehow aim for [2] even though I'm not sure what should trigger the preload then, but let's ask Kris since he probably knows better this code than me. Kris, what's your preference here?
Flags: needinfo?(kmaglione+bmo)
FWIW, I like 2. Could we initiate async preload (using idle period) once ContentChild::RecvRegisterChrome is called? Basically not call nsContentUtils::AsyncPrecreateStringBundles() in nsContentUtils::Init in child process but in ContentChild::RecvRegisterChrome. Is this StringBundle issue such that we need to fix it in FF57 too.
So, we don't actually use the URLPreloader in the content process. Or, rather, we use the API, but it always does a synchronous read when called in the content process (and when called in the parent process after startup is over). If it's failing, chances are it's because it's being initialized too late. We normally initialize the URLPreloader from the ScriptPreloader initializer, and in the content process, that happens via an async message sent from the ContentParent. That can be fixed by just calling URLPreloader::GetSingleton() somewhere like the ContentChild constructor.
Flags: needinfo?(kmaglione+bmo)
(In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're blocked) from comment #25) > So, we don't actually use the URLPreloader in the content process. Or, > rather, we use the API, but it always does a synchronous read when called in > the content process (and when called in the parent process after startup is > over). > > If it's failing, chances are it's because it's being initialized too late. > We normally initialize the URLPreloader from the ScriptPreloader > initializer, and in the content process, that happens via an async message > sent from the ContentParent. That can be fixed by just calling > URLPreloader::GetSingleton() somewhere like the ContentChild constructor. I tried calling URLPreloader before the async preload call from various places from the content process but it only gave me crashes. (In reply to Olli Pettay [:smaug] from comment #24) > FWIW, I like 2. Could we initiate async preload (using idle period) once > ContentChild::RecvRegisterChrome is called? Basically not call > nsContentUtils::AsyncPrecreateStringBundles() in nsContentUtils::Init in > child process but in ContentChild::RecvRegisterChrome. > > Is this StringBundle issue such that we need to fix it in FF57 too. Let's go with this version. It's the safest option IMO too. I don't know how could this preload ever work without the child learning about the registered chrome urls. About 57... I'm not sure, how late are we for uplifts? Seems to be more risky not to uplift it...
Attachment #8912751 - Attachment is obsolete: true
Attachment #8913657 - Flags: review?(bugs)
yeah, feels like more risky to not uplift. You may want to move the patch to another bug and ask beta approval there. But I'm about to review the patch here anyhow.
Comment on attachment 8913657 [details] [diff] [review] Delay nsStringBundle preloading in content processes. v1 >+ static bool preloadDone = false; >+ if (preloadDone) { This is wrong. preloadDone gets never set to true. Should be !preloadDone
Attachment #8913657 - Flags: review?(bugs) → review-
Assuming tryserver looks good, r+ to the patch you pushed there.
(In reply to Olli Pettay [:smaug] from comment #27) > You may want to move the patch to > another bug and ask beta approval there. Bug 1404383.
Depends on: 1404383
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
some perf improvements: == Change summary for alert #9762 (as of October 02 2017 11:54 UTC) == Improvements: 7% tart summary windows7-32 pgo e10s 4.22 -> 3.93 7% tart summary windows7-32 opt e10s 5.20 -> 4.86 6% tart summary windows10-64 opt e10s 4.33 -> 4.07 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9762
Depends on: 1423261
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: