Bugzilla

Comment 10

•

7 years ago

backing this out appears to cause an improvement in awsy memory measurements: == Change summary for alert #9135 (as of August 30 2017 22:48 UTC) == Improvements: 3% Explicit Memory summary linux64 opt 297,866,944.90 -> 290,115,887.16 2% Heap Unclassified summary linux64 opt 54,748,050.26 -> 53,583,599.85 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9135

Assignee

Comment 11

•

7 years ago

So the failures here are only happening in setup that we don't run on try, if I understand it correctly that means they are tier2 tests and patches should not be backed out because of them. I cannot reproduce the failures locally and believe that it's just some stupid timing issue in the reftest harness, can I re-land the patch as it is? It would help to figure out if there is any other issue going on with this feature and detect it in time. Otherwise I have no idea how long it will take to figure out why are these tests failing on this specific machine. One thing is for sure that the process reusing that makes these tests fail, as another option I can just turn that part off for reftests temporarily until I can figure out what's going on.

Flags: needinfo?(gkrizsanits) → needinfo?(jmaher)

Ryan VanderMeulen [:RyanVM]

Comment 12

•

7 years ago

If we want to land it fully and see if it sticks that is fine- a few lower frequency failures would be ok to live with. What is the work required to turn this off for reftests only? could we do that after the fact if we find the failures are too frequent?

Flags: needinfo?(jmaher)

Comment 13

•

7 years ago

To be clear, Windows 8 reftests are *not* Tier 2. You do have to opt into them explicitly in your Try syntax, however.

Assignee

Comment 14

•

7 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #13) > To be clear, Windows 8 reftests are *not* Tier 2. You do have to opt into > them explicitly in your Try syntax, however. Alright, then I guess I have no choice and have to figure out how to fix this properly even if that might take long.

Assignee

Comment 15

•

7 years ago

So it seems like something goes wrong in the nsStringBundle::LoadProperties in the first content process which we usually kill early and won't cause any harm, but since this patch attempts to reuse that process and the function keeps failing after the first failed LoadProperties attempt, we cannot get the localized string for anything. The failing reftests all use some sort of localized strings. It's been fun and joy that this code just keeps failing silently when the entire browser is probably broken without localized strings. Kris, you worked on this area recently and I suspect your patches (bug 1363482) are causing this issue (this used to work a couple of month ago so either that or Bug 1377377), do you have any idea what might be going on? First, I think if something fails in this method we should handle it better (in debug build it should be a crash for sure). Second, I have the feeling that the try once and then never again approach is not the best any more... I can probably just mark the process not to be reused, land my patch and move on, but this bug scares me a bit, so I don't want to just walk away without investigate it further. A content process without localized string access (with all this code that just fails silently and renders everything incorrectly) might be a worse user experience than a crash, so let's try to fix this. So my plan is to figure out what fails (ofc I cannot reproduce it locally so that's hard). If that takes too much time, I will mark the process, land this patch, file a followup to handle failure better and then see where that other bug will lead us. Does that plan make sense to you? Do you have any ideas what might fail? https://treeherder.mozilla.org/#/jobs?repo=try&revision=1fae22fd38442a9d611ea70d19396497844d4b99&selectedJob=131918009

Depends on: 1363482

Flags: needinfo?(kmaglione+bmo)

Assignee

Comment 16

•

7 years ago

I guess the reason for the failure is this: {"source": "reftest harness", "process": "ProcessReader", "thread": "ProcessReader", "time": 1505807737494, "action": "process_output", "data": "Couldn't convert chrome URL: chrome://branding/locale/brand.properties", "pid": 6828} Note that this happens really early in the startup.

Assignee

Comment 17

•

7 years ago

The channel creation fails sometimes, if we're too early in the startup: http://searchfox.org/mozilla-central/rev/3dbb47302e114219c53e99ebaf50c5cb727358ab/intl/strres/nsStringBundle.cpp#106 So it probably unrelated to your patch. If I remove the bits that prevents consequent attempts, after a few tries it always works: https://treeherder.mozilla.org/#/jobs?repo=try&revision=464da6e565ba4050439b40669a0ac6f14d8de5d6&selectedJob=133569282 I'm not sure who owns this code though...

Flags: needinfo?(kmaglione+bmo)

Assignee

Comment 18

•

7 years ago

Attached patch Let nsStringBundle retry LoadProperties. v1 (obsolete) (deleted) — Details — Splinter Review

Olli, could you review this patch? Or if not, do you know who is the right person to talk to? I have no idea who owns this code...

Attachment #8912751 - Flags: review?(bugs)

Comment 19

•

7 years ago

Comment on attachment 8912751 [details] [diff] [review] Let nsStringBundle retry LoadProperties. v1 So in which case does the load fail? I assume when http://searchfox.org/mozilla-central/rev/3dbb47302e114219c53e99ebaf50c5cb727358ab/dom/base/nsContentUtils.cpp#804 triggers the preload, am I right? If so, shouldn't we move that preloading happening when we know that loading should succeeed, like when necko is up and running or so. We don't want to basically regress Bug 1377377 in some cases when preloading has failed. Have you looked at necko why loading might fail, or asked necko devs? Could we start preloading once we have both gecko (in practice nsContentUtils) and necko up and running?

Attachment #8912751 - Flags: review?(bugs) → review-

Assignee

Comment 20

•

7 years ago

(In reply to Olli Pettay [:smaug] from comment #19) > Comment on attachment 8912751 [details] [diff] [review] > Let nsStringBundle retry LoadProperties. v1 > > So in which case does the load fail? I assume when > http://searchfox.org/mozilla-central/rev/ > 3dbb47302e114219c53e99ebaf50c5cb727358ab/dom/base/nsContentUtils.cpp#804 > triggers the preload, am I right? Possibly, I haven't checked, I cannot reproduce it locally which does not help. But based on the logs: [1] it seems like there are several failed attempts from various places so simply fixing the preloading part won't be enough. > If so, shouldn't we move that preloading happening when we know that loading > should succeeed, like when necko is up and running or so. I'm planing to file a followup as this code is clearly not working as intended, but that should be a separate bug. > We don't want to basically regress Bug 1377377 in some cases when preloading > has failed. Note that loading actual content happens much later so that is very unlikely. And if this situation occurs (and it does just luckily in a content process we throw away) the current behavior is a completely broken content process that cannot render content that has localized string. I would prefer some non-optimal initial performance over a completely broken behavior any time of the day. > > Have you looked at necko why loading might fail, or asked necko devs? The chrome URL is unknown based on comment 16. So it's simply too early and the chrome URL registration is not finished completely yet. > Could we start preloading once we have both gecko (in practice > nsContentUtils) and necko up and running? I think figuring that out should be a followup since that has nothing to do with this bug. Probably timing that right will involve some sort of message from parent to content to signal the right time, or just setting some initial process data to tell the process that it's too early to start preloading for this process. But as I said earlier timing preloading won't be enough. Is there a reason to block the preallocated process patch by this work? My fix is a clear improvement over the broken situation we have today and I'd prefer to tackle this problem as a followup, more precisely it would be great if someone who worked on this could fix this problem in a followup. [1]: https://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/a97aa77a290a31f6360118f744c4f8544e90a7c3c0d536d1eeb7fde5eb357ec58f4368d707ab044713ac35cc2d69fd4364955e418103dab1265c664f796e6c20

Flags: needinfo?(bugs)

Comment 21

•

7 years ago

(discussed on IRC, and I think we should fix the actual bug, and not make service first fail during startup and then starting to work later. If it can't work during startup, then its users should use it later.)

Flags: needinfo?(bugs)

Assignee

Comment 22

•

7 years ago

Let's try a weaker version. This patch enables retry's only if the first failure comes from a preload. https://treeherder.mozilla.org/#/jobs?repo=try&revision=efb59a47c3cbe22a3288574c215a8897890b5238

Assignee

Comment 23

•

7 years ago

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #22) > Let's try a weaker version. This patch enables retry's only if the first > failure comes from a preload. > > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=efb59a47c3cbe22a3288574c215a8897890b5238 So the failure comes from preloading. We can do 2 things: [1] Do what the patch in Comment 22 does and let nsStringBundle retry the load ONLY if the first failing attempt came from a preload. (so if the preload fails we just ignore it and try again later but only once as the original API) [2] Throttle the preload until it is guaranteed to succeed (all the chrome URLs are registered in the content process) [2] Seems like the cleaner approach but it might come with some consistent performance cost and we will always have to worry about the right timing of the preload in early content processes. I'm not quite sure what should we wait for to make sure that the given string bundle is ready to be loaded or what sort of performance cost this approach might have. [1] is a bit less predictable, it's a hit or miss for preload with high probability of success. In case of success it's a bit faster than [2] in case of failure probably a bit slower. I guess we should somehow aim for [2] even though I'm not sure what should trigger the preload then, but let's ask Kris since he probably knows better this code than me. Kris, what's your preference here?

Flags: needinfo?(kmaglione+bmo)

Comment 24

•

7 years ago

FWIW, I like 2. Could we initiate async preload (using idle period) once ContentChild::RecvRegisterChrome is called? Basically not call nsContentUtils::AsyncPrecreateStringBundles() in nsContentUtils::Init in child process but in ContentChild::RecvRegisterChrome. Is this StringBundle issue such that we need to fix it in FF57 too.

Kris Maglione [:kmag]

Comment 25

•

7 years ago

So, we don't actually use the URLPreloader in the content process. Or, rather, we use the API, but it always does a synchronous read when called in the content process (and when called in the parent process after startup is over). If it's failing, chances are it's because it's being initialized too late. We normally initialize the URLPreloader from the ScriptPreloader initializer, and in the content process, that happens via an async message sent from the ContentParent. That can be fixed by just calling URLPreloader::GetSingleton() somewhere like the ContentChild constructor.

Flags: needinfo?(kmaglione+bmo)

Assignee

Comment 26

•

7 years ago

Attached patch Delay nsStringBundle preloading in content processes. v1 (deleted) — Details — Splinter Review

(In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're blocked) from comment #25) > So, we don't actually use the URLPreloader in the content process. Or, > rather, we use the API, but it always does a synchronous read when called in > the content process (and when called in the parent process after startup is > over). > > If it's failing, chances are it's because it's being initialized too late. > We normally initialize the URLPreloader from the ScriptPreloader > initializer, and in the content process, that happens via an async message > sent from the ContentParent. That can be fixed by just calling > URLPreloader::GetSingleton() somewhere like the ContentChild constructor. I tried calling URLPreloader before the async preload call from various places from the content process but it only gave me crashes. (In reply to Olli Pettay [:smaug] from comment #24) > FWIW, I like 2. Could we initiate async preload (using idle period) once > ContentChild::RecvRegisterChrome is called? Basically not call > nsContentUtils::AsyncPrecreateStringBundles() in nsContentUtils::Init in > child process but in ContentChild::RecvRegisterChrome. > > Is this StringBundle issue such that we need to fix it in FF57 too. Let's go with this version. It's the safest option IMO too. I don't know how could this preload ever work without the child learning about the registered chrome urls. About 57... I'm not sure, how late are we for uplifts? Seems to be more risky not to uplift it...

Attachment #8912751 - Attachment is obsolete: true

Attachment #8913657 - Flags: review?(bugs)

Comment 27

•

7 years ago

yeah, feels like more risky to not uplift. You may want to move the patch to another bug and ask beta approval there. But I'm about to review the patch here anyhow.

Comment 28

•

7 years ago

Comment on attachment 8913657 [details] [diff] [review] Delay nsStringBundle preloading in content processes. v1 >+ static bool preloadDone = false; >+ if (preloadDone) { This is wrong. preloadDone gets never set to true. Should be !preloadDone

Attachment #8913657 - Flags: review?(bugs) → review-

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c0b658abc1cc9155036b8dd9acb203cbb1d8c94e

Assignee

Comment 29

•

7 years ago

Comment 30

•

7 years ago

Assuming tryserver looks good, r+ to the patch you pushed there.

Assignee

Comment 31

•

7 years ago

(In reply to Olli Pettay [:smaug] from comment #27) > You may want to move the patch to > another bug and ask beta approval there. Bug 1404383.

Depends on: 1404383

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d8e9cd055e2c597ba90290aa198dcd4296b9f46f

Assignee

Comment 32

•

7 years ago

Pulsebot

Comment 33

•

7 years ago

Pushed by gkrizsanits@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/5ef005eb34d9 Reenable the preallocated process. r=mrbkap

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 34

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/5ef005eb34d9

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox58: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla58

Comment 35

•

7 years ago

some perf improvements: == Change summary for alert #9762 (as of October 02 2017 11:54 UTC) == Improvements: 7% tart summary windows7-32 pgo e10s 4.22 -> 3.93 7% tart summary windows7-32 opt e10s 5.20 -> 4.86 6% tart summary windows10-64 opt e10s 4.33 -> 4.07 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9762