Closed Bug 887192 Opened 11 years ago Closed 11 years ago

GonkHal should write lowmemorykiller parameters in increasing order

Tracking

(blocking-b2g:leo+, firefox23 wontfix, firefox24 wontfix, firefox25 fixed, b2g18? fixed, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 wontfix, b2g-v1.1hd fixed)

Status:

VERIFIED FIXED

Milestone:

1.1 QE4 (15jul)

Project Flags:

blocking-b2g

leo+

Tracking Flags:

Tracking

Status

firefox23

---

wontfix

firefox24

---

wontfix

firefox25

---

fixed

b2g18

fixed

b2g18-v1.0.0

---

wontfix

b2g18-v1.0.1

---

wontfix

b2g-v1.1hd

---

fixed

People

(Reporter: alan.yenlin.huang, Assigned: alan.yenlin.huang)

References

Details

(Keywords: perf, Whiteboard: [c= ][LeoVB+])

Attachments

(1 file, 3 obsolete files)

Invert the loop through NUM_PROCESS_PRIORITY 11 years ago Alan Huang (deleted), patch	justin.lebar+bug : feedback+	Details \| Diff \| Splinter Review
Write LMK parameters in increasing order, and add assertion when someone breaks the rule 11 years ago Alan Huang (deleted), patch	justin.lebar+bug : review+	Details \| Diff \| Splinter Review
Write LMK parameters in increasing order, and add assertion when someone breaks the rule 11 years ago Alan Huang (deleted), patch		Details \| Diff \| Splinter Review
Write LMK parameters in increasing order, and add assertion when someone breaks the rule 11 years ago Alan Huang (deleted), patch		Details \| Diff \| Splinter Review

Alan Huang

Assignee

Description

•

11 years ago

I found we now have decreasing order of lowmemorykiller parameters from "/sys/module/lowmemorykiller/parameters/minfree" and "/sys/module/lowmemorykiller/parameters/adj". 

$ adb shell cat /sys/module/lowmemorykiller/parameters/adj
6,4,3,2,1,0
$ adb shell cat /sys/module/lowmemorykiller/parameters/minfree
5120,2048,1792,1536,1280,1024

In lowmemorykiller.c, it assume that these parameters are in increasing order. It calculates # of free pages, and begins to compare the # of frees pages with the first minfree parameters and so on. This loop breaks when # of free pages is less than the minfree parameter it has in hand. 

in lowmem_shrink of lowmemorykiller.c:

>        for (i = 0; i < array_size; i++) {
>                if (other_free < lowmem_minfree[i] &&
>                    other_file < lowmem_minfree[i]) {
>                        min_adj = lowmem_adj[i];
>                        break;
>                }
>        }

So if we set lowmemorykiller parameters in decreasing order, we will have unexpected result of lowmem_shrink in lowmemorykiller.c. 

I think this bug come from Bug 870181 patch part 1, and the fix should be easy: just invert the loop through NUM_PROCESS_PRIORITY in EnsureKernelLowMemKillerParamsSet() form GonkHal.cpp.

Alan Huang

Assignee

Updated

•

11 years ago

blocking-b2g: --- → tef?

Alan Huang

Assignee

Comment 1

•

11 years ago

Attached patch Invert the loop through NUM_PROCESS_PRIORITY (obsolete) (deleted) — Details — Splinter Review

Hello Justin,
I just invert the loop which go through all process priorities in this patch. Please help to review this, thanks!

Attachment #767802 - Flags: review?(justin.lebar+bug)

Justin Lebar (not reading bugmail)

Comment 2

•

11 years ago

Comment on attachment 767802 [details] [diff] [review]
Invert the loop through NUM_PROCESS_PRIORITY

Ouch, thanks for finding this bug, Alan.

Can you please add an assertion in the code that we set these params in the right order, and add a comment briefly explaining why this is important?

Attachment #767802 - Flags: review?(justin.lebar+bug) → feedback+

Justin Lebar (not reading bugmail)

Comment 3

•

11 years ago

As I read the kernel code, the effect of this bug is that when there's less than 5120*4kb = 20mb free, we'll always kill some process, even if there are no bg processes present.  After this bug is fixed, we'll kill a fg process only when there's less than 1280 * 4kb = 5mb free.

So this patch lets fg apps use 15mb more memory, which is great.

However, with or without this patch, we still kill processes in strict order of priority.

Is that right?

I'm not totally sure that we want to take this for 1.1, because the effect of letting processes use more memory is that we'll have less memory available for the file cache, which has an impact on performance.  OTOH, this effectively expands the amount of memory on our phone significantly, which should help us avoid some OOM crashes, and I know partners care dearly about those.

blocking-b2g: tef? → leo?

Gabriele Svelto [:gsvelto]

Comment 4

•

11 years ago

This might explain some of the behavior we were seeing when receiving phone calls & memory pressure events at the same time, I definitely need to re-test with this patch applied.

BTW We're in Taipei with Leo & Tef teams so I can raise this issue during today's meetings and see if they're interested in having the patch uplifted.

Justin Lebar (not reading bugmail)

Comment 5

•

11 years ago

> BTW We're in Taipei with Leo & Tef teams so I can raise this issue during today's meetings and see 
> if they're interested in having the patch uplifted.

Great, let me know if there's anything I can do to help.

Gabriele Svelto [:gsvelto]

Comment 6

•

11 years ago

We'd like to get the leo+ status for this bug because it can bring significant benefits:

- As Justin pointed out in comment 3 it will enable the fg application to get 15 more megs of free memory which is significant on 256MiB devices

- It will make the order in which applications are killed due to OOM more predictable. This is important also for devices with lots of memory; with many applications open the user might find that certain important applications are killed unexpectedly even though they should keep running (such as the music player, see bug 824000)

If we can get the patch through our regular tests the risk factor for taking this patch should be fairly low. The reason is that we already exercise this functionality extensively and we would only be changing its configuration, not the logic behind it. Additionally the affected code is nearly identical between b2g18 and central so we don't need to land different patches. If we can prove in a test that applications will get killed in the proper order then I think we should be pretty safe.

Blocks: 824000

Alan Huang

Assignee

Comment 7

•

11 years ago

Attached patch Write LMK parameters in increasing order, and add assertion when someone breaks the rule (obsolete) (deleted) — Details — Splinter Review

Patch v2, I also add assertion to make sure we won't break this rule in future.

Attachment #767802 - Attachment is obsolete: true

Attachment #768148 - Flags: review?(justin.lebar+bug)

Justin Lebar (not reading bugmail)

Comment 8

•

11 years ago

> - It will make the order in which applications are killed due to OOM more predictable.

I was trying to say in comment 3 that I don't think this is the case.  I'm looking again and I /think/ I was right, but I'm not sure.

I'm pretty tired; can you look at the kernel source and let me know what you think?

Justin Lebar (not reading bugmail)

Comment 9

•

11 years ago

What's weird is, it looks like we'll only kill processes with oom_adj 6 using the lmk, since selected_oom_adj will always be 6.

But then why do we ever see the LMK killing fg apps?  I've also seen the OOM killer get involved, and that makes sense.  But it's not the only thing that kills fg apps.  So I feel like I'm missing something here.

This might explain why I've seen the homescreen app (oom_adj 4) outlive a fg app (oom_adj 2).  That's certainly bad, and this patch will probably fix that.  I'm just not expecting miracles here.

Justin Lebar (not reading bugmail)

Comment 10

•

11 years ago

Comment on attachment 768148 [details] [diff] [review]
Write LMK parameters in increasing order, and add assertion when someone breaks the rule

>@@ -1074,21 +1077,33 @@ EnsureKernelLowMemKillerParamsSet()
>     int32_t killUnderMB;
>     if (!NS_SUCCEEDED(Preferences::GetInt(
>           nsPrintfCString("hal.processPriorityManager.gonk.%s.KillUnderMB",
>                           ProcessPriorityToString(priority)).get(),
>           &killUnderMB))) {
>       MOZ_CRASH();
>     }
> 
>+    // LMK in kernel calculates # of free pages, and begins to compare the # 
>+    // of frees pages with the first minfree parameters and so on. The loop 
>+    // in LMK breaks when # of free pages is less than the minfree parameter 
>+    // it has in hand. So, we have to make sure our parameters are in 
>+    // increasing order

It's probably better to say "number" instead of "#".

Reading this, I don't think we can describe in a comment what's going on well
enough for someone who isn't familiar with this bug to understand this.  We
should just say something like

  The LMK silently malfunctions if we assign the parameters in non-decreasing
  order; see bug 887192.

>+    MOZ_ASSERT(oomScoreAdj > lowerBoundOfNextOomScoreAdj && 

Nit: Trailing space.

>+               killUnderMB > lowerBoundOfNextKillUnderMB);

Also, please split this assertion in two.

>     // adj is in oom_adj units.
>     adjParams.AppendPrintf("%d,", OomAdjOfOomScoreAdj(oomScoreAdj));
> 
>     // minfree is in pages.
>     minfreeParams.AppendPrintf("%d,", killUnderMB * 1024 * 1024 / PAGE_SIZE);
>+
>+    // Update the lower buonds of next oomScoreAdj and killUnderMB parameters,

s/,/./
s/buonds/bounds/

Actually, I don't think you need this comment at all.  :)

>+    lowerBoundOfNextOomScoreAdj = oomScoreAdj;
>+    lowerBoundOfNextKillUnderMB = killUnderMB;
>   }

Attachment #768148 - Flags: review?(justin.lebar+bug) → review+

Gabriele Svelto [:gsvelto]

Comment 11

•

11 years ago

(In reply to Justin Lebar [:jlebar] from comment #8)
> I'm pretty tired; can you look at the kernel source and let me know what you
> think?

Sure, I'll have a look ASAP.

Alan Huang

Assignee

Comment 12

•

11 years ago

Attached patch Write LMK parameters in increasing order, and add assertion when someone breaks the rule (obsolete) (deleted) — Details — Splinter Review

Revise comments and split assertion statement into two.

I tested this on my unagi, both debug and non-debug build. I also pushed this to try server and waiting for result. See https://tbpl.mozilla.org/?tree=Try&rev=bf7acaa7ad89 .

Attachment #768148 - Attachment is obsolete: true

Attachment #768281 - Flags: review?(justin.lebar+bug)

Ed Morley [:emorley]

Comment 13

•

11 years ago

Someone on #developers noticed that the try push for this patch used:
"try: -b do -p all -u all -t none"

This causes 160-180 hours of machine time to be used - when the patch was B2G only.

Please can you choose more specific platforms using http://trychooser.pub.build.mozilla.org/ when you don't need to test everywhere, to save using resources unnecessarily (which in the case of the EC2 run jobs, directly costs $$$ per push).

Thanks! :-)

Alan Huang

Assignee

Comment 14

•

11 years ago

(In reply to Ed Morley [:edmorley UTC+1] from comment #13)
> Someone on #developers noticed that the try push for this patch used:
> "try: -b do -p all -u all -t none"
> 
> This causes 160-180 hours of machine time to be used - when the patch was
> B2G only.
I got it, thanks for reminding. 
I didn't notice I am just doing this in GonkHal. I stopped the try push, thanks!

Justin Lebar (not reading bugmail)

Comment 15

•

11 years ago

> Attachment #768281 [details] [diff] - Flags: review?(justin.lebar+bug@gmail.com)

I gave you r+ on the last patch, so you don't need to ask for another review.

If on the other hand there's something here that you'd like me to look at, I'm happy to; just let me know what it is.

Justin Lebar (not reading bugmail)

Updated

•

11 years ago

Attachment #768281 - Flags: review?(justin.lebar+bug)

Mike Lee [:mlee]

Comment 16

•

11 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #6)
> We'd like to get the leo+ status for this bug because it can bring
> significant benefits:
> 
> ...
> 
> If we can get the patch through our regular tests the risk factor for taking
> this patch should be fairly low...

From Triage:

Since the TBPL tests didn't complete we don't have the low risk confirmation that Gabriele mentioned in comment #6 so the risk is too high for leo+. Nominating for koi and tracking.

Status: NEW → ASSIGNED

blocking-b2g: leo? → koi?

tracking-b2g18: --- → ?

Keywords: perf

Whiteboard: [c= ]

Justin Lebar (not reading bugmail)

Comment 17

•

11 years ago

Maybe we misunderstood each other.

Whether or not this patch passes tests by the time triage happens to look at the patch has nothing to do with how risky the patch is.  That has to do with how loaded our infrastructure is.  We're not making blocking decisions based on infra load, right?

I'm pretty sure Gabriele was saying in comment 6 that he thinks this patch will be safe, in the sense that we'd notice any regressions as part of our automated test suite.  I don't think he was saying that you should not approve for blocking if triage happened to occur before the automated tests completed.

blocking-b2g: koi? → leo?

Justin Lebar (not reading bugmail)

Comment 18

•

11 years ago

Allan, can you please push this to try again (for the relevant platforms) so release drivers can make an informed decision here?

Justin Lebar (not reading bugmail)

Comment 19

•

11 years ago

Next time you would like completed test runs before making a triage decision, I would really appreciate it if you'd consider asking for those and waiting, instead of proactively marking the bug as blocking-minus.  That would make us in engineering feel more like partners in this process.

Alan Huang

Assignee

Comment 20

•

11 years ago

(In reply to Justin Lebar [:jlebar] from comment #19)
> Next time you would like completed test runs before making a triage
> decision, I would really appreciate it if you'd consider asking for those
> and waiting, instead of proactively marking the bug as blocking-minus.  That
> would make us in engineering feel more like partners in this process.

Thanks, Justin. Next time I will mark triage decision after test runs completely and successfully. 

I pushed this to try again, and I am waiting for result. I will mark "leo?" again once I got possitive result from tests.

https://tbpl.mozilla.org/?tree=Try&rev=ec9b9b6db603

blocking-b2g: leo? → ---

Alan Huang

Assignee

Comment 21

•

11 years ago

Attached patch Write LMK parameters in increasing order, and add assertion when someone breaks the rule (deleted) — Details — Splinter Review

Attachment #768281 - Attachment is obsolete: true

Alan Huang

Assignee

Comment 22

•

11 years ago

> I pushed this to try again, and I am waiting for result. I will mark "leo?"
> again once I got possitive result from tests.
> 
> https://tbpl.mozilla.org/?tree=Try&rev=ec9b9b6db603

Try server tested this completely and we got positive result of this patch.

blocking-b2g: --- → leo?

Alan Huang

Assignee

Updated

•

11 years ago

Keywords: checkin-needed

Ryan VanderMeulen [:RyanVM]

Comment 23

•

11 years ago

https://hg.mozilla.org/projects/birch/rev/a5c067c78ffc

Keywords: checkin-needed

Jinho Hwang [:jeffhwang]

Comment 24

•

11 years ago

Hi Alan & Justin,
Since we have a crash on modem, Bug 886217, we have tested a lot using marionette scripts. During the test, what we found is that when the system enters to OOM state because of memory leakage in b2g process, it starts to kill the OOP applications one by one. After this happens a number of times finally it kills Homescreen App and then tries to reopen Homescreen App again but it couldn't open it properly. And kill & reopen it again & again... 
I think we need to take this patch for 1.1 to fix this unexpected behavior of LMK.
I will change the flag to leo+. Please consider this.

leo.bugzilla.gecko

Updated

•

11 years ago

blocking-b2g: leo? → leo+

Justin Lebar (not reading bugmail)

Comment 25

•

11 years ago

(In reply to Jinho Hwang [:jeffhwang] from comment #24)
> I think we need to take this patch for 1.1 to fix this unexpected behavior
> of LMK.
> I will change the flag to leo+. Please consider this.

I wouldn't expect this patch to fix the problem you describe.

If the phone is so low on memory that the homescreen immediately dies, then that's probably because the main process is using a lot of memory.  That's a problem, but it does not have anything to do with this fix.

Ryan VanderMeulen [:RyanVM]

Comment 26

•

11 years ago

m-c may be closed for awhile, so I've gone ahead and pushed to b2g18 based on being green on birch.

https://hg.mozilla.org/releases/mozilla-b2g18/rev/561f78c43eb7

status-b2g18: --- → fixed

status-b2g18-v1.0.0: --- → wontfix

status-b2g18-v1.0.1: --- → wontfix

status-b2g-v1.1hd: --- → affected

status-firefox23: --- → wontfix

status-firefox24: --- → wontfix

status-firefox25: --- → fixed

Target Milestone: --- → 1.1 QE4 (15jul)

Ryan VanderMeulen [:RyanVM]

Comment 27

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/a5c067c78ffc

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Comment 28

•

11 years ago

https://hg.mozilla.org/releases/mozilla-b2g18_v1_1_0_hd/rev/561f78c43eb7

status-b2g-v1.1hd: affected → fixed

William Hsu [:whsu]

Comment 29

•

11 years ago

Hi, Alan and all,

Thanks for your effort on this bug.
The patch works as we expected behavior.
The following is the detail information that QA collected.
-----------------------------------------------------------------
* Defect/Bug information:
  While user launched over than 4 apps on Fx OS, the  apps will only exist 2~3 apps.
  The others apps are "randomly" killed by memory management process of kernel.
  Please refer to the following video.
  -> https://dc1.safesync.com/LMmDYKWv/VIDEO0008.mp4?a=kwBNaKoMF6Q

* Root cause:
  In "lowmemorykiller.c", the wrong order of minfree caused this problem.

* Reproduce steps:
  1. Launching 5 apps on Fx OS. (Ex: dialler -> Message -> Contact -> Browser -> FM radio -> Settings)
  2. Switch to card view
  @ Expected result: The card view shows apps preview page by "sequence"
      (Settings -> FM radio -> Browser -> Contact -> Message -> dialler)
      Maybe "Message" app and dialler app are killed by system by default behavior.
  @ Actual result: The launched apps are randomly killed by system and only existing 2~3 apps on card view.

* Verified build:
   + Mozilla-b2g18-unagi-eng/2013-07-03-07-02-10
      - HG - Gecko : "a078412b1671"
      - HG - Gaia : "NA"
      - Git - Gecko: "38e6b257a832fbd6c91fac54bf8a8c5d3572e3fb"
      - Git - Gaia: "b12b6f8895427708fad4e25af180266085bd2e74"
   + Mozilla-b2g18_v1_1_0_hd-unagi/2013-07-03-07-02-14
      - HG - Gecko : "6f874c20e68e"
      - HG - Gaia : "NA"
      - Git - Gecko: "NA"
      - Git - Gaia: "b12b6f8895427708fad4e25af180266085bd2e74"

* Verified methods:
  1. Command line:
      william@tpemozilla:~/Mozilla-b2g18-unagi-eng/2013-07-03-07-02-10/$ adb shell cat /sys/module/lowmemorykiller/parameters/minfree
      1024,1280,1536,1792,2048,5120 ->>(Pass, it shows in increasing order)
      william@tpemozilla:~/Mozilla-b2g18-unagi-eng/2013-07-03-07-02-10/$ adb shell cat /sys/module/lowmemorykiller/parameters/adj
       0,1,2,3,4,6 (Pass, it shows in increasing order)
  2. Follow the reproduction steps I mentioned. You could click the following link to know the detail.
      -> https://dc1.safesync.com/LMmDYKWv/VIDEO0009.mp4?a=fE0C7WFP6K0

Status: RESOLVED → VERIFIED

jongsoo.oh

Updated

•

11 years ago

Whiteboard: [c= ] → [c= ][LeoVB+]

You need to log in before you can comment on or make changes to this bug.