Closed Bug 1458447 Opened 6 years ago Closed 6 years ago

7.88% installer size (linux32) regression on push cdd70d25704ec2791fe02cd26104b8fd7b45df34 (Fri Apr 27 2018)

Categories

(Core :: Gecko Profiler, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED INVALID
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox59 --- unaffected
firefox60 --- unaffected
firefox61 - wontfix
firefox62 - wontfix

People

(Reporter: igoldan, Unassigned)

References

Details

(Keywords: regression)

We have detected a build metrics regression from push:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=cdd70d25704ec2791fe02cd26104b8fd7b45df34

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

  8%  installer size linux32 pgo      58,437,611.17 -> 63,040,222.42


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=12961

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Automated_Performance_Testing_and_Sheriffing/Build_Metrics
Component: Untriaged → Gecko Profiler
Product: Firefox → Core
:m_kato The installer size grew a lot, so can you look over this?
Flags: needinfo?(m_kato)
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #1)
> :m_kato The installer size grew a lot, so can you look over this?

Hmm, according to perf, binary size is back to Apr 16 by my landing.

Does anyone investigate [*1]'s reason?  I suspect infra...

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=00d978a04775bd28ddeedad7c33fd232868ae659&tochange=02df9dc8a300c705cdaabebf52d9e96a6fa906a3

down to 7% 

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=02df9dc8a300c705cdaabebf52d9e96a6fa906a3&tochange=057c58bcb1b19b282e6a9afe20cd90cee87244a4

Up to 7% [*1] (no C++ change)

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=ae7c124af9f03b88cf5efd53d4d2f9383fa959f2&tochange=a0a108e7e7b34e2f90a9298d225ae811024615dd

down to 7% (changed only JS!!)
Flags: needinfo?(m_kato) → needinfo?(igoldan)
(In reply to Makoto Kato [:m_kato] (offline or slow 5/3-5/6) from comment #2)
> (In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #1)
> > :m_kato The installer size grew a lot, so can you look over this?
> 
> Hmm, according to perf, binary size is back to Apr 16 by my landing.
> 
> Does anyone investigate [*1]'s reason?  I suspect infra...
> 
> https://hg.mozilla.org/integration/autoland/
> pushloghtml?fromchange=00d978a04775bd28ddeedad7c33fd232868ae659&tochange=02df
> 9dc8a300c705cdaabebf52d9e96a6fa906a3
> 
> down to 7% 
> 
> https://hg.mozilla.org/integration/autoland/
> pushloghtml?fromchange=02df9dc8a300c705cdaabebf52d9e96a6fa906a3&tochange=057c
> 58bcb1b19b282e6a9afe20cd90cee87244a4
> 
> Up to 7% [*1] (no C++ change)
> 
> https://hg.mozilla.org/integration/autoland/
> pushloghtml?fromchange=ae7c124af9f03b88cf5efd53d4d2f9383fa959f2&tochange=a0a1
> 08e7e7b34e2f90a9298d225ae811024615dd
> 
> down to 7% (changed only JS!!)

:bholley m_kato made a good point. There's something strange about this regression, because the improvements from April 16th don't seem to make sense. Can you validate this regression alert?
Flags: needinfo?(igoldan) → needinfo?(sledru)
Sorry, instead of :bholley I should have written :sledru.
Interestingly, there's a bump and drop of compiler warnings at the same time on april 17 and april 23, matching the size changes.
By any reason, it doesn't become hot code, so its size is small at Apr 16.

By my landing, SharedLibraryInfo::GetInfoForSelf might be hot code, then size is big.  I am still looking generated code...
s/it doesn't become/anything become cold code for pgo/
There's apparently a change in -Wmaybe-uninitialized warning numbers.

Overall, this all smells like PGO not really working properly, like, firefox crashing during profiling, which is actually a know issue: see https://bugzilla.mozilla.org/show_bug.cgi?id=1252556#c8
I see.

Before my landed
https://taskcluster-artifacts.net/fx46e6hNRaOcVfRQckpalw/0/public/logs/live_backing.log

[task 2018-04-27T11:57:52.450Z] 11:57:52     INFO -  MOZ_PGO_INSTRUMENTED=1 JARLOG_FILE=jarlog/en-US.log /builds/worker/workspace/build/src/obj-firefox/_virtualenv/bin/python /builds/worker/workspace/build/src/build/pgo/profileserver.py 10
[task 2018-04-27T11:57:52.451Z] 11:57:52     INFO -  ExceptionHandler::WaitForContinueSignal waiting for continue signal...
[task 2018-04-27T11:57:52.451Z] 11:57:52     INFO -  ExceptionHandler::GenerateDump cloned child 32581
[task 2018-04-27T11:57:52.451Z] 11:57:52     INFO -  ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2018-04-27T11:57:52.451Z] 11:57:52     INFO -  jarlog: /builds/worker/workspace/build/src/obj-firefox/jarlog/en-US.log
[task 2018-04-27T11:57:52.451Z] 11:57:52     INFO -  ExceptionHandler::GenerateDump cloned child 32614
[task 2018-04-27T11:57:52.451Z] 11:57:52     INFO -  ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2018-04-27T11:57:52.451Z] 11:57:52     INFO -  ExceptionHandler::WaitForContinueSignal waiting for continue signal...

After my landed
https://taskcluster-artifacts.net/cnX-KOQKSlKpLGml_IMIVg/0/public/logs/live_backing.log

[task 2018-04-27T12:37:45.709Z] 12:37:45     INFO -  MOZ_PGO_INSTRUMENTED=1 JARLOG_FILE=jarlog/en-US.log /builds/worker/workspace/build/src/obj-firefox/_virtualenv/bin/python /builds/worker/workspace/build/src/build/pgo/profileserver.py 10
[task 2018-04-27T12:37:45.709Z] 12:37:45     INFO -  *** UTM:SVC TimerManager:registerTimer called after profile-before-change notification. Ignoring timer registration for id: telemetry_modules_ping
[task 2018-04-27T12:37:45.709Z] 12:37:45     INFO -  ###!!! [Parent][DispatchAsyncMessage] Error: PClientManager::Msg_Teardown Route error: message sent to unknown actor ID
[task 2018-04-27T12:37:45.709Z] 12:37:45     INFO -  JavaScript error: resource://gre/modules/osfile/osfile_async_front.jsm, line 408: Error: OS.File has been shut down. Rejecting post to DirectoryIterator_prototype_close
[task 2018-04-27T12:37:45.709Z] 12:37:45     INFO -  jarlog: /builds/worker/workspace/build/src/obj-firefox/jarlog/en-US.log
[task 2018-04-27T12:37:45.709Z] 12:37:45     INFO -  *** WIFI GEO: shutdown called
[task 2018-04-27T12:37:45.709Z] 12:37:45     INFO -  make[1]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'

So PGO work well by my landed.
Ionuț, sorry but I am not sure what you expect from me here?
Flags: needinfo?(sledru) → needinfo?(igoldan)
Before landing bug 1350500, PGO process didn't work correctly.  And our infra cannot detect whether profiling stage of PGO works (bug 1252556).   So until fixing bug 1252556, binary size of PGO package is unreliable.

Ionuț, should we keep open for this bug?
Also, since we don't run talos on linux 32 bit builds, what we don't see is that when pgo is broken (so, before *this* regression), performance is degraded. Meaning, if talos were running on linux 32 bit builds, we should have seen performance improvements at the same time as this installer size regression.
(In reply to Sylvestre Ledru [:sylvestre] from comment #10)
> Ionuț, sorry but I am not sure what you expect from me here?

I wanted to know what happened around April 17: was there a real installer size reduction or did the test broke and reported erroneously?
Flags: needinfo?(igoldan)
(In reply to Makoto Kato [:m_kato] from comment #11)
> Before landing bug 1350500, PGO process didn't work correctly.  And our
> infra cannot detect whether profiling stage of PGO works (bug 1252556).   So
> until fixing bug 1252556, binary size of PGO package is unreliable.
> 
> Ionuț, should we keep open for this bug?

My conclusion here is that PGO went wrong. This made the test report fake improvements. Then things returned to normal.
I'd say we resolve this as worksforme, as it was a temporary glitch with the infra.
Doesn't sound like there's a real issue other, other than the whole "we managed to break Linux32 PGO without knowing it" thing. We could definitely use a follow-up bug for being better-able to detect when that happens, but this isn't the place for that discussion.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.