track macos cpu usage/temp in influx
Categories
(Infrastructure & Operations :: RelOps: Posix OS, task)
Tracking
(Not tracked)
People
(Reporter: dhouse, Assigned: dhouse)
References
Details
Attachments
(4 files)
Start collecting cpu usage and temperature, and throttling, metrics on the mac minis.
confirmed we can get the temperature and thottling under mojave macos (I tested both on a mojave mac mini also):
~$ sysctl machdep.xcpm.cpu_thermal_level
machdep.xcpm.cpu_thermal_level: 48
$ sysctl "machdep.xcpm."|grep thermal_level
machdep.xcpm.cpu_thermal_level: 71
machdep.xcpm.gpu_thermal_level: 0
machdep.xcpm.io_thermal_level: 12
the golang ps package is used by the telegraf temp(and cpu/other) input plugins, and it collects sysctl data for darwin. So it may include the temperature. If not, we can get it directly with sysctl.
for throttling, there may not be anything already available for telegraf.
~$ pmset -g thermlog
2019-10-02 12:21:53 -0600 CPU Power notify
CPU_Scheduler_Limit = 100
CPU_Available_CPUs = 12
CPU_Speed_Limit = 97
2019-10-02 12:21:54 -0600 CPU Power notify
CPU_Scheduler_Limit = 100
CPU_Available_CPUs = 12
CPU_Speed_Limit = 100
When I load my macbookpro to 80% cpu, the cpu_thermal_level rises, the fans start up and the thermal level dips down then rises again, and finally (presumably because the fans reach max speed and the thermal level does not reduce) around a thermal level of 99 the CPU_Speed_Limit reduces until the thermal level starts reducing or it bottoms out around 30. In my testing, with 80% cpu, it stabilizes at a limit of 40 with thermal level 135. At 95% cpu, the readings stabilize at a limit of 29 and thermal level 145.
macos also has been logging the thermal level (under default level logging):
https://my.papertrailapp.com/groups/1223184/events?focus=1115767836724334605&q=%28program%3Athermald%20%22thermal%20level%22%29&selected=1115767836724334605
like:
Sep 30 05:12:33 t-mojave-r7-230.test.releng.mdc2.mozilla.com thermald: [thermald:log] Thermal pressure level: Nominal based on CPU thermal level 49
Sep 30 05:12:33 t-mojave-r7-309.test.releng.mdc1.mozilla.com thermald: [thermald:log] Thermal pressure level: Nominal based on CPU thermal level 49
Sep 30 05:12:33 t-mojave-r7-286.test.releng.mdc1.mozilla.com thermald: [thermald:log] Thermal pressure level: Nominal based on CPU thermal level 49
For the performance difference on weekends (bug 1578694), I'll check if we have this in the logs (stackdriver/archives) and if it is lower overall on the weekends. From my spot checks, it looks like it is not extreme/high on the last few week days in the current searchable logs.
I had turned off telegraf because of the performance impact in production. Through testing the configuration last week and this week, and with varying configurations (tried with a production worker also), the performance impact appears to be from the filecount plugin.
I attached a screenshot of stats collected with different telegraf host stats plugins on/off. When the cpu spikes at 30%, it is when the filecount plugin is enabled and there are many files in the task user directory tree. I'm running some additional tests with varying the filecount parameters to confirm it.
merging the last change (2019 Nov 26, 6:35am pacific), I turned the monitoring back on for the MacOS workers.
After the monitoring collects stats on the machines through the next week, I can review to see if there is thermal throttling happening during daily typical work loads (and not during the weekends).
The macos workers have run over the last 7 days with this monitoring. I see cpu throttling happening when the machines are under load and heat up, but it averages to 100% (no limit) for the majority of the workers. Most drop down to 80% for about an hour and then recover. 6-10 of them show high temperatures and significant throttling.
The throttling happened less over the weekend. I don't know if the throttling and tasks running before and during the throttling are the same.
Description
•