#CPU Frequency Scaling broken on some machines

9 messages · Page 1 of 1 (latest)

ruby mauve
#

Symptom: All CPUs locked to 1500mhz (minimum P-state) and never boost, even under sustained load. This makes compute-intensive workloads 2-4x slower than expected.

Evidence from within the container:

# Governor is schedutil, should ramp up under load — but doesn't
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
schedutil

# Driver is legacy acpi-cpufreq (not amd-pstate)
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
acpi-cpufreq

# Only 3 P-states available
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
3300000 2400000 1500000

# Stuck at minimum even under sustained CPU load
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
1500000    # ALL cores report 1500000, even while running compute workloads

# Limits look correct - max is 3300 MHz, boost is enabled
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
3300000
$ cat /sys/devices/system/cpu/cpufreq/boost
1
$ cat /sys/devices/system/cpu/cpu0/cpufreq/bios_limit
3300000

Hardware: AMD EPYC 9575F (Zen 5 Turin) 128 threads visible. Should boost to 3.3ghz base (or higher with turbo)

Root cause hypothesis: The schedutil governor on the host isn't receiving proper CPU utilization feedback for this VM/container, so it never ramps up above the min P-state. Alternatively, the host is using acpi-cupfreq instead of amd-pstate or amd-pstate-epp which is the recommended driver for Zen4+ CPUs and handles frequency scaling much better.

Comparison with a working machine (using amd-pstate-epp driver, correct for Zen 4): Boosts to 5342mhz under load - everything working as expected

Requested fix (any of these):

  1. Switch host CPU frequency driver to amd-pstate-epp (kernel param amd_pstate=active)
  2. Switch governer to performance on host: echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
  3. If neither is possible, at least ensure schedutil is responding to load (may require kernel update or BIOS CPPC enablement)
wild mothBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

vagrant nexusBOT
neat turret
#

@ruby mauve I suggest you post this to #1185337232517759028 too as a feature request

ruby mauve
#

@neat turret I'm pretty sure this is an error and not a feature - the CPUs are advertised as 5Ghz+ but locked to 1.5Ghz

neat turret
#

@ruby mauve I mean - if RunPod fixes this for you it still wont be released for everyone. Requesting it as a 'new feature' and getting wider support from other customers could help with that.

ruby mauve
#

@neat turret it is a fleet wide issue, and it can only be fixed on the host container. It’s a misconfiguration that is affecting anyone unlucky enough to be randomly assigned to a host with this issue. The resolution is independent of customer, it cannot be applied on a per-customer basis, it must be fixed across the entire fleet

#

Computers are supposed to dynamically scale frequencies as a power saving measure - but what’s currently happening is it’s locked to the lowest possible setting (worse performance)

neat turret
#

@floral heath @fleet tapir ?