#Dev machine discussion: "performance" and "efficiency" cores
1 messages · Page 1 of 1 (latest)
"Performance" cores can do more work per clock but require more power and take more die space than their "efficiency" counterparts. Modern OS schedulers take this into account when assigning work to cores and will put threads that have a history of CPU intensive work on performance cores. So when running multiple compiles, the threads running compilations will tend to run on the performance cores. If all the performance cores are in use, the scheduler will run CPU intensive threads on the efficiency cores rather than make them wait.
Any speed gains realized using efficiency cores come from the additional parallelization of the overall workload they make possible.
I've used CPUs with a single core type and with a mix of core types doing large builds like CP. There might be measurable performance differences between the core types, but what really makes a difference is the total number of cores and how well the workload is parallelized.
As a purely practical matter for CP builds, only up to 8 cores appear to get utilized. For more than 8 cores there's no appreciable performance improvement as measured by total elapsed time for the build.
What is your disk like in the Optiplex? I’d look at increasing disk speed if possible, too. Especially when building things with thousands of often relatively small files. Compiling code from very fast SSDs that are on PCIe can make a huge difference. Samsung has some of the fasted rated NVMe SSDs, but I’m not if that translates into reality. I just went SSD shopping and was surprised at how much prices have fallen (4TB is now MCU less than a kidney or mortgage).
I have a laptop that has 12th Gen Intel(R) Core(TM) i5-1235U with 12 CPU threads: It has 2 P-cores with 2 threads each + 8 E cores. Logical processors 0 and 2 are the first thread of each P-core.
I'm still using gcc 10 for building CP.
I'll gather up some timings for you.
Using "make -j2" and taskset to use 1 thread from each P core,
reporting "real" (elapsed) time to build adafruit_feather_rp2040
commandline Real time (s)
taskset --cpu-list 0 make -j1 100s
taskset --cpu-list 0-3 make -j4 57s
taskset --cpu-list 4-11 make -j8 39s
taskset --cpu-list 0,2,3-11 make -j10 36s
taskset --cpu-list 0-11 make -j12 39S
$ lscpu
... Model name: 12th Gen Intel(R) Core(TM) i5-1235U ...
$ lstopo
Machine (15GB total)
Package L#0
NUMANode L#0 (P#0 15GB)
L3 L#0 (12MB)
L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#1)
L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#2)
PU L#3 (P#3)
L2 L#2 (2048KB)
L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2 + PU L#4 (P#4)
L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3 + PU L#5 (P#5)
L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4 + PU L#6 (P#6)
L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5 + PU L#7 (P#7)
L2 L#3 (2048KB)
L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6 + PU L#8 (P#8)
L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7 + PU L#9 (P#9)
L1d L#8 (32KB) + L1i L#8 (64KB) + Core L#8 + PU L#10 (P#10)
L1d L#9 (32KB) + L1i L#9 (64KB) + Core L#9 + PU L#11 (P#11)
...```
I don't have any way to directly measure power consumption of each build, but for me building with all 8 "E" cores together is just about the fastest, faster than building with the 2 "P" cores and barely slower than building with all cores.
this is basically main branch commit 28c2094b
compare to AMD Ryzen 7 3700X 8-Core Processor (16 threads): with 1 thread per core it builds in 22s (-j8) and with 2 threads per core it builds in 15s (-j16). building at -j1 takes 95 seconds.
the machines are: debian 12, 16GB, micron NVME, ext4 filesystem; and debian 11, 32GB, samsung NVME, zfs filesystem
I can run another set of tests for a different port if you like; I think rp2040 is a best case because it uses MESSAGE_COMPRESSION_LEVEL=1 and does not use LTO, which are two steps of the build that are not effectively parallel in other builds.
(a best case for larger number of threads improving build times)
Thanks - that is interesting. On my 6-core i7=8700, time says the real time for this build is 24s.
so ryzen 7 3700X "per thread" performance is only a hair faster than i5-1235U "P" core "per thread" performance in this test, which surprises me a bit (expected the ryzen to be more faster). "E" core time on the i5 is 110s, so its single thread performance is not much behind the "P" core.
The "U" is a laptop part with 15W base power & 55W turbo power, while the 3700X is rated 65W TDP, but Intel and AMD numbers are not directly comparable, they use different methodology.
-j1 for me is 1m40s
anyway that's my info for now. I don't think you're going to get a 4x improvement but I could be wrong.
Thanks very much for the testing. No, I don't think I will either. The single-core CPU performance of 13700 vs 8700 is about 1.5x according to cpubenchmark.net
so i might get 2x when all is said and done, but my cores are not terribly slower. This contrasts with the Moore's law era when I upgraded my machine every few years and got a 6x improvement each time
current disk is a Samsung SSD 970 EVO Plus 1TB
but this is a pretty old motherboard
M.2 PCIe 3x4 NVMe
Q370 chipset
I would have been really surprised if you didn't already have a decent SSD... Samsung has some newer ones that I think are about double the speed but they need PCIe 4.0, that chipset is 3.0
I wonder if there'd be enough benefit to building from a RAM disk to make it worthwhile
I was thinking that the files are going to be cached in RAM anyway, and that the compilation is not necessarily going to crowd that out. I have 24GB
O for the halcyon days of yore when new generations of CPUs really did have dramatic performance improvements...
What is helpful is ccache, sometimes. If I build with 16 threads on my ryzen machine AND it was exactly equivalent to a previous build, it's 6.5s, down from 11s or <50% of the 15s at -j16 without ccache. HOWEVER because of how qstrs & translations work, many changes will cause virtually all source files to be different (e.g., addition/removal/change of a qstr or a message)
distcc or icecream for sharing builds across a local trusted network are also interesting tech but I've never checked if they help with circuitpython. as remarked elsewhere, we have some slow sequential steps that limit the benefit of multiple parallel build tasks
there's even a law heuristic about that, https://en.wikipedia.org/wiki/Amdahl's_law
In computer architecture, Amdahl's law (or Amdahl's argument) is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It states that "the overall performance improvement gained by optimizing a single part of a system is limited by the ...
what board are you building to test? I have an AMD 5950x for my desktop and an AMD 7480U in my new laptop
Phoronix is my goto for performance testing. Laptop CPUs: https://phoronix.com/benchmark/result/fedora-39-amd-ryzen-laptop-benchmarks/timed-linux-kernel-compilation-defconfig.svgz
Jeff tested adafruit_feather_rp2040
time make -j 32 BOARD=adafruit_feather_rp2040
- Verbosity options: any combination of "steps commands rules", as `make V=...` or env var BUILD_VERBOSE
Traceback (most recent call last):
File "/home/tannewt/repos/circuitpython/ports/raspberrypi/gen_stage2.py", line 3, in <module>
import cascadetoml
ModuleNotFoundError: No module named 'cascadetoml'
make: *** [Makefile:431: build-adafruit_feather_rp2040/genhdr/flash_info.h] Error 1
make: *** Waiting for unfinished jobs....
________________________________________________________
Executed in 383.05 millis fish external
usr time 385.53 millis 297.00 micros 385.23 millis
sys time 54.28 millis 28.00 micros 54.25 millis
! ~/r/circuitpython idf5.1.2 *$+ ports/raspberrypi source ~/repos/venv/bin/activate.fish
(venv) ◰³ venv ~/r/circuitpython idf5.1.2 *$+ ports/raspberrypi make -j 32 BOARD=adafruit_feather_rp2040 clean
- Verbosity options: any combination of "steps commands rules", as `make V=...` or env var BUILD_VERBOSE
rm -rf build-adafruit_feather_rp2040
(venv) ◰³ venv ~/r/circuitpython idf5.1.2 *$+ ports/raspberrypi time make -j 32 BOARD=adafruit_feather_rp2040
- Verbosity options: any combination of "steps commands rules", as `make V=...` or env var BUILD_VERBOSE
GEN build-adafruit_feather_rp2040/genhdr/mpversion.h
text data bss dec hex filename
244 0 0 244 f4 build-adafruit_feather_rp2040/boot2.elf
Root pointer registrations updated
GEN build-adafruit_feather_rp2040/genhdr/root_pointers.h
Module registrations updated
GEN build-adafruit_feather_rp2040/genhdr/moduledefs.h
QSTR updated
GEN build-adafruit_feather_rp2040/genhdr/qstrdefs.generated.h
/usr/lib/gcc/arm-none-eabi/13.2.0/../../../../arm-none-eabi/bin/ld: warning: build-adafruit_feather_rp2040/firmware.elf has a LOAD segment with RWX permissions
Memory region Used Size Region Size %age Used
FLASH_FIRMWARE: 805320 B 1020 KB 77.10%
RAM: 48240 B 256 KB 18.40%
SCRATCH_Y: 0 GB 4 KB 0.00%
SCRATCH_X: 2 KB 4 KB 50.00%
Converted to uf2, output size: 1610752, start address: 0x10000000
Wrote 1610752 bytes to build-adafruit_feather_rp2040/firmware.uf2
________________________________________________________
Executed in 9.03 secs fish external
usr time 120.29 secs 315.00 micros 120.29 secs
sys time 49.70 secs 38.00 micros 49.70 secs
-j 1 is 82.28 seconds
so 9 seconds on this 32-core machine vs 24 seconds on my 6-core i7-8700
Cache is king. Maybe. Would be interesting to see back to back timings without the error.
haha, oops. I copied them both
AMD also has the large cache variants
the X3D chips
I'm thinking filesystem cache. Would be interesting to see back to back builds after a boot.
gets out the new laptop
Laptop took 22.89 seconds with -j 16
what CPU is in the laptop?
AMD 7480U
following up on this thread from over two years ago. I finally replaced my Intel i7-8700 machine (6 cores, 12 threads) with an Ultra 7 265 machine (8 performance cores, 12 efficiency cores). Here are some CircuitPython build-time comparisons. A single performance core on the 265 is about 1.8 times faster than an i7-8700 core. These builds are with is with automatic -j<n> where <n> is the number of cores:
$ make BOARD=trinket_m0
real 0m20.804s
user 1m4.790s
sys 0m12.647s
real 0m7.225s
user 0m29.888s
sys 0m6.702s
$ make BOARD=feather_m4_express
real 0m47.506s
user 3m54.140s
sys 0m38.988s
real 0m11.424s
user 1m34.986s
sys 0m20.907s
$ make BOARD=adafruit_metro_esp32s3
real 1m21.665s
user 6m57.475s
sys 2m57.367s
real 0m26.298s
user 3m12.123s
sys 1m12.385s
So 2.9-4.3 times faster. The old machine has 24GB RAM; the new only has 16GB (due to the exorbitant cost of RAM right now).
The SSD's on the two machines have very similar performance, I believe.
I bought an open-box lower-end Dell machine from Best Buy for about $550: no graphics card, etc.
For compilation speed, this was worth the upgrade. For typical purposes like web browsing, there is no obvious performance difference.