#Dev machine discussion: "performance" and "efficiency" cores

1 messages · Page 1 of 1 (latest)

broken torrent
#

starting a thread

keen mural
#

"Performance" cores can do more work per clock but require more power and take more die space than their "efficiency" counterparts. Modern OS schedulers take this into account when assigning work to cores and will put threads that have a history of CPU intensive work on performance cores. So when running multiple compiles, the threads running compilations will tend to run on the performance cores. If all the performance cores are in use, the scheduler will run CPU intensive threads on the efficiency cores rather than make them wait.
Any speed gains realized using efficiency cores come from the additional parallelization of the overall workload they make possible.

#

I've used CPUs with a single core type and with a mix of core types doing large builds like CP. There might be measurable performance differences between the core types, but what really makes a difference is the total number of cores and how well the workload is parallelized.

#

As a purely practical matter for CP builds, only up to 8 cores appear to get utilized. For more than 8 cores there's no appreciable performance improvement as measured by total elapsed time for the build.

queen token
#

What is your disk like in the Optiplex? I’d look at increasing disk speed if possible, too. Especially when building things with thousands of often relatively small files. Compiling code from very fast SSDs that are on PCIe can make a huge difference. Samsung has some of the fasted rated NVMe SSDs, but I’m not if that translates into reality. I just went SSD shopping and was surprised at how much prices have fallen (4TB is now MCU less than a kidney or mortgage).

tacit lantern
#

I have a laptop that has 12th Gen Intel(R) Core(TM) i5-1235U with 12 CPU threads: It has 2 P-cores with 2 threads each + 8 E cores. Logical processors 0 and 2 are the first thread of each P-core.

I'm still using gcc 10 for building CP.

I'll gather up some timings for you.

#

Using "make -j2" and taskset to use 1 thread from each P core,
reporting "real" (elapsed) time to build adafruit_feather_rp2040

commandline                             Real time (s)
taskset --cpu-list 0        make -j1     100s
taskset --cpu-list 0-3      make -j4      57s
taskset --cpu-list 4-11     make -j8      39s
taskset --cpu-list 0,2,3-11 make -j10     36s
taskset --cpu-list 0-11     make -j12     39S

$ lscpu 
...  Model name:                         12th Gen Intel(R) Core(TM) i5-1235U ...

$ lstopo
Machine (15GB total)
  Package L#0
    NUMANode L#0 (P#0 15GB)
    L3 L#0 (12MB)
      L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#1)
      L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#3)
      L2 L#2 (2048KB)
        L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2 + PU L#4 (P#4)
        L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3 + PU L#5 (P#5)
        L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4 + PU L#6 (P#6)
        L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5 + PU L#7 (P#7)
      L2 L#3 (2048KB)
        L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6 + PU L#8 (P#8)
        L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7 + PU L#9 (P#9)
        L1d L#8 (32KB) + L1i L#8 (64KB) + Core L#8 + PU L#10 (P#10)
        L1d L#9 (32KB) + L1i L#9 (64KB) + Core L#9 + PU L#11 (P#11)
...```
#

I don't have any way to directly measure power consumption of each build, but for me building with all 8 "E" cores together is just about the fastest, faster than building with the 2 "P" cores and barely slower than building with all cores.

#

this is basically main branch commit 28c2094b

#

compare to AMD Ryzen 7 3700X 8-Core Processor (16 threads): with 1 thread per core it builds in 22s (-j8) and with 2 threads per core it builds in 15s (-j16). building at -j1 takes 95 seconds.

#

the machines are: debian 12, 16GB, micron NVME, ext4 filesystem; and debian 11, 32GB, samsung NVME, zfs filesystem

#

I can run another set of tests for a different port if you like; I think rp2040 is a best case because it uses MESSAGE_COMPRESSION_LEVEL=1 and does not use LTO, which are two steps of the build that are not effectively parallel in other builds.

#

(a best case for larger number of threads improving build times)

broken torrent
#

Thanks - that is interesting. On my 6-core i7=8700, time says the real time for this build is 24s.

tacit lantern
#

so ryzen 7 3700X "per thread" performance is only a hair faster than i5-1235U "P" core "per thread" performance in this test, which surprises me a bit (expected the ryzen to be more faster). "E" core time on the i5 is 110s, so its single thread performance is not much behind the "P" core.

The "U" is a laptop part with 15W base power & 55W turbo power, while the 3700X is rated 65W TDP, but Intel and AMD numbers are not directly comparable, they use different methodology.

broken torrent
#

-j1 for me is 1m40s

tacit lantern
#

anyway that's my info for now. I don't think you're going to get a 4x improvement but I could be wrong.

broken torrent
#

Thanks very much for the testing. No, I don't think I will either. The single-core CPU performance of 13700 vs 8700 is about 1.5x according to cpubenchmark.net

#

so i might get 2x when all is said and done, but my cores are not terribly slower. This contrasts with the Moore's law era when I upgraded my machine every few years and got a 6x improvement each time

broken torrent
#

but this is a pretty old motherboard

#

M.2 PCIe 3x4 NVMe

#

Q370 chipset

queen token
# broken torrent current disk is a Samsung SSD 970 EVO Plus 1TB

I would have been really surprised if you didn't already have a decent SSD... Samsung has some newer ones that I think are about double the speed but they need PCIe 4.0, that chipset is 3.0

I wonder if there'd be enough benefit to building from a RAM disk to make it worthwhile

broken torrent
queen token
#

O for the halcyon days of yore when new generations of CPUs really did have dramatic performance improvements...

tacit lantern
#

What is helpful is ccache, sometimes. If I build with 16 threads on my ryzen machine AND it was exactly equivalent to a previous build, it's 6.5s, down from 11s or <50% of the 15s at -j16 without ccache. HOWEVER because of how qstrs & translations work, many changes will cause virtually all source files to be different (e.g., addition/removal/change of a qstr or a message)

#

distcc or icecream for sharing builds across a local trusted network are also interesting tech but I've never checked if they help with circuitpython. as remarked elsewhere, we have some slow sequential steps that limit the benefit of multiple parallel build tasks

#

there's even a law heuristic about that, https://en.wikipedia.org/wiki/Amdahl's_law

In computer architecture, Amdahl's law (or Amdahl's argument) is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It states that "the overall performance improvement gained by optimizing a single part of a system is limited by the ...

pastel creek
#

what board are you building to test? I have an AMD 5950x for my desktop and an AMD 7480U in my new laptop

broken torrent
#

Jeff tested adafruit_feather_rp2040

pastel creek
#
time make -j 32 BOARD=adafruit_feather_rp2040
- Verbosity options: any combination of "steps commands rules", as `make V=...` or env var BUILD_VERBOSE
Traceback (most recent call last):
  File "/home/tannewt/repos/circuitpython/ports/raspberrypi/gen_stage2.py", line 3, in <module>
    import cascadetoml
ModuleNotFoundError: No module named 'cascadetoml'
make: *** [Makefile:431: build-adafruit_feather_rp2040/genhdr/flash_info.h] Error 1
make: *** Waiting for unfinished jobs....

________________________________________________________
Executed in  383.05 millis    fish           external
   usr time  385.53 millis  297.00 micros  385.23 millis
   sys time   54.28 millis   28.00 micros   54.25 millis

 !  ~/r/circuitpython   idf5.1.2 *$+  ports/raspberrypi  source ~/repos/venv/bin/activate.fish
(venv)  ◰³ venv  ~/r/circuitpython   idf5.1.2 *$+  ports/raspberrypi  make -j 32 BOARD=adafruit_feather_rp2040 clean
- Verbosity options: any combination of "steps commands rules", as `make V=...` or env var BUILD_VERBOSE
rm -rf build-adafruit_feather_rp2040 
(venv)  ◰³ venv  ~/r/circuitpython   idf5.1.2 *$+  ports/raspberrypi  time make -j 32 BOARD=adafruit_feather_rp2040
- Verbosity options: any combination of "steps commands rules", as `make V=...` or env var BUILD_VERBOSE
GEN build-adafruit_feather_rp2040/genhdr/mpversion.h
   text    data     bss     dec     hex filename
    244       0       0     244      f4 build-adafruit_feather_rp2040/boot2.elf
Root pointer registrations updated
GEN build-adafruit_feather_rp2040/genhdr/root_pointers.h
Module registrations updated
GEN build-adafruit_feather_rp2040/genhdr/moduledefs.h
QSTR updated
GEN build-adafruit_feather_rp2040/genhdr/qstrdefs.generated.h
/usr/lib/gcc/arm-none-eabi/13.2.0/../../../../arm-none-eabi/bin/ld: warning: build-adafruit_feather_rp2040/firmware.elf has a LOAD segment with RWX permissions
Memory region         Used Size  Region Size  %age Used
  FLASH_FIRMWARE:      805320 B      1020 KB     77.10%
             RAM:       48240 B       256 KB     18.40%
       SCRATCH_Y:          0 GB         4 KB      0.00%
       SCRATCH_X:          2 KB         4 KB     50.00%
Converted to uf2, output size: 1610752, start address: 0x10000000
Wrote 1610752 bytes to build-adafruit_feather_rp2040/firmware.uf2

________________________________________________________
Executed in    9.03 secs    fish           external
   usr time  120.29 secs  315.00 micros  120.29 secs
   sys time   49.70 secs   38.00 micros   49.70 secs
#

-j 1 is 82.28 seconds

broken torrent
#

so 9 seconds on this 32-core machine vs 24 seconds on my 6-core i7-8700

keen mural
pastel creek
#

haha, oops. I copied them both

#

AMD also has the large cache variants

#

the X3D chips

keen mural
#

I'm thinking filesystem cache. Would be interesting to see back to back builds after a boot.

pastel creek
#

gets out the new laptop

pastel creek
#

Laptop took 22.89 seconds with -j 16

broken torrent
#

what CPU is in the laptop?

pastel creek
#

AMD 7480U

broken torrent
#

following up on this thread from over two years ago. I finally replaced my Intel i7-8700 machine (6 cores, 12 threads) with an Ultra 7 265 machine (8 performance cores, 12 efficiency cores). Here are some CircuitPython build-time comparisons. A single performance core on the 265 is about 1.8 times faster than an i7-8700 core. These builds are with is with automatic -j<n> where <n> is the number of cores:

$ make BOARD=trinket_m0
real    0m20.804s
user    1m4.790s
sys    0m12.647s

real    0m7.225s
user    0m29.888s
sys    0m6.702s

$ make BOARD=feather_m4_express
real    0m47.506s
user    3m54.140s
sys    0m38.988s

real    0m11.424s
user    1m34.986s
sys    0m20.907s

$ make BOARD=adafruit_metro_esp32s3
real    1m21.665s
user    6m57.475s
sys    2m57.367s

real    0m26.298s
user    3m12.123s
sys    1m12.385s

So 2.9-4.3 times faster. The old machine has 24GB RAM; the new only has 16GB (due to the exorbitant cost of RAM right now).

The SSD's on the two machines have very similar performance, I believe.

I bought an open-box lower-end Dell machine from Best Buy for about $550: no graphics card, etc.

For compilation speed, this was worth the upgrade. For typical purposes like web browsing, there is no obvious performance difference.