#Monomorphization slows down code?

21 messages · Page 1 of 1 (latest)

bright locust
#

I’m building kdtree library and my build and query times slow down when I introduce generics over several float types

Looking at just the tree build time using into_tree for now because it introduces the least amount of changes, I simply changed CP32 to some generic T: PartialOrd and the code runs 6% slower.

These are the build/build and build/build_f32 benchmarks run with RUSTFLAGS="-C target-cpu=native" cargo bench

(this compares generic vs main, i.e. main is 6% faster so generic is ≈6% slower)

build/build             time:   [3.0397 ms 3.0498 ms 3.0706 ms]
                        change: [-7.4159% -6.8037% -6.1435%] (p = 0.00 < 0.05)
                        Performance has improved.
build/build_f32         time:   [1.8228 ms 1.8253 ms 1.8280 ms]
                        change: [-2.7180% -2.2652% -1.8788%] (p = 0.00 < 0.05)
                        Performance has improved.

The query code, which adds generics over the input types for queries but is more complicated to review and does add a few other changes adds 20-30% more time.

https://github.com/cavemanloverboy/bosque/compare/generic?expand=1

GitHub

Contribute to cavemanloverboy/bosque development by creating an account on GitHub.

#

here is a more minimal summary of changes that makes the tree build generic

#

and to substantiate the claim, here is another measurement in the other direction

build/build             time:   [3.3711 ms 3.3765 ms 3.3814 ms]
                        change: [+9.3331% +10.029% +10.719%] (p = 0.00 < 0.05)
                        Performance has regressed.
build/build_f32         time:   [1.8636 ms 1.8699 ms 1.8816 ms]
                        change: [+2.4101% +3.8233% +6.0918%] (p = 0.00 < 0.05)
                        Performance has regressed.
young olive
#

try opt-level = "z" and using LTO

slender juniper
#

are you using criterion for the benchmarks?

#

sometimes making a function generic can influence how much it gets inlined, since without a generic a function always gets generated, whereas with generics functions only get generated when they are called (with their generic type), which may make inlining a little more likely

serene solstice
#

without taking a closer look, it could also be code bloat filling up the instruction cache
inlining noise is as B3NNY said also always a possibility for every change

torn nest
#

#[inline(never)] on it and see what happens

#

Also, I see an lto = true. Try lto = "thin"

bright locust
bright locust
#

there are only 3 types for which these functions are used btw: f32, f64, and a custom compressed float CP32

thin grove
#

On my system there's no difference in performance detected (in fact main is 0.27% to 1.85% slower than generic, though criterion doesn't think it's statistically significant). Your benchmark system is probably not very stable.

Here's some things to try:

  • disable SMT/hyper-threading
  • disable core frequency boosting and pin the CPU frequency to a specific value
  • ensure your CPU is not thermal throttling
  • ensure there's nothing else running on the systems (web browsers like vs code are terrible)
  • give the benchmark a niceness of -20
bright locust
bright locust
# thin grove On my system there's no difference in performance detected (in fact main is 0.27...

on an AMD epyc on a computing cluster (nothing else is running, I have exclusive access to this compute node), I got the following results (some omitted to make post shorter):

build/build             time:   [4.5284 s 4.6006 s 4.7057 s]
                        change: [+18.352% +21.554% +25.259%] (p = 0.00 < 0.05)
                        Performance has regressed.
build/build_f32         time:   [2.5496 s 2.5627 s 2.5776 s]
                        change: [-9.2576% -8.5813% -7.8888%] (p = 0.00 < 0.05)
                        Performance has improved.
query_nearest/query     time:   [3.0764 s 3.0790 s 3.0814 s]
                        change: [+9.4986% +9.6279% +9.7646%] (p = 0.00 < 0.05)
                        Performance has regressed.
query_nearest/query_par time:   [104.52 ms 105.31 ms 106.17 ms]
                        change: [+2.4514% +4.1084% +5.7435%] (p = 0.00 < 0.05)
                        Performance has regressed.
query_nearest/queryf32  time:   [1.6714 s 1.6742 s 1.6771 s]
                        change: [+3.1144% +3.3475% +3.5817%] (p = 0.00 < 0.05)
                        Performance has regressed.
query_nearest/queryf32_par
                        time:   [62.639 ms 63.250 ms 63.959 ms]
                        change: [-0.6636% +1.1376% +2.9097%] (p = 0.21 > 0.05)
                        No change in performance detected.
query_nearest/query_periodic
                        time:   [3.0774 s 3.0816 s 3.0860 s]
                        change: [+12.636% +12.959% +13.251%] (p = 0.00 < 0.05)
                        Performance has regressed.
query_nearest/queryf32_periodic
                        time:   [1.6857 s 1.6894 s 1.6932 s]
                        change: [+6.0268% +6.4027% +6.7685%] (p = 0.00 < 0.05)
query_k/query_k         time:   [11.780 s 11.811 s 11.835 s]
                        change: [+59.072% +60.224% +61.298%] (p = 0.00 < 0.05)
                        Performance has regressed.
query_k/query_k_f32     time:   [16.479 s 16.509 s 16.551 s]
                        change: [+89.791% +90.508% +91.231%] (p = 0.00 < 0.05)
                        Performance has regressed.

for the most part, performance regressed significantly

thin grove
#

did you pin the core frequency?
is this a VM?

#

read

Reducing randomness in benchmarks

bright locust