Monomorphization slows down code? | Rust Programming Language Community | Page 1

bright locust May 29, 2023, 3:42 PM

#

I’m building kdtree library and my build and query times slow down when I introduce generics over several float types

Looking at just the tree build time using into_tree for now because it introduces the least amount of changes, I simply changed CP32 to some generic T: PartialOrd and the code runs 6% slower.

These are the build/build and build/build_f32 benchmarks run with RUSTFLAGS="-C target-cpu=native" cargo bench

(this compares generic vs main, i.e. main is 6% faster so generic is ≈6% slower)

build/build             time:   [3.0397 ms 3.0498 ms 3.0706 ms]
                        change: [-7.4159% -6.8037% -6.1435%] (p = 0.00 < 0.05)
                        Performance has improved.
build/build_f32         time:   [1.8228 ms 1.8253 ms 1.8280 ms]
                        change: [-2.7180% -2.2652% -1.8788%] (p = 0.00 < 0.05)
                        Performance has improved.

The query code, which adds generics over the input types for queries but is more complicated to review and does add a few other changes adds 20-30% more time.

https://github.com/cavemanloverboy/bosque/compare/generic?expand=1

GitHub

Comparing main...generic · cavemanloverboy/bosque

Contribute to cavemanloverboy/bosque development by creating an account on GitHub.

#

https://github.com/cavemanloverboy/bosque/commit/a99090ee9f668dd4ea19fec0e2ee10d2c21ba5d5

GitHub

make tree build generic · cavemanloverboy/bosque@a99090e

#

here is a more minimal summary of changes that makes the tree build generic

#

and to substantiate the claim, here is another measurement in the other direction

build/build             time:   [3.3711 ms 3.3765 ms 3.3814 ms]
                        change: [+9.3331% +10.029% +10.719%] (p = 0.00 < 0.05)
                        Performance has regressed.
build/build_f32         time:   [1.8636 ms 1.8699 ms 1.8816 ms]
                        change: [+2.4101% +3.8233% +6.0918%] (p = 0.00 < 0.05)
                        Performance has regressed.

young olive May 30, 2023, 4:38 AM

#

try opt-level = "z" and using LTO

slender juniper May 30, 2023, 4:46 AM

#

are you using criterion for the benchmarks?

#

sometimes making a function generic can influence how much it gets inlined, since without a generic a function always gets generated, whereas with generics functions only get generated when they are called (with their generic type), which may make inlining a little more likely

serene solstice May 30, 2023, 5:51 AM

#

without taking a closer look, it could also be code bloat filling up the instruction cache
inlining noise is as B3NNY said also always a possibility for every change

torn nest May 30, 2023, 8:12 AM

#

#[inline(never)] on it and see what happens

#

Also, I see an lto = true. Try lto = "thin"

bright locust Jun 1, 2023, 12:33 AM

#

slender juniper are you using criterion for the benchmarks?

yes

bright locust Jun 1, 2023, 12:34 AM

#

torn nest `#[inline(never)]` on it and see what happens

on which function exactly? there's an inner recursive function and a public facing entry function

bright locust Jun 1, 2023, 12:35 AM

#

torn nest Also, I see an `lto = true`. Try `lto = "thin"`

can you explain the rationale behind here? Note that this library by design basically has zero dependencies. With no features (parallel -> rayon, cbindgen for build script) it has zero deps

#

there are only 3 types for which these functions are used btw: f32, f64, and a custom compressed float CP32

thin grove Jun 1, 2023, 7:03 AM

#

On my system there's no difference in performance detected (in fact main is 0.27% to 1.85% slower than generic, though criterion doesn't think it's statistically significant). Your benchmark system is probably not very stable.

Here's some things to try:

disable SMT/hyper-threading
disable core frequency boosting and pin the CPU frequency to a specific value
ensure your CPU is not thermal throttling
ensure there's nothing else running on the systems (web browsers like vs code are terrible)
give the benchmark a niceness of -20

bright locust Jun 1, 2023, 4:30 PM

#

thin grove On my system there's no difference in performance detected (in fact main is 0.27...

interesting. i am on an arm machine btw (M2 Max). I will try it out on an AMD EPYC on a computing cluster and report back. I've never heard of benchmark niceness

bright locust Jun 1, 2023, 6:04 PM

#

thin grove On my system there's no difference in performance detected (in fact main is 0.27...

on an AMD epyc on a computing cluster (nothing else is running, I have exclusive access to this compute node), I got the following results (some omitted to make post shorter):

build/build             time:   [4.5284 s 4.6006 s 4.7057 s]
                        change: [+18.352% +21.554% +25.259%] (p = 0.00 < 0.05)
                        Performance has regressed.
build/build_f32         time:   [2.5496 s 2.5627 s 2.5776 s]
                        change: [-9.2576% -8.5813% -7.8888%] (p = 0.00 < 0.05)
                        Performance has improved.
query_nearest/query     time:   [3.0764 s 3.0790 s 3.0814 s]
                        change: [+9.4986% +9.6279% +9.7646%] (p = 0.00 < 0.05)
                        Performance has regressed.
query_nearest/query_par time:   [104.52 ms 105.31 ms 106.17 ms]
                        change: [+2.4514% +4.1084% +5.7435%] (p = 0.00 < 0.05)
                        Performance has regressed.
query_nearest/queryf32  time:   [1.6714 s 1.6742 s 1.6771 s]
                        change: [+3.1144% +3.3475% +3.5817%] (p = 0.00 < 0.05)
                        Performance has regressed.
query_nearest/queryf32_par
                        time:   [62.639 ms 63.250 ms 63.959 ms]
                        change: [-0.6636% +1.1376% +2.9097%] (p = 0.21 > 0.05)
                        No change in performance detected.
query_nearest/query_periodic
                        time:   [3.0774 s 3.0816 s 3.0860 s]
                        change: [+12.636% +12.959% +13.251%] (p = 0.00 < 0.05)
                        Performance has regressed.
query_nearest/queryf32_periodic
                        time:   [1.6857 s 1.6894 s 1.6932 s]
                        change: [+6.0268% +6.4027% +6.7685%] (p = 0.00 < 0.05)
query_k/query_k         time:   [11.780 s 11.811 s 11.835 s]
                        change: [+59.072% +60.224% +61.298%] (p = 0.00 < 0.05)
                        Performance has regressed.
query_k/query_k_f32     time:   [16.479 s 16.509 s 16.551 s]
                        change: [+89.791% +90.508% +91.231%] (p = 0.00 < 0.05)
                        Performance has regressed.

for the most part, performance regressed significantly

thin grove Jun 1, 2023, 6:06 PM

#

did you pin the core frequency?
is this a VM?

#

https://bench.cr.yp.to/supercop.html

#

read

Reducing randomness in benchmarks

bright locust Jun 1, 2023, 8:34 PM

#

thin grove did you pin the core frequency? is this a VM?

i don't think i'm allowed to do this. I am not sudo/admin. This is a HPC cluster w/ SLURM.

#Monomorphization slows down code?