#Compiler Suboptimal SIMD Instruction Scheduling

4 messages · Page 1 of 1 (latest)

sand herald
#

i made my own custom simd wrapper because i wanted to have more control over the simd instructions that goes beyond what std::simd offers. however, for some reason the compiler just seems to be scheduling my simd instructions worse, immediately using registers after they were just used without regard for dependency chains. the same code (or nearly exactly the same) get secheduled very differently when using std::simd.

the first attached image is the .ll of the code using my custom simd wrapper that uses intrinsics directly (no llvm intrinsics or core intrinsics, x86 vendor specific intrinsics). the second attached image is the same code but using std::simd. you can see clearly that the second one has logically grouped together the instructions, which allows it to work on more things simultaneously (i think). in any case, it performs a good amount faster despite using the same instructions (minus the fact it's using logical shift instead of arithmetic shift, which shouldn't matter anyways). my version actually compiles into less assembly instructions than the std::simd one, but it's still slower despite that.

im very confused on how std::simd is doing this so much better. is it because it's using the core intrinsics? does llvm just do better scheduling with those? is there some way i can get that same behavior for my simd wrapper?

summer dagger
#

I don’t know what I’m talking about, but would it make sense that LLVM has minimal awareness of the behaviour of specific instructions and doesn’t do a good job of optimising them, but has a very good understanding of its own LLVM simd mechanisms and can optimise those better

sand herald
#

i think i might have fixed it! i explicitly inlined absolutely everything. it didn't change the assembly, but performance is better? im currently on my laptop not plugged in so im not using the same clock speed as earlier but im getting the same performance with both now?

#

i didn't think to expicitly inline because the compiler was clearly already doing that, but it's better to be explicit about it anyways