i made my own custom simd wrapper because i wanted to have more control over the simd instructions that goes beyond what std::simd offers. however, for some reason the compiler just seems to be scheduling my simd instructions worse, immediately using registers after they were just used without regard for dependency chains. the same code (or nearly exactly the same) get secheduled very differently when using std::simd.
the first attached image is the .ll of the code using my custom simd wrapper that uses intrinsics directly (no llvm intrinsics or core intrinsics, x86 vendor specific intrinsics). the second attached image is the same code but using std::simd. you can see clearly that the second one has logically grouped together the instructions, which allows it to work on more things simultaneously (i think). in any case, it performs a good amount faster despite using the same instructions (minus the fact it's using logical shift instead of arithmetic shift, which shouldn't matter anyways). my version actually compiles into less assembly instructions than the std::simd one, but it's still slower despite that.
im very confused on how std::simd is doing this so much better. is it because it's using the core intrinsics? does llvm just do better scheduling with those? is there some way i can get that same behavior for my simd wrapper?