#Mojo dictionary benchmarks

1 messages · Page 1 of 1 (latest)

mint owl
#

I've published a GitHub repository intended as a baseline for comparing the performance of various dictionary implementations in Mojo, with Python and Rust implementations as additional references. You can find it here: https://github.com/dorjeduck/mojo-dictionary-benchmarks.

My hope for this is to serve as a foundation for community discussions on improving Mojo's dictionary performance , especially since current results show Python and Rust implementations with significant performance advantages. The best Mojo approach I am aware of so far is @mossy axle's compact-dict: https://github.com/mzaks/compact-dict

My current focus on dictionaries is driven by the desire to improve the performance of my tokenizer implementation in https://github.com/dorjeduck/minbpe.mojo.

All comments and contributions are most welcome. I quickly put this together to provide a starting point for discussions. Thanks! 🙏

GitHub

A fast and compact Dict implementation in Mojo 🔥. Contribute to mzaks/compact-dict development by creating an account on GitHub.

mossy axle
#

The Rust benchmark is slightly different as the dict type is dic: HashMap<usize, usize> hashing a usize is much cheaper than hashing a string.

#

Generally if we want to benchmark a Dict where key is string and value is int, we should decouple the creation of keys from the actual dictionary operations. In my experiments Compact dict was faster than Python stdlib dict, so my suspicion is that the usage of str(i*2) might lead to a slow down. It woudl be nice if the keys would be generated outside of the benchmark, say by creating a List[String] and the references are used as keys.

#

One more thing regarding CompactDict, there is an upsert method to mutate or add a value for key, this should be a bit faster as it translates to only one keys lookup (hashing of the key), but one could argue that it makes the benchmark unfair as other dicts do not provide such method.

#

Sorry if I am too direct in the messages. I really like the idea of the benchmark, great work!

mint owl
#

thanks @mossy axle i am more than happy to accept any pull request if you have the time to improve this repo - as said i just put it quickly together as starting point.

mossy axle
#

I am a bit behind on multiple things, but I will try to contribute to this repo!

reef citrus
#

I think it's also probably reasonable to give Rust -C target-cpu=native, since Mojo is doing the same. Ideally, we should also compare with a standard hash algorithm, since the default Rust one is quite a bit slower than ahash.

#

@mossy axle Did you want to take a pass at mojo ahash using llvm intrinsics or should I?

mint owl
#

@reef citrus very welcome to make a pull request for that or i do it if you prefer

reef citrus
#

I'll do it because I want to make some modifications to the benchmark script to make it more consistent.

#

FIFO scheduling is great for benchmarking because you remove scheduler overhead.

mint owl
#

very happy for all contributions as i am not an expert on these matters at all , just saw a need to look into dictionary performance.

reef citrus
#

Ideally, we want to break out the raw hash table so people can use that.

young topaz
young topaz
reef citrus
#

Mojo defaults to whatever CPU you have, which for me is x86-64-v4 + some more stuff, which has AVX-512.

#

Thats a MASSIVE capability difference.

reef citrus
reef citrus
#

Also, I have an implementation of the x86 AES intrinsics for std. I need to spend some more time with ahash to figure out where I can safely make use of 4x wider instructions. Code-gen is very close to perfect for larger spans of known size, so I'll do some benchmarking to find the performance cliff so I can generate unrolled versions of the processing loop.

@mossy axle What are your thoughts on having a fixed-width write in the Hasher trait (probably both SIMD and UnsafePointer with a compile-time known size)? I know a lot of cases where the a key can be reliably constrained to 256 or 512 bytes, which translate to 4 and 8 instructions respectively with avx512. A SmallKeyDict with keys that are a maximum of N bytes could do wonders for the codegen here.

mossy axle
mint owl
#

just played around with creating the String keys outside the benchmark and as @mossy axle already assumed, it changes the picture - i will update the repo later today

mint owl
#

Updated the progs as suggested by @mossy axle (key generation outside the benchmark) - wow, compact-dict rocks mojo

half shell
#

Maybe I am missing something but, why not just change the stdlib Dict code to match the compact dict ones? Is there any caveat?

young topaz
half shell
mint owl
#

wonderful to hear that all this knowledge is flowing into stdlib dict, thanks to everyone involved in this 🙏 mojo

reef citrus
# half shell About the AHash algorithm, I think we are almost there, as recently this PR from...

It's important to note that Maxim implemented the fallback algorithm, not the real one. The fallback is good, but the real one backed by hardware AES instructions is quite a bit faster. I'm still working on implementing all the plumbing in std to enable it, but the speedup is generally proportional to the vector width of the system plus some more based on how efficient the fallback instructions were.

#

One area went from ~26 cycles to 3 cycles on Zen 5.

young topaz
reef citrus
#

Does anyone know if there are any Mojicians who are reasonably qualified cryptographers? I can roll my own software AES fallback, but if std is going to have an AES implementation then it probably should be a good one.

young topaz
#

I mean you could get the implementation on supported platform in first and leave everything else constrainted[False]

reef citrus
young topaz
#

Don't hand roll your fallback. Blow up in the users face at compile time is safer

reef citrus
#

The newest processor from intel that DOESN'T support this is the broadwell pentiums/celerons (2014). For AMD pre-bulldozer doesn't have it.

#

ARM is where the fallback is more likely due to a lot of CPUs not implementing the cryptographic extensions.

#

Apple is frustratingly vague about whether apple silicon has FEAT_AES.

young topaz
#

I think you will need to add a bunch of functions to sys.info. We can get those in first

reef citrus
#

hw.optional.arm.FEAT_AES

young topaz
#

My point is, you don't have to provide a complete solution all at once, it's more difficult to review

reef citrus
#

That's true.

#

This is already going to be a fairly big PR since I made x86 vector width agnostic.

young topaz
#

I can imagine

reef citrus
#

It's not that bad, only 95 lines for the x86 implementation, but that was accomplished with rather dense metaprogramming I should probably comment more.

young topaz
#

You could make a draft PR first and let's figure out what's the best way to get that in

reef citrus
#

I'll do that once I have better comments. Some parts of this deserve better explanation:

@parameter
for position in range(start_offset, (width // instruction_width) * instruction_width, instruction_width):
  var out = llvm_intrinsic[
        intrinsic,
        type=SIMD[DType.uint64, instruction_width],
        has_side_effect=False
  ](data.slice[instruction_width, offset=position](), round_keys.slice[instruction_width, offset=position]())

  @parameter 
  for offset in range(instruction_width):
    output[position + offset] = out[offset]

The copy loop is only because pop.insert was doing weird things on Zen 5, using 256 bit stores.

young topaz
#

(width // instruction_width) * instruction_width use _align_down

reef citrus
#

I may also get some grief for making ARM go up to the level of x86 by chaining 2 intrinsics to do a full round of AES.

reef citrus
#

We should probably have a lint for "manual align down", since a lot of people used to writing SIMD will write what I did.

young topaz
#

And I'd really hope you used shorter variable names. 🙃

reef citrus
# young topaz And I'd really hope you used shorter variable names. 🙃

I'd rather have longer variable names to make it vaguely self-documenting (yes I will still add comments). I'll shrink things if I run into the 80 character limit too often. I also want to bring up moving to 88 characters at some point because that cuts down on the number of lines which need to wrap a lot. Those extra 8 characters dropped 5k lines from a ~100kloc python codebase I worked on.

young topaz
reef citrus
reef citrus
# young topaz I'm a believer that mathy code should look like math. Plus I'm a functional prog...

I tend to move further away from math as what I am doing becomes more complex unless I am implementing a formula. I write a decent amount of low-level code, and I have a strong preference for mmio_nic_cmd_queue_doorbell_register over reg because you already have enough stuff going on in your head trying to keep the state of the driver as you are manipulating it without needing to track the meaning of variables. This is especially true in drivers which often have lots of magic constants.

cedar hemlock
#

I also want to bring up moving to 88 characters at some point

#

Yes plz @reef citrus. The frequent spilling of function signatures over multiple lines is such an eyesore!

reef citrus
#

Doing math in the type system has some “fun” consequences in Mojo.

warm gyro
#

@mint owl any interest in sharing this work during our community meeting/showcase on October 21st?

mint owl