Figure I'd share here for anyone with a gfx1100 (7900 XTX or W7900) or gfx1151 (Strix Halo) that wants to kick the tires: https://github.com/shisa-ai/hipEngine
This is a Python/CPP/HIP inference engine that is specifically tuned to run currently, a single mode (Qwen 3.5 MoE ParoQuant) extremely fast on gfx1100/gfx1151. It does not have a PyTorch dependency, instead directly driving native HIP libraries, with >100+ custom fused/unfused kernels.
The upshot to this is that while a very early implementation (initial kernel work started about 3 weeks ago, the hipEngine harness is 1 week old) , performance is quite good. On gfx1151 it beats llama.cpp prefill+decode speed across the board, and on gfx1100 it beats llama.cpp HIP by a healthy margin (prefill+decode), and is a little behind vs llama.cpp Vulkan on decode. It is however >2X prefill at long context.
I've released a ROCm ParoQuant fork and a 19GB Qwen 3.6 35B-A3B PARO packed safetensor model), and there's a FastAPI OpenAI api server as well, but it's brand new software, so I'd be interested in feedback for those that can get it working/testiing.
Developers may also be interested in:
- gfx1100 roofline doc (includes low level kernel-launch analysis): https://github.com/shisa-ai/hipEngine/blob/main/docs/ROOFLINE.md
- gfx1151-specific addendum: https://github.com/shisa-ai/hipEngine/blob/main/docs/ROOFLINE-gfx1151.md
- some of the LESSONS LEARNED grinding RDNA3 kernels: https://github.com/shisa-ai/hipEngine/blob/main/docs/LESSONS-LEARNED.md