#hipEngine: ROCm-native local LLM inference for RDNA3/3.5

1 messages · Page 1 of 1 (latest)

round radish
#

Figure I'd share here for anyone with a gfx1100 (7900 XTX or W7900) or gfx1151 (Strix Halo) that wants to kick the tires: https://github.com/shisa-ai/hipEngine

This is a Python/CPP/HIP inference engine that is specifically tuned to run currently, a single mode (Qwen 3.5 MoE ParoQuant) extremely fast on gfx1100/gfx1151. It does not have a PyTorch dependency, instead directly driving native HIP libraries, with >100+ custom fused/unfused kernels.

The upshot to this is that while a very early implementation (initial kernel work started about 3 weeks ago, the hipEngine harness is 1 week old) , performance is quite good. On gfx1151 it beats llama.cpp prefill+decode speed across the board, and on gfx1100 it beats llama.cpp HIP by a healthy margin (prefill+decode), and is a little behind vs llama.cpp Vulkan on decode. It is however >2X prefill at long context.

I've released a ROCm ParoQuant fork and a 19GB Qwen 3.6 35B-A3B PARO packed safetensor model), and there's a FastAPI OpenAI api server as well, but it's brand new software, so I'd be interested in feedback for those that can get it working/testiing.

Developers may also be interested in:

compact stump
#

Hi, the prefill performance looks promising. If you have a 7900 XTX or Strix Halo machine and want to push the limits of local MoE inference on AMD hardware, this is worth trying. Obviously as it is an early software bugs are expected. Here are few suggestions from me. 1) Add support for more models. 2) Only gfx1100 and gfx1151. No support for other RDNA3 variants or other architectures, so you can work on that as well. 3) gfx1100 decode is still slightly behind llama.cpp Vulkan on decode speed, so for pure decode-heavy workloads Vulkan may still win. Hope these are helpfull, thanks again for sharing, love to see the community filled with such interesting projects.

round radish
#

INT8 kvcache support has been added. It is possible to run Qwen 3.6's full 256K context window in <24GiB of memory.

@compact stump your clanker's suggestions are in fact not useful. The code is open source so you (or anyone else) can work on additional model and GPU support and submit a PR if you want, however.