Hey everyone,
There is a massive misconception right now that AMD hardware struggles with local AI compared to the competition. After doing a deep dive, I realized the silicon isn't the problem—the bloated software stacks (like Ollama, generic wgpu wrappers, and standard framework ports) are severely bottlenecking the hardware.
To prove this, I completely bypassed the standard AI stack and built a proprietary, bare-metal inference engine from scratch in Rust (Zelly OS). I pitted it against the standard behavior of tools like Ollama on a highly constrained, 5-year-old legacy GPU: an AMD Radeon Pro 560X (2019).
Here is the architectural benchmark comparison:
❌ The "Standard Stack" (Ollama / generic ports):
The bottleneck: Heavy OS overhead and constant CPU-GPU synchronization between every transformer layer. The PCIe bus is choked by dispatch latency.
The result on constrained AMD silicon: Massive VRAM overhead, unmanageable TTFT (Time-To-First-Token) for multi-turn agentic loops, and throughput that frequently forces fallback to the CPU.
✅ The "Zelly OS" Stack (Custom Bare-Metal Engine):
The fix: I built a custom engine that eliminates CPU-GPU ping-pong entirely. The CPU sits 100% idle during decode, and the GPU handles the entire pipeline internally with strict memory locality.
The benchmark results on the EXACT same 2019 hardware (INT4):
🚀 SmolLM2-135m: 45.8 tokens/sec peak | ~150ms TTFT (via custom session caching) | Only 84 MB VRAM footprint.
🧠 Llama 3.2 1B: 7.5 tokens/sec (Locked steady) | Only 643 MB VRAM footprint.
The Reality Check:
Getting a perfectly stable 7.5 tok/s on Llama 3.2 1B and pushing 45+ tok/s on a 5-year-old Mac GPU proves that AMD hardware is an absolute beast when you remove the software abstractions.
If anyone from the internal AMD AI or hardware architecture team wants to discuss the performance implications of bypassing these standard stacks, I’d love to connect!