Before I get into it, for context I daily drive Linux and can confirm issues myself, and I've found multiple threads from others who are on Windows having a similar issue.
The latest public version of ROCm (6.4.3) is causing substantial issues with the Ollama LLM backend, which uses its own engine now, with llama.cpp as fallback. I've confirmed this is caused directly by ROCm 6.4.3, as downgrading the Ollama PKG doesn't fix the issue, but downgrading ROCm does. However that's not a real fix, especially for us on rolling release Linux distros. Partial upgrades = nono. I've attached a few images showing some of the Ollama server behaviour. It appears that something regarding GTT is broken, as ROCm regularly tries to allocate more than even exists, at least in this context with Ollama. I haven't directly tried llama.cpp (neither Vulkan or ROCm) yet for this specific issue, but I wouldn't be surprised if it's also impacted.
Multiple threads are reporting the same issue on several generations of AMD hardware. It's likely others are impacted, but I haven't directly observed it.
The issue at Hand:
When running ollama serve, it loads fine initially until you actually try loading any model (regardless of size, even a 270M model will cause this) in which it'll load the model into VRAM, then "inference" absolutely nothing. Not a single token, and then immediately aborts the inference all together to the point where it even unloads from VRAM. **This is not a system resource issue. I've confirmed that via testing multiple different model sizes & architectures, plus some of the threads linked below mention it too.
It seems it's related to ROCm allocating too much GTT, exceeding my VRAM (24GB, but GTT in Ollama's debug log attempts to use ~30GB. This is not an issue with any of my models or parameters, this only began happening with the new ROCm rev. This causes a segfault. It attempts to map to 0x28, which I assume is a reserved address space. (see screenshot 2.)