Would it be normal to expect ollama (running on another machine) to take 32 seconds to reply? I have a quadro p2200 backing it up, and it barely holds the suggest model on the ollama integration page.
Already looking to upgrade GPU in next few months, but that seems low for some basic questions. And I want to make sure the GPU is my bottle neck, not something else in the pipeline
#32 seconds for ollama
1 messages · Page 1 of 1 (latest)
Most probably your context makes Ollama to swap into regular RAM. If model itself barely fits the VRAM, then there's no room for additional context, that HA provides to LLM
Yeah, that was my concern. Are there some slightly smaller models that could work decently with home assistant?
There are smarter persons than me here, but there's no point using something less than 14B Q4, better Q8. All other models already proven to be hit or miss ..
Yeah, There's been many people trying to use Qwen2.5:14b at q4 and it can do some things very well, but seems to still struggle with instruction following. Maybe at q8 it would do better, but that would require something like a 3090/4090/5090 card with at least 24GB VRAM to hold everything, unless you are working with a very small set of devices and a small context window.
Here's the calculation for default context size of 8192 with q8 quant on qwen 14b:
https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator if you want to see the vram usages of other models and quants/windows 🙂
well my 3080Ti on my windows system, whipped up a response in 3 seconds
now that has more VRAM and WAAAY more cuda cores, and im not sure which was more impactful to the speed diff
both tbh
more vram, faster vram, more cuda cores to process the network faster
and probably a higher CUDA version
gotcha, the card im looking at is a RTX A2000 12GB
it supports all the encode/decode codecs im looking for, and jumps be from 5GB to 12GB VRAM
but CUDA cores only go from 1280 (Quadro P2200) to 3328 RTX A2000 (i also gain tensor cores which my quadro does not have)
now these cores are certainly newer, but my 3080ti has 10,240
Yeah I mean I use a 4060ti and I think that would outdo that card
and has more vram
and isn't terribly expensive, think they got for $400 or less now new
actually maybe closer to 500 for the 16gb variant, forgot the 8gb is the default
yeah im trying to keep the card in my server has MoBo power only, but i dont think i have a great reason for that besdies space savings