#32 seconds for ollama

1 messages · Page 1 of 1 (latest)

hollow olive
#

Would it be normal to expect ollama (running on another machine) to take 32 seconds to reply? I have a quadro p2200 backing it up, and it barely holds the suggest model on the ollama integration page.
Already looking to upgrade GPU in next few months, but that seems low for some basic questions. And I want to make sure the GPU is my bottle neck, not something else in the pipeline

silent raven
#

Most probably your context makes Ollama to swap into regular RAM. If model itself barely fits the VRAM, then there's no room for additional context, that HA provides to LLM

hollow olive
#

Yeah, that was my concern. Are there some slightly smaller models that could work decently with home assistant?

silent raven
#

There are smarter persons than me here, but there's no point using something less than 14B Q4, better Q8. All other models already proven to be hit or miss ..

spark cedar
#

Yeah, There's been many people trying to use Qwen2.5:14b at q4 and it can do some things very well, but seems to still struggle with instruction following. Maybe at q8 it would do better, but that would require something like a 3090/4090/5090 card with at least 24GB VRAM to hold everything, unless you are working with a very small set of devices and a small context window.

#

Here's the calculation for default context size of 8192 with q8 quant on qwen 14b:

hollow olive
#

well my 3080Ti on my windows system, whipped up a response in 3 seconds

#

now that has more VRAM and WAAAY more cuda cores, and im not sure which was more impactful to the speed diff

spark cedar
#

both tbh

#

more vram, faster vram, more cuda cores to process the network faster

#

and probably a higher CUDA version

hollow olive
#

gotcha, the card im looking at is a RTX A2000 12GB
it supports all the encode/decode codecs im looking for, and jumps be from 5GB to 12GB VRAM
but CUDA cores only go from 1280 (Quadro P2200) to 3328 RTX A2000 (i also gain tensor cores which my quadro does not have)
now these cores are certainly newer, but my 3080ti has 10,240

spark cedar
#

Yeah I mean I use a 4060ti and I think that would outdo that card

#

and has more vram

#

and isn't terribly expensive, think they got for $400 or less now new

#

actually maybe closer to 500 for the 16gb variant, forgot the 8gb is the default

hollow olive
#

yeah im trying to keep the card in my server has MoBo power only, but i dont think i have a great reason for that besdies space savings