#Local LLM
1 messages · Page 1 of 1 (latest)
3060 not the best. It will do, but don't expect miracles 🙂
Is there a list of recommended graphics cards somewhere for people to reference?
Just take the best you can get (Nvidia for sure). 4090 is good. 🙂
It sounds like I'm probably not going to be able to shove one of those into a 1L PC 😅
Unfortunately....
That's my problem as well. I'm thinking of buying Jetson Orin Nano Super when it's available.
One thing to think about with the nano super is it's just 60 or so TOPS, and that 8gb of RAM is shared memory, so the OS is gonna use some of that. So you may really only have like 7-6GB available for llm depending on the os.
Exactly. Enough for tinkering, but not enough for using tools on Home Assistant with the context.
But that should still work if you only ask "enclosed" questions without using tools, right? Like if I feed in all my sensor data in my prompt it would still generate a usable output... just that it cant retrieve infos or perform actions on its own.
Really hope a consumer friendly machine will be available to do all this. I ordered a NVIDIA Jetson orin Nano Super for local LLM... didn't have tools in mind when I ordered it....
For running HA and local LLM I really wouldnt want to buy a loud and power hungry graphics card :/
For running a local LLM you have very few choices and they are all power hungry
Well, llama3.2 on Orin Nano Super will be okay I guess. Definitely less than my 3080.
Yes, everything you wrote makes sense. I use LLM exactly like that.
They are really only power hungry when processing commands. A spike there and then back to idle. Gamers who play all day and stream for hours are pegging it that whole time
But yes, at idle more power hungry than a nano jetson or other
True, but a lack of power when needed will ruin the experience
Well I ended up ordering 2 x 4090’s
remember that you can also run on both cpu and gpu. i run a granite model with 50/50 split between my 1080 and ryzen 3600. works fine and my desktop is still responsive while interacting with the llm, but the fans do turn on...
What's the t/s on that kind of setup though?
right now i am running a too big model. 😄 bartowski/granite-3.0-8b-instruct-GGUF (Q8_0) in lm-studio. my gpu also drives my 49" ultrawide screen and i have a ton of browser tabs open. so not exactly idle. I tried the query "make a list of 100 countries and their currencies in MD table use a column for numbering" as a kind of benchmark and i get "4.45 tok/sec • 1485 tokens • 2.61s to first token • Stop: eosFound"
that model seems really good for coding advice though
with Q6_K instead I get 8.23 tokens/second
and with Q4_K_M i get 26.11 tokens/second
btw, granite models are from IBM/Watson. they are licensed under apache 2.0 license. ibm also discloses their training methodology and training data set.