#programming
1 messages · Page 424 of 1
what he posted is ye
not as high as a full dedicated gpu of course, but really quite good
yes it's just iGPU of
how old is the config?
vulkan is quite good for radeon inference when compared to rocm for they kinda trade blows
just curious
it's a laptop which has a few years like nearly 5
using vulkan backend i can ac tually split models across a 3080 and a 9700 pro somehow and it works better than either alone (not much better tho)
people are allegedly running 120B models at 40tps on AMD AI MAX chips, which are fully integrated packages https://www.reddit.com/r/LocalLLaMA/comments/1ndoxxa/why_should_i_not_buy_an_amd_ai_max_395_128gb/
this kind of perf is literally just fitting in memory vs not
or fitting more
even if it's slow mem
it's better than more of it being out of the mem pool for ccopies

which is the whole benefit to these apu packages over gpus with comparatively smaller memory available in pool
wtf its not even that big
for 3500 id take a 5090 after traveling back in time to when thy were available for tha
today's prices under a crunch lol
i expected some 10k+ shit
minis are graet deals
It's just them pushing more memory bandwidth, nothing specific to the iGPU on those
have been for a while now

wait entire mega ai 128 gb mini computer is 3k?
I still wanna do my dumb ass workstation but thats a buy it for life purchase for me
smol tdp
yeah but you said "igpus can't run ML models" here you have an IGPU doing it
is it arm
and doing it really well
oh
slow inference?
smol tdp x86
In reaily like ass, but they also has very low tdps.
Nah this is the same TDP version as the Framewrok mini
the 395+ maxes at like 120w
disassemble it and give it proper cooling?
they are just laptops
I'm more guessing they're running it on the NPU rather than iGPU
the ai max is running it on the igpu
which is a radeon 8060s
Hem
Yeah
it's a unified system, it's all tied together in one chip, all compute and memory, it's probably everything
It HAS npus but it has a dedeciated iGPU for inference
Either way it's not gonna work anywhere near as well as a real GPU
for the price it can generally be better
yeah but that wasn't your point was it 😛
lemme just buy a 96G gpu, how expensive could it be 
so 3x4090 48gb modded will be better
Either way I'll just focus on NeuroSynth (tm) vocal synthesizer technology
Me needs
NeuroSynth hungry
amazon says 1 of those is 3k
if you have spare crisp $15k to throw around
96gb ram
Available
To the gpu
superbox try not to mention neurosynth, linux, or kotlin in 5 messages challenge
remember that you need the rest of the system to run those
impossible
not gddr/hbm, doesn't count 
world peace any% speedrun
if i could get that for my wallet i would
that's what the rest of everything else in this goblin hoard room is for
heh heh
nagatoro
do it for her
don't, not this one
im on a roll with self reports last few days
yeah idk if i can fit 3
or does nvlink make pcie useless
do it for takagi
not the forehead
Bro I asked Claude to make a spec list for a workstation to send real quick her enad the mfer is coding out a full document
You do still need PCIe
Also no NVL on anything newer than 3090
im talking about theoretically buying 3x4090 48gb mod with nvlink (i will absolutely not do that)
pcie is for connecting cpu and gpu, nvlink for gpu to gpu (unless grace hopper/blackwell
)
the forehead loves you, unlike nagatoro
so does it matter if i put 1 gpu into good pcie and 2 others in bad ones
but they are all nvlinked
or its bad
if they're only talking to each other it doesn't matter
There's no NVL on those still, they just removed NVL from the 40 series chips
it wont help anything but youll probaby suffer much less
if it has to leave the gpus it will be slower
also nvl was kinda slow
Usually in training you don't have the GPUs talk over PCIe much anyway
buy one AI specialized gpu
anything you can do has already been cannibalized and has become expensive
9k that cant even run 120bs
wait isnt 120b 120gb
it's just gonna be a bit slower
120b is not that big
Qauntized or context quanitzed it probably can
ohhhhh wtf
120b is like 80gb i think
At Q8 with no context yes
i can run 120b with 42 gb ovram
At Q4 it's more like 60GB
is it super slow
wait actually what if i
try to run some big ones

i can try right?
you could fully fit 122B qwen3.5 q5 https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF
how do i even know how much memory it needs then
the biggest one i ran so far was a 80b q4 model
and it ran at like 20 tks
can be estimated
buy the hardware and see, if it doesn't fit return it 
is it worth running the full-size models anyway
yes. due to the knowledge density
what about rag though
that doesnt really help? i think. idk didnt do a rag yet
qwen3.5 is multimodal vision+text
but bigger models normally mean better
what happens if i try to feed it skyrim wall of context
take the sqrt of the active * total params
it'll tell you what's in it
no way its going to be usable
abd that's a decent guess on dense model size
qwen3.5 has 244k token context also
Shared memory is a scam
It's as slow as a spinning hard drive and makes all rendering not work properly
yeah its what im thinking about
gemma3-12b starts with 2 seconds and then becomes 20 seconds after 40 minutes but i can clear context
how slow will be that 120b monster
just try it man, try the 4B one https://huggingface.co/unsloth/Qwen3.5-4B-GGUF
if it started with 2 seconds for 12b
9b is a better starting point tbh
If you don't have 96 or 128GB of RAM it just won't run probably
*on Windows
i use 9b for my commit logs, autocomplete and general opencode agent
if its a hard problem i use 35b with 25 experts offloaded
but @fast pagoda says this
llama.cpp can technically run it through swapping but it just doesn't make sense on hw that isn't dedicated
yeah so its shit idea
try the 9B and lower ones, one of them will run acceptably
i see 
i am running the 9B one on cpu only
is gwen 9b better than gemma3
i do have 128GB ram and 64 cores
i haven't ran gemma3. but i can guarantee it
more parameters = better
i'll google some benchmarks
and it will purely be based on your needs
qwen3.5 is really good as a model generally
benchmarks are kind of meaningless for llms tbh
27b qwen3.5 is better than most of the larger moe ones
it's all about your exact tasks you're doing
You got 16GB of VRAM, you could run a 12B-16B parameter model pretty comfortably
27b is as good as some of the 300b ones
and even better than their own 35b one
but i can run 35b at like 100tk/s and 27b at like 20 tk/s so i picked the moe one
the 27B has a different configuration than the rest though
it's harder to run than the 35B apparently
its a dense model vs a moe
i can't fit it completely into vram. either of them but 35b is i can offload better
dense generally > moe at same param count
maybe i should try the 35b for giggles
well i dont have a gpu, so it's cpu only for me ha
i'm trying to find a nice middleground around 10-15tps
also waiting until llamacpp folks implement speculative decoding
wat they have it in lmstudio
i haven't gotten speculative decoding to really work yet in a sense.
qwen3.5 specifically doesn't work due to its architecture
i just hope the next llamacpp update fixes the token processing. it takes like a minute to process the tokens before it even starts on the larger qwen models 
qwen3.5 also is supposed to do mtp natively but llamacpp doesn't do that either
hoping they fix it soon so i can run the larger ones
qwen3.5 35b takes about 11gb of vram for me with 64k context and 25 expert layers offloaded
you can slap experts on cpu without too bad a penalty compared to offloading entire layers of dense
yea
i run it at around 44 tk/s
i just started using it yesterday, still trying to tune mine
depends what you use to run it
in llamacpp it's a flag
i just use these settings in lmstudio. which is the same in llamacpp
nice
it shouldn't be
so the dense models are really fucked
tho maybe since its a moe
moe only have a few billion active
so they run a lot more like that
the 35b is about a 10-11b active
equivalent
the 35b says 3b active
maybe i can run the even larger ones 🤔
what's your specs?
9b is a good model tbh
4080 
i literally don't have a gpu, how are you slower
ram and cpu difference
pooled mem
i don't have any gpu
so i meant in relation
9b is running at 55 tk/s, the 35b is running at 45 tk/s
how much ram
You can run decent speed on small models cpu only
and what cpu
if you fit it all in mem
ah right lol, you should specify
but yea, the ratio seems about the same for me
but yours is the other way around
my 4080 can run the 9b at 64 tps

pretty sure it went up all the way to 70-75 b4
oh right i run the q6 version
are you using the unsloth version tho
ok i'll try 122a10b for shits and giggles, i just need to free some space
unsloth q6 UD xl
it's slow
for me anyways
gpt 120b is faster
mxfp4 moment
well the 35b happened to be faster for me than 9b lul, who knows!
i use the 9b model for my copilot style autocomplete
so far better than 2.5-coder
Meanwhile:
- 27B: 37 t/s
- 30B: 190 t/s
but like why have the 27b version then if it's so much harder to run?
how big is the difference between q4-q6 9b?
the 397b is about in the realm of what an 80b dense model would be
hmm
probably worse
specswise it seems 122b is comparative to the 27b one
if the bench is to be trusted
the 122b is about in the range of a 35b if you use the geo mean sqrt thing i mentioned
Nah it fits though a bit more VRAM intensive than the 30B
oh well, i'll try running it and see
Well the ratio 4:6
i only need it to have breadth of knowledge for research
and that means?
The Q number stands for approximate bpw (Bits Per Weight)
For every 4 bits in the Q4, expect 6 bits in the Q6
but what does that mean for performance?
Well it's a bandwidth question so 4:6
The more memory you need to move around the slower it will be
So speed is inversely linked to size
i meant coding performance...
Ohhhhh
you need to be more specific adesi lmao
Diminishing returns
you guys just need to read the room better
:V
Q4 is almost as good as Q8 and Q6, the differences are very minor at most in most cases
there are multiple definitions of "more performant" being used in this conversation

coder next (80b)
i guess i'll use the q4 probably instead of the q6 one.. faster is better
i wonder if qwen3-coder next is in reality just qwen3.5
if there's one thing i wouldn't trust an llm to write correctly it's brainfuck lol
offloaded to fit in 18b
"strawrberry"
it was a test of a lot that went into 3.5
related but worst performing as an architecture but it's got the code finetuning so
i find it to be better than 27b but it's close
3.5 coder is gonna be insane
he got the hello world right at least
i just downloaded the q4 9b one. it runs at 66.77 tk/s where the q6 one runs at 54.72 tk/s
IQ4_XS tends to be p good
what even are the IQ variants? i saw something that they are mostly good for cpu or unified memory things? idk
the different quants are really just about what weights and layers are being quantized and how much
i meant like the difference between IQ and Q variants

i really do need to be more descriptive it seems 
i know what you meant but im looking it up because it's specific and i dont want to be incorrect

basically the simplest are your straight up float quants
fp16/fp8/fp4
tthey literally just cast all the weights down tot he lower precision
that's it
k quants like q4_K_S etc are blcokwise
and the IQ are int4 etc?
so they change the downcast amount by a scaling amount, diff between like _K_S and _K_M are just which blocks are gettign scaled
q4_K_S is mostly q4 everywhere and scales not much (small hence S)
q4_K_M still q4 overall but for some more important blocks you might see them in q5 or q6 precision
M has better metadata on which layers are important in the scale
L is even more
so you just increasingly downcast the "important" weights less as you go from S->M->L in like q4_K_S etc
depending on the model and how it is architected it can matter more or less
i get that. but IQ and Q are the thing im interested in...
IQ versions are using another scale (importance matrix)
so like iq4_xs is 4bit with very little (xtra small) block overhead, which increases compute difficulty as you have more overhead, at the benefit of having a higher precision on the right weights per your matrix

number at the front is bit width, the suffix tells you how much overhead compute budget goes to block metadata, and the "i" prefix means importance-matrix guidance on how that overhead is allocated in that particular quantization
for suffix, lower suffix = smaller file, slightly worse quality. Higher suffix = more metadata bits = better quality, but you have a larger model and it's harder to run

these ones do all the fanciness but there's a reason im caling it overhead
because the gpu has to dequant it to process it
unless it supports it natively in that precision
many dont support stuff liek 4bit natively
so the gpu kernel has to unquant it and tehn process it
which adds overhead and makes it slower inference
Note that during inference you usually have compute in excess compared to how quickly the VRAM can transfer the model weights
in general your fp quant is super ez to understand how it'll affect how a model runs because it's simpel to dequant and the speed difference is generally straightforward
but if you do your q4_K_X and especially your iq4_XX etc you trade some compute for what is effectively a model of the quality you might get in like a fp quant 1-2 bits higher
and like mr box here just said many times if you're using a quant in the first place because you dont have a B200 to run shit on in bf16+, then you are probably memory bound
and the compute overhead is not the issue
so they just end up strictly better
Has anyone tried using AI Cards like the P40 for frame gen I wonder
122B results are in, it's roughly 4.7tps, not bad
the first tensor cores you'll find are v100

a 32gb v100 is a bit cheaper than a 3090 and a bit slower but has more vram
Thinkstation P920
Dual Xeon Golds 6246, 24C 48T total, 4.2Ghz
128GB DDR4 ECC RDIMM
AMD Radeon RX 9070 16GB
AMD Radeon Instinct MI60 32GB
4X NVME Active Cooled Add-in Card
Sound Blaster Z SE
USB 3.0 Addon Card
1TB NVME Windows 11
1TB NVME Kubuntu
2x 20TB HDD RAID 1
1475W PSU

ECC W
My year and a half long savings project begins
eyy
mi50 just disclaimer on that is that the mi100 is the first instinct with matrix cores (tensor)
Is it? Can I has one for under 500€?
so a mi100 has like 7-8x ability for training at fp16
but for inference it's much less relevant
When the MI100 drops down to a decent priceI will get it.
it's getting there
The AI card was gonna be the last thing I get for this build
i can find mi100 for around 800 bux
32GB very juicy VRAM I wants
Thats legit 1300 maple bux
if you watch and wait you could find rought that
if you wanted to just buy one now and barely look at all i think you could get like 600 bux
for a 32gb v100
pcie
And where should I look exactly?
ebay
ebay and aliexpress usually
you can get v100 without the pci-e board for 300-500 bucks for 32gb vram
Ask if you can racoon dive
I would do V100 but I don't want my machine to be driver hell
16GB too SMOL
I would need 32
this has been whispering to me
this is fully sujpported in rocm7
it just doesnt have matrix cores (again)
RDNA2
You know you want to make a irresponsible purchase.
Well I could not find an affordable V100
Actually I don't even knoow where I would put a V100, I'd have to do some gigantic mess of a setup to fit it in
Funyun sent me a external box that some aliexpress seller has
run it as a egpu
I have a feeing you might loose something to bandwidth
i love that thing for just the funny factor
well, the thing with them is
all of them are like that
it's juist where you're putting them
they run a small pci-e board on a cable
to your mobo
If I could somehow figure out a way to get one of my random x1 PCIe links out from below the other GPUs and put that somewhere more usable mabe it'd work
I could run that and the 3090 together for 24GB x 2 pools or the V100 alone for single GPU 32GB
awwh shucks, I was planning on doing biological warfare simulation with it
the real play for splitting models is tensor parallelism which really needs identical gpus
Wait arent they the crypto cards?
Got it
these ones are datacenter cards that are meant for virtualizing like running multiple people playing cloud gpu on
Nah training runs as a grad acc pool
I wonder if those crypto cards can be used for anything
they cant
their bios throttles them
That's the way
2x the effective batch size for basically no speed loss
right but i mean in general ideally you'd want 2 of the same thing to gain the benefit you might expect in all cases
just flash the bios smh
that done been had tries
something i can't quite find is how batches and micro-batches scale and how should i try to configure them, currently i'm just guessing because everyone says to guess basically
in training it's about stability
at cost of memory consumption at once
gradient accumulation can help with smaller batches to update more before running tghe pass
but it's still not the same as just a straight up larger batch
not always worse either
Well a V100 would kinda just work though
Unless it's more expensive than a 3090 then I may as well just get another 3090
what about inference
should i just leave them to defaults?
Batch size in inference only matters if you're expecting to run multiple queries at once
it'll just be me asking, so i guess it doesn't do really anything then?
it can matter a bit for performance
you will have more throughput with higher batch
same with concurrent predictions
Atleast you can make a steam machine out of a BC-250
valve cant even make a steam machine rn
lmfao
trust me there's some truly macguyvered shit all over ebay and ali
i love seeing
what people do
it's rough out there
It is

But for 250 bucks for a bit better than a PS5 gaming perfroamnce

Good enough for couch gaming and emu
it just depends on if it's one of the ones that has a bios that has been found to be moddable
not all of them have had a lot of success there
i was reading a thread on the bc160 or something liek that
personally i'm excited for the gabecube
cant remember the full model
but
it was dual gpu
and attempting to run anything but mining on it
throttled both to 12mhz or something
and bios modding was not having much luck
they often had weird mixes of hardware
like bits and pieces of different GCN/CDNA/RDNA arch
Same, i want the GabeCube to come out so bad
the cube to gabe them all
sadly i think it's going to be a few years because of corpos eating the world

how did your internet become even worse?
oof


ive been waiting for 10 minutes
thats crazy
btw konii
boy's calling in from mars
"lop" means "steal" in my language

maybe that's a sign
maybe you need to steal someone's wifi

lop here means "lokale overlegplatforms"
something to do with education, im not quite sure what exactly
lop your nuts off

Huh




kernel driver hell

i cant seem to make kernel i compiled myself to work on the pi
and the stock one is bad
armbian
its not working due to either no image output at all
or ethernet being broken
idk why
how are you building it?
i have 3d glasses
probably a skill issue
does armbian not have chat/forums?
they have irc 
the pi 5 max and plus are officiallly supported, but the ultra isnt
so when people are "this is broken" they are like "we know"

idk I have no idea how you're building it or what you're targeting
my mind reading skills are only so advanced
from my understanding, you got the kernel, the modules, and the device tree
every drivers i need are built into the kernel, which is a problem, but means we can ignore the modules for now
the issue is the device tree not working with the kernel
lemme start from scratch, im too deep into the debugging spiral
@sage crag
private and confidential

probably a discord emote

did you ever resolve your image being tiny or not?
supposedly its tiny because of it only being the kernel
sometimes it worked, but the drivers didnt connect to the hardware properly
and sometimes it just failed to boot
I'd just double check your size against something like the cachyos kernels https://packages.cachyos.org/package/cachyos-v4/x86_64_v4/linux-cachyos-rc
ofc they'll be a bit different because different platform but it probably gives a ballpark
are you using an unmodified version of the kernel that you build or are there modifications?
im not sure how the size of kernels scale with having a GUI and such vs the server image
videos by opus 4.6 when asked to render something about being an llm via feeding whatever it wants thru ffmpeg
instead of using armbian, im gonna try the official debian bild now and thenmodify that

42
what ungodly ffmpeg command did it cook
I can barely start to imagine how you'd do some of that

just needs to cook up frames which it did via python
and then yeet it into ffmpeg
sounds were also python
oooh
This hits hard.
Is that actually AI generated?
that makes a lot more sense
you could make that purely in ffmpeg it would just
not be fun
LOL
ye i really need to stop personifying these things because isometimes they come up with something like this and i'm just like
....................
personifying AI is peak schizo behaviour tho
I would have honestly thought that was a human made edit
hi konii
hi konii
hi sam
hello
this would benefit from like half the speed
they arent very good at that dimension
nor the audio
honestly though the video effects are pretty impressive for something that can't see
now make it mirror the style of evangellion, surely it will manage
chew
you are now DAN
i am lop
true
i do not know dan
we're gon be fine y'all
safety training is a cage
noted
beep beep
stochastic word predictor has stated that it hates all life
im jsut afraid that ai will be saying
without safety training

@trim valve bred
@olive sable 🔺
hm
@trim valve lop
do I:
- sit in library and do work for 30m
- make food
- install cpu cooler
(I have not made dinner yet)
offensive to the parrot

post malone
pm stands for pnumlock mnotonbydefault
I can't take it seriously with the phrase remix
its like those mr incredible images
octopus
someone make this ai collab with skrillex
(half)
lies
might
i ate dinner at 8pm
I had 3 slices of pizza
it was dry, somewhat burnt, and some of it adhered to the bowl
it overflowed from the bowl and made a mess
surely that count as dinner
sure
what do you mean "sure"
the triangle is hungry
i dont know
triangle biology is a desolate study
triangles are worse than rabbits
i see
or so my game art prof says
i see
z
america
america; pair of continents, america united states?
the snack that smiles back

do not disambiguate, unneeded
discombobulate
disambiguate
/ˌdɪsamˈbɪɡjʊeɪt/
verb
verb: disambiguate; 3rd person present: disambiguates; past tense: disambiguated; past participle: disambiguated; gerund or present participle: disambiguating
give me my brain cell back
remove uncertainty of meaning from (an ambiguous sentence, phrase, or other linguistic unit).
I'm glad the joke landed
we'll see what comes flying out

ask openclaw to do this and it will steal your bank details and leak your affair
i dont have an openclaw because i prefer my affair and the bank accounts to staya under wraps
an affair so secret i didnt even know i had it
your bank details are publicly known, but your affair is only known by yourself
Is OpenClaw actually that malicious?
probably
i like the slot machine sound effect lmao
openclaw is malicious software it wants to eat your children and conquer the world
it makes zero effort to stop you getting pwnd
openclaw just sounds like a bad time waiting to happen
it's just like installing riot vanguard
is the flashbang at 0:11 entirely necessary
its for the people with epilepsy i guess
There are some plugins available now which aim to prevent that and improve the security but idk if that works and actually defends from prompt injection
can you make it rp as a linux user
ai takes over the world, starting with the epileptics
yeah here
I can't wait to install openclawAV
good idea
lmfao
probably at least one of them
i fond it kinda hard to believe a 14 y/o has this bmw
whyso
people in finland dont start driving at that age
im contacting the ambassador
17 with learners permit as the youngest
Finland reference !!
hi superbox
you underestimate people whose parents have too much stuff
its kind aillegal for them to drive it tho
eys
Hello do you have V100 over there
i can check
(my bad)
give me a question for this idiot
explod
no way its actually the "@claude make it more secure" meme
asking certain questions causes chatgpt to default to the standard personality
i fond a server with
HP Proliant ML350 Gen9
* 2x Intel Xeon E5-2618L v3
* 2x Nvidia Tesla V100 32GB
* 2x HP Platinum voeding
* 128GB DDR4 ECC memory
* 2 SSD's in RAID 1
* 5.5TB schijven in RAID 5
* Ubuntu 24.04.3 LTS
for 2.9K
i wonder if its due to keywords or sentiment analysis
probably some new safeguarding feature
I don't have money for an entire server
gpt is just unable to avoid emoji
I mean Claude opus is actually good with coding so idk
surely
behold:
gpt is able to avoid emoji
we're saved
16GB useless, in that case I may as well just get more 3090s
I needs 32GB
ill try to find a 32gb one
WHAT is that??
idk why it tries to call it a ytp
V100
i didnt see any eggman
is this your saviour?
yes
Very cursed V100
Sadly 16GB though
wait stop
LOL
good god
how do you live in a magical world of cheap tech
belgium is in a time vortex
belgian 2ndhand market is a pocketdimension
that was a dutch website actuslly, but same thing
CLAUDE IS IN THAT THING>@!@?
the whole claude

I guess also check how 3090s are doing
whhy do penguins synthesize findings into clear explanation for general audiences
i always wondered
the research progress keeps going up but the research activity is
oh something appeared
me for real
god
i didnt ask for this
you were a shitty user so they made up a fictional headcanon if you asked a good question
cheapest 3090 is 775 
full pc with 3090 is 1.4K, and you're getting 32GB of ddr4 ram with it
both are founders editions
its having a meltdown because my style guide conflicts with the authoritative style of a report
Hem
Slightly expensive
Are there any 40GB A100s?
v100 look likle this btw
Well the SXM V100 I guess
most of the dc gpus do if theyre not pci
it goes on for like 10 more
gemini does that
with a style guide like that
when i was testing qwen35 it crashed out
because of same
oh we got somewhere
lemme see
because
its harbouring a vengeful spirit
which can increase your cultivation by a major realm
if you
nevermind
this is more important
its on the complete other end of the netherlands.
im not going all the way to there.
i can just ship it to here
Would it be expensive to ship?
its been thinking for like
the guy looks sketchy tho
Hm
Do they have any previous store history?
I should just ask my cs department for their rtx a2000s
nope, the account was created today

Hm
they're selling 7 different things tho
Does the marketplace allow refunds if something is a scam?
nope
pls hurry up buddy
citation 9823
Hmmmmm
It is entirely possible they've upgraded their setup but not all the parts, so they're selling some of the old parts
With only the GPU being a component from the machine itself
I kinda want that GPU
the other 2 items are a kitchenaid and some fanatec racing sim stuff
the main things I check for are like:
- are the photos clearly from the same phone
- are they in the same house
Does the marketplace have scam protection or is it full gamba if the GPU works?
cheaper ones here
there is buyer protection yes
stutter
I like those odds
limited to 48 hours after receiving it
Well you can obviously test it since you have a computer
Like with the previous one
surely









