How to add more RAM to the Pod | Runpod | Page 1

stray glacier Jul 21, 2025, 7:32 AM

#

Hi! Is there a way to set RAM manually? I mean, the current options is limited for example max 283gb RAM. But I need 1TB. How do I add more RAM?

Thank you.

fluid juniperBOT Jul 21, 2025, 7:32 AM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

half rapids Jul 21, 2025, 8:47 AM

#

use clusters with more GPUs

#

or cpu pods, with larger ram

stray glacier Jul 21, 2025, 11:05 AM

#

half rapids use clusters with more GPUs

Is this what you mean? It's empty don't know why..

CPU pods only limited to about 250 gb RAM.

half rapids Jul 21, 2025, 11:46 AM

#

they have larger vram because they have more gpu's yes

#

(in one cluster)

#

btw clusters are expensive tho

#

nvm you can go here (look below in my screenshot) and increase more gpus and still get 1tb of ram

#

These setups allows you to get 1tb of ram

#

these are in pods

stray glacier Jul 21, 2025, 2:47 PM

#

half rapids These setups allows you to get 1tb of ram

Ah okayy, finally understand the parameter for it. Thank you so much..

So.... with about $7/hour, I can try use kimi k2 for my self? hahahaha. And try the quantized version with so much cheaper price? :D:D

Anyway thank youu

half rapids Jul 21, 2025, 2:50 PM

#

Hmm or use the amd ones

#

Try apis if you don't wanna self host just for your self (if it's cheaper that way why not)

stray glacier Jul 22, 2025, 3:43 AM

#

half rapids Try apis if you don't wanna self host just for your self (if it's cheaper that w...

apis? Do you mean like OpenRouter, Chutes?

half rapids Jul 22, 2025, 4:23 AM

#

Ye openrtr tgther ai

#

Unless you need high throughput, that is effecient on cloud gpus

stray glacier Jul 22, 2025, 11:10 AM

#

half rapids Unless you need high throughput, that is effecient on cloud gpus

I see... I thought of using Runpod because of hit rate limit on free one. and if paying, I wonder if assumed 1 hour on OpenRouter kimi usage can be cheaper than I use runpod vs speed. Basically it gets back to price value.. If runpod is cheaper for 1 hour full usage then it's really worth it ( based on my tests a couple of days and for my use cases ).

What do you think?

#

If you said openrouter will be cheaper then I won't bother trying since it's also kind of a work to make the template first for gguf one. << Never try this, I also want to know if gguf one (from unsloth) is enough for roocode usage.

half rapids Jul 22, 2025, 11:27 AM

#

Hmm depends on what you use

#

How much tokens estimate

#

Most likely if you're alone, using the full kimi k2 it's more Exp in runpod or wherever it is, paying more than. .8 per hr

stray glacier Jul 22, 2025, 11:42 AM

#

half rapids How much tokens estimate

wow... This is really a hard question hahaahah. Honestly I don't know, but I assume it is actually so much because when I try using gemini pro, it's taking maybe about 80M token within 3 hours, so it's around 28M token per hour or more depending on the task.

I see that are cases that is not using much token, but there is using so much token that I'm not even except it is possible.

#

Based on your comment, I think it's wise for me to just try using runpod for full 1 hour paying to see how much it is ( I don't want to do this at first ). hahahah. But still, on OpenRouter, total token matter so much when runpod is no, that's also some things to consider. If we're using maximum token fully 1 hour, of course runpod will be cheaper right?
I need maybe 50t/s since when I use roocode, I got that speed ( based on what chutes provider said on openrouter )

half rapids Jul 22, 2025, 11:54 AM

#

stray glacier Based on your comment, I think it's wise for me to just try using runpod for ful...

The runpod costs are estimatable if you're using pods, even without using it

#

Just multiply (the price per hour + cost of your storage/hr) with your usage

half rapids Jul 22, 2025, 11:55 AM

#

stray glacier wow... This is really a hard question hahaahah. Honestly I don't know, but I ass...

Wow that's quite heavy use

stray glacier Jul 22, 2025, 11:56 AM

#

half rapids Wow that's quite heavy use

I also shocked by it.. I don't know how gemini cli works at that time, but it really burns the token so much.. hahaha.

half rapids Jul 22, 2025, 11:56 AM

#

stray glacier Based on your comment, I think it's wise for me to just try using runpod for ful...

Yeah sure try the perf on runpod

#

Oh gemini cli? Maybe it's agentic use

#

It explores many files, thinking

#

Tool calling

#

Not sure if kimi k2 can support cli like use

#

If it's not fine tuned for it

stray glacier Jul 22, 2025, 11:59 AM

#

half rapids Not sure if kimi k2 can support cli like use

No no.. after I got billed about $35 within 4 days, I try roocode instantly, and try for the free, and honestly it's good. But about these 2 days, somehow I get limited so much ( maybe the chutes doing the limiting, since openrouter says 1000 request / day, so it should be okay but no ).

That's why, if I must pay, I think I want it to be no restriction and of course low cost since it's all for my personal research.

half rapids Jul 22, 2025, 12:01 PM

#

ahh ic

#

How much are you willing to pay?

#

my friend said he managed to run it in 4x or 5x mi300x (its low stock currently) with 4k context 50t/s, usually you'd want a higher context tho
if you're okay with ram offloading ig it fits with other gpus

stray glacier Jul 22, 2025, 12:05 PM

#

half rapids How much are you willing to pay?

This is also hard to say since this is relative type question. Like if it's 50t/s, I feel this already pretty pretty good, I'm okay with paying $1.5/hour ( Note: I'm not like a crazy person that turns this on 24 hours HAHAHA. I use it when I'm only on computer too.. So maximum is 3 hours a day ). If 50t/s with 64K token, I already feel that's enough. Even if it's 30t/s, I also already okay.

So let say I can even get it on much much lower rate, I also okay with 15t/s 🤣. Sorry to be like a cheapskate, but I'm not rich enough to burn money.

Is it too much to ask?

half rapids Jul 22, 2025, 12:06 PM

#

woah

#

1.5$ an hour?

elfin tiger Jul 22, 2025, 12:06 PM

#

stray glacier This is also hard to say since this is relative type question. Like if it's 50t/...

When I said 50t/s I mean processing

stray glacier Jul 22, 2025, 12:07 PM

#

half rapids my friend said he managed to run it in 4x or 5x mi300x (its low stock currently)...

Depends on t/s. If it's acceptable, of course I'm okay offloading to ram. On my local PC, I try it with offloading, but I only can run up to 14B model Q4, and when using high context, I already got soooo poor speed.

half rapids Jul 22, 2025, 12:07 PM

#

so to run kimi you need around 1tb+ of mixed ram and vram?

#

is t hat correct

elfin tiger Jul 22, 2025, 12:08 PM

#

If you load a full 64K prompt at 50t/s pp it would take 21 mins

#

But currently testing how fast it is on nvidia, the 50 was on AMD

stray glacier Jul 22, 2025, 12:08 PM

#

half rapids so to run kimi you need around 1tb+ of mixed ram and vram?

yeah hahahaah! That's why I want to try the quantized one by unsloth team ( the 280gb size one ).

half rapids Jul 22, 2025, 12:09 PM

#

i guess $1.5 nots going to run it (the full one)

elfin tiger Jul 22, 2025, 12:09 PM

#

stray glacier yeah hahahaah! That's why I want to try the quantized one by unsloth team ( the ...

Which quant is that?

half rapids Jul 22, 2025, 12:10 PM

#

#

alright lets see

stray glacier Jul 22, 2025, 12:10 PM

#

elfin tiger Which quant is that?

https://huggingface.co/unsloth/Kimi-K2-Instruct-GGUF

The IQ1_S one. But also want to try 2-bit one, since it should be much better somehow with only a little differences.

unsloth/Kimi-K2-Instruct-GGUF · Hugging Face

elfin tiger Jul 22, 2025, 12:10 PM

#

I doubt those will be coherent

#

You'd be better off with a smaller model

#

Either way my template is https://koboldai.org/runpodcpp it can run these

#

Just link to the 00001-of and it gets it

stray glacier Jul 22, 2025, 12:11 PM

#

half rapids i guess $1.5 nots going to run it (the full one)

Yeah, for the full one, I already give up since I see the runpod price to achieve that. Anyway, it's not possible that I use that, but unlikely. If the quantized one already enough, I won't use the full one at all.

half rapids Jul 22, 2025, 12:12 PM

#

iQ1_S would run on 4x a5000 ig

half rapids Jul 22, 2025, 12:12 PM

#

elfin tiger I doubt those will be coherent

other models?

elfin tiger Jul 22, 2025, 12:12 PM

#

1-bit tends to be incoherent

half rapids Jul 22, 2025, 12:12 PM

#

i see

elfin tiger Jul 22, 2025, 12:12 PM

#

Wouldn't be surprised if a good 100B beats it in 4-bit

half rapids Jul 22, 2025, 12:13 PM

#

search for agentic coding models in reddit lol

elfin tiger Jul 22, 2025, 12:13 PM

#

Hope my 32K context fits

half rapids Jul 22, 2025, 12:14 PM

#

elfin tiger Hope my 32K context fits

Fits in what gpu

elfin tiger Jul 22, 2025, 12:14 PM

#

half rapids iQ1_S would run on 4x a5000 ig

Probably very slow

elfin tiger Jul 22, 2025, 12:14 PM

#

half rapids Fits in what gpu

4xB200

half rapids Jul 22, 2025, 12:14 PM

#

welp

stray glacier Jul 22, 2025, 12:14 PM

#

half rapids search for agentic coding models in reddit lol

hahaha. It's not all about agentic though... hmm. I try smolLM3 , and it's really good at tool calling etc, it can run the roocode somehow, but..... it's not good enough for coding @_@.

elfin tiger Jul 22, 2025, 12:14 PM

#

I am testing it with Q4 though

half rapids Jul 22, 2025, 12:14 PM

#

stray glacier hahaha. It's not all about agentic though... hmm. I try smolLM3 , and it's reall...

try deepseek distilled models

elfin tiger Jul 22, 2025, 12:14 PM

#

I wanna see what the speed is on a beefy nvidia

half rapids Jul 22, 2025, 12:14 PM

#

like to llama or qwen

#

or qwen models

elfin tiger Jul 22, 2025, 12:15 PM

#

half rapids try deepseek distilled models

My community doesn't like the distils

half rapids Jul 22, 2025, 12:15 PM

#

idk honestly

half rapids Jul 22, 2025, 12:15 PM

#

elfin tiger My community doesn't like the distils

for coding? what do you think

elfin tiger Jul 22, 2025, 12:15 PM

#

For code GLM / Devstral maybe?

half rapids Jul 22, 2025, 12:15 PM

#

is glm new, its not on artificial analysis yet hhh

stray glacier Jul 22, 2025, 12:16 PM

#

half rapids try deepseek distilled models

Already try... It's not good enough. So far already try many kinds of DeepSeek & R1, Devstral, Kimi, Qwen. And nothing beats Kimi K2 so far... It's really really far ahead of the others. I think I only feels performance like that on Gemini Pro 2.5, never try claude though ( Of course gemini better, but kimi cheaper and can be free with rate limited 😄 ).

half rapids Jul 22, 2025, 12:17 PM

#

devstral seems interesting

stray glacier Jul 22, 2025, 12:17 PM

#

elfin tiger For code GLM / Devstral maybe?

Devstral ALMOST GOOD... But not enough

#

Already try it too :D:D

half rapids Jul 22, 2025, 12:17 PM

#

💀 i guess dont compare gemini pro here

#

its like comparing with o3 too

elfin tiger Jul 22, 2025, 12:17 PM

#

You wont be able to beat Kimi (Maybe full sized deepseek gets close) but the smaller ones are way cheaper

stray glacier Jul 22, 2025, 12:17 PM

#

Btw, @half rapids you tell me about 4x A5000 << A5000 is the prev generation, is it still great though on performance? If yes, it's quite cheap @_@...

stray glacier Jul 22, 2025, 12:18 PM

#

half rapids 💀 i guess dont compare gemini pro here

yeah yeah ahahahah I'm sorry hahahaha

half rapids Jul 22, 2025, 12:18 PM

#

np

elfin tiger Jul 22, 2025, 12:18 PM

#

stray glacier Btw, <@340880706865594370> you tell me about 4x A5000 << A5000 is the prev gener...

For the big models no

#

Like if I find the Mi300x to slow for kimi an A5000 certainly will be slow

#

I do have to test the Mi300x again after AMD's PR lands

stray glacier Jul 22, 2025, 12:20 PM

#

I see...........

half rapids Jul 22, 2025, 12:20 PM

#

elfin tiger I do have to test the Mi300x again after AMD's PR lands

what? pr

stray glacier Jul 22, 2025, 12:21 PM

#

elfin tiger Like if I find the Mi300x to slow for kimi an A5000 certainly will be slow

What the speed that you thought it's too slow btw?

elfin tiger Jul 22, 2025, 12:21 PM

#

50 tokens per second prompt processing

#

Token generation speed was good with 30t/s

#

But the 50t/s pp is very slow

#

Feels like a beefy CPU

#

But its just a massive model

#

During generation its a 32B, but during processing its the full 1T

half rapids Jul 22, 2025, 12:24 PM

#

elfin tiger During generation its a 32B, but during processing its the full 1T

ahh thats why its slow

elfin tiger Jul 22, 2025, 12:24 PM

#

Yup

half rapids Jul 22, 2025, 12:24 PM

#

hows the download

#

on b200s

stray glacier Jul 22, 2025, 12:24 PM

#

elfin tiger During generation its a 32B, but during processing its the full 1T

I see...........

elfin tiger Jul 22, 2025, 12:24 PM

#

half rapids Jul 22, 2025, 12:24 PM

#

nice, last file

elfin tiger Jul 22, 2025, 12:25 PM

#

Expensive benchmark though, I really hope this 32K fits

#

Not gonna download it a second time, its $10 for the download in this config

half rapids Jul 22, 2025, 12:26 PM

#

🤣

#

damn

#

ima try roo code next week

stray glacier Jul 22, 2025, 12:26 PM

#

elfin tiger Not gonna download it a second time, its $10 for the download in this config

crazyyyyyyyyyy hahahaa. And here I just said I accept $1.5/hour.. lol

elfin tiger Jul 22, 2025, 12:27 PM

#

I am using 4xB200 because I want to speed test the best case scenario

#

Its $23 an hour

stray glacier Jul 22, 2025, 12:27 PM

#

half rapids ima try roo code next week

So far if there's no problem from OpenRouter, RooCode + Kimi K2 is really really really really more than enough. << But I always use Test Driven Development to make sure the development is high quality. This is maybe why the token usage is so much. lol

stray glacier Jul 22, 2025, 12:28 PM

#

elfin tiger Its $23 an hour

I want to cry.... hahahahaahha.

half rapids Jul 22, 2025, 12:28 PM

#

i dont get it how test driven development makes more token usage

#

i see, openrouter has one provider for kimi k2 that's free right?

#

or not

elfin tiger Jul 22, 2025, 12:28 PM

#

Best case scenario, 3 times faster than MI300x

half rapids Jul 22, 2025, 12:29 PM

#

https://openrouter.ai/qwen/qwen3-235b-a22b-07-25:free
nvm this is free

Qwen3 235B A22B 2507 (free) - API, Providers, Stats

Qwen3-235B-A22B-Instruct-2507 is a multilingual, instruction-tuned mixture-of-experts language model based on the Qwen3-235B architecture, with 22B active parameters per forward pass. It is optimized for general-purpose text generation, including instruction following, logical reasoning, math, code, and tool usage. Run Qwen3 235B A22B 2507 (free...

stray glacier Jul 22, 2025, 12:29 PM

#

half rapids i dont get it how test driven development makes more token usage

Because it needs to start from zero code, and dummy test. then next it will test it, and it failing.

After that, it will create 1 basic test. Then test it again << fail again. Then code as simple as possible to passed that test. Then test it again << if fail, fix it again till working, if works. It goes to next cycle.

half rapids Jul 22, 2025, 12:30 PM

#

i see, aslong the code works right

stray glacier Jul 22, 2025, 12:30 PM

#

Basically, I'm not letting the model to do complex coding before it's time, so it won't need to have super high intelligence.

stray glacier Jul 22, 2025, 12:30 PM

#

half rapids i see, aslong the code works right

yeah.. So the request to API about 3 times more than usual at minimum I think.

elfin tiger Jul 22, 2025, 12:30 PM

#

half rapids https://openrouter.ai/qwen/qwen3-235b-a22b-07-25:free nvm this is free

They have a free kimi

half rapids Jul 22, 2025, 12:30 PM

#

oh rlly

#

ah i see it

stray glacier Jul 22, 2025, 12:31 PM

#

elfin tiger Best case scenario, 3 times faster than MI300x

Even with that beast, it's only getting 27t/s?? hahahahaah

elfin tiger Jul 22, 2025, 12:31 PM

#

Its a big model

half rapids Jul 22, 2025, 12:32 PM

#

2x~ more exp on b200's

elfin tiger Jul 22, 2025, 12:32 PM

#

I don't know why I cap out at 30t/s on all of them

stray glacier Jul 22, 2025, 12:32 PM

#

elfin tiger They have a free kimi

yeah, this free kimi is that the one that I use.

elfin tiger Jul 22, 2025, 12:32 PM

#

Interesting the PP sped up a bit

half rapids Jul 22, 2025, 12:32 PM

#

elfin tiger They have a free kimi

they have no limits on that?

elfin tiger Jul 22, 2025, 12:32 PM

#

half rapids they have no limits on that?

Probably do

half rapids Jul 22, 2025, 12:33 PM

#

did you use same cache? or maybe its just batching

elfin tiger Jul 22, 2025, 12:34 PM

#

half rapids did you use same cache? or maybe its just batching

There is no batching

#

For batching you gonna be using stuff like vllm not koboldcpp

half rapids Jul 22, 2025, 12:35 PM

#

oh

#

wonder if vllm's gonna achieve better perf

elfin tiger Jul 22, 2025, 12:36 PM

#

Probably, this isn't exactly what llamacpp/koboldcpp was designed to do lol

#

But it did work, so thats something 😄

#

And on the B200 at 32K single user it was usable for me

#

2 mins to build the cache

stray glacier Jul 22, 2025, 12:40 PM

#

Hmm. If I want to try using vLLM + gguf model, I need to make custom docker right? Because I need to add to script to download the model first, then run the starting command with loading that downloaded model. Am I right?

elfin tiger Jul 22, 2025, 12:40 PM

#

vLLM isn't that good at gguf

half rapids Jul 22, 2025, 12:40 PM

#

idk

stray glacier Jul 22, 2025, 12:41 PM

#

elfin tiger vLLM isn't that good at gguf

So, what what I should use? Normal llama.cpp ?

elfin tiger Jul 22, 2025, 12:41 PM

#

For gguf you can use the koboldcpp template

half rapids Jul 22, 2025, 12:41 PM

#

stray glacier Hmm. If I want to try using vLLM + gguf model, I need to make custom docker righ...

You can just launch one that downloads automatically, use vllm docker image

elfin tiger Jul 22, 2025, 12:41 PM

#

Extremely easy to use just don't expect kimi to be economical

stray glacier Jul 22, 2025, 12:42 PM

#

half rapids You can just launch one that downloads automatically, use vllm docker image

This one can download if it's not gguf by the way.. << I just read the documentation this afternoon.

elfin tiger Jul 22, 2025, 12:42 PM

#

stray glacier Jul 22, 2025, 12:42 PM

#

elfin tiger Extremely easy to use just don't expect kimi to be economical

🤔 wait, let me check, what it is about.. I never read about koboldcpp..

elfin tiger Jul 22, 2025, 12:43 PM

#

stray glacier 🤔 wait, let me check, what it is about.. I never read about koboldcpp..

Were based on llamacpp

#

Comes with an optional bundled UI and a super easy runpod template

stray glacier Jul 22, 2025, 12:43 PM

#

elfin tiger

Yeah, I already read this.. That's a good point, I haven't check the kimi gguf download file, is it 1 file or multiple files. lol

elfin tiger Jul 22, 2025, 12:44 PM

#

stray glacier Yeah, I already read this.. That's a good point, I haven't check the kimi gguf d...

Its 13 in the Q4

#

HF's upload limit is 50GB

stray glacier Jul 22, 2025, 12:44 PM

#

I see......

stray glacier Jul 22, 2025, 12:44 PM

#

elfin tiger Were based on llamacpp

Ah okay, will check it out..

elfin tiger Jul 22, 2025, 12:44 PM

#

KoboldCpp you give the first link and it automatically finds the others if you use a split

#

https://get.runpod.io/koboldcpp customize before deploying

#

Has built in download acceleration, no manual steps once deployed

#

And we have official support for runpod

half rapids Jul 22, 2025, 12:46 PM

#

i see

half rapids Jul 22, 2025, 12:46 PM

#

elfin tiger And we have official support for runpod

yeah look who is the maintainer, this guy

elfin tiger Jul 22, 2025, 12:47 PM

#

😄

stray glacier Jul 22, 2025, 12:47 PM

#

HAHAHAHAAH. nice..

#

So, I can download the model from the UI? Or I need to use SSH? And... Is it already supporting OpenAI Compatible API?

#

Provides many compatible APIs endpoints for many popular webservices (KoboldCppApi OpenAiApi OllamaApi A1111ForgeApi ComfyUiApi WhisperTranscribeApi XttsApi OpenAiSpeechApi) << oh this right?

elfin tiger Jul 22, 2025, 12:48 PM

#

stray glacier So, I can download the model from the UI? Or I need to use SSH? And... Is it alr...

Only yes to the last one

#

You edit these to pick a model

#

In the args you can define the max context size (or reduce the layers for a partial cpu/gpu)

#

And of course if you don't need them you can delete the image generation one, etc

stray glacier Jul 22, 2025, 12:50 PM

#

elfin tiger You edit these to pick a model

Where can I see all the parameters available?

elfin tiger Jul 22, 2025, 12:51 PM

#

stray glacier Where can I see all the parameters available?

Like the screenshot or every KCPP_ARGS variable?

stray glacier Jul 22, 2025, 12:51 PM

#

elfin tiger Like the screenshot or every KCPP_ARGS variable?

Oh, I mean like a link to the documentation, so I know all the names of the parameter and the description on it. hahahahahaha.

#

Or that's already all?

elfin tiger Jul 22, 2025, 12:52 PM

#

These are for the KCPP_ARGS

📎 kcpp-help.txt

#

The defaults are quite complete though

stray glacier Jul 22, 2025, 12:53 PM

#

elfin tiger These are for the KCPP_ARGS

Sorry >,<.. I mean the parameter for the kobold pod though. hahahaah.

These "KCPP_MODEL", "KCPP_ARGS", etc...

elfin tiger Jul 22, 2025, 12:53 PM

#

All of them are there except for KCPP_MMPROJ

stray glacier Jul 22, 2025, 12:53 PM

#

elfin tiger All of them are there except for KCPP_MMPROJ

Ah okay... Thank you thank you :D.. Will try it soon.

elfin tiger Jul 22, 2025, 12:54 PM

#

Should be a breeze to setup for GGUF as long as what your doing fits the specs

stray glacier Jul 22, 2025, 1:00 PM

#

So, in KCPP_MODEL I just need to put: "https://huggingface.co/unsloth/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-UD-IQ2_XXS-00001-of"

Like that?

elfin tiger Jul 22, 2025, 1:05 PM

#

The full link of part 1

stray glacier Jul 22, 2025, 1:06 PM

#

Sorry for asking more:
so, do I need to use --usecuda ? since the doc said "if you're on windows" when on runpod we're using linux.

Is this pod already accept "--flashattention" ?

stray glacier Jul 22, 2025, 1:06 PM

#

elfin tiger The full link of part 1

Ah I see.

elfin tiger Jul 22, 2025, 1:06 PM

#

If you don't intend to put everything on the GPU you also need to customize --gpulayers but I don't know the optimal partial offload for this modell

#

--usecuda you need yes, it forces nvidia GPU mode

#

--flashattention should be specified in KCPP_ARGS already

stray glacier Jul 22, 2025, 1:08 PM

#

elfin tiger If you don't intend to put everything on the GPU you also need to customize --gp...

From the doc, I read I can just put "-1" so it's automatically offload what it needs.

#

Okay2, Thanks!!!

elfin tiger Jul 22, 2025, 1:08 PM

#

stray glacier From the doc, I read I can just put "-1" so it's automatically offload what it n...

Don't do that

#

It can currently only see the vram of one GPU when it picks them

#

It will put to few

#

-1 is meant for home users

stray glacier Jul 22, 2025, 1:13 PM

#

I see.. Okay... let say if I have 10 VRAM in total, and the model is 40GB.

How do I know max GPU offload value so I know how much I need to put the value? << I never know how to calculate this.

elfin tiger Jul 22, 2025, 1:14 PM

#

stray glacier I see.. Okay... let say if I have 10 VRAM in total, and the model is 40GB. How...

10vram on a single GPU?

stray glacier Jul 22, 2025, 1:14 PM

#

elfin tiger 10vram on a single GPU?

Let say it's across 4GPU to make the example more relevant. hhahaha

elfin tiger Jul 22, 2025, 1:14 PM

#

We usually trial and error

#

Lets say it has 40 layers to keep it simple

#

Your 4 times over

#

So then its around 10 layers

#

MoE's are a bit of a special case

#

Unsloth recommends keeping it 99 but using override tensors

#

--overridetensors ".ffn_.*_exps.=CPU" is one they say you can try

#

#

For us -ot is --overridetensors

stray glacier Jul 22, 2025, 1:17 PM

#

Oh wow... Okay2. Will read that first too...

elfin tiger Jul 22, 2025, 1:17 PM

#

That puts specific ones on the GPU so its more performant

#

Thing is on runpod because ram and vram are tied its not always a good idea

#

The MoE's consume a lot of ram usually, non MoE's only consume the ram for stuff thats not on the GPU

stray glacier Jul 22, 2025, 1:23 PM

#

elfin tiger Thing is on runpod because ram and vram are tied its not always a good idea

hooo.. ic ic..

#

Okay, I think I will need some trial and error a little. At least I will post how the result in here and is it good enough to run roocode :D.

half rapids Jul 22, 2025, 1:24 PM

#

elfin tiger Thing is on runpod because ram and vram are tied its not always a good idea

Fair stuff for hosts

#

Because the pricing mostly depends on the multiple of gpu so..

elfin tiger Jul 22, 2025, 1:25 PM

#

stray glacier Okay, I think I will need some trial and error a little. At least I will post ho...

Personally I doubt it

#

Roo code switches prompts all the time, ton of reprocessing

#

Gonna have long delays, tunnnel timeouts if you don't edit the template so you can connect to port 5001 over TCP

#

And I think in general unless you hyper optimize it it will be more expensive than API providers

#

A model this large single user just isn't economical

#

For the smaller ones like GLM / Devstral runpod + koboldcpp works very well

#

Same if you use a 100B

#How to add more RAM to the Pod