#Why "CUDA out of memory" Today ? Same image to generate portrait, yesterday is ok , today in not.

46 messages · Page 1 of 1 (latest)

lofty coral
#

"delayTime": 133684,
"error": "CUDA out of memory. Tried to allocate 1.50 GiB (GPU 0; 23.68 GiB total capacity; 18.84 GiB already allocated; 1.47 GiB free; 20.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF",
"executionTime": 45263,
"id": "ae1e4066-e2b7-43c1-8f37-3525bda03893-e1",

vital oak
#

Ask the developer of the application, it has nothing to do with RunPod.

steep fjord
#

meaning you need a bigger gpu for that

#

or try unloading your other models somehow

lofty coral
vital oak
#

Then it needs a larger GPU as nerdy said.

normal isle
#

It looks like you are trying to use a 24GB GPU when you need more VRAM. Try to run it on a 48GB GPU. If that is still not enough then try to run it on a 80GB GPU.

lofty coral
#

OK, I see, I will test.

rigid cove
#

I have exactly the same problem. We have changed nothing in our setup. Just today most image generation fails

#

I have a second serverless endpoint running that uses the same template. that one is running fine

steep fjord
#

how does your setup work?

#

does it unloads model?

#

what models are loaded in the vram, maybe too much model is loaded

rigid cove
#

I have just realised this only happend on one specific worker: m07jdb658oetph

#

thats why not all of the generations failed and my other endpoint runs fine

steep fjord
#

Interesting, when it happens, try collect traceback or logs

rigid cove
#

I have not switched it back on. But I can give you the logs from the weekend when it happened

steep fjord
#

Any stacktrace?

#

maybe somewhere here:


2024-08-03  20:33:07.360 | info | m07jdb658oetph | ', memory monitor disabled

2024-08-03  20:33:07.360 | info | m07jdb658oetph | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

2024-08-03  20:33:07.360 | info | m07jdb658oetph | For debugging consider passing CUDA_LAUNCH_BLOCKING=1

2024-08-03  20:33:07.360 | info | m07jdb658oetph | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

2024-08-03  20:33:07.360 | info | m07jdb658oetph | Warning: caught exception 'CUDA error: out of memory
rigid cove
#

sorry, not sure how I would get a stacktrace. I just downloaded the logs directly from runpod

#

This?

{5 items "endpointId":"6oe3safoiwidj3" "workerId":"m07jdb658oetph" "level":"info" "message":"Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. " "dt":"2024-08-03 18:27:11.64919904" }

steep fjord
#

Whats the application that you're using?

#

that creates that

rigid cove
#

we run stable diffusion with automatic1111

steep fjord
#

so all worker runs the same specific model loras, etc?

rigid cove
#

yes

steep fjord
#

then all should be throwing that oom error if they load the same models

#

so there should be some workers that is loading and unloading models dynamically

#

and thats where you should find out from the application that your using

rigid cove
#

we don have that functionality in our code.

They should all load the very same way

#

This specific worker had a 100% fail rate though.

steep fjord
#

where's the code for loading the model tho

#

i'll try to look at it

steep fjord
rigid cove
#

I'll send you the start.sh and handler script.

steep fjord
#

maybe you didn't unload the models somewhere

rigid cove
#

How? I dont see a file option in DM

steep fjord
#

just the model loading code maybe

#

a1111 loads model dynamically as far as i know

vital oak
#

Its an OOM issue, why are you using sdp attention and not xformers?

steep fjord
#

what about it? xformers has lower vram usage?

vital oak
#

yes

rigid cove
#

Ill try this in a new deployment. Just thought it was odd that just this one worker failed

vital oak
#

A1111 can fail intermittently with OOM errors based on your request. I experienced random/intermittent OOM and had to upgrade from 24GB to 48GB GPU tier.