"delayTime": 133684,
"error": "CUDA out of memory. Tried to allocate 1.50 GiB (GPU 0; 23.68 GiB total capacity; 18.84 GiB already allocated; 1.47 GiB free; 20.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF",
"executionTime": 45263,
"id": "ae1e4066-e2b7-43c1-8f37-3525bda03893-e1",
#Why "CUDA out of memory" Today ? Same image to generate portrait, yesterday is ok , today in not.
46 messages · Page 1 of 1 (latest)
Ask the developer of the application, it has nothing to do with RunPod.
seems like out of memory error
meaning you need a bigger gpu for that
or try unloading your other models somehow
I am the developer. When I use my ai app, I get CUDA out of memory. I did nothing to the app.
Then it needs a larger GPU as nerdy said.
It looks like you are trying to use a 24GB GPU when you need more VRAM. Try to run it on a 48GB GPU. If that is still not enough then try to run it on a 80GB GPU.
OK, I see, I will test.
I have exactly the same problem. We have changed nothing in our setup. Just today most image generation fails
I have a second serverless endpoint running that uses the same template. that one is running fine
how does your setup work?
does it unloads model?
what models are loaded in the vram, maybe too much model is loaded
I have just realised this only happend on one specific worker: m07jdb658oetph
thats why not all of the generations failed and my other endpoint runs fine
Interesting, when it happens, try collect traceback or logs
I have not switched it back on. But I can give you the logs from the weekend when it happened
Any stacktrace?
maybe somewhere here:
2024-08-03 20:33:07.360 | info | m07jdb658oetph | ', memory monitor disabled
2024-08-03 20:33:07.360 | info | m07jdb658oetph | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-08-03 20:33:07.360 | info | m07jdb658oetph | For debugging consider passing CUDA_LAUNCH_BLOCKING=1
2024-08-03 20:33:07.360 | info | m07jdb658oetph | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-08-03 20:33:07.360 | info | m07jdb658oetph | Warning: caught exception 'CUDA error: out of memory
sorry, not sure how I would get a stacktrace. I just downloaded the logs directly from runpod
This?
{5 items "endpointId":"6oe3safoiwidj3" "workerId":"m07jdb658oetph" "level":"info" "message":"Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. " "dt":"2024-08-03 18:27:11.64919904" }
we run stable diffusion with automatic1111
so all worker runs the same specific model loras, etc?
yes
then all should be throwing that oom error if they load the same models
so there should be some workers that is loading and unloading models dynamically
and thats where you should find out from the application that your using
we don have that functionality in our code.
They should all load the very same way
This specific worker had a 100% fail rate though.
try reporting it to runpod for now, but i guess this is more an application issue
maybe you didn't unload the models somewhere
How? I dont see a file option in DM
Its an OOM issue, why are you using sdp attention and not xformers?
what about it? xformers has lower vram usage?
yes
Ill try this in a new deployment. Just thought it was odd that just this one worker failed
A1111 can fail intermittently with OOM errors based on your request. I experienced random/intermittent OOM and had to upgrade from 24GB to 48GB GPU tier.