#Slow startup times

283 messages · Page 1 of 1 (latest)

meager tinsel
#

Has anyone experienced really variable startup times? Loading comfyui today took 45+ minutes when it usually takes 1-2 minutes.

Also, jupyterlab in general has been laggy / not responsive. Yesterday was working just fine, so not sure if that's just me.

tribal phoenixBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

bronze shuttle
meager tinsel
austere moth
#

@meager tinsel not just you, I am also experiencing exactly the same issues with EU-RO-1, just joined here to see if anyone else was having trouble

#

Jupyter very laggy and keeps 'sticking,' echoing back some seconds later, everything seems very slow/unusable when usually its very responsive. I also noticed the pod local storage get to 107% at one stage, despite the fact that I don't do anything outside of /workspace. As usual I am running runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 with network storage, never usually have any issues with it.

bronze shuttle
#

yup, the jupyter lags a little bit, mine too

lapis dragonBOT
bronze shuttle
#

maybe open a ticket

sleek saffron
#

same here

brave lagoon
#

same, EU-RO-1 as well

#

disk I/O seems to be extremely slow i think?

#

python3 -m venv venv took about a minute to finish
and libraries are taking forever to load:

Testing torch import...
torch imported successfully in 70.69s
torch version: 2.7.1+cu128
Testing transformers import...
transformers imported successfully in 99.36s
transformers version: 4.52.4
fathom quiver
#

Same for us — running multiple pods with network storage on EU-RO-1 and they are extremely slow. ComfyUI takes 10 to 30 minutes to start and we constantly get cloudfare timeouts, and it's the instance is pretty much inaccessible.

I've opened a ticket https://contact.runpod.io/hc/en-us/requests/19551 yesterday and looking to see if I can provide more relevant info.

Hope this gets solved! 🤓

bronze shuttle
#

many users are reporting this

spiral glade
#

Facing the same issue

fathom quiver
#

The container volumes work at expected speeds though, it seems that it's only network volume related

bronze shuttle
#

Oh ic

#

nvm

fathom quiver
#

😅

bronze shuttle
#

trying to run this sageattn installer on slow ns be like

fathom quiver
#

happy to check for you

bronze shuttle
#

doesnt work, slow, frustating

fathom quiver
#

yea!!!

#

ComfyUI is pretty much unusuable especially if you have a large install with multiple custom nodes

bronze shuttle
#

still usable, just longer startup heheh

white bone
#

too much slow on RTX 4000 Ada EU-RO-1 even unable to start comfyui

fathom quiver
#

It seems that this is affecting serverless workers using a different network storage on EU-RO-1

#

Our serverless workers are no longer starting up due to timeouts

bronze shuttle
#

🥲 are you able to migrate to another region for now?

#

seems like runpod staffs arent too active at these times

fathom quiver
#

That would mean moving around 1 TB of data from different network volumes and reinstalling multiple containers

white bone
#

i am using a permanently mounted dis, but it stucks

sleek saffron
#

How is it possible that nothing happen, it's been almost 24H

bronze shuttle
fathom quiver
#

But there's clearly an issue here

bronze shuttle
#

i'll try notice some staffs

fathom quiver
#

I'd be happy to provide more info if needed

bronze shuttle
#

like what kind of info

fathom quiver
#

Anything that's needed by staff for debugging on our side

tawny cargo
#

is issue still going on, can someone provide me some pod ids, I will check with infra team.

bronze shuttle
#

i guess so

#

0r8r7mugbuw42j this is a cpu pod

#

let me try loading comfyui again

#

if more than 4 minutes i havent reported back just assume its still there

white bone
#

@tawny cargo
here is mine:
zafzxwm66rcvy8

#

Same issue

bronze shuttle
#

yup still loading until now

#

so its still slow

fathom quiver
#

Also, serverless worker: 1kxd62zyws4zq0

#

The pod 4n17vfvlfgxuwj took around 20 minutes to boot ComfyUI from /workspace/ComfyUI but I could not connect to it after it booted

#

Trying to deploy another one to see how it behaves

tawny cargo
#

i've just run a speedtest on the machine, the network is good. can you give me a screenshot what you doing is slow?

fathom quiver
#

For e.g. serverless worker n85qs8g9swe0ed is currently trying to initialize comfyUI from /workspace/ComfyUI (network storage)

sleek saffron
#

same here, my comfyui backend is loading forever and not booting

bronze shuttle
#

Not the network speed..

fathom quiver
#

Again, the issue is with network volumes

bronze shuttle
#

its the network storage (disk speed)

sleek saffron
#

I use storage network

tawny cargo
#

network volume speed is slow?

fathom quiver
#

Yes!!

#

😄

tawny cargo
#

got it, let me run some test

bronze shuttle
#

want my script for testing>?

white bone
#

ON MY END network seems good but its not booting comfyui taking too much time to boot seems like issue is in volume!

fathom quiver
#

I'm connected to two separate teams running services on different network volumes on EU-RO-1 and they both fail

#

both normal pods and serverless workers

#

the speed on the temp volume on the container works well

#

it's just the network volumes that are getting hit

#

e.g. pod: enzp9x316vp4yo

bronze shuttle
fathom quiver
#

and serverless worker 1kxd62zyws4zq0 is currently stalling

bronze shuttle
#

i hope this can be fixed soon, sent a disk speed testing script if you need on your dm yhlong

tawny cargo
#

Ran a test, seems a bit slower than normal. I’ve pinged the infra team to take a look, it’s the weekend so the response might be a bit slow. If you need a quicker workaround, you could temporarily switch to another region and copy the files over. I know it’s not ideal, but it might help for now. Appreciate your patience!

fathom quiver
#

We're actually running a live production which needs the serverless to work in the next hour. Transfering 400 GB between regions is pretty much impossible

#

Will try to find a different solution

sleek saffron
tawny cargo
bronze shuttle
#

setting up from another source would be more ideal

bronze shuttle
# tawny cargo

yeah its not a great option, i tried using syncthing, reading files takes long too (my files are about 60gbs)

fathom quiver
# tawny cargo

Yep, thanks, but this would take around 3 days to complete

#

We'll re-route it to our local machines

#

Thanks for looking into this!

#

Fingers crossed that someone will look into it soooooon 🤞

tawny cargo
#

The network speed actually looks pretty solid, 400GB should finish transferring in about an hour.

fathom quiver
#

Yea, but the network volume that I'm transferring from is the one affected

#

I get less than 100Mb/s

#

And it's patchy

bronze shuttle
tawny cargo
#

yeah, I am getting about 180-190 Mib/s

bronze shuttle
#

one big file?

fathom quiver
#

Exactly, but it goes up and down!! 🤓

#

I've benchmarked with both small and large files:

dd if=/dev/zero of=/workspace/slowtest bs=1M count=1024 oflag=direct
dd if=/dev/zero of=/workspace/testfile bs=1G count=1 oflag=dsync
bronze shuttle
#

and whats the result

fathom quiver
#

Different results: 50, 100, 180

#

I tried it multiple times and it goes up and down, but never more than 180-190Mb/s

dapper lodge
#

This is what I see until I get timed out while trying to run Comfy with a Network Volume. A Romanian 5090. Will this be resolved?

fathom quiver
#

Same here

#

ComfyUI takes super long to load, and then if it loads, you cannot connect to it

grave sonnet
#

i am stuck and very lag when using it

formal sphinx
#

same here, dismal load times

grave sonnet
#

want to confrim our problem is same or not

#

u are also using network volume?

blissful pendant
#

same problem here

formal sphinx
grave sonnet
#

two things u can help me

  1. can u ssh into pod, and cd /workspace, try run some linux command, see if it is lag or not?
  2. can u try to make a folder with many files in root, and then try to copy them to /workspace with cp -rv, and see if the copying is lag after copy about some files-> normal-> lag again?
grave sonnet
#

if our issue is same then i think we need to tag the admin to notice this issue

fathom quiver
#

We have been tagging everyone and opening tickets since yesterday, but no real intervention yet

tulip linden
#

raised this internally, we'll be looking into it, no eta on a fix

fathom quiver
#

We had to move our infrastructure to a different provider since EU-RO-1 network volumes do not work properly

fathom quiver
#

I wish this was reported as an incident to be able to track it correctly from our side

#

And to stop losing cash on booting up pods and serverless workers

#

Luckly we noticed these workers running and shut them down manually....

#

i.e. the requests were triggering workers that were timing out

open fulcrum
open fulcrum
open fulcrum
fathom quiver
#

Running most operations locally and other providers now until the problem is fixed...

#

We tried to do a transfer to EUR-IS-1, but it would take days since the network volume simply cannot transfer fast enough

#

This affects multiple teams and projects on our side unfortunately

icy bronze
#

glad i'm not the only one whose work relies on this

open fulcrum
#

There should be zero charge to swich locations. I assumed this was an established company. I no longer feel comfortable referring this service.

#

why would you charge someone to change locations esp when one does not have the GPU or network?!?!

tropic flint
#

The same issue is still occurring on EU-RO-1. Since this is incurring additional charges, we would greatly appreciate your prompt assistance.

bronze shuttle
#

hows the status

maiden goblet
#

Hey all - apologies for the delay, the team was able to track down the congestion on EU-RO-1’s storage cluster and resolved it at 00:37 UTC. We’ve been monitoring for the past hour - at this time, performance should be restored to normal levels.

bronze shuttle
#

yup faster loading i see

spiral glade
#

Its still not working @maiden goblet

bronze shuttle
#

it should be working rn, what are you getting

spiral glade
#

Page not found

fathom quiver
#

@maiden goblet It still doesn't work properly on our side

#
root@90eb9f3820be:/# dd if=/dev/zero of=/workspace/slowtest bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 7.3823 s, 145 MB/s
#

This is the average speed I get

#

ComfyUI still takes a long time to load all modules and I cannot connect to the interface at all...

#

Is the team still on this rn?

#

As a note, this particular network volume is 540GB — another volume that's considerably smaller that belongs to another team I'm part of works

prisma ermine
fathom quiver
#

@prisma ermine ComfyUI imports many small files (most of them only a few kb), so I feel that it's more relevant to test with a smaller block size

#

This was not a problem a few days ago BTW

prisma ermine
sleek saffron
spiral glade
bronze shuttle
#

i feel like its faster but its still kind of slow

#

my comfyui with bunch of custom nodes took like 15~ mins to load

#

totally not normal, swarmui creator have to even patch a fix for me to make swarmui correctly detect if comfyui is still running after loading for 15mins

fathom quiver
#

The loading is one thing, but the overall slowness is a big issue: ComfyUI doesn't load in the browser, and if it does,models load very slow, files cannot be uploaded or downloaded etc.

austere moth
#

Working better for me but still not entirely right, Jupyter is still a bit laggy

spiral glade
#

Hey runpod team, are you gonna fix this or not??

austere moth
#

I spoke to soon, just moved from an A2000 to 4090 and its very slow to load. Burning through credit here with no output!

#

The 4090 pod is giving me:
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.98087 s, 216 MB/s

I had about 600 MB/s on the A2000

#

As others have said, its just the network storage, for root I get

1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.61319 s, 666 MB/s

#

Please let us know an ETA to get this resolved. If you are not going to sort this soon then I want to delete my network volume, as I am currently paying for 450GB that I can't use.

hot bough
#

I even cant enter Jupiter lab or fluxgym on my pods х_х
i lost 20 min for waiting

void lake
#

Iv been unable to use runpod the entire weekend. My entire weekend is just wasted not being able to get a second of work done. First time user experience isnt that great i tell ya

spiral glade
marble forge
#

Is it working for you guys? Still very slow here...

fathom quiver
#

Still slow for us too

#

It seems to work better every now and then, but it's super variable and unreliable

#

For e.g. we're struggling to upload a 6MB mp4 for 10 mins now

grave sonnet
#

waiting for runpod team resolve the problem

#

can someone who are using the service raise a support ticket through email?

glass pewter
#

Finally saw this thread. I spent all day trying to move to a new network storage 🤦🏻‍♂️ A heads up from the team would've been useful.

bronze shuttle
#

@normal flint whats the status on this, is it still ongoing

normal flint
#

?

bronze shuttle
#

People are experiencing slow ns speeds

#

Is runpod still working on this?

normal flint
#

what exacly slow speeds uploads/download, from where local, remote?

bronze shuttle
#

Network storage speed, try to read some of the chats above

#

Especially in EU ro 1 region

#

Slow disk speeds

fathom quiver
#

Currently spent a lot of cash to make things work and it's impossible

normal flint
#

It's just me as I'm OOO till tomorrow.

#

Though even as support can't do much as it's infrastructure and reliability team responsibility.

fathom quiver
#

Just to make it clear: the problem started friday and it affects the EU-RO-1 network volumes

#

We're currently experiencing i/o speeds between 10 and 180 MB/s

normal flint
#

I mean EU-RO-1 is often heavy used mostly cause CPU pods

fathom quiver
#

So you're recommendation is to...?

normal flint
#

usually would say change region or submit ticket so we can forward it to the team

fathom quiver
#

There are multiple of us that opened multiple tickets from different teams / accounts

#

Please read the thread starting from above

#

One of your team members acknowledged the issue and then they said it was fixed

normal flint
#

discord is not main support platform though

fathom quiver
dapper lodge
fathom quiver
normal flint
#

I do not have now access to work device so I'm unable to check

fathom quiver
#

Sorry, but you're the only Runpod rep online now

normal flint
#

I mean I will be checking on the Monday but my friend works on Weekend tickets.

#

I'm only tech support, issues like drives slow downs need to go to eng team as I do not have high level access.

fathom quiver
#

Alright! Don't mean to throw blame, sorry, I know it's not your personal fault, but we need someone from Runpod to communicate and provide support even during the weekends because this is affecting our projects

dapper lodge
#

Im spending 400$ per month on Runpod and all we get when its down for 3 days is.. nothing actually

normal flint
#

I mean all are valid things.

umbral dune
#

Please check other regions too when having the time. I had the issue with 4 pods on 2 regions since yesterday.
As someone mentionned before, the pod startup time is one issue, but the biggest I see is regarding performance (the double ! ), last the laggy Jupyter.
Ex: I'm training Flux, Before I had 4s/it, from yesterday it's 8s/it !

bronze shuttle
#

Which region is the other one? (EU ro 1 and?)

normal flint
#

let me guess Fluxgym?

umbral dune
#

Just Flux dev. it was EU ro and is as I remember.

bronze shuttle
#

Which app do you use to train it

umbral dune
#

But cannot check as I've deleted the pods.

bronze shuttle
#

Or toolkit

umbral dune
#

OneTrainer with CLI (Onetrainer CLI 1.1 is the template name)

glass pewter
#

Do you guys have any recommendations on what region to move to? I don't want to end up in another one with issues.

umbral dune
#

And that was with RTX5090, secure cloud but guess the issue is general with any GPU.

fathom quiver
#

We had the issue with A100 PCIe, A100 SXM, RTX PRO 6000, 5090, 4090 (this on serverless) so it's definitely GPU independent imo

normal flint
#

ok did what I could do and send message on internal chat.

fathom quiver
#

Cool, thanks!

normal flint
#

also tried myself and also seeing it

fathom quiver
#

I wish there was a strategy to run without the network volume — this would save a lot of headaches — but for us it would be impossible to manage the python venv updates via image. And the small python modules are definitely the i/o bottleneck here

void lake
#

Im kinda curious why all these big tech/ai companies who get most traffic on weekends when people are free have all their staff off lol.

Civitai too. Site goes to hell every friday to monday🤣 every staff is off. Makes no sense to me

normal flint
#

I have huge hopes for S3 API

fathom quiver
#

But aren't the EU-RO-1 deployed on S3 too?

normal flint
#

they are test region

fathom quiver
#

😅 is this the reason why things are not working, then?

normal flint
#

nope dont think so

#

but S3 API would help to move data between regions

umbral dune
#

For info and if I remember correctly:
Yesterday pod could start, slow but started. But training time was the double of usual, really the double ! it Was on EU IS.
Today on EU RO, I had to cancel the pod setup after waiting 10 minutes, it usually take just one minute.
So it seems that some regions are slower than others but the problem is general, all GPU and regions.

#

Note if this can help: pod deployement through SSH, on demand plan and not using network volume.

fathom quiver
#

Just getting this constantly

#

Different pods on EU-RO-1

#

Every 10-15 mins

normal flint
#

Problem should be solved pls check

fathom quiver
#

It still doesn't work

#

now a bunch of serverless workers started to fail again

#

have to switch to back to local

#

We get a lot of file not found errors, as if the network volume keeps disconnecting:

{\n    "error_type": "<class 'FileNotFoundError'>",\n    "error_message": "[Errno 2] No such file or directory: '/runpod-volume/ComfyUI/temp/ComfyUI_temp_pcefp_00010_.png'",\n    "error_traceback": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_job.py\", line 134, in run_job\n    handler_return = handler(job)\n  File \"/rp_handler.py\", line 187, in handler\n    with open(image_path, 'rb') as image_file:\nFileNotFoundError: [Errno 2] No such file or directory: '/runpod-volume/ComfyUI/temp/ComfyUI_temp_pcefp_00010_.png'\n",\n    "hostname": "pe2mc92j5rwzsd-644113f9",\n    "worker_id": "pe2mc92j5rwzsd",\n    "runpod_version": "1.6.2"\n}
#

Really a nightmare tbh. We'll probably switch to another provider completely next week. Can't justify this to the folks that are relying on our productions

#

And now the serverless workers are just eating through funds like crazy

normal flint
#

Why you use so old version of sdk?

marble canopy
#

Same issue. Spent like 2 hours trying to get something to run in EU-RO-1 - it's just not working. Looks like storage issues again...

dapper lodge
#

Now working for me also. I am able to start Comfy in the terminal, but its not getting loaded in the new tab, it just shows a loading animation and after some time it throws an error

#

(RO network volume)

normal flint
#

Tried deploy new pod?

normal flint
meager tinsel
bronze shuttle
#

1) Sequential write

• Total size: 2GiB
• Block size: 256KiB
• Count: 8192 blocks

8192+0 records in
8192+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.23865 s, 1.7 GB/s

2) Sequential read

• Same file
• Block size: 256KiB

8192+0 records in
8192+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 2.78476 s, 771 MB/s

pretty fast

spiral glade
#

So is it working right now??

prisma ermine
#

@bronze shuttle @maiden goblet @normal flint

It works! Thank you for your effort to resolve the issue.

fathom quiver
#

The load times seem significantly faster here too. Will report back as the teams are starting their days.

#

Thank you!

marble forge
#

It works here too! Thanks a lot @maiden goblet

umbral dune
#

Same on EU RO 1, it came back to the normal figures for performance.

dapper lodge
#

Same problem again, Comfy loads forever. Worked just 10 minutes ago

fathom quiver
#

I had the same issue a few minutes ago. It does feel like either the issue is still there, or that now it's a different, network-related issue, that wasn't observed earlier.

#

The network volume speed seems really good now, but the HTTP loading time is still taking a cap every now and then, which also leads to some components of the ComfyUI interface not loading (e.g. css files etc.)

#

Our manual fix is to see what's not loading using browser dev tools and reload those items individually, then refresh the main interface

dapper lodge
#

yeah I am not a dev so I have no idea how to do all that, I just need to generate some images man

#

@normal flint

bronze shuttle
#

can you try opening your comfyui web, in an incognito page

#

if it doesnt work press f12 and try to screenshot the networks tab
(my comfyui does loads btw)

dapper lodge
#

incognito tab

bronze shuttle
#

can you reinstall comfyui frontend

#

or the whole comfyui

#

to see if it works, might be your installation

glass pewter
#

I am also suffering from VERY slow starts on EU-RO-1 still. I migrated my network storage to US-TX-3, which loads perfectly fine (except it quickly runs out of pods lol).

bronze shuttle
#

slow comfyui starts?

glass pewter
#

Yeah, I start it on the console and it takes over 10 minutes to get running. Just a few seconds on the other region.

bronze shuttle
#

maybe open a ticket so staffs can look at them

dapper lodge
blissful pendant
prisma ermine
# dapper lodge

well, I have this issue since the day one. Somehow I have accepted that as a normal behaviour.. just wait a few minutes to get it loaded.

meager tinsel
#

Is anyone seeing slow startup times again or is that just me? Have been trying to launch for the past few hours

bronze shuttle
#

Comfyui? Which region?