Extremely slow network storage | Runpod | Page 1

plush sapphire Dec 23, 2024, 8:34 AM

#

For a couple of weeks now, it's taken >15 minutes (sometimes up to 20) to load a 70B model from network storage in CA-MTL. This is, of course, paid time on the GPU while waiting for it (as well as being quite inconvenient) - it's actually quicker to download the model fresh from an internet server every time rather than waiting for it to load from network storage. Is the "network storage" actually on the same network as the GPU server, or is it on some cloud somewhere? Why is it so slow?

mystic harnessBOT Dec 23, 2024, 8:35 AM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

shut yew Dec 23, 2024, 9:59 AM

#

hey, I meet the same problem here at CA-MTL-1, hopes to be fixed fast!
this is my pod id: ID: d4cfdt3tvtinod

#

today the model training is delayed by the terrible model loading time, and then my kohya-ss training is killed halfway for the log writing failure , for three times, this never happened before but today I can't have a complete training process even once.
I am a faithful user of yours, and really need it to be fixed soon!

#

I wonder should I keep the d4cfdt3tvtinod for you debug

fading saffron Dec 23, 2024, 10:42 AM

#

Same problem in EU-SE-1

#

Also happens with new pods

celest igloo Dec 23, 2024, 1:18 PM

#

shut yew I wonder should I keep the d4cfdt3tvtinod for you debug

No i think you can just stop it

#

Along you can reference the ID it's good

#

@nimble tangle

celest igloo Dec 23, 2024, 1:19 PM

#

plush sapphire For a couple of weeks now, it's taken >15 minutes (sometimes up to 20) to load a...

No it's in a different servers most likely but interconnected in the same dc

plush sapphire Dec 23, 2024, 1:20 PM

#

It's weird how I can download the same model from a VM in the Netherlands faster than from network storage in the same DC

celest igloo Dec 23, 2024, 1:22 PM

#

Yeah maybe the bandwidth for network storage is slower or some reasons because the architecture, I'm not sure but staff can debug and give more information

#

Guys try to open a support ticket to report too

sand peak Jan 13, 2025, 5:35 AM

#

Same problem in EU-SE-1 today for some reason starting new pod the problem still persist ( slightly faster but same instability issue ), moving files within the storage also have the same problem

rsync -ah --info=progress2 outputs/model-out-stage2/checkpoint-150/*.safetensors outputs/stage2_results/
          3.25G   3%    2.19MB/s   11:37:31  rsync: [receiver] write failed on "/workspace/data/outputs/stage2_results/model-00001-of-00021.safetensors": Input/output error (5)
rsync error: error in file IO (code 11) at receiver.c(380) [receiver=3.2.7]

I already submit a ticket for this

snow briar Jan 13, 2025, 8:39 AM

#

yep, thevolumn problems happends again today, too slow to load models and failed to wirte log happends once again, i've created ticket for them

hybrid chasm Jan 13, 2025, 8:43 AM

#

I'm also using network storage in EU-SE-1, i tested the IO speed with dd command. on /tmp folder its 4.6GB/s but on /workspace its 5.1MB/s

dd if=/dev/zero of=/workspace/testfile bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 211.795 s, 5.1 MB/s

dd if=/dev/zero of=/tmp/testfile bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.232843 s, 4.6 GB/s

sand peak Jan 13, 2025, 8:53 AM

#

anything which involve network storage was congested, local machine storage is okay

upper pagoda Jan 13, 2025, 9:39 AM

#

All of our 6 pods are down because of that. Looks like even ssh/jupyterlab becomes unresponsive for some period, and then it's up again. Will definetly need to move from the network discs. It's not the first time

plush sapphire Jan 15, 2025, 8:35 AM

#

I gave up with network storage, my use case is better handled by the KoboldCpp template which downloads the model each time - a 15-20 minute start up process is now 5-8 minutes at most

celest igloo Jan 15, 2025, 8:45 AM

#

yeah there was some incidents in eur-se-1 https://discord.com/channels/912829806415085598/1185333521124962345

brittle sluice Jan 15, 2025, 9:08 AM

#

but is there in update when they are fixing it?

celest igloo Jan 15, 2025, 9:08 AM

#

when its starts, and when its fixed i believe

brittle sluice Jan 15, 2025, 9:10 AM

#

hm okay cause there are running costs for no service actually since 2 days..

celest igloo Jan 15, 2025, 9:10 AM

#

brittle sluice hm okay cause there are running costs for no service actually since 2 days..

they can refund you if there's some service problems in runpod's side just create a ticket

plush sapphire Jan 15, 2025, 9:18 AM

#

I had the same issue for months, not just in SE, it's just a general problem with network storage, it's crap - thankfully I have a workaround for my use case

gray tide Jan 15, 2025, 9:28 AM

#

My network storage is in EU SE 1 and I am facing the same issue, the technical support wrote this to my ticket a few hours ago:
Thank you for reaching out! I’m very sorry that you’ve encountered this issue. It turns out the data center hosting your volume was experiencing network problems, which have been resolved as of yesterday evening/this morning.

I just tested it and unfortunately it is even worse than before... I will write them again

pseudo spruce Jan 15, 2025, 9:54 PM

#

Same, loading models in comfyui somehow takes longer than before lol, to the point that I wasnt even able to load it at all

shadow gyro Jan 15, 2025, 10:42 PM

#

Same for me, my pod with storage takes around ~45min to load the ports, its really annoying to pay 1 buck just for starting

brittle sluice Jan 16, 2025, 7:43 AM

#

They told that it is fixed but I wrote again that I need an hour to start ComfyUi and even longer to load models.. they want my PodID to check deeper but nothing happens since then

#Extremely slow network storage