#Billed for endpoint stuck in state: Service not ready yet. Retrying...

105 messages · Page 1 of 1 (latest)

dawn scaffold
#

Hey there, we have a serverless endpoint that seems to be stuck in a state: Service not ready yet. Retrying...

It has been stuck in this state for about 18 hours now and it appears we're getting billed for this (already $22 for a day) even though no resources are being used. We can't get it out of this state and we don't want to put more money on our account until this is resolved.

Is there anything we can do to stop this? Is this a common problem? Or a rare glitch?

I submitted a support ticket online as well with further details about the specific endpoint ID

astral wyvern
#

Rare glitch or more likely about your code used in the serverless handler is not proper .. Let's wait for staffs to take on that case

#

You should be able to stop the running worker whenever you want on your endpoint page, and you also can check the logs on each of the running workers

astral wyvern
dawn scaffold
#

Thanks @astral wyvern I was mistaken on a few things. The total billed time was 14hr.

Looking through the logs, this was happening for 2 hours. But I think it was happening to multiple workers, so perhaps that is how it added up to 14hr charge.

Could it perhaps have to do with workers trying to connect to network storage and failing? And then we are getting billed for that?

astral wyvern
#

Ic.
Yes it could be.. When did it happen?

#

Not really sure tho if its that, it's most unlikely

#

Which region is it in also?

#

Also Has it ever worked before or you just deployed it

dawn scaffold
#

Region: Oregon
Time: July 5, 11:51pm MT – July 6, 1:58am MT

#

We've had the same endpoint deployed for roughly a month or so now.. maybe 3 weeks? Been working great AFAIK. But I have seen these messages before.

astral wyvern
#

So it's working now?

#

I think, "Service not ready retrying" is from your handler code

#

Can you check what does it do, Does it check some internal service that might have not boot up yet in your worker

#

And no other logs?

dawn scaffold
#

Additional info if it helps.
You can see total execution time is 164s, Cold start time is 95s so total
(164+95)*0.00044 so should be roughly $0.12.

astral wyvern
#

Yeah

dawn scaffold
#

So wait for service... not sure what this method does tbh

astral wyvern
#

can you launch an pod using that network volume

#

use the runpod pytorch template one

#

then run this command on the terminal from jupyter:
cat /workspace/logs/webui.log
copy the output here

dawn scaffold
#

Ah does it have to be through jupyter? Can I ssh?

#

I just started the pod without Jupyter, hah

#

sqlite3.DatabaseError: database disk image is malformed

#

Hmm

astral wyvern
#

Can you check the usage of your network volume in your pods page?

#

howmuch % is it?

dawn scaffold
#

77% of the volume, 0% of container

astral wyvern
#

sqlite3.DatabaseError: database disk image is malformed

it means your sqlite db is corrupted, idk what can be the causes

dawn scaffold
#

Yeah, it might be the sqlite db for A1111, might just delete it and try relaunching

astral wyvern
#

alright

dawn scaffold
#

So I can relaunch A1111 no problem, no db issues. Hmm

astral wyvern
#

After deleting the db file?

dawn scaffold
#

No just running from the Pod.
I did delete the cache db anyway. I'm not sure that was the issue though because it continued to execute on the inferences. Very strange.

astral wyvern
#

ah

dawn scaffold
#

So perhaps it's possible that the network volume is taking a long time to attach, it got stuck somehow and the endpoint is billing right away.

But we had set Execution timeout to 600 seconds (default). So 10 minutes max. I would think that would kill the worker. But this went on for 2 hours

#

Not sure where you are @astral wyvern but it is getting late here. I'm really grateful for all your support! I will update the zendesk ticket to fill them in about what we learned here, and maybe that will help us understand what happened to the billing.

#

I think we'll try to move to direct storage vs network storage moving forward.

Thanks again @astral wyvern !

astral wyvern
#

Oh alright

#

but direct storage on serverless is non persistent

#

your welcome!

dawn scaffold
#

@astral wyvern I can start another thread about this, but I understand that the main container disk is non persistent. However I assumed with templates we can spec a volume disk which is persistent? In that case, we could have our dockerfile (or start.sh?) load in/configure A1111 + models onto the volume disk if they don't exist, and then use that between executions?

Anyway, sleep for me! hah

Thanks again

astral wyvern
#

sure yes, put your files in your image, and its there all the time
You should directly access files from your docker image, not move it to container disk ( it'll be more efficient )

dawn scaffold
#

@astral wyvern btw this is happening again. Nothing is getting picked up from the queue

astral wyvern
#

wew

#

do you use any extension?

astral wyvern
dawn scaffold
#

It's all the stuff from Ashley's image including, controlnet, adetailer etc.

#

But honestly it just seems like everything is moving very slow on Runpod right now. I haven't changed anything, same configuration and same extensions as I have for the past month. Just the past few days have been really glitchy.

#

Will check logs now one sec

#

webui.log is empty

astral wyvern
dawn scaffold
#

Ohhhh, geez. I'm sorry. I am out of network volume space. That must be the issue

astral wyvern
#

ahh ic, yeah, what were you doing so that you ran out of space?

dawn scaffold
#

Well, it's strange because I had 77% usage of a 65GB network volume. So roughly 15GB free right?

I just downloaded a checkpoint ~7GB.

And suddenly now it's all used up.

astral wyvern
#

Hmm, any failed?

#

you can check your usage, try searching linux command to check folder size

dawn scaffold
#

Yeah I did, somehow venv is using 14G

astral wyvern
#

so its your package, yeah it can use a bunch of space hahahah

dawn scaffold
#

I didn't install any new packages, just a new model ~7G. Anyway, yeah something filled that space for sure. I will need to investigate.

I think I'd really like to find a way to stop using network storage, and have a single model per endpoint or something. Do you know what most people do?

astral wyvern
astral wyvern
#

like you put model, files everything in container, it should slow down your start times if your image is bulky

dawn scaffold
#

If it needs to then run from network volume anyway, wouldn't it also be the same speed, even slower?

#

Ah, it's happening again! It seems only one or two workers get stuck like this. Very strange.

astral wyvern
dawn scaffold
#

Yup just creating a pod again, hah

astral wyvern
dawn scaffold
#

89% still (I deleted an unused model to free up some space) and it hasn't changed since

#

Super weird, the cat /workspace/logs/webui.log keeps changing every time I run it

#

Like drastically different

astral wyvern
#

oh yea i think it keeps rewriting a new file every worker run (?) not sure

dawn scaffold
#

Got a live one.

astral wyvern
#

sqlite3.OperationalError: disk I/O error
seems because out of space from this?

#

huh nice

dawn scaffold
#

The sqlite dbs are only 8MB

#

AFAIK, it's just the stable diffusion webui cache files

#

It's strange because we'll run all 10 workers, and it will chew through the queue, but one will get stuck in this state.

astral wyvern
#

Huh

#

Maybe because conflicting 😂

#

I'm not sure whats causing this

hot owl
main oar
#

If you want to get a closer look at your network volume run this pod (gives you web file explorer view):

https://runpod.io/console/deploy?template=lkpjizsb08&ref=a57rehc6
Mount the network volume you want to work with when you deploy this template. It should be mounted (to /workspace) By default the username and password will be "admin" and "admin".

hot owl
#

I didnt touch anything

#

why I get this issue now ?

#

this issue happen with all my storages

#

it's def an issue from runpod or the template itself

astral wyvern
weary terrace
astral wyvern
#

Oh it causes some kind of sqlite error?

weary terrace
#

Seems to be the case, but I am not sure.

#

I think the sqlite DB becomes corrupted when upgrading because the structure changes, but that is just an assumption, someone will need to test it to confirm.

astral wyvern
#

Ye might be

dawn scaffold
weary terrace
# hot owl Hi, did you fix this issue ? I have the same problem

This issue is due to corrupt files within the venv. It seems to happen when you use more than 1 template for A1111 on the same network storage. It seems that you can fix it as follows:

Step 1: Activate venv

Step 2: Reinstall torch modules and clear __pycache__ files

pip3 install -U --force-reinstall torch==2.4.0+cu121 xformers==0.0.27.post2 torchvision torchaudio --index-url=https://download.pytorch.org/whl/cu121
find . -name __pycache__ | xargs rm -rf
hot owl
#

Thanks a lot Marcus