#Job timeout constantly (bug?)

23 messages · Page 1 of 1 (latest)

lyric reef
#

I'm getting the job timeout error constantly in each worker with a random time after. I have seen the logs, there is no error, the pod it's just killed with no reason even having nothing set of timeout in the serverless endpoint ( I have seen it in live), seems that it's totally bugged.

The software it's the same, nothing has been changed and I'm getting this issue all the time, even if I use 16gb or 48gb.

#

Also the gpu memory it's not reaching the limit, it's just stops with no reason

#

Job timeout constantly (bug?)

granite sonnet
#

that means the job is getting lost, worker picks up the job but then stops reporting status on the job its working on, can you make sure your using the latest sdk

lyric reef
#

2024-10-03T11:16:25.150260367Z 30%
2024-10-03T11:16:25.168653046Z 30%
2024-10-03T11:16:25.186904329Z 30%
2024-10-03T11:16:25.205032555Z 30%
2024-10-03T11:16:25.224321673Z 30%
2024-10-03T11:16:27.099648500Z {"logger": "cog.server.http", "timestamp": "2024-10-03T11:16:27.098688Z", "severity": "INFO", "message": "stopping server"}
2024-10-03T11:16:27.109831864Z {"logger": "uvicorn.error", "timestamp": "2024-10-03T11:16:27.109335Z", "severity": "INFO", "message": "Shutting down"}
2024-10-03T11:16:27.211000754Z {"logger": "uvicorn.error", "timestamp": "2024-10-03T11:16:27.210347Z", "severity": "INFO", "message": "Waiting for application shutdown."}

#

I have tested an image of 5 months ago at seems to fix the issue, looks like it's an issue of libraries

#

I'm still using the same libraries so must be from an update, i'm trying with doing a downgrade of the cog sdk and runpod sdk

lyric reef
#

I fixed the issue doing a downgrade of the runpod sdk to the version 1.6.0, now it's working fine

#

Seems that the last version or the latests versions have a timeout bug

granite sonnet
#

we found the bug, fix in progress

lyric reef
tranquil goblet
tranquil goblet
near mason
#

I'm encountering the same issue with version 1.7.3 @tranquil goblet

turbid ice
#

Same issue with 1.7.3

tranquil goblet
#

Hi. Please file a support ticket and mention this thread so that you can share more info that would help us determine what's going on and how to fix it. Feel free to mention me on your tickets. Thank you.

rare bloom
#

I just discovered that if the idle timeout setting is set too long and your job also takes a long time to finish, it might cause the job to retry. I’m still testing this and will share more info soon. For now, try setting the idle timeout to less than 20 seconds and see if that helps.

near mason
# tranquil goblet Hi. Please file a support ticket and mention this thread so that you can share m...

I reported the issue but quickly resolved it by downgrading the SDK. However, I recall experiencing other strange behaviors, such as two workers starting up simultaneously for a single request. Additionally, there were instances where a worker would start even though the Docker image was still downloading, which incurred costs, even after canceling the request. I had to manually terminate the worker in those cases. Maybe these issues are related

near mason
#

By the way, this only happened with longer-running requests, where the timeout occurred after around 100-200 seconds I think. Shorter-running jobs completed without any issues

rare bloom
near mason