#Issues in SE region causing a massive amount of jobs to be retried
26 messages · Page 1 of 1 (latest)
Obviously I am referring to the "Connection timeout" errors which causes the job results to fail to be returned, and not the single exeption among them.
@vapid vigil DO YOU MIND SUBMITING AS TICKET ON WEBSITE EASIER TO ESCALATE
No need to shout but sure 😁
ups sorry for caps
Ticket number is 4208
done
Thank you
my jobs works well btw
You probably didn't try and send 1000 jobs today
Yes yes
I said 10% are retried NOT ALL 🤦♂️
im using dev on SE
Ooh so 10% expected to fail?
They are retried they don't fail
well goodluck on your problem
RunPod needs to check it out, I switched to CA in the meantime and it works fine without any issues.
I was using CA but then switched to SE because my jobs were failing, but it was actually because my own Redis server had OOM issues due to running out of memory and wasn't a RunPod issue.
So I upgraded my ElastiCache instance on AWS from cache.t3.medium to cache.m4.large and now its fine.
Because its a cluster not a single instance