#All workers in CA region went to initialising and all my jobs started failing
20 messages · Page 1 of 1 (latest)
Endpoint ids:
- 5y6svi6m3g5tk3
- oic105cyzlovnk
2 different GPU tiers as well, the one is 24GB and the other is 48GB,
I just got 1 running now
and an error 🤨 2024-04-22T14:59:15.661713428Z engine.py :105 2024-04-22 14:59:15,660 Error initializing vLLM engine: [Errno -3] Temporary failure in name resolution 2024-04-22T14:59:15.661755703Z Traceback (most recent call last): 2024-04-22T14:59:15.661761523Z File "/vllm-installation/vllm/utils.py", line 176, in get_ip 2024-04-22T14:59:15.662006038Z s.connect(("dns.google", 80)) # Doesn't need to be reachable 2024-04-22T14:59:15.662025868Z socket.gaierror: [Errno -3] Temporary failure in name resolution
Looks like some networking issue if it can't resolve DNS
they said in #📢|announcements they were doing some kind of networking change last week. maybe related?
I don't see anything in #📢|announcements , only something in #🚨|incidents for US-OR-1
sorry thats what I meant.
Nah thats US and this is CA, so its different.
Appears to be an outage, i would expect a post in announcements soon
All my workers have also recovered but would be nice to know what the issue was that took them all out.
same here, im back online now
Mine works
Yes its resolved now, you are late to the party.
Ooh