#All workers in CA region went to initialising and all my jobs started failing

20 messages · Page 1 of 1 (latest)

oblique yoke
#

2 different endpoints

sacred belfry
#

I think I am getting the same thing

#

@oblique yoke solidarity with ya lol

oblique yoke
#

Endpoint ids:

  • 5y6svi6m3g5tk3
  • oic105cyzlovnk
#

2 different GPU tiers as well, the one is 24GB and the other is 48GB,

sacred belfry
#

I just got 1 running now

#

and an error 🤨 2024-04-22T14:59:15.661713428Z engine.py :105 2024-04-22 14:59:15,660 Error initializing vLLM engine: [Errno -3] Temporary failure in name resolution 2024-04-22T14:59:15.661755703Z Traceback (most recent call last): 2024-04-22T14:59:15.661761523Z File "/vllm-installation/vllm/utils.py", line 176, in get_ip 2024-04-22T14:59:15.662006038Z s.connect(("dns.google", 80)) # Doesn't need to be reachable 2024-04-22T14:59:15.662025868Z socket.gaierror: [Errno -3] Temporary failure in name resolution

oblique yoke
#

Looks like some networking issue if it can't resolve DNS

sacred belfry
#

they said in #📢|announcements they were doing some kind of networking change last week. maybe related?

oblique yoke
#

I don't see anything in #📢|announcements , only something in #🚨|incidents for US-OR-1

sacred belfry
#

sorry thats what I meant.

oblique yoke
#

Nah thats US and this is CA, so its different.

keen vault
#

Appears to be an outage, i would expect a post in announcements soon

oblique yoke
#

All my workers have also recovered but would be nice to know what the issue was that took them all out.

sacred belfry
#

same here, im back online now

tardy abyss
#

Mine works

oblique yoke
tardy abyss
#

Ooh