#discordjs/ws big bot memes (old)

1 messages ยท Page 1 of 1 (latest)

stable hatch
#

@dusty dove

#

discordjs/ws issue

dim oracle
#

@sullen snow for any follow up questions

stable hatch
#

Also was the version not injected??

dim oracle
#

doesn't seems like it

stable hatch
#

Tfffff

#

Pain

sullen snow
#

what I did is mostly just inject the erlpack since d.js dont support it yet so connection flow was untouched on my version of discordjs/ws port to v14

stable hatch
#

Wait

#

This is injected ws into djs?

sullen snow
#

yes I have my own impl.

#

I wanted to port it myself but someone already did it so I didnt

dusty dove
#

i cant debug off that lol

#

i need an actual repro sample

#

either way that looks like its heartbeating a closed connection

#

which doesnt sound all that related to the bubbling pr compared to some other stuff i changed

stable hatch
#

Tbf it looks like a long lasting conn issue that you can't easily repro in quick prs

sullen snow
#

I feel like this could be caused by a missed interval clean

#

though I'm not sure how it happens

#

cause seems like it doesn't happen on all the bots I did

dusty dove
#

doesnt help you dont run with --enable-source-maps

#

so i have no idea what index.js line 795

sullen snow
#

oh that is disabled by default?

dusty dove
#

i feel like this could be tied to the jitter pr somehow

#

yes

stable hatch
stable hatch
#

I don't remember

dusty dove
#

yea

#

i need to know what 795 is

#

its either in the interval

#

or its the awaited send call

sullen snow
#

also on prior discordjs versions

#

like my prod rn

#

is running on 0.6.x

#

has all the shards intact

#

so it could be isolated as 0.7.x issue

dusty dove
#

nvm its def interval now that i paid more attention

stable hatch
dusty dove
#

are you running into this consistently

dim oracle
#

795 - 798

    await this.send({
      op: import_v102.GatewayOpcodes.Heartbeat,
      d: this.session?.sequence ?? null
    });
dusty dove
#

or was this a one off earlier

sullen snow
#

right now yes

dusty dove
#

interesting

dim oracle
#

second time it has happened now

dusty dove
#

is there anything that seems to lead there

#

like a resume or reconnect

dim oracle
#

sec

stable hatch
#

You'd need debug logs for that

dusty dove
#

because the interval is cleared in destroy()

#

which should be called the moment we get a close event

#

or a payload telling us to resume

dim oracle
#

Exported the logs to a file so ignore the markdown but we get a few reconnects/resumes before (Ignore the no close code, that's on us)

[32mINFO [Wed,03/15/23,12:04:47] (Cluster Process [ID: 27]): [EventHandler]: Shard Reconnecting => Shard: 888 | Close Code: none
INFO [Wed,03/15/23,12:04:47] (Cluster Process [ID: 27]): [EventHandler]: Shard Resumed => Shard: 888 => Replayed Events: 1
INFO [Wed,03/15/23,12:04:47] (Cluster Process [ID: 28]): [EventHandler]: Shard Reconnecting => Shard: 914 | Close Code: none
ERROR [Wed,03/15/23,12:04:47] (Cluster Process [ID: 28]): WebSocket is not open: readyState 0 (CONNECTING)
    err: {
      "type": "Error",
      "message": "WebSocket is not open: readyState 0 (CONNECTING)",
      "stack":
          Error: WebSocket is not open: readyState 0 (CONNECTING)
              at WebSocket.send (/main/node_modules/ws/lib/websocket.js:442:13)
              at WebsocketShard.send (/main/node_modules/vanguard/dist/src/ws/WebsocketShard.js:187:25)
              at async WebsocketShard.heartbeat (/main/node_modules/@discordjs/ws/dist/index.js:795:5)
    }
dusty dove
#

lmao nvm

#

i think i found it

#

nop just chrome search being busted

sullen snow
#

we could enable debug logs and filter things out but that would require us some time

dim oracle
#

Best case would be like a week or two lol, so not really an option I think

sullen snow
#

@dusty dove also could I request a way to inject a custom identify manager specially for multi process bots so we could implement our own identify throttler? also pass the shardId that requests for identify

dusty dove
#

the strategy handles identify throttling

#

so u can just write your own

#

the identifythrottler class is just used for our built in strategies

sullen snow
#

yes but the websocket shard dont pass the shardId

#

I would appreciate if it would also pass the shardid

#

since its needed to calculate for buckets

dusty dove
#

wut

#

where

sullen snow
#

a sec

#

this one

dusty dove
#

so why do you need the shardId there

sullen snow
#

like just send the shardId as well on the payload

#

of the shard that asks for identify

#

because on big bots

#

its needed to properly calculate buckets of shard to identify

stable hatch
#

If this is about max concurrency, that's not how it works

#

You don't really need shard id to calculate it

dusty dove
#

Thonk yeah you dont need it per api docs afaik

sullen snow
#

are you sure? seems like all the devs I asked that has big bots calculates it using the shard id and concurrency

stable hatch
#

That's bc they don't realize buckets are already just n shards based on max concurrency

#

If you need 64 shards and max concurrency 16, its shard 0-15, 16-31, 32-47, stc

sullen snow
#

const bucket = shardId % concurrency; like I calculate if the shards are able to login by using this formula

#

and if the redis lock is on this bucket it will let the shard login

stable hatch
#

We can't easily test max concurrency

sullen snow
#

yes im not asking for d.js to implement the max concurrency, but rather just pass the shardId for this to be possible

stable hatch
#

Sofor all intents and purposes you could be right, but the docs example just shows batches of max_concurrency identifies

dusty dove
#

unless this is some big bot thing only and its secret and not in the api docs no, thats not the case

#

theres no buckets, it just says you can identify max_concurrency shards per 5 seconds

#

it doesnt matter what id they have

stable hatch
dusty dove
#

well idk

stable hatch
#

I doubt you can identify the same shard for n amount of times

dusty dove
#

the idea pisses me off

#

that theyd not put it in the docs

#

for lib devs to actually handle

#

so i almost dont want to in protest

#

but anyway ill look into it

sullen snow
#

thanks also let me know if what we are doing is wrong or right

#

but it was stable for like

#

months now

dim oracle
#

5s after shard 0 identifies you can do 16

sullen snow
#

we dont have any reidents

stable hatch
#

There's definitely a bug in 0.7

#

From what you said

dim oracle
#

you need 5s between identifies for shards in the same bucket, where bucket = shard_id % max_concurrency

#

other than that you can do whatever order you want

stable hatch
#

Thats unrelated to max concurrency

sullen snow
#

yes I thought I just want to bring that up

#

since we are already talking about the ws

stable hatch
#

I'll poke some people about it later

sullen snow
#

thank you

#

cause personally myself

#

concurrency is also something I'm not sure of

#

if its just idc just login the shards as long as the max concurrency allows

dim oracle
#

also, kinda weird they don't allow you to have 16x to test stuff

stable hatch
#

But even the docs show that its batches of max-concurrency shards that are sequential

sullen snow
#

or there is buckets

#

the buckets implementation is stable so far but then again without knowing what they mean about it I'm unsure as well

dusty dove
#

re the heartbeats

#

looks like it can happen after reconnects

#

will see whats up

stable hatch
#

Ty

#

I'll look into version injection too

#

Seems like it broke

dusty dove
#

is that broken for every pkg

stable hatch
#

Possible

#

But idk

#

I'll need to check

#

When home

sullen snow
#

let us know if we could help somehow it would depend on my timezone though

dim oracle
dusty dove
#

no we dont rly have comms

#

well

#

vlad does and a few other ppl on the team

#

but lib devs r treated like nobodies mostly lol

dim oracle
#

same in the bot space unless you're 10m+

sullen snow
#

I wish I have 16x cause its a pain to login 112 shards with 1x concurrency

dim oracle
#

for what it's worth, if you need to test something I'm down to run it on my token

#

finally, next time we have an issue do I just DM you again Vladdy?

stable hatch
dim oracle
#

Sure that works

#

Thank you both for the quick replies ^^

dusty dove
#

so this is you hacking it into discord.js right

#

that makes it a bit hard to track which of those events come from /ws and what is patched up by you

#

because i do not have a reconnect event

#

yeah sorry I can't get too far on this without you listening to my debug event

dim oracle
#

iirc

#

we don't use it for anything besides logging/graphing @dusty dove

#

Hopefully this can be done without the debug listener, I'd rather not restart prod

dusty dove
#

unless i can repro it with a super clean minimal sample to get debug logs myself, probably not

dim oracle
#

Not sure if that would help

#

Saya could probably answer most of you questions, but its midnight for him

#

cc @sullen snow

dim oracle
#

@dusty dove alright so it is happening quite often, with a few occurrences the past hour, here's what I've found so far

Here are two examples, each representing one shard having this issue, for the first one (shard 394) we try to reconnect and get the error/issue the same second, for the second shard (866) we try to reconnect but get the issue after 3 seconds.

[Fri,03/17/23,14:54:02] [EventHandler]: Shard Reconnecting => Shard: 394 | Close Code: none
[Fri,03/17/23,14:54:02] [ERROR] WebSocket is not open: readyState 0 (CONNECTING)
    err: {
      "type": "Error",
      "message": "WebSocket is not open: readyState 0 (CONNECTING)",
      "stack":
          Error: WebSocket is not open: readyState 0 (CONNECTING)
              at WebSocket.send (/main/node_modules/ws/lib/websocket.js:442:13)
              at WebsocketShard.send (/main/node_modules/vanguard/dist/src/ws/WebsocketShard.js:187:25)
              at async WebsocketShard.heartbeat (/main/node_modules/@discordjs/ws/dist/index.js:795:5)
    }

[Fri,03/17/23,14:54:02] WebSocket is not open: readyState 0 (CONNECTING)
    err: {
      "type": "Error",
      "message": "WebSocket is not open: readyState 0 (CONNECTING)",
      "stack":
          Error: WebSocket is not open: readyState 0 (CONNECTING)
              at WebSocket.send (/main/node_modules/ws/lib/websocket.js:442:13)
              at WebsocketShard.send (/main/node_modules/vanguard/dist/src/ws/WebsocketShard.js:187:25)
              at async WebsocketShard.heartbeat (/main/node_modules/@discordjs/ws/dist/index.js:795:5)
    }
[Fri,03/17/23,20:17:18] [EventHandler]: Shard Reconnecting => Shard: 866 | Close Code: none
[Fri,03/17/23,20:17:21] [WebSocket is not open: readyState 0 (CONNECTING)
    err: {
      "type": "Error",
      "message": "WebSocket is not open: readyState 0 (CONNECTING)",
      "stack":
          Error: WebSocket is not open: readyState 0 (CONNECTING)
              at WebSocket.send (/main/node_modules/ws/lib/websocket.js:442:13)
              at WebsocketShard.send (/main/node_modules/vanguard/dist/src/ws/WebsocketShard.js:187:25)
              at async WebsocketShard.heartbeat (/main/node_modules/@discordjs/ws/dist/index.js:795:5)
    }
[Fri,03/17/23,20:17:21] [WebSocket is not open: readyState 0 (CONNECTING)
    err: {
      "type": "Error",
      "message": "WebSocket is not open: readyState 0 (CONNECTING)",
      "stack":
          Error: WebSocket is not open: readyState 0 (CONNECTING)
              at WebSocket.send (/main/node_modules/ws/lib/websocket.js:442:13)
              at WebsocketShard.send (/main/node_modules/vanguard/dist/src/ws/WebsocketShard.js:187:25)
              at async WebsocketShard.heartbeat (/main/node_modules/@discordjs/ws/dist/index.js:795:5)
    }

The error originates from here + the error showing a readyState of CONNECTING just seems like we're trying to send a heartbeat while the shard is still connecting (not sure if this could cause issues). But we get the error twice per shard Thonkang

// djs/ws index.js line 795
    await this.send({
      op: import_v102.GatewayOpcodes.Heartbeat,
      d: this.session?.sequence ?? null
    });

Then finally it gets moved to Vanguard's send function https://github.com/Deivu/Vanguard/blob/master/src/ws/WebsocketShard.ts#L157

#

I do want to move this to high priority since it's happening often, without enabling debug on prod (for now), how can I enable you to get debug logs on your own? I'm really out of touch with Djs so I'd need you to point me to the right parts.

If needed I can even stream and we can go through this in vc or something

sullen snow
#

i didnt really hack anything, i just included erlpack on the ws build, and didnt touch any connection flow of ws package

#

i left that as is

#

i tried to mimic the old ws manager, so its mostly just discord.js package who is modified to fit into ws, not ws modified to fit into discord.js

dusty dove
#

ive heavily changed it in ws because of certain bugs

#

nevermind, yours is up to date

#

sigh

#

ill see what i can do today

sullen snow
#

@dusty dove is it possible to lock into a specific commit in the current repo setup?

dusty dove
#

also, if you're around

#

how often were you saying you run into this

#

since I think I figured it out

sullen snow
#

very frequent

#

based on the latest news I know

#

50 shards are down in less than a day

dusty dove
#

mmm

#

no but im asking more like

#

time frequency on a per-shard basis

#

like this issue will hit all shards regardless

#

does it take a while before it happens the first time on a given shard

#

and then it keeps happening?

#

because if so I think I have it

sullen snow
#

once the shard goes in this state

#

you cant revive it

dusty dove
#

no idea then lol

#

I'd probably figure it out instantly with debug logs but /shrug

#

I can't quite get it into your state locally

#

I just see something wrong with the code that vaguely resembles your issue

sullen snow
#

what do you think it is?

#

so I have an idea and see

#

actually I have the commit that is known to be the last stable

#

if you want I could give you the files

#

and try to compare it to latest master

#

"version": "0.6.1-dev.1675904160-0e4224b.0", this is the current version on my prod bot and it doesnt crash

dusty dove
#
            case GatewayOpcodes.Hello: {
                this.emit(WebSocketShardEvents.Hello);
                const jitter = Math.random();
                const firstWait = Math.floor(payload.d.heartbeat_interval * jitter);
                this.debug([`Preparing first heartbeat of the connection with a jitter of ${jitter}; waiting ${firstWait}ms`]);

                await sleep(firstWait);
                await this.heartbeat();

                this.debug([`First heartbeat sent, starting to beat every ${payload.d.heartbeat_interval}ms`]);
                this.heartbeatInterval = setInterval(() => void this.heartbeat(), payload.d.heartbeat_interval);
                break;
            }```

as it is, that `await sleep` call isn't cancelled in any circumstances, so if your shard goes through a reconnect very soon after another one (or after the initial connect), you end up in a state where:
- you end up sending 2 heartbeats once the connection is finally fully re-established
- the old `heartbeatInterval` is never cleared and instead gets lost since there's no reference to it anymore as its overwritten by the new one
- the part that confuses me - only with very specific bad timing would you end up sending a heartbeat before the conn is actually open sometime in the future, since well, now there's just 2 or more heartbeatIntervals running, not in sync
- if that initial condition keeps occurring, the issue adds up, with more and more loose unbound intervals
#

so it would actually be the heartbeat jitter PR that's guilty

sullen snow
#

actually

#

that is probably the issue

dusty dove
#

yeah most def.

sullen snow
#

cause I also encounter 2 reconnects on my current prod bot

dusty dove
#

I guess since your bot is big maybe your heartbeat interval is much smaller than mine

#

(my test bot has 45 seconds)

#

so I can't really trigger it

#

even manually

sullen snow
#

this is on a 1.7mil bot or 1.5m

#

the bot I own is around 105k

dusty dove
#

and you also just have more shards so you're more likely to run into it lol

#

but yeah I'll PR a fix now

sullen snow
gloomy skyBOT
dusty dove
#

@stable hatch ^

#

yeah I know one of those = null assignments is redundant in some cases, I just want to be extra safe

stable hatch
#

that is...jankkkk

dusty dove
#

a little

#

but its ok

stable hatch
#

couldn't you have done like

#

what someone else suggested somewhere

#

in your jitter pr actually

dusty dove
stable hatch
dusty dove
#

well they mean setTimeout on that first call, first off

stable hatch
#

well ye

dusty dove
#

i dont know if i like that more

#

i do

stable hatch
#

ignore me

dusty dove
#

nice delete

stable hatch
#

gh didn't show the method name dead

dusty dove
#

ugh

#

this is honestly just JS sucking here lmao

#

I took this super async/await approach everywhere when it came to events and waiting for things

#

and what I did with the controller is consistent with that

stable hatch
#

Also

dusty dove
#

but the other pattern does look nicer here i guess

stable hatch
#

you could've / should've used a try/catch/finally

dusty dove
stable hatch
#

LISTEN

dusty dove
#

will keep this approach but will do try/catch i guess after lunch

dim oracle
#

Do you perhaps know when this will be merged?

dusty dove
#

whenever space and kyra review it

dim oracle
#

Also (unrelated to the rest), what is now the official/best way to support Djs, I'm still subscribed to https://patreon.com/discordjs but I haven't seen Amish in eons, and iirc he's also no longer part of Djs

dusty dove
#

yeah he only really runs /voice

#

you can donate there

#

it's all transparent and you can also see where the money is going in the Expenses tab

#

every so often we (contribs) can bill if we've had any significant work on the library ^^

dim oracle
#

Alright, I'll look into that

stable hatch
dim oracle
stable hatch
#

I mean either way it goes to/through open collective

dusty dove
#

@dim oracle @sullen snow it's been merged

#

just wait for the next @dev release and let me know if it helps

#

oh, sorry about that ping

#

dont know if that's helpful either kek

dusty dove
dim oracle
dusty dove
#

while you're at it, since this is gonna take a prod deploy anyway, enable debug logs so I can actually figure out what's going on if this doesn't fix it

dim oracle
#

Yeah will do that for sure, will probably release it on prod this Monday

dim oracle
#

@stable hatch @dusty dove alright so it appears the issue has not yet been resolved. Surprisingly it took over 24 hours before the first shard got hit. Here's what I can find in the debug logs for shard 873 (Once again ignore markdown)

Seems like there's... a lot going on

#

You can ignore the INFO lines here

stable hatch
#

Well we haven't released the fix, unless you put it in yourself in the code?

dim oracle
#

We used the dev release

stable hatch
#

Hmm

#

Can you npm ls it?

#

Just to be sure ๐Ÿ™

#

Or yarn why, whichever package manager you use

#

Also HOLY did you just say shard 873

#

๐Ÿ˜ตโ€๐Ÿ’ซ๐Ÿ˜ตโ€๐Ÿ’ซ๐Ÿ˜ตโ€๐Ÿ’ซ

sullen snow
#

yeah we have 1.5k shards

stable hatch
#

Damnnn

sullen snow
#

i just did some magic to make d.js scale at this number

dim oracle
#

We don't have the files on disk, we run npm install in the docker file

stable hatch
#

You can access the container!

sullen snow
#

we are on 0.7.1-dev

dim oracle
#

I'd normally just eval the version but that isn't inject

#

Yeah I exec'd the container

stable hatch
#

docker exec -it name /bin/bash

dim oracle
#

package.json shows "version": "0.7.1-dev.1679184639-9842082.0"

sullen snow
#

"version": "0.7.1-dev.1679184639-9842082.0" to be precise

dusty dove
#

ill look at those logs when i have a sec

dim oracle
#

And yeah we run 1440 shards

dusty dove
#

probably will be a while

stable hatch
#

Well its definitely latest version at least

sullen snow
#

seems like an error occurred before the send error happened

stable hatch
#

Can you provide us the strategy you use too?

sullen snow
#

worker

stable hatch
#

Shards per worker?

sullen snow
#

1

stable hatch
#

Mkayyyy

#

Ty

sullen snow
#

my guess is when ws.on error happened

#

the interval was not cleaned for some reason

stable hatch
#

I'll also take a look to see if anything jumps out but I doubt I'll figure it out as quick as dd might

sullen snow
#

thank you ayase_smile

dusty dove
#

theres no send calls erroring anymore

dim oracle
#

Yeah the previous error is not present which kinda surprised me

dusty dove
#

either way that looks like a good clue

#

outside of that SSL error

#

which ???

dim oracle
#

But it results in the same thing

#

Yeah no clue

dusty dove
#

i need lunch im running on 0 calories for the past 18 hours or so kek

dim oracle
#

np ^^

#

@sullen snow do you know what the SSL error is about

sullen snow
#

i dont touch anything about ssl

#

nor how this ws open a connection

#

but the error happened after an error

#

also it looks like ws is trying to close something that isnt established?

#

could be a good clue since it means we may have been missing some checks here

dusty dove
#

yeah that's what I eye'd as well

sullen snow
#

though

#

why it happened

#

after several "error"

#

it could happen after the first error

#

but it decided not to happen on that and happened after several errors

dusty dove
#

@dim oracle yeah so, it looks like the WS docs just straight up betrayed me

#

Prevent the server from accepting new connections and close the HTTP server if created internally. If an external HTTP server is used via the server or noServer constructor options, it must be closed manually. Existing connections are not closed automatically. The server emits a 'close' event when all connections are closed unless an external HTTP server is used and client tracking is disabled. In this case the 'close' event is emitted in the next tick. The optional callback is called when the 'close' event occurs and receives an Error if the server is already closed.

#

nothing here says I can't call .close() if it's CONNECTING

#

but that's what happened to you

#
DEBUG [Mon,03/20/23,11:55:22] (Cluster Process [ID: 27]): [EventHandler](Discord.JS): [WS => Shard 873 => Worker] Connection status during destroy
    Needs closing: true
    Ready state: 0
ERROR [Mon,03/20/23,11:55:22] (Cluster Process [ID: 27]): [EventHandler]: Shard Errored => Shard: 873
    err: {
      "type": "Error",
      "message": "WebSocket was closed before the connection was established",
      "stack":
          Error: WebSocket was closed before the connection was established
              at WebSocket.close (/main/node_modules/ws/lib/websocket.js:285:7)
              at WebsocketShard.destroy (/main/node_modules/@discordjs/ws/dist/index.js:638:25)
              at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
              at async WebsocketShard.bubbleWaitForEventError (/main/node_modules/@discordjs/ws/dist/index.js:693:7)
              at async WebsocketShard.connect (/main/node_modules/@discordjs/ws/dist/index.js:587:20)
              at async WebsocketShard.connect (/main/node_modules/vanguard/dist/src/ws/WebsocketShard.js:79:9)
    }```
stable hatch
dusty dove
#

fricking insane

#

hate how poorly documented this package is

stable hatch
#

And for what it matters it seems that we handle those gracefully

dusty dove
#

yeah it looks like it was handled fine

stable hatch
#

only the server

dusty dove
#

lmao yeah i think you're right

#

yeah

#

there's the client one

#

thanks, very informative

#

either way, I guess like

stable hatch
#

I guess try catch connection close, assume its fine and carry on?

dusty dove
#

nuh-uh

#

I feel like that could leak

#

somehow

#

leaves an open connection or smth

#

I was thinking we make that if (this.connection.readyState === WebSocket.OPEN)

#

and proceed as we currently do

#

and else if (this.connection.readyState === WebSocket.CONNECTING)

stable hatch
#

its like

dusty dove
#

use connection.terminate() instead

stable hatch
#

not a big deal

dusty dove
#

oh yeah I guess that's fair

#

much simpler for us to handle too then

stable hatch
#

connection.terminate is nodejs ws only

dusty dove
#

great

#

anyway

#

this seems like the only issue kyoso ran into

#
DEBUG [Mon,03/20/23,11:55:22] (Cluster Process [ID: 27]): [EventHandler](Discord.JS): [WS => Shard 873 => Worker] Destroying shard
    Reason: Something timed out
    Code: 1000
    Recover: Reconnect
DEBUG [Mon,03/20/23,11:55:22] (Cluster Process [ID: 27]): [EventHandler](Discord.JS): [WS => Shard 873 => Worker] Connection status during destroy
    Needs closing: false
    Ready state: 3
ERROR [Mon,03/20/23,11:55:22] (Cluster Process [ID: 27]): WebSocket was closed before the connection was established
    err: {
      "type": "Error",
      "message": "WebSocket was closed before the connection was established",
      "stack":
          Error: WebSocket was closed before the connection was established
              at WebSocket.close (/main/node_modules/ws/lib/websocket.js:285:7)
              at WebsocketShard.destroy (/main/node_modules/@discordjs/ws/dist/index.js:638:25)
              at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
              at async WebsocketShard.bubbleWaitForEventError (/main/node_modules/@discordjs/ws/dist/index.js:693:7)
              at async WebsocketShard.connect (/main/node_modules/@discordjs/ws/dist/index.js:587:20)
              at async WebsocketShard.connect (/main/node_modules/vanguard/dist/src/ws/WebsocketShard.js:79:9)
    }
ERROR [Mon,03/20/23,11:55:22] (Cluster Process [ID: 27]): WebSocket was closed before the connection was established
    err: {
      "type": "Error",
      "message": "WebSocket was closed before the connection was established",
      "stack":
          Error: WebSocket was closed before the connection was established
              at WebSocket.close (/main/node_modules/ws/lib/websocket.js:285:7)
              at WebsocketShard.destroy (/main/node_modules/@discordjs/ws/dist/index.js:638:25)
              at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
              at async WebsocketShard.bubbleWaitForEventError (/main/node_modules/@discordjs/ws/dist/index.js:693:7)
              at async WebsocketShard.connect (/main/node_modules/@discordjs/ws/dist/index.js:587:20)
              at async WebsocketShard.connect (/main/node_modules/vanguard/dist/src/ws/WebsocketShard.js:79:9)
    }
#

well this looks weird

#

but they got into a broken state by that point already

#

sooo

#

we'll pretend we don't see that part

gloomy skyBOT
dusty dove
#

@stable hatch ^

#

also updated that debug log since it was a bit miss-leading after some refactors

stable hatch
#

hmmm, you should remove all listeners from the conn if its not in a should close state

#

๐Ÿ‘€

#

oh wait i see

#

misread

dusty dove
#

ye

#

its not done there either way

stable hatch
#

maybe add a debug message in the else for shouldClose to log that "shit broke, oh well"

dusty dove
#

the debug log there is already clear enough though if it'll enter the if or not

#

Needs closing: false, will mean it didn't

stable hatch
#

ugh its still pain that we cannot close while its connecting

dusty dove
#

yeah, shrug

#

u're right though it'll eventually just die anyway if it does end up staying open with no refs to it

#

WS connections due that in the first place if they don't send any payloads for a while

#

discord might be even faster if they see we aren't heartbeating or anything

stable hatch
#

its not IDEAL

#

but yea

dusty dove
#

its p rare it'll happen anyway

#

and leaks basically nothing realistically

#

it's just a tcp handle

stable hatch
#

still hate it but its the best we can do I guess because ws fucking throws on attempting to close when connecting

dusty dove
#

rofl

#

though I must ask

#

@dim oracle did the shard eventually recover w/o your intervention or

dim oracle
dusty dove
#

so like.. what is it doing

#

is it just destroy-looping

dim oracle
#

I don't think its doing anything, after those logs the shard ID never showed up again

dusty dove
#

lmao

#

not the most intended of behaviors

#

is that the only shard that broke?

#

if so it sounds like we did squash the original bug and you just ran into something new that's much less likely to happen

#

(which seems to be trying to destroy a shard that hasn't fully connected yet)

dim oracle
#

I think it is, I don't really have a good way to check since my PC cries when I open the 8GB log file

dusty dove
#

yeah good, cool

stable hatch
#

For confirmation sake, can you patch-package it with the PR fix?

dim oracle
#

I don't really want to restart prod during peak hours tbh

#

Only one shard down atm after 30+ hours, so not really worth restarting prod for atm

stable hatch
#

Well you dont have to do it now

dim oracle
#

I can do it, but probably in a few days

dusty dove
#

pr was merged anyways

dim oracle
#

Just thought I'd keep you posted~ surpassed 48 hours of no issues so far

dusty dove
#

have u pulled the last fix we merged

#

or is this w/o

dim oracle
#

Running 0.7.1-dev.1679400254-950fc47.0 still

dusty dove
#

yeah so i was probs right your last issue was just a super rare thing

dusty dove
#

either way good to know u're stable now

#

cc @stable hatch, looks like I nailed it the first try

dusty dove
#

too late

dusty dove
#

yes ok

#

so everything should just be good

dim oracle
#

or is it

#

Let me actually confirm if this is related at all

dim oracle
#

@stable hatch @dusty dove it seems like there's another issue? Here's an example for one of the shards affected. After this it is completely radio silence and I never see that shard again.

stable hatch
#

ohlawd

dim oracle
#

cc @sullen snow as well I guess

dusty dove
#

what 1k shards does to an mf

#

dude ur ass got a 520 from discord

#

wth

#

yeah ok

#

that is insane

#

i have no idea what im looking at

dusty dove
#

or not actually

#

nvm

#

i dont know why theres just radio silence after

#

either way that shard just completely broke

#

it managed to get into a state where it identified after resuming?

#

there is one reference to this.identify in the whole file

#

and its in connect()

#

you had 2 concurrent connect calls running

#

one with a session and one without

dim oracle
#

I can check if the same happened for the other 18 shards I suppose

#

Weird thing is, this happened on all 3 bots at the exact same time

dusty dove
#

yeah that makes a lot of sense

#

because you got a 520 from discord

#

which means this is specific breakage from them im not handling appropriately

#

you just keep running into the most absurd edge cases

dim oracle
#

what 1440 shards does to a mf ig

dusty dove
#
DEBUG [Fri,03/24/23,22:20:18] (Cluster Process [ID: 42]): [EventHandler](Discord.JS): [WS => Shard 1370 => Worker] Resuming session
DEBUG [Fri,03/24/23,22:20:18] (Cluster Process [ID: 42]): [EventHandler](Discord.JS): [WS => Shard 1370 => Worker] Identifying
    shard id: 1370
    shard count: 1440
    intents: 131
    compression: none```
#

this is nonsense

#

nope

#

I think I get it

dim oracle
#

time for a third pr I guess

#

Though, whatever changed from 0.6.x -> 0.7.0 seems to have added a lot of edge cases

#

Things were running fine for weeks before we upgraded to 0.7.0

dusty dove
#

omg

#

that's such a cool bug

#

look at this

#
    /**
     * Does special error handling for waitForEvent calls, depending on the current state of the connection lifecycle
     * (i.e. whether or not the original connect() call has resolved or if the user has an error listener)
     */
    private async bubbleWaitForEventError(
        promise: Promise<unknown>,
    ): Promise<{ error: unknown; ok: false } | { ok: true }> {
        try {
            await promise;
            return { ok: true };
        } catch (error) {
            // Any error that isn't an abort error would have been caused by us emitting an error event in the first place
            // See https://nodejs.org/api/events.html#eventsonceemitter-name-options for `once()` behavior
            if (error instanceof Error && error.name === 'AbortError') {
                this.emit(WebSocketShardEvents.Error, { error });
            }

            // As stated previously, any other error would have been caused by us emitting the error event, which looks
            // like { error: unknown }
            // eslint-disable-next-line no-ex-assign
            error = (error as { error: unknown }).error;

            // If the user has no handling on their end (error event) simply throw.
            // We also want to throw if we're still in the initial `connect()` call, since that's the only time
            // the user can catch the error "normally"
            if (this.listenerCount(WebSocketShardEvents.Error) === 0 || !this.initialConnectResolved) {
                throw error;
            }

            // If the error is handled, we can just try to reconnect
            await this.destroy({
                code: CloseCodes.Normal,
                reason: 'Something timed out or went wrong while waiting for an event',
                recover: WebSocketShardDestroyRecovery.Reconnect,
            });

            return { ok: false, error };
        }
    }```
#

so there's this method, right

#

in your case, what happened is

#

the shard was trying to resume from the await this.resume() call in connect()

#

and it timed out for whatever reason

#

but since this isn't your first connection

#

it goes to those last 2 statements

#
            await this.destroy({
                code: CloseCodes.Normal,
                reason: 'Something timed out or went wrong while waiting for an event',
                recover: WebSocketShardDestroyRecovery.Reconnect,
            });

            return { ok: false, error };```
#

and it awaits the destroy call

#

which initiates a fresh reconnect

#

before returning { ok: false }

#

so back in connect

#
        const { ok } = await this.bubbleWaitForEventError(
            this.waitForEvent(WebSocketShardEvents.Hello, this.strategy.options.helloTimeout),
        );
        if (!ok) {
            return;
        }

        if (session?.shardCount === this.strategy.options.shardCount) {
            this.session = session;
            await this.resume(session);
        } else {
            await this.identify();
        }```
#

the waitforEvent call for hello fails in a connect() where connect would be called because of the state of session

#

and it initiates another connect because of that inner destroy call - before hitting the if (!ok) return;

#

so the 2nd wait for Hello on the non-resume connect goes through

#

and both of them end up going through, I guess?

#

though I still don't quite understand how, since we still eventually return ok: false

#

but it's def something racing there

dim oracle
#

I'm surprised you found this already

dusty dove
#

lol

#

my instinct for async races is crazy good since its all ive been working with in my time programming

dim oracle
#

Welp, it sure worked this time I suppose

dusty dove
#

still super odd

#

but yeah it's def smth to do with this given both a resume and identify fired right after a Hello came through

dim oracle
#

also, 520 is from cloudflare mmLol

dusty dove
#

lol

#

classic

dim oracle
#

tbh, this direct line has helped a ton the past few days so glad this came to be ๐Ÿ™

dusty dove
#

yeah i mean

#

its good you're running it in prod

#

because I get to iron it out like this

#

yea im not 100% this will do it but its a one-line diff that should improve the behavior either way

#

while im at it

#

i wanna append something to the readme

dim oracle
#

Seems like this only happens rarely so I don't really need to rush any of this

dusty dove
#

yeah

#

its the v specific "shard timed out on hello while resuming"

stable hatch
#

..just for 2 shards

dusty dove
stable hatch
#

thats the fix?

dusty dove
#

rofl

#

unclear 100%

#

but it's def. more correct now

#

since before it made connect calls just kinda hang about if something timed out

#

until the shard fully reconnected

#

and that could start nesting up

stable hatch
#

so heres

#

a wild shot

#

can you yolo unit test this?

dusty dove
#

lmao

#

not really

#

mocking WS even enough just for this is insane

#

well its not insane its just completely unmaintainable if I just hack it up like this

stable hatch
#

hrm right, you'd need to make some mock ws that just sends the 3 payloads

dusty dove
#

yeah

dim oracle
#

are versions injecting correctly yet?

stable hatch
#

in dev, yea

#

they should be

stable hatch
#

wont this throw undefined too?

dusty dove
#

shouldn't?

#

when would it

stable hatch
#
if (this.listenerCount(WebSocketShardEvents.Error) === 0 || !this.initialConnectResolved) {
  throw error;
}
dusty dove
#

yeah, but when would it be undefined

stable hatch
#

I mean

#

when an aborterror is thrown

#

ALSO emitting error will throw if theres no error listener

stable hatch
dusty dove
#

thinking

dim oracle
#

cool

dusty dove
#

in the !this.initialConnectResolved case its so the user can try..catch connect()

stable hatch
#

it is still technically able to throw undefined, right?

dusty dove
#

oh lmao

#

yes there is a missing conditional

stable hatch
#

if an aborterror is thrown

dusty dove
#

I always destructure even if its an aborterror

#

ill push a fix for that into this PR as well

stable hatch
#

also why do you even emit error if its an aborterror

#

thats really not useful, right?

#

if you get an aborterror you just wanna return false, or?

dusty dove
stable hatch
#

maybe rethink the whole catch block then?

dusty dove
#

because of control flow

stable hatch
dusty dove
#

like

#

you call .connect()

#

the first time

#

I need it to throw so connect() throws

#

and so the user can catch it

stable hatch
#

I guess from the main ws.connect call

dusty dove
#

yeah

#

in any other case i dont really care

stable hatch
#

@dusty dove honestly tho

#

error = isAbortError ? error : (error as { error: unknown }).error; should be error = error instanceof Error ? error : theCast

dusty dove
#

oh

#

sure

dim oracle
#

merged already sheesh? Will just wait for dev release then

dusty dove
#

@dim oracle @sullen snow do u guys keep perf metrics

#

we're getting rid of zlib-sync in favor of node:zlib

#

and i wanna see what the impact is

sullen snow
#

we dont compress

dusty dove
#

oh

#

wth why do you use etf then

#

it's literally just bigger payloads in a bunch of cases if you don't zlib

sullen snow
#

used to be for performance

#

but nowdays

#

due to threaded ws

#

might just remove it to reduce maintenance for me or wait for your implementation

stable hatch
#

Please tell me you've installed bufferutil and utf-8-validate

dusty dove
#

discordjs/ws big bot memes

sullen snow
#

both are installed, but even with those, cpu usage without etf encoding is high

dusty dove
#

thats nuts

stable hatch
#

That's why you use zlib-stream

dusty dove
#

yeah lol

sullen snow
#

zlib has leaks on node

stable hatch
#

It wat

dusty dove
#

yeah it actually does

#

I've seen that before

#

but i dont know why/how

sullen snow
#

wait a second

stable hatch
dusty dove
#

I'd imagine its because abal doesn't maintain zlib-sync

sullen snow
#

ill show you a really old

#

experimentation I have

dusty dove
#

so there's just no updates to zlib being pulled in

#

but node:zlib shouldn't leak

dim oracle
#

I don't really know the compression stuff but we basically have no CPU usage on prod

sullen snow
#

with etf ^

dim oracle
#

Like <5%

sullen snow
#

how do you search for

#

messages that I sent with a specific keyword again

#

@stable hatch @dusty dove

#

surely you dont want those pauses

dusty dove
#

yea dunno

#

once we merge this try it i guess

#

and see if it breaks ur stuff

sullen snow
#

I'll see if I can add garbage collection pauses again

#

as that was my 2020 code

#

who knows where I even put it

stable hatch
#

@dusty dove where pr

dusty dove
#

making now

#

just doing some cleanup

sullen snow
#

thats why personally I used etf over compression

#

but then again, once this zlib issues is fixed

#

we may want to try it but this is prod so we don't want anything breaking like the 0.7.0 commit CheshireXD

dim oracle
#

"big bot memes" Deadge

stable hatch
#

Lets make a (dev) release with both

dim oracle
stable hatch
#

So

dusty dove
stable hatch
#

...add a new compression method called nativezlib?

dusty dove
#

awful

#

fine

stable hatch
dusty dove
#

yeah im not doing this today anymore

#

dont feel like it

#

too much conditional work

stable hatch
#

I'll do it if you push code to a branch

#

Lol

dusty dove
#

ok

dim oracle
#

anything else you want to test on a big bot before I pull the latest dev release somewhere tomorrow

dusty dove
#

we have one too now actually

#

0 guilds but it has big bot sharding toggled on

stable hatch
dim oracle
#

oh that's nice

stable hatch
#

Yeah LOL so we can test max concurrency

dusty dove
#

well we can't actually test stress and stuff

#

but its still useful

dim oracle
#

yeah its not gonna be anything like the 'real' thing though

dusty dove
#

ye

dim oracle
#

16x or higher?

dusty dove
stable hatch
#

Pls tell me you let maintainers push to it

dusty dove
#

i also split useIdentifyCompression into its own option but thats probably not how i should have done it

#

yes

stable hatch
#

Otherwise i will ping you every minute for the next day

dusty dove
#

i never tick that box off

dusty dove
dim oracle
#

do you actually know how many bots with big bot sharding use djs

stable hatch
dusty dove
#

mostly cus of how I deal with the compression enum

stable hatch
#

You can't mix identify compression with zlib-stream

dusty dove
#

yes, that's handled elsewhere

#

join them back if u want to

stable hatch
#

THEN WHY'D YOU SPLIT

#

Reeeee

stable hatch
#

Zoomers i s2g

dim oracle
dusty dove
#

I pass the value as-is into the query params of the WS url

dusty dove
stable hatch
#

I'll look into it Deadge

dim oracle
#

it is now, if you hit 150k and have a multiple of 16 as shard count you're just moved automatically to 16x

#

same for 32x iirc

#

not sure about 64x or 128x though

stable hatch
#

Are there any bots in that range o.o

dim oracle
#

but only like 4 bots have access to that anyways

#

iirc Mee6 is the only on 128x

dusty dove
#

p sure rythm still is as well

dim oracle
#

yeah but dead bot so shrug

dusty dove
#

well

#

we still keep a gateway connection going

#

so they def. care lol

#

esp. considering the last time we did anything spicy we took the platform down

stable hatch
dusty dove
#

to display presence, marketing things i guess meguFace

stable hatch
#

Discord rolling in their grave

#

If you ever disconnect and the sessions fully shut down and you reconnect you'll be yelled at ๐Ÿ˜‚

dusty dove
#

idk tho i had nothing to do w bot eng and still dont have anything to do with eng there

dim oracle
#

I do wonder what you'll come up with to support 16x if that's gonna be a thing now

stable hatch
#

We already support max concurrency

dusty dove
#

yeah i had vlad double check and we're still waiting for a response but

#

p sure you were wrong abt the max concurrency formula thing

#

there's just no such thing as bucketing on them, you can just identify max_concurrency shards at any given time regardless of their id

#

so we already fully support it lol

stable hatch
#

We'll see when I get an answer

dim oracle
#

Hmm, tbf I haven't actually used djs's sharding manager in a while

stable hatch
#

Good

#

Save yourself

dim oracle
#

๐Ÿ’€

stable hatch
#

The amount of issues it has is insane

#

Anyways this is what i asked discord

We do actually have a question about max_concurrency! sweats

Is there any formula to what shards can connect concurrently? Does max_concurrency only alter the identify limit (so we could technically identify shard id 0/16 16 times), or does it alter the sequence too (so only shard 0-15 first, then 16-31)

#

We shall wait and see

#

The docs suggest the latter

#

(which wouldn't affect us anyways bc it works already)

dim oracle
#

We'll see, every bot dev I have contact with does say the bucketing is a thing

#

how many sessions does the test bot have actually? Since it scales of guild count I guess just the normal 1000

dim oracle
#

@dusty dove @stable hatch I'm not sure if this is caused by Djs, but again a shard is showing some weird things before never recovering again. Still on ws 0.7.1-dev.1679400254-950fc47.0

#

Again after these logs shard 162 is never seen again

dusty dove
#

aaaa

stable hatch
#

likely bc of outdated ws

dusty dove
#

why do you not run with --enable-source-maps

stable hatch
#

I mean @dusty dove

#

for all we know

#

your awaiting destroy bug fix could've solved this

#

Best bet is wait for them to update, and check after

dusty dove
#

if u look at the later logs

#

thats a network issue anyway

#

opening handshake timeout

stable hatch
dusty dove
#

well

#

that specifically patched a race caused by uh

#

waitForEvent timing out while resuming

#

but yes, thatd typically happen from network issues

stable hatch
#

I stand by what i said mm

#

@dim oracle

  1. update ws
  2. run with --enable-source-maps
  3. followup if it dies again
    4. contemplate why you made a discord bot when networking is so reliable
dim oracle
#

Just making sure this isn't missed before I update ws again

stable hatch
#

i mean dd knows that better than me but its not impossible its already handled

#

this will end up being a cat and mouse against async race conditions

#

still more stable than djs's current ws which has at least 1-2 dead locks LMAO

dim oracle
#

no weird network shenanigans around that time though

stable hatch
#

your logs show networking issues

#

@dusty dove fyi the error seems to have been thrown in waitForEvent

dusty dove
#

yeah makes sense

stable hatch
#

probably waiting for hello?

dusty dove
#

but the handshake timeouts are def. outside of my control

stable hatch
#

ye

dusty dove
#

if its not your network its discord

dusty dove
stable hatch
#
DEBUG [Sun,03/26/23,18:55:51] (Cluster Process [ID: 5]): [EventHandler](Discord.JS): [WS => Shard 162 => Worker] Waiting for event hello for 60000ms
ERROR [Sun,03/26/23,18:56:51] (Cluster Process [ID: 5]): [EventHandler]: Shard Errored => Shard: 162
    err: {
      "type": "Error",
      "message": "The operation was aborted",
      "stack":
          AbortError: The operation was aborted
              at EventTarget.abortListener (node:events:958:14)
              at [nodejs.internal.kHybridDispatch] (node:internal/event_target:735:20)
              at EventTarget.dispatchEvent (node:internal/event_target:677:26)
              at abortSignal (node:internal/abort_controller:308:10)
              at AbortController.abort (node:internal/abort_controller:338:5)
              at Timeout.<anonymous> (/main/node_modules/@discordjs/ws/dist/index.js:650:91)
              at listOnTimeout (node:internal/timers:569:17)
              at process.processTimers (node:internal/timers:512:7)
    }

#

ya dont say

dusty dove
#

no but

#

nvm

#

i missed the destroy log

#

thats what i was looking for

#

yeah it timed out on hello

stable hatch
#

Yknow, I do wonder if we can deadlock shards like that

dusty dove
#

than subsequent reconnects timed out on the TPC handshake

dim oracle
dusty dove
#

sooo discord was just dying

stable hatch
#

imma test if shards deadlock when ws never responds with a hello

dusty dove
#

wut why would it

#

it just times out n starts over

dim oracle
#

Just to make sure, ^0.8.0-dev.1679789487-b8b852e.0 would be the latest right

stable hatch
#

theres enough async to kill an elephant

stable hatch
#

why did we bump major again Thonk

#

eh w/e

#

yes, its correct

dim oracle
#

Will update prod somewhere tomorrow

stable hatch
#

@dusty dove so this is fun...

#
{
  message: 'Connecting to ws://localhost:8080?v=10&encoding=json',
  shardId: 0
}
{ message: 'Waiting for event hello for 60000ms', shardId: 0 }
[WSS] Connected
Exception in PromiseRejectCallback:
file:///Users/vlad/Development/Discord/discord.js/node_modules/@vladfrangu/async_event_emitter/dist/index.mjs:305
      });
      ^

RangeError: Maximum call stack size exceeded
node:events:958
      reject(new AbortError(undefined, { cause: signal?.reason }));
             ^

AbortError: The operation was aborted
    at EventTarget.abortListener (node:events:958:14)
    at [nodejs.internal.kHybridDispatch] (node:internal/event_target:735:20)
    at EventTarget.dispatchEvent (node:internal/event_target:677:26)
    at abortSignal (node:internal/abort_controller:308:10)
    at AbortController.abort (node:internal/abort_controller:338:5)
    at Timeout.<anonymous> (/Users/vlad/Development/Discord/discord.js/packages/ws/src/ws/WebSocketShard.ts:258:65)
    at listOnTimeout (node:internal/timers:569:17)
    at process.processTimers (node:internal/timers:512:7) {
  code: 'ABORT_ERR',
  [cause]: DOMException [AbortError]: This operation was aborted
      at new DOMException (node:internal/per_context/domexception:53:5)
      at AbortController.abort (node:internal/abort_controller:336:18)
      at Timeout.<anonymous> (/Users/vlad/Development/Discord/discord.js/packages/ws/src/ws/WebSocketShard.ts:258:65)
      at listOnTimeout (node:internal/timers:569:17)
      at process.processTimers (node:internal/timers:512:7)
}
Node.js v18.14.0```
#

i.. dont know why or how theres a rangeerror from async ee

dusty dove
#

this looks like not my problem

stable hatch
#

well

#

the process crashed

#

you said it should resume

dusty dove
#

so true, how could i forget to catch the stack overflow error

stable hatch
#

what in the name of sweet lord

#

I logged the error that kept getting thrown

#
AEE ERROR Error: Unhandled 'error' event emitted, received [object Object]
    at WebSocketManager.emit (file:///Users/vlad/Development/Discord/discord.js/node_modules/@vladfrangu/async_event_emitter/dist/index.mjs:213:19)
    at WebSocketShard.<anonymous> (file:///Users/vlad/Development/Discord/discord.js/packages/ws/dist/index.mjs:977:51)
    at file:///Users/vlad/Development/Discord/discord.js/node_modules/@vladfrangu/async_event_emitter/dist/index.mjs:297:34
    at new Promise (<anonymous>)
    at Object.wrappedFn [as wrappedFunc] (file:///Users/vlad/Development/Discord/discord.js/node_modules/@vladfrangu/async_event_emitter/dist/index.mjs:292:23)
    at WebSocketShard.emit (file:///Users/vlad/Development/Discord/discord.js/node_modules/@vladfrangu/async_event_emitter/dist/index.mjs:224:25)
    at file:///Users/vlad/Development/Discord/discord.js/node_modules/@vladfrangu/async_event_emitter/dist/index.mjs:301:30
    at new Promise (<anonymous>)
    at Object.wrappedFn [as wrappedFunc] (file:///Users/vlad/Development/Discord/discord.js/node_modules/@vladfrangu/async_event_emitter/dist/index.mjs:292:23)
    at WebSocketShard.emit (file:///Users/vlad/Development/Discord/discord.js/node_modules/@vladfrangu/async_event_emitter/dist/index.mjs:224:25) {
  context: { context: { context: [Object], shardId: 0 }, shardId: 0 }
}```
#

I'll have to take a look at that

#

that might be my bad

#

ok so that aside, @dusty dove didnt u say that uhhh...if the hello times out it should retry?

#

bc it sure doesnt do that

#

It DOES work if the conn is closed

#

but not if it times out

dusty dove
#

this is only if its not the initial connect, actually

#

im not sure why its that way

#

anymore

#

but it is on purpose

stable hatch
#

sounds..stupid

dusty dove
#

maybe, i think its this way bcus like

stable hatch
#

First off how are you sure its only on initial connect

#

secondly it shouldn't even be like this ever

dusty dove
#

the idea is that connect only resolves once its ready right

stable hatch
#

hello timeouts can happen bc internet is just dead

dusty dove
#

and to accomplish that itd just

#

recurse down on every failure

#

so in practice the retry logic would need to be handled outside the shard class

stable hatch
#

Right, then explain why connect fails on waiting for hello but not when conn dies

dusty dove
dusty dove
#

what does "conn dies" mean

stable hatch
#

Consider a wss local server
Consider ws = the connection to the wss

When the manager spawns the shard that connects to the local wss

  • if the timeout is reached, process exits with the abort error
  • if the ws is closed in the wss, it tries to reconnect constantly, as it should
dusty dove
#

mmmh

#

that actually sounds bad

stable hatch
#

oh god I've found bugs in AEE too brb crying

dusty dove
#

does the promise ever resolve in that latter case

#

i have a vague feeling it doesnt

stable hatch
dusty dove
#

the more i think about this the more absurd i realize it is to make it so connect is catchable

#

so many edge cases

#

its why all the error handling is so dank

stable hatch
#

HAH

#

uhhh

#

Check dms

dusty dove
#

oh i just had an epiphany @stable hatch

#

i know how to fix all of this

stable hatch
#

i broke shit so hard

#

check dms

dusty dove
#

what version is this

stable hatch
#

latest main

#

and uh

#

ok so tbf it could be my test script

#

B u t uhhhhh it like breaks breaks

dusty dove
#

either way

#

heres what ill do

stable hatch
#

handle this not at midnight

dusty dove
#

ill scrap the thing that makes connect throw if things timeout during the initial connects

#

anddd i also know a way to still guarantee it only resolves on ready

dusty dove
#

@dim oracle I tracked down a new bug related to what you ran into a couple of days ago

#

it seems waitForEvent calls were never cancelled by the WS shard closing regularly

#

I just did a massive refactor to address this and some of the janker error handling

dim oracle
#

hmm, prod is currently on the dev release I mentioned yesterday

dusty dove
#

yeah nws

#

I haven't even opened the PR

#

vlad has been toying with things and edge cases using a fancy script

#
import { WebSocketManager } from '@discordjs/ws';
import { REST } from '@discordjs/rest';
import { WebSocketServer } from 'ws';

let initial = true;
const wss = new WebSocketServer({ port: 8080 });

wss.on('connection', (ws) => {
    console.log('[WSS] Connected');

    ws.on('close', () => {
        console.log('[WSS] Disconnected');
        initial = false;
    });

    ws.close();
});

const rest = new REST({}).setToken('');

const manager = new WebSocketManager({
    intents: 0,
    rest,
    token: '',
    retrieveSessionInfo(shardId) {
        if (initial) {
            return {
                shardId,
                shardCount: 1,
                sequence: 1337,
                resumeURL: 'ws://localhost:8080',
                sessionId: 'owo',
            };
        }
        return null;
    },
    shardCount: 1,
    shardIds: [0],
    helloTimeout: 10000,
});

manager.on('debug', console.log);
manager.on('heartbeat', console.log);
manager.on('ready', console.log);

await manager.connect();
console.log('Connected');```
#

we hijack the WS server it connects to using the resumeURL, lol

#

to test some weirder stuff like if it insta closes

dim oracle
#

interesting

dusty dove
#

@stable hatch you around to mess w my branch

#

im pushing now

stable hatch
dusty dove
#

lmao

stable hatch
#

but i mean

dusty dove
#

the zlib one?

#

dw this is probs more important

stable hatch
#

ye

#

aite

stable hatch
dusty dove
#

oh yea

#

sure

#

like, just let it time out?

stable hatch
#

yes

dusty dove
#

worked

stable hatch
#

i'm more interested in the behavior in that condition

dusty dove
#

mostly

stable hatch
dusty dove
#

just a range error from AEE

#

lmao

#
โžœ node --enable-source-maps vlad.mjs
{
  message: 'Connecting to ws://localhost:8080?v=10&encoding=json',
  shardId: 0
}
{ message: 'Waiting for event hello for 10000ms', shardId: 0 }
[WSS] Connected
Exception in PromiseRejectCallback:
file:///home/didinele/Documents/Code/didinele/discord.js/node_modules/@vladfrangu/async_event_emitter/dist/index.mjs:308
    }, "wrappedFn");
    ^

RangeError: Maximum call stack size exceeded

Exception in PromiseRejectCallback:
file:///home/didinele/Documents/Code/didinele/discord.js/node_modules/@vladfrangu/async_event_emitter/dist/index.mjs:308
    }, "wrappedFn");
    ^

RangeError: Maximum call stack size exceeded

Exception in PromiseRejectCallback:
file:///home/didinele/Documents/Code/didinele/discord.js/node_modules/@vladfrangu/async_event_emitter/dist/index.mjs:308
    }, "wrappedFn");
    ^

RangeError: Maximum call stack size exceeded

{
  message: 'Destroying shard\n' +
    '\tReason: Something timed out or went wrong while waiting for an event\n' +
    '\tCode: 1000\n' +
    '\tRecover: Reconnect',
  shardId: 0
}
{
  message: 'Connection status during destroy\n\tNeeds closing: true\n\tReady state: 1',
  shardId: 0
}
[WSS] Disconnected
{
  message: 'Connecting to wss://gateway.discord.gg?v=10&encoding=json',
  shardId: 0
}
{ message: 'Waiting for event hello for 10000ms', shardId: 0 }
{
  message: 'Preparing first heartbeat of the connection with a jitter of 0.6635785282947928; waiting 27372ms',
  shardId: 0
}
{ message: 'Waiting for identify throttle', shardId: 0 }
{
  message: 'Identifying\n\tshard id: 0\n\tshard count: 1\n\tintents: 0\n\tcompression: none',
  shardId: 0
}
{ message: 'Waiting for event ready for 15000ms', shardId: 0 }
{
  data: {
    v: 10,
    user_settings: {},
    user: {
      verified: true,
     ----------- [snip] ----------
  },
  shardId: 0
}
Connected```
#

huh, verified: true

#

what bot token have I been using

stable hatch
dusty dove
#

do they

stable hatch
#

Its not the verified bot flag

dusty dove
#

oh its email verified

#

lol

#

yes that makes a lot of sense

stable hatch
#

Past the aee errors

#

Nice

gloomy skyBOT
dusty dove
#

have fun

stable hatch
#

I sure love getting 5 notifications whenever a pr is open

dusty dove
#

this should really iron things out

dim oracle
#

watch there be many edge cases after this anyways

dusty dove
#

well of course

#

but i've addressed a p fundamental flaw

#

lmao

stable hatch
#

We're just glad you're reporting these and that, overall, ws has been more stable than djs ws

dusty dove
#

yeah honestly

#

its been better out the gate

#

any version past 0.3

stable hatch
#

Gives me more confidence for 14.10

dusty dove
#

that didn't have the send bug that caused all shards to eventually reconn loop

stable hatch
dim oracle
#

Running debug logs for a week straight learned me I need to set a size limit for the logfile

stable hatch
#

@dusty dove heres the thing with your emitting of error events in waitForEvent

#

it will ALWAYS throw the error

#

because just like in node, emitting an error event when theres no error listeners will throw

#

so the destroy call will never happen

#

and same with the return

dusty dove
#

wait what

#

I thought the throw was async on the next tick or something

#

not on the .emit call

stable hatch
#

its on the emit call

dusty dove
#

at least that's how native EE behaves IME

#

huh really

#

I must be miss-remembering

stable hatch
#

no, native ee also does it

#

no listeners on error event = throw error

dusty dove
#

what's interesting is if you run simple strategy there'll always be a bound error event anyway

#

so the throw ends up coming from the manager

stable hatch
#

i mean u only emit it for abort errors

#

which

#

as i've said before

dusty dove
#

yeah, was just a consistency thing

stable hatch
#

is kinda useless

dusty dove
#

fair

stable hatch
#

I mean an abort error emitted in error events is kinda...useless?

#

especially since the method can reject instead

dusty dove
#

ye, done

#

either way

#

I like this solution

#

I'm finally actually pleased with waitForEvent

sullen snow
#
    at [kNewListener] (node:internal/event_target:514:17)
    at eventEmitter.<computed> (node:internal/worker/io:307:12)
    at MessagePort.addEventListener (node:internal/event_target:623:23)
    at MessagePort.on (node:internal/event_target:873:10)
    at VanguardBootstrap.setupThreadEvents (/main/node_modules/@discordjs/ws/src/utils/WorkerBootstrapper.ts:83:5)
    at VanguardBootstrap.bootstrap (/main/node_modules/vanguard/src/worker/VanguardBootstrap.ts:51:14)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) ```
@dusty dove there is a memory leak issue from event emitter on thread ws
#

not sure how it started, but I doubt this is from my code

sullen snow
#

there are some uneeded code there that I dont use like the extendederrordata since d.js support it already

dusty dove
#

that implies the setup method is called a bunch of times? hm

sullen snow
#

yes

#

not sure why as well

#

you can probably prepatch it by just checking if the method was initialized once, but then again that would mean this would be a patch fix rather than fixing it from the base

dusty dove
#

id rather you hacked in a trace that figures out where its being called from

sullen snow
#

not sure how I can make a trace for that

dusty dove
#

like so:

const err = new Error();
Error.captureStackTrace(err);

console.log(err);```
sullen snow
dusty dove
#

used this trick a bunch to debug shard

sullen snow
#

is trace warnings not complete in this regard?

dusty dove
#

doesnt seem to be, since it only goes to bootstrap

#

but i guess an error trace wont help more either then

#

idk, if setup is only called in bootstrap that'd imply bootstrap is called multiple times

#

wait

#

is that the only warning u got

#

setup makes multiple listeners

#

u shouldve gotten one for each event i feel

#

ohhh wait

#

@sullen snow whats your shardsPerWorker

#

this might actually just not be a leak

#

we do actually just bind that many listeners

#

one per shard

dusty dove
#

nvm then

#

lol

sullen snow
#

thats why theoretically it should not emit

dusty dove
#

does this repro every startup

sullen snow
#

nope usually it emits after some time

dim oracle
#

ehh this does occur for every cluster we spawn

#

so if we spawn 60 clusters during startup we get the message 60 times

dusty dove
#

so like.. only after some time huh

#

v odd

sullen snow
#

oh

dim oracle
#

nah directly after we spawn the cluster iirc

sullen snow
#

my bad then

dim oracle
#

but only past a certain amount of shards per cluster, so e.g. for like 2 shards per cluster we don't get the message

#

did some testing this morning, didn't mention that to saya yet ^^

dusty dove
#

but for 3 you do?

dim oracle
#

didn't test at what point we got the message, can do in a bit though

dusty dove
#

ye would be helpful

dim oracle
#

And we get it instantly after we launch the cluster

dim oracle
#

cc @sullen snow I guess

sullen snow
#

we usually run around

#

1 shard per worker

#

@dim oracle are you changing how many shards we run

#

worker !== cluster

#

@dusty dove we run the same amount of threads for our websocket

#

so basically if we run 32 shards

#

thats 32 threads

dim oracle
#

Used to be 32 per cluster, now its 24 per cluster to match the core count of the dedi

dusty dove
#

make up your minds derpsnail

sullen snow
#

probably its just confusing but

#

the structure is like this

#

master process -> cluster process -> thread for websocket

#

where we run 24 threads in that cluster

#

each thread handles 1 websocket

#

if you ask me why I do that is because I want each ws thread to have its dedicated event loop, and its not that expensive to spawn them once, its not like its being spawned everytime

#

this way heartbeats is as accurate as it can be