#discordjs/ws big bot memes (old)

1 messages · Page 3 of 1

sullen snow
#

@stable hatch so far its fine to me

#

since the ws actually heartbeats

#

this is on 50 shards per cluster with 1x concurrency

#

so the non identified shards didnt dc on this time frame

stable hatch
#

oh wait

#

do heartbeats keep it alive?

#

@dusty dove i'm testing my hypothesis by spawning 255 shards

#

uh

#

wat

#
[251] Identifying
    shard id: 251
    shard count: 256
    intents: 0
    compression: none
[251] Waiting for event ready for 15000ms
[251] Ready
Connected
[251] Identifying
    shard id: 251
    shard count: 256
    intents: 0
    compression: none
[251] Waiting for event ready for 15000ms
[251] More than one auth payload was sent.
[251] Destroying shard
    Reason: none
    Code: 4005
    Recover: Reconnect```
dusty dove
stable hatch
#

only shard that had this issue somehow

dusty dove
#

no idea what that's all about

stable hatch
#

on your PR

dusty dove
stable hatch
#

yeah me neither, somehow it happened right after the connect promise resolved OMEGALUL

#

The gateway closed with an unexpected code 1006

#

god i love the internet

dusty dove
#

this looks like it double-destroyed

stable hatch
#

have fun

#

256 shards

dusty dove
#
[251] Connecting to wss://gateway.discord.gg?v=10&encoding=json
[251] Waiting for event hello for 10000ms
[251] Preparing first heartbeat of the connection with a jitter of 0.5238143187542947; waiting 21607ms
[251] Waiting for identify throttle
[251] First heartbeat sent, starting to beat every 41250ms
[251] The gateway closed with an unexpected code 1006, attempting to resume.
[251] Destroying shard
[251] Connection status during destroy
[251] Connecting to wss://gateway.discord.gg?v=10&encoding=json
[251] Waiting for event hello for 10000ms
[251] Preparing first heartbeat of the connection with a jitter of 0.34028765017804896; waiting 14036ms
[251] Waiting for identify throttle
[251] First heartbeat sent, starting to beat every 41250ms
[251] Identifying
[251] Waiting for event ready for 15000ms
[251] Ready
[251] Identifying
[251] Waiting for event ready for 15000ms
[251] More than one auth payload was sent.
[251] Destroying shard
[251] Connection status during destroy
[251] Connecting to wss://gateway.discord.gg?v=10&encoding=json
[251] Waiting for event hello for 10000ms
[251] Preparing first heartbeat of the connection with a jitter of 0.49264618846197816; waiting 20321ms
[251] Waiting for identify throttle
[251] Identifying
[251] Waiting for event ready for 15000ms
[251] Ready
[251] First heartbeat sent, starting to beat every 41250ms
#

got it

#

oh noo

#

i didnt extract everything if its multiline

#

ugh

stable hatch
#

i think i can guess the issue

#

the identify throttle wait is never cancelled

#

even if the shard dies

dusty dove
#

ahhhh

#

yeah

#

I just saw it too

#

oh no

stable hatch
#

good luck

#

sounds like HELL to handle

dusty dove
#

yeah this looks like I need those abort controllers passed into the async queue

stable hatch
#

which you can do but lord.

dusty dove
#

actually no

#

i can just do a Promise.race in the shard

#

and it should be enough

#

worst case what ends up happening is the shard after waits a bit extra

#

if i dont do proper aborts

#

though

#

i could

#

nah ill just do my favorite "hack"

#
        this.debug(['Waiting for identify throttle']);

        const controller = new AbortController();

        const interrupted = await Promise.race<boolean>([
            this.strategy.waitForIdentify(this.id).then(() => false),
            once(this, WebSocketShardEvents.Closed, { signal: controller.signal }).then(() => true),
        ]);

        if (interrupted) {
            this.debug(['Was waiting for an identify, but the shard closed in the meantime']);
            return;
        }

        // clean up the once listener
        controller.abort();``` @stable hatch lol
#

should be fixed now

stable hatch
#

LOL

dusty dove
#

(we did this properly after all since kyra was moaning about it)

stable hatch
#

why do people have to moan to do something

dusty dove
#

merged

#

just need to wait for release now

dim oracle
#

Alright, will probably test on Monday though

sullen snow
rare shard
#

We also added an AbortSignal parameter, I hope you can handle it someway, @sullen snow

sullen snow
#

yeah I can probably connect that in some sort

dusty dove
#

lol i mean

#

if you don't handle it

#

things will break

sullen snow
#

Scary means I need to reconfigure the thread to throw errors and handle the abort signal eh

dusty dove
#

the reason why we needed it was because apparently if a shard closed while it was waiting for an identify

#

the shard would duplicate its connection

#

yeah just look at how I do it in the worker sharding strategy

#

you can probs follow a similar pattern

sullen snow
#

though on your case its just passed on async queue?

#

then let the function throw an error

dusty dove
#

yeah

#

but I meant how it gets there

#

since it's cross-thread

#

though now that i think about it

#

you are still using the worker strategy

sullen snow
#

with our scale only worker sharding is viable

dusty dove
#

you just need to add the param to your throttler

#

and figure out how it should work

#

i guess

sullen snow
#

my confusion just rises from, what does the abort signal does

#

when it emits, and how it should interact with waitForIdentify

dusty dove
#

when controller.abort() is called the signal fires an event

#

in my case that's handled in the async queue and I just let it throw

sullen snow
#

cause our identify handling never really needs to be cancelled, it would just clear up, then let other shards get the identify

dusty dove
#

that's still a cancel though

#

like

sullen snow
#

so my options is a, when abort signal is here, abort the thread waiting and reject the promise

dusty dove
#

shard starts connecting and needs an identify

#

and then the shard tells you it no longer wants the identify

#

since the connection closed

#

that is cancellation

#

it frees up the identify it wanted for the next shard in line, though, yes

rare shard
#

It's better to somehow handle the AbortSignal in some way because it lets you clean up resources and let the next entry go thru asap

#

If you have a blocking mechanism like a queue, and you don't free it up for a cancelled entry, you'll block the following entry unnecessarily

sullen snow
#

and d.js manager also needs to know if the waitForIdentify throws an error?

dusty dove
#

yes

rare shard
#

It's true many APIs don't support AS, but a lot of them support a way to cancel/abort

dusty dove
#

if you get told to abort

#

you need to throw

#

and free your lock so another shard can grab that identify

#

that's basically it

sullen snow
#

hmmGe we'll see how I can cook it up on the redis sinec thats also offloaded on another thread

#

cause for some magic node.js reason

#

while loop while even the function inside of it is async

#

blocks the event loop

#

one last thing

#

signal: AbortSignal this is just the signal and not the whole abort controller class?

dusty dove
#

yup

#

though the global types on it are incomplete for some reason

#

I had to hack this in

#
// Because the global types are incomplete for whatever reason
interface PolyFillAbortSignal {
    readonly aborted: boolean;
    addEventListener(type: 'abort', listener: () => void): void;
    removeEventListener(type: 'abort', listener: () => void): void;
}```
#

and I do (signal as unknown as PolyFillAbortSignal).addEventListener('abort', listener);

#

lol

sullen snow
#

do this emit an event of some sort

dusty dove
#

yes, abort

#

when controller.abort() is called signal's abort event fires

sullen snow
#

oh, ok that makes things a bit easy I guess

#

so in your code, its just passed on the "queue" instance, then when abort is called, the queue will reject, then throw the promise on where waitForIdentify is called?

dusty dove
#

yeah, basically

sullen snow
#

ok thanks I'll cook something up :

sullen snow
#

also forget that promise reject, was due to original impl. without the abort signal KEKW

sullen snow
#

yes what I mean on that shard what it will do

#

reconnect, or leave it hanging

dusty dove
#

well it should never throw in the first place unless the shard aborted it

#

lol

#

and it only aborts it if it closed in the meantime

#

so it's already reconnecting by that point

sullen snow
#

since its return, oh nvm, you have another handler on closed do you?

sullen snow
rare shard
sullen snow
sullen snow
#

thanks for the idea actually, it looks a lot cleaner if the promisify class handles the cancellation internally

rare shard
#

No problem ^^

stable hatch
#

Otherwise you're spawning 270? threads?

dim oracle
#

Not sure how the math works but we've got 1440 shards with shardsPerWorker at 1

rare shard
dim oracle
sullen snow
#

i could but then again, i really like each websocket to have its own thread so i can be assured it is as fast at it can be

#

may change in future but the memory penalty is so ineligible

rare shard
#

Spawning 40 threads on a 10 core CPU won't make it magically perform better than spawning 20

dim oracle
rare shard
#

There's so much the hardware can do

#

The reason why Kyoso can run 1440 threads and be just fine, is because /ws is very lightweight and uses very little resources per worker

#

Think about it, the reason you two run so many threads, is because sockets blocking other sockets on the same thread

#

But if you have a 10T CPU, it can only run 10 workers at a given time, if you have 1440 workers, 1430 will be idling and waiting for the OS's scheduler to give them a chance to run, and that happens way more frequently than if you have a worker per CPU thread (virtually almost never, all sockets would basically run with little to no stop)

stable hatch
#

I mean that said, cramming 1440 in like 10 threads is also not ideal

#

Speaking from experience, cramming that many ws connections (~144/thread) in one single process/thread will cripple it

rare shard
#

Realistically, such large bots are likelier to run on servers with a lot more cores. If they have 64 cores, 1440 will require only 23 (rounded up 22.5) sockets per thread

#

And many services force you to increase CPU core count when increasing RAM count, so to account for the RAM needs large bots need, together with the amount of CPU required to run so many sockets...

stable hatch
#

64 cores is not cheap

#

Anywhere

#

My bigger worry is if threads stay alive when the parent dies

rare shard
#

The only positive side I see of running more workers than CPU threads, is that the GC has less memory to sweep

stable hatch
#

Not like shard threads keep much in ram that needs GC

rare shard
#

But even so, depending on how the objects are managed, it's possible and likely that the scavenger deals with basically almost all the objects

#

So the GC does little to nothing

dim oracle
#

We have 36c/72t

#

But CPU isn't really an issue

stable hatch
#

Nice sin wave

dim oracle
#

thanks

stable hatch
#

Wouldn't recommend it

#

I'd go for max concurrency then

dim oracle
#

I mean we used to run all just fine

stable hatch
#

That's 1440 shards in one thread

#

Badddd idea

dim oracle
#

nah we have clusters, so probably 24 per worker?

stable hatch
#

I'd go for powers of 2

#

..aka 16

#

Or 32

#

Whichever floats your boat

#

Tho you have uh, around 90 shards per ratelimit key

dim oracle
#

eh I'm not concerned about the startup/concurrency ratelimits

#

that'll work fine in whatever shardcount/shards per worker I run

stable hatch
#

Fair

rare shard
#

I think 16 is a sweet spot

#

But then again WS is capable of handling far more than one would give credit for, my main bot runs 14 internal shards in Discord.js v13, and Discord.js puts a heavy overhead on every message it gets, plus it's using a lot of intents, yet it's handling everything fine and the event loop latency is unnoticeable

#

Raw /ws should be able to handle much more, so I think you'd be fine even with 20 shards per worker

dim oracle
#

this is our eventloop with 1 shard per worker atm

dusty dove
#

me when still no custom rust erlpack

rare shard
#

Not like erlpack would fix the performance, DD meguFace

#

The optional WS dependencies do, tho

dusty dove
#

no i know im just harassing you for not doing things

rare shard
dusty dove
#

thats not stable yet is it

#

last i looked at it it def wasnt useable lmao

rare shard
#

I don't think it's released

#

Oh I think that even if it's released, the performance would suck because it uses the McBloaty web event API, which has even worse performance than pre-fixed AEE

sullen snow
#

even you run 1 or all

#

but then since we now use json encoding, cpu usage is definitely a thing now

#

I did have experience on original d.js ws also your original code before I refactored it where the bot is literally screaming on cpu usage due to the js event loop getting overloaded by json.stringify and json.parse, and erlpack did fix it. we just dropped it now because you can run the ws threads in its own "thread"

#

besides I don't see any issue on running a bit of threads on a dedicated container since most of our pcs anyways even handle more than that amount of threads without issues

#

like even with 56 shards / cluster, discord.js was able to run flawlessly with 1 shard per thread on /ws

dusty dove
#

yeah ure lucky im so hot and talented :^)

sullen snow
dusty dove
#

?

#

they work

#

i dont need to do anything special to "support" those

sullen snow
#

i dont no / think so (?)

dusty dove
#

it's purely on the ws package

#

if you install them ws uses them

#

it just wasn't in the README until some point

sullen snow
#

not sure since idk if ws even use it

#

but anyways, on that kind of guild / process, stock d.js will also not work

#

even with probably less cache, still it may not work

dusty dove
sullen snow
#

the amount of overhead d.js have will just make the whole process blow

#

can you even check if the ws use that

#

not the /ws package but the ws package itself

dusty dove
#

.. it does

#

this is literally how discord.js "implements" bufferutil and utf-8-validate too

#

we don't do anything

#

we just tell you you can install them and they'll be used

sullen snow
#

oh well, probably it works then

sullen snow
#

can't tell but yeah

#

i keep both the buffer util and utf 8 on my package file

#

so I guess ws should use it

stable hatch
#

Json parse can be very fast

stable hatch
sullen snow
#

its crucial for us to be stable, and until I got a verification on that gc spikes now fixed on that, we can't run it in prod

stable hatch
#

Aren't you guys already running it

stable hatch
#

Also you wouldn't notice the gc spikes (if there's any) since they're on different threads

sullen snow
stable hatch
#

Hmm, strange

sullen snow
#

i have an old message here

#

showing the issue

#

what we used is erlpack which works fine

rare shard
#

Oh, so zlib-sync makes the GC go brr?

sullen snow
#

yes if nothing is fundamentally changed when I noticed that issue

rare shard
#

I'll probably look into making a replacement for that library using CF's zlib somewhere this summer, same for erlpack (but in Rust), although the latter will probably happen first

stable hatch
#

Etf shouldn't be the priority @rare shard

rare shard
#

I know it isn't, but it's more fun to write

upbeat ermine
#

If I am looking at my logs right, this error seems to cause the rest of my shards to eventually spiral out causing a full bot crash, but I could be misinterpreting this as the main cause.

dusty dove
#

if you're having issues id prefer you open them on github with full logs

#

i cant keep track of so many people in the discord

dim oracle
#

While this thread has been reopened I might as well mention that we haven't had any issues since the last one 🙏

#

It also seems like a few more people joined in

dusty dove
#

i dont mind

#

its just like, i want to limit this thread to specific convos

#

but now that /ws is in mainlib

#

i dont want like everyone to come with issues here

#

github is still best when theres many of them so i can track it

upbeat ermine
#

I apologize, I was just searching for similar errors and came across this. I figured since it was relating to big bots, and with my bot also being a big bot, this would be an appropriate place to ask. I've made some comments on this Github issue:
https://github.com/discordjs/discord.js/issues/9139

GitHub

Which package is this bug report for? discord.js Issue description Running startup on either PM2 or general shard manager leaves the same problem, on a docker container if that's any use to kno...

dim oracle
dusty dove
#

all of the above

dim oracle
#

side note, I will start to donate a bit now and then via Open Collective and I have stopped supporting on Hydra's Patreon so I'll probably lose the sponsor role, anyhow first $200 is sent for now

dusty dove
#

if its auto removed by patreon just DM crawl w a transaction from OC and he'll grant it back

#

that's the process

#

oh actually we have a command for it now

dim oracle
#

The Patreon bot is pretty ass anyways so I doubt it'll actually remove it correctly

dusty dove
#

yeah lol

dim oracle
#

Alright so we seem to have another issue, a small spike where a lot of shards disconnected and resumed but in all of this one shard never recovered and is stuck reconnecting since then. These are the logs around that specific shards, the ECONNRESET is most likely related but not 100% sure. After these logs the shard is never seen again and is stuck reconnecting like I said.

#

There were about 11 ECONNRESET errors around that time

dusty dove
#

i love networking issues

dusty dove
#

"waiting for ready" is the last thing you got?

dim oracle
#

Yeah

#

I don't really have a decent way to access all of the logs (log file is 60gb atm) but the "waiting for ready" line is at 06:13:17 and the snippet I got goes until 08:30:55

#

and no mention of that shard within that timespan

dusty dove
#

gotcha

#

yeah ive seen some similar behavior described

#

im a bit unsure whats going on there

#

realistically that should timeout and throw

#

all signs point to this just being the worker dying entirely

#

but im unsure why

#

if it dies to an unhandled exception unless you guys messed with the worker strategy that should cause your main process to re-throw the error and therefore quit entirely

#

but i cant imagine what it could be dying to that isnt an unhandled exception

#

sounds like a gracious exit?

#

@dim oracle do you guys have your own worker strategy still

#

think you could do me a favor and attach an exit log to the worker since i dont

#

and log when it fires

#

and when you run into this again check if it did fire

dim oracle
#

cc @sullen snow

upbeat ermine
dim oracle