#discordjs/ws big bot memes (old)
1 messages · Page 3 of 1
since the ws actually heartbeats
this is on 50 shards per cluster with 1x concurrency
so the non identified shards didnt dc on this time frame
oh wait
do heartbeats keep it alive?
@dusty dove i'm testing my hypothesis by spawning 255 shards
uh
wat
[251] Identifying
shard id: 251
shard count: 256
intents: 0
compression: none
[251] Waiting for event ready for 15000ms
[251] Ready
Connected
[251] Identifying
shard id: 251
shard count: 256
intents: 0
compression: none
[251] Waiting for event ready for 15000ms
[251] More than one auth payload was sent.
[251] Destroying shard
Reason: none
Code: 4005
Recover: Reconnect```

only shard that had this issue somehow
no idea what that's all about
on your PR
yeah they should
yeah me neither, somehow it happened right after the connect promise resolved 
The gateway closed with an unexpected code 1006
god i love the internet
do u have more logs for this shard
this looks like it double-destroyed
funny you should ask that
have fun
256 shards
[251] Connecting to wss://gateway.discord.gg?v=10&encoding=json
[251] Waiting for event hello for 10000ms
[251] Preparing first heartbeat of the connection with a jitter of 0.5238143187542947; waiting 21607ms
[251] Waiting for identify throttle
[251] First heartbeat sent, starting to beat every 41250ms
[251] The gateway closed with an unexpected code 1006, attempting to resume.
[251] Destroying shard
[251] Connection status during destroy
[251] Connecting to wss://gateway.discord.gg?v=10&encoding=json
[251] Waiting for event hello for 10000ms
[251] Preparing first heartbeat of the connection with a jitter of 0.34028765017804896; waiting 14036ms
[251] Waiting for identify throttle
[251] First heartbeat sent, starting to beat every 41250ms
[251] Identifying
[251] Waiting for event ready for 15000ms
[251] Ready
[251] Identifying
[251] Waiting for event ready for 15000ms
[251] More than one auth payload was sent.
[251] Destroying shard
[251] Connection status during destroy
[251] Connecting to wss://gateway.discord.gg?v=10&encoding=json
[251] Waiting for event hello for 10000ms
[251] Preparing first heartbeat of the connection with a jitter of 0.49264618846197816; waiting 20321ms
[251] Waiting for identify throttle
[251] Identifying
[251] Waiting for event ready for 15000ms
[251] Ready
[251] First heartbeat sent, starting to beat every 41250ms
got it
oh noo
i didnt extract everything if its multiline
ugh
i think i can guess the issue
the identify throttle wait is never cancelled
even if the shard dies
which you can do but lord.
actually no
i can just do a Promise.race in the shard
and it should be enough
worst case what ends up happening is the shard after waits a bit extra
if i dont do proper aborts
though
i could
nah ill just do my favorite "hack"
this.debug(['Waiting for identify throttle']);
const controller = new AbortController();
const interrupted = await Promise.race<boolean>([
this.strategy.waitForIdentify(this.id).then(() => false),
once(this, WebSocketShardEvents.Closed, { signal: controller.signal }).then(() => true),
]);
if (interrupted) {
this.debug(['Was waiting for an identify, but the shard closed in the meantime']);
return;
}
// clean up the once listener
controller.abort();``` @stable hatch lol
should be fixed now
LOL
(we did this properly after all since kyra was moaning about it)
why do people have to moan to do something
Alright, will probably test on Monday though
https://safe.saya.moe/6fd72iliioxi.png thats a lot cleaner, lets hope this works 
We also added an AbortSignal parameter, I hope you can handle it someway, @sullen snow
yeah I can probably connect that in some sort
means I need to reconfigure the thread to throw errors and handle the abort signal eh
the reason why we needed it was because apparently if a shard closed while it was waiting for an identify
the shard would duplicate its connection

yeah just look at how I do it in the worker sharding strategy
you can probs follow a similar pattern
though on your case its just passed on async queue?
then let the function throw an error
yeah
but I meant how it gets there
since it's cross-thread
though now that i think about it
you are still using the worker strategy
with our scale only worker sharding is viable
you just need to add the param to your throttler
and figure out how it should work
i guess
my confusion just rises from, what does the abort signal does
when it emits, and how it should interact with waitForIdentify
when controller.abort() is called the signal fires an event
in my case that's handled in the async queue and I just let it throw
cause our identify handling never really needs to be cancelled, it would just clear up, then let other shards get the identify
so my options is a, when abort signal is here, abort the thread waiting and reject the promise
shard starts connecting and needs an identify
and then the shard tells you it no longer wants the identify
since the connection closed
that is cancellation
it frees up the identify it wanted for the next shard in line, though, yes
It's better to somehow handle the AbortSignal in some way because it lets you clean up resources and let the next entry go thru asap
If you have a blocking mechanism like a queue, and you don't free it up for a cancelled entry, you'll block the following entry unnecessarily
and d.js manager also needs to know if the waitForIdentify throws an error?
yes
It's true many APIs don't support AS, but a lot of them support a way to cancel/abort
if you get told to abort
you need to throw
and free your lock so another shard can grab that identify
that's basically it
we'll see how I can cook it up on the redis sinec thats also offloaded on another thread
cause for some magic node.js reason
while loop while even the function inside of it is async
blocks the event loop
one last thing
signal: AbortSignal this is just the signal and not the whole abort controller class?
yup
though the global types on it are incomplete for some reason
I had to hack this in
// Because the global types are incomplete for whatever reason
interface PolyFillAbortSignal {
readonly aborted: boolean;
addEventListener(type: 'abort', listener: () => void): void;
removeEventListener(type: 'abort', listener: () => void): void;
}```
and I do (signal as unknown as PolyFillAbortSignal).addEventListener('abort', listener);
lol
do this emit an event of some sort
oh, ok that makes things a bit easy I guess

so in your code, its just passed on the "queue" instance, then when abort is called, the queue will reject, then throw the promise on where waitForIdentify is called?
yeah, basically
ok thanks I'll cook something up :
https://safe.saya.moe/o0u3crqcu65u.png here I assume this is that, also one thing, if waitForIdentify throws an error, what happens?
also forget that promise reject, was due to original impl. without the abort signal 
well it should never throw in the first place unless the shard aborted it
lol
and it only aborts it if it closed in the meantime
so it's already reconnecting by that point
since its return, oh nvm, you have another handler on closed do you?
https://safe.saya.moe/ar5u7cl35jir.png this should do it thanks
I'm curious, idk what promisify.send is, but does it not support a signal/is not trivial to do?
just a personal class for making promise based ipc, now you mentioned it, I could technically
thanks for the idea actually, it looks a lot cleaner if the promisify class handles the cancellation internally
No problem ^^
You could really bump shardsPerWorker to like 4
Otherwise you're spawning 270? threads?
Not sure how the math works but we've got 1440 shards with shardsPerWorker at 1
Then you have... 1440 threads
That's seems a bit many? I don't know much about our websocket stuff but would that still be the case when we've got a cluster setup? (Basically like Kurasuta)
i could but then again, i really like each websocket to have its own thread so i can be assured it is as fast at it can be
may change in future but the memory penalty is so ineligible
No, Kurasuta uses the amount of cores as many threads
Spawning 40 threads on a 10 core CPU won't make it magically perform better than spawning 20
I mean if there's no performance increase if we go above the CPU's thread count then we might as well just use the same amount of threads as that the CPU has
There's so much the hardware can do
The reason why Kyoso can run 1440 threads and be just fine, is because /ws is very lightweight and uses very little resources per worker
Think about it, the reason you two run so many threads, is because sockets blocking other sockets on the same thread
But if you have a 10T CPU, it can only run 10 workers at a given time, if you have 1440 workers, 1430 will be idling and waiting for the OS's scheduler to give them a chance to run, and that happens way more frequently than if you have a worker per CPU thread (virtually almost never, all sockets would basically run with little to no stop)
I mean that said, cramming 1440 in like 10 threads is also not ideal
Speaking from experience, cramming that many ws connections (~144/thread) in one single process/thread will cripple it
Realistically, such large bots are likelier to run on servers with a lot more cores. If they have 64 cores, 1440 will require only 23 (rounded up 22.5) sockets per thread
And many services force you to increase CPU core count when increasing RAM count, so to account for the RAM needs large bots need, together with the amount of CPU required to run so many sockets...
64 cores is not cheap
Anywhere
My bigger worry is if threads stay alive when the parent dies
The only positive side I see of running more workers than CPU threads, is that the GC has less memory to sweep
Not like shard threads keep much in ram that needs GC
But even so, depending on how the objects are managed, it's possible and likely that the scavenger deals with basically almost all the objects
So the GC does little to nothing
Nice sin wave
thanks
So you can have ~20 shards per worker
Wouldn't recommend it
I'd go for max concurrency then
I mean we used to run all just fine
nah we have clusters, so probably 24 per worker?
I'd go for powers of 2
..aka 16
Or 32
Whichever floats your boat
Tho you have uh, around 90 shards per ratelimit key
eh I'm not concerned about the startup/concurrency ratelimits
that'll work fine in whatever shardcount/shards per worker I run
Fair
I think 16 is a sweet spot
But then again WS is capable of handling far more than one would give credit for, my main bot runs 14 internal shards in Discord.js v13, and Discord.js puts a heavy overhead on every message it gets, plus it's using a lot of intents, yet it's handling everything fine and the event loop latency is unnoticeable
Raw /ws should be able to handle much more, so I think you'd be fine even with 20 shards per worker
this is our eventloop with 1 shard per worker atm
me when still no custom rust erlpack
no i know im just harassing you for not doing things
I don't think it's released
Oh I think that even if it's released, the performance would suck because it uses the McBloaty web event API, which has even worse performance than pre-fixed AEE
it should not affect anything
even you run 1 or all
but then since we now use json encoding, cpu usage is definitely a thing now
I did have experience on original d.js ws also your original code before I refactored it where the bot is literally screaming on cpu usage due to the js event loop getting overloaded by json.stringify and json.parse, and erlpack did fix it. we just dropped it now because you can run the ws threads in its own "thread"
besides I don't see any issue on running a bit of threads on a dedicated container since most of our pcs anyways even handle more than that amount of threads without issues
like even with 56 shards / cluster, discord.js was able to run flawlessly with 1 shard per thread on /ws
reference: https://grafana.saya.moe/d/kashima-is-a-good-girl/kashima-cluster?orgId=1&refresh=5s https://safe.saya.moe/zcltakyivkcw.png
would drop this down to 4 clusters again, but then again the memory penalty is very unnoticeable to the point I'd rather just have each ws on its own event loop to ensure their stability
yeah ure lucky im so hot and talented :^)
will be saving that until the optional deps (bufferutil and utf8 validate) is on /ws 
i dont no / think so (?)
it's purely on the ws package
if you install them ws uses them
it just wasn't in the README until some point
not sure since idk if ws even use it

but anyways, on that kind of guild / process, stock d.js will also not work
even with probably less cache, still it may not work
the amount of overhead d.js have will just make the whole process blow
can you even check if the ws use that
not the /ws package but the ws package itself
.. it does
this is literally how discord.js "implements" bufferutil and utf-8-validate too
we don't do anything
we just tell you you can install them and they'll be used
oh well, probably it works then
can't tell but yeah
i keep both the buffer util and utf 8 on my package file
so I guess ws should use it
Your issue was not using zlib-sync
Json parse can be very fast
Yep, you can see it in ws/lib/buffer-util.and ws/lib/validators
but then again is the gc spikes fixed on that
its crucial for us to be stable, and until I got a verification on that gc spikes now fixed on that, we can't run it in prod
Aren't you guys already running it
Also you wouldn't notice the gc spikes (if there's any) since they're on different threads
yes but since it increases over time, it will eventually lead to a very slow thread
Hmm, strange
i have an old message here
showing the issue
what we used is erlpack which works fine
Oh, so zlib-sync makes the GC go brr?
yes if nothing is fundamentally changed when I noticed that issue
I'll probably look into making a replacement for that library using CF's zlib somewhere this summer, same for erlpack (but in Rust), although the latter will probably happen first
Etf shouldn't be the priority @rare shard
I know it isn't, but it's more fun to write
What was the solution to this? I also got this error code 1006 issue 👀
If I am looking at my logs right, this error seems to cause the rest of my shards to eventually spiral out causing a full bot crash, but I could be misinterpreting this as the main cause.
please dont necro a month old issue
if you're having issues id prefer you open them on github with full logs
i cant keep track of so many people in the discord
While this thread has been reopened I might as well mention that we haven't had any issues since the last one 🙏
It also seems like a few more people joined in
i dont mind
its just like, i want to limit this thread to specific convos
but now that /ws is in mainlib
i dont want like everyone to come with issues here
github is still best when theres many of them so i can track it
I apologize, I was just searching for similar errors and came across this. I figured since it was relating to big bots, and with my bot also being a big bot, this would be an appropriate place to ask. I've made some comments on this Github issue:
https://github.com/discordjs/discord.js/issues/9139
Which package is this bug report for? discord.js Issue description Running startup on either PM2 or general shard manager leaves the same problem, on a docker container if that's any use to kno...
The logs you've posted show a completely normal WS shard resume so unless there's logs you haven't shown then I don't think that's the right issue, if you really think it is related to d.js then its probably best to create a new issue with more info.
Also anything above 150k guilds is considered a "big bot" since you will get access to big bot sharding then ^^
all of the above
side note, I will start to donate a bit now and then via Open Collective and I have stopped supporting on Hydra's Patreon so I'll probably lose the sponsor role, anyhow first $200 is sent for now
we grant the sponsor role over OC contributions anyway
if its auto removed by patreon just DM crawl w a transaction from OC and he'll grant it back
that's the process
oh actually we have a command for it now
The Patreon bot is pretty ass anyways so I doubt it'll actually remove it correctly
yeah lol
Alright so we seem to have another issue, a small spike where a lot of shards disconnected and resumed but in all of this one shard never recovered and is stuck reconnecting since then. These are the logs around that specific shards, the ECONNRESET is most likely related but not 100% sure. After these logs the shard is never seen again and is stuck reconnecting like I said.
There were about 11 ECONNRESET errors around that time
i love networking issues
wait so its never seen again in logs as in
"waiting for ready" is the last thing you got?
Yeah
I don't really have a decent way to access all of the logs (log file is 60gb atm) but the "waiting for ready" line is at 06:13:17 and the snippet I got goes until 08:30:55
and no mention of that shard within that timespan
gotcha
yeah ive seen some similar behavior described
im a bit unsure whats going on there
realistically that should timeout and throw
all signs point to this just being the worker dying entirely
but im unsure why
if it dies to an unhandled exception unless you guys messed with the worker strategy that should cause your main process to re-throw the error and therefore quit entirely
but i cant imagine what it could be dying to that isnt an unhandled exception
sounds like a gracious exit?
@dim oracle do you guys have your own worker strategy still
think you could do me a favor and attach an exit log to the worker since i dont
and log when it fires
and when you run into this again check if it did fire
cc @sullen snow
I have this exact same issue with my bot
So it's definitely not just you if that is helpful :)
Would be cool if you could include some more info you might have gathered from debug logs or anything ^^



