#Koyamies caching hell

318 messages · Page 1 of 1 (latest)

autumn gazelle

cache sucks

weak kraken

hi

so basically here are my cache sizes

queen chasm

Don't disable the PermissionOverwriteManager, there's a reason altering it is unsupported

weak kraken

yeah was joking sorry lmao

autumn gazelle

purging users & messages are the key ones for me

bans also seems sensible to nuke to 0

weak kraken

yeah I use bans cache a lot

autumn gazelle

Ah

queen chasm

hmm

weak kraken

thinking about it, it doesn't seems that huge for a 1.4M guilds bot

autumn gazelle

What's your total usage again?

weak kraken

453GB right now

autumn gazelle

I'm at 3.5gb w/ 14k, so that roughly scales up

weak kraken

but in fact my issue is more that since v13 update my ram just definitely climbs up hitting the max memory

and that, in few hours

instead of 1 week before

my bot could run for 2 weeks straight without memory issue (still was close to limit at some moments but not enough to kill database and so kill the bot)

autumn gazelle

What caches seem to be climbing?

weak kraken

i'll need more time so see which, I just added caches to my metrics

  • I can notice some shards growing up in ram suddenly, average usage is 500MB and sometimes I see a shard using 2GB

like right now

coral orbit

Hey guys

weak kraken

what

weak kraken

made a quick graph so you see the difference

I have to reboot approx 2 times per day right now

it takes 8 hours to reach ram limit

it was taking ~9 days before v13 and was even working fine (see 14 days graph), it was just playing with limits but never at a point it would make the db crash (wich is the case right now)

there is definitely something wrong

this is past 2 days

gusty moth

this is prior to the real 13 release:

each line is a shard

now:

and it will keep increasing until death

here are my relevant client options: ```js
intents:
Intents.FLAGS.GUILDS |
Intents.FLAGS.GUILD_MEMBERS |
Intents.FLAGS.GUILD_BANS |
Intents.FLAGS.GUILD_EMOJIS_AND_STICKERS |
Intents.FLAGS.GUILD_MESSAGES |
Intents.FLAGS.GUILD_MESSAGE_REACTIONS |
Intents.FLAGS.DIRECT_MESSAGES,
makeCache: Options.cacheWithLimits({
...Options.defaultMakeCacheSettings,
ApplicationCommandManager: 0,
GuildInviteManager: 0,
GuildMemberManager: {
maxSize: 2_000,
sweepInterval: 900,
},
PresenceManager: 0,
StageInstanceManager: 0,
UserManager: {
maxSize: 2_000,
sweepInterval: 900,
},
VoiceStateManager: 0,
}),

weak kraken

this is interesting

autumn gazelle

im holding pretty steady now @ 14k guilds

looking back 56 days ago, so roughly 2 months, i wasn't much different @ 12k guilds

Koyamie, have you got any plots of collection sizes that correlate to that increase in mem usage?

autumn gazelle

Similar to koyamie, please log collection sizes so we can see if anything is growing rapidly that might be the cause

autumn gazelle
gusty moth

you're right, i thought it had a default filter for some reason

weak kraken

guild members cache goes from 2M on boot to 6M when bot reach its limits

for users, 1M to 6M

weak kraken

heads up about this. I have made a workaround to have my shard running in "hybrid" mode or "clustering", this has (for now) fixed my memory issues. I'll continue monitoring but its in the right way

gusty moth

cluster mode on the shardingmanager?

why does the up arrow put a sticker in the message

stupid discord

weak kraken

but doing that means I handle concurrency somewhere else, otherwise you can say hello to cf ban

austere heart
weak kraken

I've got some news about this, using nodejs v16 seems to be the reason of increased memory usage.
After talking to another dev, testing his d.js v12 code on nodejs v16 is also significantly increasing his memory usage.
We cannot try d.js v13 code using nodejs v14 because v16 is required but I guess it could be possible if we just remove v16 features from v13?

spare hemlock

We'll eventually have to change either way, sadly :/

It's not like we can stay in v12/v14 forever

weak kraken

ikr

I would say, as a temporary solution

on a fork

my question is more "is it technically possible with v13"

spare hemlock

It's really not

@queen chasm did you measure the memory impact of having potentially hundreds of thousands of WeakRef's?

(One MessageManager for each TextBasedChannel)

weak kraken

whats the reason behind? do we use that much of v16 features that it would broke everything to stop using them?

spare hemlock

Yeah, WeakRef is v16+

weak kraken

ah

alright thanks

spare hemlock

We use it to cancel the sweep interval when it's removed from cache

weak kraken

I see

I only had abort controller in mind

spare hemlock

I... think, maybe it was added in v12?

weak kraken

yes, but was working with a npm package

now its native

spare hemlock

AbortController and ??= (used first time in the https.Agent PR) are the only v15+ features

Guides use v16+'s packages (timers/promises) to enormously simplify a lot of things

autumn gazelle

when did weakrefs get added

spare hemlock

With SweptCollection (now renamed, I think), Ckohen added it after I suggested it

autumn gazelle

Oh, so as part of v13 original

weak kraken

is it a v13 feature? because as said d.js v12 was tested with node16 and the memory issue appears

spare hemlock

Yeah, WeakRef was used first time in djs v13 for sweeping archived text channels

But the feature itself is available since Node.js v12 iirc

autumn gazelle

If this is reproduceable on v12 code, that might make it easier to compare two dumps?

spare hemlock

Sorry for the confusion there ablobcatsipsweats

I suppose you could check the memory usage before and after the SweptCollection, mind you it's also very outdated by now, and was long before the option helpers

autumn gazelle

Nah if this is happening on v12, sweptcollectiton isnt the issue

spare hemlock

But the truth is, SweptCollection has a very large memory impact

There are sort of memory leaks in v12, namely when a channel's type changes between GUILD_TEXT and GUILD_NEWS, since a lot of messages would have a TextChannel instance instead of a NewsChannel one, but that happens very rarely as the feature is rarely used

autumn gazelle

Wait, where does sweptcollection (limitedcollection) use weakrefs/

spare hemlock

It's in the manager

Uh, GitHub doesn't give results

autumn gazelle

yeah i see no weakrefs anywhere now

Koyamie, if you can repro this on v12 code, maybe do two heap dumps on diff node versions and see if anything stands out?

spare hemlock

Mind you, that bind creates a new function for each manager, which has a large impact

It's not WeakRef, but it's the feature created alongside it

autumn gazelle

Hm yeah that could be part of the mem diff between v12 and v13

but it sounds like this is more just a node v14 vs v16 issue, if its also happening on v12

spare hemlock

Let's wait for the results then

I personally didn't get a memory increase when going from node v14 to v16, although I did when going from djs v12 to v13, badly

My bot has always been sitting at 1860 MB RAM

weak kraken

d.js v12 with nodejs v14

weak kraken
weak kraken
solid bolt
spare hemlock

I saw that later

It's more about FinalizerRegistry having potentially hundreds of thousands of callbacks and stuff

(Which should in theory use more memory than Map itself, thus using more memory than it helps reducing)

queen chasm

I mean, an unreffed timeout I would hope just gets discarded during GC, but I never tested that after our conversations

spare hemlock

It doesn't get discarded

It's just not waited for their completion when the Node.js process is awaiting for close

spare hemlock

No

17 is the newest nightly

solid bolt

Yeah lol, rip brain

weak kraken
spare hemlock

Meister said "try on v18"

weak kraken

oh lmao, I see

solid bolt

node v14 | Same Code | 2.1 gb | v13 | 21 Shards | 41k Guilds
node v16.6 | Same Code | 3.5 gb | v13 | 21 Shards | 41k Guilds

weak kraken

I will send mines today, downgraded to v14 and booted tonight

spare hemlock

Have you thought of making an issue in Node.js's repo?

weak kraken

Not a big difference here, my ram usage looks similar to node v16

spare hemlock

Same, it's nearly the same for me

weak kraken

not the same code for me tho LuffyThink

weak kraken

you have summarized the situation well, RAM usage is increasing very fast

queen chasm

Okay I have a thought that kaira sparked, that 12% RAM increase on v13 is probably caused by Finalization Registry.
Here is a list of things that may have impacted memory usage from v12 to v13:

- timeout caches in client
- removal of typing caches
+ application command managers
+ Guild bans managers
+ Guild Invite managers
+ Guild Sticker Managers
+ Stage Instance managers
+ Thread managers
+ Finalization Registry callbacks
spare hemlock

I bet the manager changes have negligible impact compared to that

queen chasm

It was just a list of the possible things, bans, invites, stickers, and especially stages probably don't contribute much

but they were thing s that weren't cached at all before

I'm trying to figure out if we can possibly figure out a way to not use finalization registry and still stop the interval. Cause in my testing I never even saw a mark and sweep happen, had to manually call it

gusty moth

and I'm using the defaultMakeCacheSettings thing as well

queen chasm

oh, you can watch the memory consmption of Client#_cleanups (its a Set). A reference to every one of those bound functions is stored ther

queen chasm

Also wait, am I dumb, can we just WeakRef the reference to "this" in the setInterval, so the object gets GCed, then if the sweepFn no longer exists, cancel the interval. Get ird of all of the finalization registry stuff

(would be opt in behavior cause semver minor)

spare hemlock

You also forgot to include the removal of Client#_timeouts, Client#_intervals, and Client#_immediates, which decreased memory usage significantly as well

In the list you wrote earlier

At this point we're gonna need C++ Ckohen, lol, C++'s destructors have near to no memory impact

It's just a number that goes up by one when referenced, and down bu one when dereferenced

queen chasm

I wish smolHands

spare hemlock

Come think of it

In my first year of C++, my professor refused to teach us std::shared_ptr

So we had to ++ and -- the reference count manually

queen chasm

dead

spare hemlock

We... could achieve something similar

queen chasm

oh

oh

I see

yes

I like

spare hemlock

Make Map#delete do entry.decreaseReferenceCount()

As well as Map#clear

(The latter becomes O(n) but... we need to do so to clear intervals either way)

queen chasm

but hold on, we don't clear the caches of managers that just disappear

we relied on GC for that

spare hemlock

Uhmm

It's more for handling an edge case

aka the user calling clear() directly

queen chasm

oh, but also the interval is on the map, not the entry

spare hemlock

Uhmmm

We can optional chain the decr ref call

An alternative is to just don't do ref counters

And instead, as soon as we stop referencing it from Client (aka its holder has been removed), we cancel its timeouts if any

queen chasm

but...how

spare hemlock

Although that'd require Base#destroy

So when Guild#destroy is called, all channels, roles, members, etc, are destroyed

(obv we can fast-track some parts so they don't have to do a full scan)

queen chasm

hmmm, okay. We probably don't need destroy() on base, the majority of structures don't have managers on them right?

spare hemlock

That's right

So entry.destroy?.(), perhaps?

queen chasm

yea

spare hemlock

I was thinking of having it in base so it's a NOP

queen chasm

fair, abstract (but not quite)

spare hemlock

Which is less expensive than a conditional

By fast-tracking some stuff I mean destroying each channel individually only when it has a timer

Same for message reactions

queen chasm

what about if the guild channel manager doesn't have an interval, but each channel has a thread manager with a timer on it

spare hemlock

ugh, lol

queen chasm

okay, let me propose something rq

spare hemlock

This would be so much better with -next's design

queen chasm

yes

spare hemlock

TypeScript, global access to client via DI, and also the option to memoise client options

So we can resolve and globally reference a memoised dependency map that tells whether or not something should recursively destroy

queen chasm

Okay```js
let interval;
const maybeThis = WeakRef(this);

/**
 * The id of the interval being used to sweep.
 * @type {?Timeout}
 */
this.interval =
  sweepInterval > 0 && sweepInterval !== Infinity && sweepFilter
    ? setInterval(() => {
        const reffedThis = maybeThis.unref();
        if (!reffedThis) {
          clearInterval(interval);
          return;
        }
        const sweepFn = reffedThis.sweepFilter(reffedThis);
        if (sweepFn === null) return;
        if (typeof sweepFn !== 'function') throw new TypeError('SWEEP_FILTER_RETURN');
        reffedThis.sweep(sweepFn);
      }, sweepInterval * 1000).unref()
    : null;
interval = this.interval;

so this is on LimitedCollection

that way the collection (cache) itself can get GCed

and if it does, interval magically clears itself

spare hemlock

Is WeakRef even invalidating with the interval being accessible somewhere in Node.js's internals, and having access to this in the scope?

queen chasm

I don't have a good way to test that, I have no idea if having those variables within the constructor makes the entire class considered refferenced

spare hemlock

We have been over this before, although admittedly we didn't test because I didn't see the logic/sense if it did

Not really on that last bit

The GC wants to run fast

It doesn't have time for smart tricks

queen chasm

right

spare hemlock

Ugh, if we could do something like ECS here, lol

Because in a way, Discord's structures are similar to the patterns used with ECS

Why can't they just give us a GC reference counter

There are two alternatives that pop up in my mind, @queen chasm

One is to make all caches global, so there's only 1 instance for all managers

Comes with the upside that cleaning them up is VERY fast, similar to ECS, and very reliable, also no need to do memory transversing

But the downside that we'd need to be careful when removing structures from cache, and *Remove events (such as guildRemove) would return empty caches unless ofc we do that in the following tick

The other is to do deep memory transversing just like we did with messages, but for all other caches

It doesn't need magic nor advanced features, no GC tricks, no more 2+ properties for each limited collection (admittedly, LimitedCollection takes more memory for me than it achieves to clear), no LimitedCollection at all (simpler cache, -1 inheritance level thus faster), is highly reactive to changes (if something becomes inaccessible from client, it's not swept)

Only downside is deep memory transversing being slow, but all of them have that issue anyways, just in 1 level instead of ~5 times nested

queen chasm

So, sweeping would then fall on the client?

spare hemlock

We could also check for making dependency graphs so if a sweeper that depends on another would probably run together, so we save a lot of transverses

Yeah

It's how we did it for many years and always worked fine

You see, one of the things that annoy me the most, is that we're sweeping dead caches with the current system too

queen chasm

by dead you mean empty right

spare hemlock

Because the GC takes up to 5 minutes since the cache removal to call the finalizer

queen chasm

oh, yeah

mark-sweep doesn't run often at all

spare hemlock

So it's going to sweep a few times if it's below 5 mins

(For those who run sweepers more aggressively, they're customisable after all)

queen chasm

So, client#sweepers for deep transversal of every cache type

spare hemlock

I know a lot of people who had their message sweep interval set at 2 minutes, so I wouldn't be surprised to see that case

Yeah

queen chasm

not a bad idea, how would we do global caches though, cause I like the sound of that, just don't see the way

spare hemlock

global caches?

spare hemlock

That's a separate idea

queen chasm

yeah

I wanted to hear more about that

I guess they're not technically mutually exclusive

spare hemlock

It's basically client.caches.messages, client.caches.channels, client.caches.roles, etc

One of its upsides is that we don't need to rely on guild_id too

But local cache management is pain

queen chasm

oooh

and those caches would have the entire cache?

spare hemlock

It's extremely complicated at the end of the day, unless we go @klasa/core's way and implement... ProxyCache

Which was basically a Cache (similar to djs Collection) but holding keys only

queen chasm

oh, so basically our datamanagers

except storing the keys so filter isn't slow, using get instead

spare hemlock

And when you queried it, it'd check if the key existed in it and grab the entry from the cache it references to if it does

Yeah

queen chasm

global caches sound like a good idea for -next

spare hemlock

@klasa/core uses ProxyCache for member.roles

queen chasm

I like client#sweepers as a replacement to LimitedCollection (do we keep it for max size? I assume yes)

spare hemlock

And a few more places too

We keep it for maxSize, yes

queen chasm

luckily thats a super easy semver minor change, deprecate the sweep properties in Limited

and the deprecation on message sweeping still applies

spare hemlock

Yeah

queen chasm

Cool, I'll add that to my todo list, hopefully to get in before 13.2

spare hemlock

Hopefully that gets Koyamie's caching hell solved

v13 was supposed to use far less memory than v12

queen chasm

yeah

weak kraken

I could give a try removing my clustering thing to see after the changes if it is better or not, but in fact I think ill keep my workaround because having 656 processes is way too huge

spare hemlock

@queen chasm are you going to write an issue regarding all the issues and ideas we discussed here?

I'd write one if it weren't for the fact I'm going to be phone only until late Sunday

queen chasm

I can try tomorrow, I don't know if I'll remember everything, I'll be scrolling here for a bit smolLUL

spare hemlock

Otherwise I can write it on Sunday

timid kernel
weak kraken

datadog 🗿

timid kernel

never heard of it 😶‍🌫️

weak kraken

it's cool and easy to set up

but I'd prefer using graphana

spare hemlock

@queen chasm @weak kraken writing that document now, I'll share it here before posting it in the repo's issue tracker so I can add your feedback on it. Should take 15-30 minutes to complete iara_thumbsup

weak kraken

bet

spare hemlock

Since you replied before, and ckohen will check later, @weak kraken ^

Enjoy the read, 7852 characters

I'd also like your input @autumn gazelle, since you have dealt with caches a lot

autumn gazelle

I like it

weak kraken

seems fine to me

autumn gazelle

Personally I vote for 2 (global sweepers) first as this is likely semver patch/minor, and then work to also do 1 (global cache) if folks feels its viable down the road

spare hemlock

2 is semver minor as long as we stick to using the utility over LimitedCollection

Which is the most ideal solution as well, due to its 0 memory overhead

spare hemlock

Added some notes

queen chasm

Looks good. I'll note that we cannot prevent people from sweeping channels because threads are stored in the channel manager and guild channel manager, meaning they need to be swept if trying to sweep threads. I don't like that paradigm, but with global sweepers, we don't even need to do deep transversal for channel, simply allowing sweeping to happen ever so slightly differently on the channel manager, using the _remove method on ChannelManager which already does that for us.

  _remove(id) {
    const channel = this.cache.get(id);
    channel?.guild?.channels.cache.delete(id);
    channel?.parent?.threads?.cache.delete(id);
    this.cache.delete(id);
  }
spare hemlock

Sorry, but I'm having trouble understanding your message in there

queen chasm

You say we can prevent sweeping channels, we can't

guilds yes

separate from that

spare hemlock

The global sweepers (without the global caches) require us to write code in how they must transverse the structures to find the relevant objects

We just don't write the code for the channels, but for threads

queen chasm

we don't need to transverse for channels because we do have a global cache for all channels, with a helpful existing method

spare hemlock

We do for messages, reactions, etc

queen chasm

yeah

spare hemlock

It's the same system, Ckohen

queen chasm
queen chasm
spare hemlock It's the same system, Ckohen

it can be, but I don't think it should be, if you want to sweep threads, presumably you want to sweep them everywhere, which requires you to set sweeping for channels, guild channels, and threads, unless you are saying isolate threads specifically in the sweepers

spare hemlock

Anything I should note in the doc? I don't know what exactly to write

queen chasm

I just explained to myself how we can prevent sweeping channels while allowing sweeping threads, so I think you're fine

although idk if you want to allow sweeping dm channels maybe?

spare hemlock

Good point

Feel free to write something for me to add in the doc, so it's included in the issue I'll create tomorrow or so

queen chasm

I suppose we could do threads, dms, and even voice??? not much should rely on a voice channel except itself now unless the voice adapter uses it, which you probably know way more about

spare hemlock

Voice adaptor only relies on the ID

queen chasm

so we could allow that too

I can write a little something in a bit

spare hemlock

Alrighty agooglethumbsup

queen chasm

Something like this probably

- And because we need to write this code manually, we can hard-forbid specific managers from being swept, such as guilds and non-thread channels.
+ And because we need to write this code manually, we can hard-forbid specific managers from being swept, such as guilds and permission overwrites. It also provides additional flexibility so that We can handle channels significantly better in two ways.
+ 1. Make it much easier for users to set up sweeping for channels by providing one channel sweeper (or one per type) instead of users having to remember to sweep up to 3 levels of caching (i.e. no more duplication between ChannelManager, GuildChannelManager, and ThreadManager). We can do this because every channel is stored in the client channel manager, which already has a way to traverse guild channel managers and thread managers to remove the desired channel.
+ 2. We can hard code a filter that runs before sweeping of channels to only allow sweeping of channels we determine can be swept.
spare hemlock

If there are no further requests, I'll proceed to posting the Gist's contents in a couple of hours

spare hemlock

Just asking a couple of things in internal, and then I'll post if nobody has anything to add

spare hemlock
autumn gazelle

blessed

weak kraken

👍

solid bolt

Sorry for opening this Thread again, but after doing some modifications on the library with the support of nearly all events.
I came to the result, that with the raw-cache storage, which is planned on v14/v15, you can decrease your ram about 50%

spare hemlock

@queen chasm @lofty mist look at this

solid bolt

I built it as a layer on Discord.js with just having one Collection, which handles the raw api data (and removes the properties, when a rule has been provided.).

When you call .get it transforms the modified raw api data and pass its to the transformerClass (-> Djs Class)

The raw object will be modified as mentioned above. Firstly for the property removing and secondly for changing the array of roles/channesl/threads/emojis objects into a array of ids, whereby the objects are cached on their appropriated collection.

So on a transform of a guild object, it calls the values from the collections via the ids, which are in the array.

Modify Guild - roles: [{}] -> roles: [id] + roles.set
Transform Guild Modified Data: roles: [id] -> roles.get -> [{}]

This allows just caching channels... 1x time instead in the guild itself.

https://media.discordapp.net/attachments/861124237221560320/934830814598934618/unknown.png

solid bolt

Would be interesting to check the usage with redis.

To sum up here the advantages:

  • 1 level cache, instead of duplicated cache
  • raw objects uses less memory in comparision to nested classes
  • Raw Data can be modified, very interesting for large bot devs
  • Compability for using with redis and easier handling of hot reloads
  • Easy handling of sending data between procceses via IPC
  • Classes do not have to be converted back, when doing rest requests
  • Less breaking changes on user side, when they use the raw api data. Since Discord do less breaking changes
  • Avoiding Memory Leaks
  • Lot of Changes required
  • Degraded Perfomance, bc of transformation on every .get ? - Maybe a temporary class cache to solve this.
queen chasm

the whole "raw data" storage thing is planned, probably for the ts rewrite though (looking like it might be possible on v15, but more likely v16

as for duplicate caches, the duplicate caches don't take up nearly as much memory as the original since they store by reference. Obviously its still > 1 map with a complete list of duplicates (+threads have 3) so memory still goes smolStonks The question becomes the trade off between memory and performance at some point, and it seems performance was chosen in the past. (filtering 5000+ channels down based on what channels would be in the current guild cache or compiling the client cache from the guild caches etc...) I think the performance on that isn't super terrible though, and I think given a proper investigation its likely the benefits of memory decrease far outweigh the performance improvement

solid bolt

v16 ablobsweats, I thought that this has a higher priority with Discord adding even more and more features.
But can understand that this requires a lot of changes and is better when it is done on the ts-rewrite.

queen chasm

V16 isn’t that far out either. It’s been said a couple times we want to have majors every few months instead of once a year now