This is a breakout thread for @deep raft's http ratelimit rewrite started in #2418
(Some pings @keen trout, @topaz mango)
1 messages · Page 1 of 1 (latest)
This is a breakout thread for @deep raft's http ratelimit rewrite started in #2418
(Some pings @keen trout, @topaz mango)
I am not sure if channels are the best way to handle ratelimits as things can happen in a uncontrolled order, but I think a rewrite to remove them would probably be quite large.
I have also been looking a bit at another "twilight-proxy"-like proxy called nirn-proxy which seems to be able to handle arbitary paths which is something which would be cool if we could add without compromising how well it works.
be careful with looking at nirn code btw, it is AGPL
I use nirn-proxy personally on top of serenity
Yeah I am aware
Its rather annoying tbh.
The issue with AGPL is that it is a bit strange what is part of the work covered by the license.
So we just need someone to read the code and describe it to us :)
You mean the mpsc channel? I don't see a problem, it's collected and split into individual queues asap (similar to the gateway queue).
I have to read through your changes again but iirc you send your rate limit request through a queue, with a one shot to message back or am I completely off with that?
I have to read through your changes again but iirc you send your rate limit request through a queue, with a one shot to message back or am I completely off with that?
Hmm I did not send that twice??
And then we have some channels where we just run a sleep and then return which is more the place I could see issues or have those been removed?
Queues are sequential, but do not block each other
You could have multiple requests in-flight in parallel within a queue, but that complicates staying in sync with the returned limits (i.e. they may return out of order)
Also most of the time requests are sequential, so it's a somewhat niche optimization
I did however remove the ratelimiters receive timeout. I figure that the http client can manages that (dropping the oneshot receiver to unstuck the ratelimiter)
hmm if I am a consumer of a ratelimit and I want to know how long is left you can't do that with a channel even though it already is know right?
Or is there something that can change how long a ticket needs to sleep for after the initial creation?
And one I think we have seen is a 429 is hit and then the other requests in-flight for that endpoint still hit the 429 even though we know they will hit it, but I am not sure how to solve that.
I might just be talking shit though, because I am not sure how you make it in a better way without some rather hard central sync code.
It is possible to implement something like that, but pending requests may unpredictably change your calculation. Note for example that my implementation does not decrement the queue's remaining/available count on making a request, but always assigns whatever Discord returns. Discord has stated to prefer a leaky bucket algorithm, but they have also stated that there may be exceptions to this.
The actor (runner function) is now feature complete if anyone wants to take a look. The last piece was interactions bypassing the global limit. It's incredibly dense and will definitely need lots of integration tests ⚠️
I added a new API in addition to acquire allowing for consumers to signal backpressure and inspect queues. Note that the predicate takes a function pointer and not a closure! You cannot reference any local variables with function pointers (the predicate runs in the actor, so closures would require allocations and Send bounds).
pub fn acquire(&self, path: Path) -> PermitFuture; // PermitFuture: Future<Permit>
pub fn acquire_if(&self, path: Path, predicate: fn(&Queue) -> bool) -> MaybePermitFuture; // MaybePermitFuture: Future<Option<Permit>>
I guess Queue will expose methods similar to what Bucket does today.
Rethinking this, taking a closure is probably better cause the allocation should never be a performance problem. Oh well
I have just marked the PR as ready for review and it is now also ready for early testing
I chose to exclude this for now. The rate limiter needs to just be able to hash the path, so dynamic parsing can easily be slipt in later
Discord's bucket hash is not globally unique, but only unique within the "top level resource".
However, HashMap<Path, Vec<u8>> and HashMap<(TopLevelResource, Vec<u8>), Queue> doesn't play well together because of ownership over the bucket hash (Vec<u8>). That's why I use the hashbrowns HashTable and hash by hand.
Makes sense, I will continue looking through it
I hope that I have not produced write only code 
I honestly don't even remember how everything ties together. That is part of my motivation for writing integration tests haha
I do have a in-progress comment about documenting the macros, mostly so it is clear when to use them and such as they can be a bit hard to decipher.
I inlined Queue::pop into pop! (renamed from on_pop!). I this is a bit better
@deep raft i deployed this fix to prod
because the gc removes queues bound to buckets it will eventually panic
atleast thats my thought
Did the crash happen after a gc? It is meant to only drop unused queues…
not sure my traces weren't on
I look at writing some tests for the garbage collection and see if I can make it crash
Ideally your fix should not be necessary, as it is more of a work around for the gc being overly aggressive
GC occurs at a 6 hour interval so you could check if that lines up with your crash time
ill check in a moment
bot started ~3:45 and people started complaining around ~10 that it was broken
so it was likely a gc
i was asleep when it broke so i dont have an exact time
yeah this isn't true 
at least with what i'm trying to do to replicate
got it to crash
How did you replicate it?
The GC left invalid bucket references to removed queues. Fixed by removing the reference from the bucket map too
I think this stems from the invalid bucket references too, so it should be resolved
@wanton seal are you still using the pr? And if so, did you run into any more issues?
@keen trout I managed to shorten the in_flight branch. I believe that one is the most complicated so PTAL regarding documentation
Is this ready to test yet?
Yes! Some internals are pending final approval but the API/functionality should be mostly stable
been looking at this and i wonder for the http proxy if we still want this using that once the rewrite is in. it might be more stable (and correcter) for the proxy to not be tied to specific routes but do some generic parsing to find the major route and id. and collect them on the bucket id
Yeah we have been talking about doing that for a while