#Result caching issue

34 messages · Page 1 of 1 (latest)

cunning oracle
#

I'm making a bot to query Javadocs, for this I've already parsed and formatted the messages the way I want, but I'm having a dilemma on whether or not my caching strategy is the best, maybe someone can give some insight:

Discord requirements

  • Autocompletions must be answered ASAP, as Discord only allows 3 seconds to respond to them. Ideally they must be ready in memory.
  • Autocompletion values (what is actually sent when selecting an autocompletion, not what is shown) must be shorter than 100 characters.
  • The actual messages can take a while to be replied to.

My approach

  1. After parsing, separate into two types of entities: objects (classes, interfaces, enum classes, annotations) and members (fields, methods, enum values, annotation elements).
  2. For each entity, create an ID using its name and an appended short ID, to avoid possible conflicts. For instance: JavaPlugin-LZbCb.
  3. The whole Javadoc cache has a folder inside a __cache__ folder, for instance, __cache__/spigot-1.20.1. The files mentioned later will always be under this folder.
  4. Save autocompletions in a data.json with this schema:
{
  d: <timestamp>,
  // Holds objects
  o: {
    <ID>: <Autocomplete Message>
  }
  // Holds members
  m: {
    <ID>: <Autocomplete Message>
  }
}

The differentiation between objects and members in the schema is done to allow searching differently. For instance, the bot can try to query members only if the user has inputted a . or #.
4. Save actual messages in objects and members folders, for instance: objects/JavaPlugin-LZbCb.txt, using a promise queue of fixed size.
5. In runtime, keep autocompletions (data.jsons) in memory, but only load full messages when queried. I can also keep an in-memory LRU cache for frequent queries.

As a remark post testing with Spigot 1.21.3, while the data.json is surprisingly small (<7MB), there's a massive amount of files (>65.7k), though the total size isn't that big (30 MB).

#

I'm considering grouping objects and members in folders per first letter

#

that wouldn't help with file size or amount obv but it'd avoid having a single folder with a gazillion files

dreamy ravine
#

In runtime, keep autocompletions (data.jsons) in memory, but only load full messages when queried.
is the bolded part because it would be infeasible to keep everything in memory? do you know how much RAM that would require?

#

i ask because given infinite memory the most efficient strategy would be to preprocess everything (you can get away with this because the source is static data). when a user queries for Object.hashCode your code could do an O(1) lookup like responses['Object.hashCode']

#

(it's possible i have the wrong idea of what your bot does though. you mentioned "autocompletion" so maybe the queries are fuzzier than that?)

cunning oracle
# dreamy ravine > In runtime, keep autocompletions (`data.jsons`) in memory, **but only load ful...

it would be infeasible to keep everything in memory?
probably: while the total file size is not big per javadoc, I need to keep multiple javadocs, so there's multiple sets all of these massive amount of messages. there's also not really the need to keep it in memory, the only "need" is to keep autocompletes (in "" because it's not needed per se, but it's part of the requirement to reply autocompletes asap)
it's possible i have the wrong idea of what your bot does though. you mentioned "autocompletion" so maybe the queries are fuzzier than that?
I would like to allow some level of fuzzyness, so it doesn't instantly fail if the user types one letter wrong, though the name (what the user sees) and the value (what is actually sent when the user selects an option) are completely different, that's why the name can say something like "Class JavaPlugin - Some description" but the value can be its id or whatever else, as long as it's shorter than 100 chars

#

sorry for the late reply, I completely missed yours when I checked a while ago

dreamy ravine
#

the name (what the user sees) and the value (what is actually sent when the user selects an option) are completely different, that's why the name can say something like "Class JavaPlugin - Some description" but the value can be its id or whatever else, as long as it's shorter than 100 chars
you mean the thing they type in searches across multiple fields of the thing you're matching against (not just its primary key/Java identifier name)?

#

if so i might structure it as one data structure for the source of truth (which may be lazily-loaded or whatever) and one or more indexes that contain references to things within that data structure

#

if you need prefix matching (which is what "autocomplete" suggests to me) then maybe this would be a use case for a Trie

#

i've been thinking in terms of "ASAP" since that's what you said, but if the requirement really is just "fast enough" then i don't think you need to get too fancy. even with multiple sequential rounds of database I/O or whatever, 3 seconds is a long time

cunning oracle
# dreamy ravine > the name (what the user sees) and the value (what is actually sent when the us...

I think an example could help: say I have scraped a Javadoc called jdoc and saved a class called MyClass with a method #foo(). It'd look like this in the cache/jdoc/data.json:

{
  o: {
    "myclass-abcd": "Class MyClass: <description>"
  },
  m: {
   "myclassfoo-defg": "Method MyClass#foo: <description>"
  }
}

where abcd and defg have been generated randomly for that class/method to avoid conflicts.
Then, the full messages are saved in cache/jdoc/objects/myclass-abcd.txt and cache/jdoc/members/myclassfoo-defg.txt.

When the user starts the autocompletion, I'll first try to start to reply with objects, so I fuzzy search in o's keys what the user is typing.

Only when the user either inputs . or # I'll do the same as above, but for members (m), though I first strip the # or . before searching. The benefit of this is that the user can type both the class and method names wrongly, but it should still match fuzzily

When the user actually selects something, they'll see the description in the client, but in the background the would've inputted the id (and whether its an object or a member), I can then just search the message in filesystem and reply

cunning oracle
cunning oracle
# dreamy ravine i've been thinking in terms of "ASAP" since that's what you said, but if the req...

in theory yeah but I also have to account for other stuff like:

  • ping: the count starts when discord tries to send the interaction to my vps, once I actually receive it then it's less than 3 seconds, and that doesn't count the trip from my vps > discord for the reply
  • my own vps' performance: it's a personal lightweight one, ideally I only want in memory what's needed, and I think having only the autocompletes is acceptable
  • the overhead for the actual fuzzy searching over all of those keys
#

it should be acceptable but I'm trying to play the safe route since it's inside my "budget"

dreamy ravine
#

do you have an implementation already? if so how long is it taking?

#

not sure i understand the conflict problem. my first thought would be to use fully-qualified names for keys (so like com.package.whatever.MyClass instead of myclass-abcd)

cunning oracle
cunning oracle
#

you know how java has long verbose names and tons of packages

#

I actually now think the method should actually use the prototype, eg myclassfoo(string,int)

#

to support different overloads

dreamy ravine
cunning oracle
#

though I think I can serialize it to make it shorter

#

but still

#

I don't think there's a benefit for using the FQN over just the name with some nonce

#

the later is shorter and easier to read in the files as well

#

and I definitely can't use it with methods

#

it'd be way too long

dreamy ravine
#

ah i missed that you were sending the full IDs to the client

cunning oracle
#

yeah it'd be for the autocomplete values