#Best Vector DB for Local Use

2390 messages · Page 3 of 3 (latest)

meager siren
#

For one, I have validated this thing on small data

inland lynx
#

uh i meant at an OS level

meager siren
#

second, the incrementation of the db size is scaling accordingly to the number of checkpoints

meager siren
inland lynx
meager siren
inland lynx
#

hmm

meager siren
#

2GB per write (table.add() call).

#

and about 3 TB for the whole table

inland lynx
#

How many concurrent writes?

#

woah is that all parallel?

meager siren
#

with my data, (assume 6 kb per entry), it would be 325.000 entries per table.add()

meager siren
#

I just add up all the new batches into a big chunk and when I reach a safe point (100,000 for my program), I call table.add() to that data and reset the stuff after before repeating again.

inland lynx
meager siren
#

No fancy parallelization with multiprocessing/multithreading but it's the best of what I can do

meager siren
#

other way is on the reading of the table which is more manual and inspecting the embedding returned

inland lynx
#

oh ok so there is clearly a difference when you read a bad embedding

#

so basically you could do a check intermediatelly for bad embeddings and clean it up

meager siren
#

No need to clean if it doesnt go into the db

inland lynx
#

right

meager siren
#

But in those cases i have found a bad embedding, the problem was in the string itself

inland lynx
#

uh

meager siren
#

ie it was an empty string or a string of white space

#

and the produced embedding would be NaN as a result

inland lynx
meager siren
#

Im preprocessing/remove such strings from the categories list before we even search the vector db for existng entries.

inland lynx
#

thats odd

meager siren
#

and also, like I mentioned before, in the case that NaNs are detected, I just skip adding that batch to the big batch chunk

inland lynx
#

how did it come in the first place

#

i havent seen any categories like that

meager siren
meager siren
#

@inland lynx Should I tempt fate and try and get my embedding script to try and parallelize? The idea is that for the big embedding loop:

  1. chunk categories into sizes of 100,000
  2. pass those chunks to processors to get 100,000 embeddings (per processor)
  3. Wait for processors to finish their chunks
  4. upon aggregation (back on the main thread/processor), iterate over each processors 100,000 category, embedding pairs and write each one to table (one chunk at a time).
  5. move to the next set of chunks and repeat starting from step 2.
#

Unfortunately, I cannot apply the same thing to running the global search to count which categories have already been embedding (so that is going to be VERY slow -> I was getting 30 minutes with a semi populated table).

#

Cuz at this point, I'm at around 48 hours into the run time and only 2M out of 4M categories have been embedded.

#

Back when I was running multiprocessing, I was done in under 24 hours.

inland lynx
meager siren
#

So you either do 16 processors x 16 batch size or some variation that gets you 256

meager siren
#

I got the 16 x 16 (processors x batch size) to run in around the 21 hours I had before.

#

So far I don’t think it’s a bad run

meager siren
#

So another note to myself is that searches will default to a limit of 10 results. Gotta specify the limit to what you want/need. Found this out the hard way when running my search for existing keys/categories and I only kept getting back 10 results instead of 100,000 (based on my SQL query parameters).

#

Also, that check after the DB is supposedly full takes around 9 hours with no parallelization (multithreading/processing). Oof.

inland lynx
meager siren
#

I’m off for a week to go home but iirc, it was about 2 min on my server CPU

meager siren
inland lynx
meager siren
tacit mica
#

why are there so many deleted messages here? it's hard to read 😭

meager siren
#

I havent deleted much. But I figure this thread is still good to carry the convo around this work so I keep contributing to it.

meager siren
#

I'm back from all the school and vacations to work on this project. I have added parameters to scale the category processing and embedding as necessary but I still need to append the "missed" categories to the downloaded category tree.

Fine. I've got the code for that but now I'm running into the issue from my vector DB such that my search is returning nothing. Specifically, the query to the vector DB for the nearest embedding given the query embedding and the restriction on categories.

Code for this section:

    for node in tqdm(missing_categories):
        # Get current category node embedding.
        _, embedding = table.search()\
            .where(f"category = '{node}'")\
            .limit(1)\
            .to_list()

        # Search for "closest" node embedding from the graph.
        best_category, _ = table.search(embedding)\
            .where(f"category IN ({nodes_for_query})")\
            .limit(1)\
            .to_list()

        # Create an edge in the downloaded graph connecting the "best
        # matching" category (from the original downloaded graph) to
        # the current category node from the "missing" categories list.
        downloaded_graph.add_edge(best_category, node)
#

I'm away from my server for a bit today but I'm thinking that, once I have the vector DB initialize and populated, I train/create the ann_index with the table.create_index() function and see if I can query the vector DB appropriately and get back some result.

meager siren
#

Ok, so querying after initializing the index helped but the .where(SQL) line is still over filtering and returning nothing.

Current plan is to build a second table based on only items from the downloaded graph (query from main table, no recomputing embeddings) and have that be the “filtered” table. I then use the filtered table to help build similarity between missing categories and those in the downloaded graph.

#

Why do this? Well, because I can’t find a way to properly filter the table in a few commands so this will have to suffice for now (at the cost of doing the same steps as part of the initial table read).

meager siren
#

I'm starting to wonder if the 10 deep category tree downloaded was the right idea. It is taking an immense amount of processing/compute time to do the following:

  1. embed all categories (local and downloaded).
  2. verify which and all categories are embedded.
  3. create/isolate a filtered table of embeddings from the downloaded category tree.
  4. perform ANN between local categories and downloaded ones.

To my understanding, there are around 4.6M total categories under consideration with around 3M being from the local dump and the rest from the 10 deep category tree.

I have category trees for depths of 2, 3, and 5 as well as depth 10. Those trees will have considerably less nodes. The cost of this is a significantly flatter tree with more documents being returned (not as fine tooth comb as compared to the depth 10 tree). Maybe it's worth considering over the depth 10 model.

meager siren
#

Running on depth 5, there seems to be 3.8M total categories vs depth 10 4.6M.

#

Maybe it would be a good idea to revisit the problem of identifying the minimum number of categories needed for full document coverage.

meager siren
#

@inland lynx can I get your thoughts on this?

Building the category tree by connecting the downloaded tree to the categories from the dumped files is taking a long time. It takes a day just to embed all 4.6M categories (depth 10 tree + dump). From there, I have to map around 3M "dump" categories to the downloaded tree categories. Even with parallelization and caching, this is taking a really long time for each time I run the script and get farther into the program.

So I was thinking of some alternatives.

  1. Cluster the dump categories to reduce the number way down from 3M
    Pros:
  • Reduces number of clusters to map from the downloaded category tree
    Cons:
  • Requires there be another map for groups to category names
  1. Build my own category tree with just the dump data
    Pros:
  • Allows me to disregard the categories from the downloaded tree (benefits scale with that tree's size)
    Cons:
  • Still very compute expensive to deal with 3M categories

For the second option, I was considering clustering the categories based on their computed semantic vectors with k means clustering (maybe implement the algo myself). I also considered implementing HNSW too which would be a "fast" abstraction of the tree. While researching what I'd need for each algo, I just realized something: Doesn't LanceDB already do a lot of this?

And that brought my to option 3:
3) Store all dump categories + embeddings to lancedb. Query with a high K limit (ie k = 100 or 1,000)
Pros:

  • Drastically simplifies things. No need to look at downloaded category tree or even category trees in general
  • LanceDB already does the work.
  • Disk based thanks to Lancedb
    Cons:
  • Not as "audit-able" as the tree structure (but still does the job)
  • Don't have any metrics on query time.
meager siren
#

Looks like this is pretty good with existing categories (no brainer) BUT using raw text does not translate well in this system. Going to have to see if keyword extraction can help with this.

#

Also, storage is only costing 13GB so that's nice

meager siren
#

The direct retrieval of the categories via the vectors is pretty fast, even for high number for top_n (ie 1,000). Downside is that I'm unable to translate this success to raw queries (ie "what does the fox say?" will not give good categories around foxes nor speaking).

#

To try and fix this, I'm thinking of doing some form of key word extraction from the query and try to run the vector search that way. Maybe I'll get something better.

meager siren
#

@timid sphinx I think I found one of the issues. I have an inverted index formatted as a prefix tree that is initialized as a class (tree class and node class) with the load/save being recursive (stored in msgpack files).

#

Profiler is giving me around 5+ minutse for load trie + retrieve documents tied to word.

#

That's one part (even with multihreading) is taking too much time

timid sphinx
#

Hehe I am just glad I guessed roughly right

meager siren
#

Welp, now I'm on to figure out a new inverted index structure/storage solution

meager siren
#

Was pleasantly surprised to see that parquet files saved 3GB of storage (total) comapred to msgpack (20GB vs 23GB). That's nice

meager siren
# timid sphinx Hehe I am just glad I guessed roughly right

I ran the profiler again using pandas. Good news is that the whole process of

  1. Load parquet data
  2. filter "redirect" documents
  3. compute TF-IDF
  4. compute cosine similarity
  5. update top-n heap

is coming in really fast at a low-low time of 165s (that's 1/2 the time it took just to initialize the prefix trees in the inverted index). Downside is that it's still way too slow for that section and that is per file (with ~200 in the database I have).

timid sphinx
#

Well parquet is not exactly lightweight. And you are still reading from disk in a loop?

meager siren
#

Unfortunately

timid sphinx
#

How many times do you read files each loop?

#

It's not 10M, is it?

meager siren
#

I'm still shopping for a solution like lancedb that has low fileIO cost

meager siren
timid sphinx
#

Hmm so what is the hottest path?

#

In profiler results

#

Do you have a flame graph?

meager siren
#

Nope

#

I'm not surprised that most of these ops are around the pandas dataframe operations

#

Current rate puts things at around 10 minutes per query just for TF-IDF

#

Substantially better compared to the 2+ hours I had before but still not great

#

How much time do you think I can cut down if I precompute TF-IDF for each doc, word pair?

#

Effectively making this a reading task?

timid sphinx
#

I presume that most of pandas time?

#

Also why are you indexing frames

meager siren
# timid sphinx Also why are you indexing frames

I think it's these lines here:

            doc_vectors = df_doc2words.groupby("doc")\
                .apply(lambda group: create_aligned_tfidf_vector(group, words))
            
            # Compute the document cosine similarity.
            # doc_vectors["cosine similarity"] = doc_vectors.apply(
            #     lambda vec: cosine_similarity(vec, query_tfidf_vector)
            # )
            results = doc_vectors.reset_index(name="tfidf_vector")
            results["cosine_similarity"] = results["tfidf_vector"].apply(
                lambda vec: cosine_similarity(vec, query_tfidf_vector)
            )
#

where doc_vectors is getting the full TF-IDF vector for a document

#

and doesn't return a full dataframe. <- this is where I think I get the most trouble,

meager siren
#

I'm looking at my code and I'm not sure why I didn't just precompute the word level TF-IDF/BM25 values and store that in parquet files. It would probably be 2x as fast right now to just "read" and aggregate the target values for each method instead of compute it dynamically. I think the only reason I wanted to have that compute on inference time was in case values had change or something.

timid sphinx
#

Yeah, you should precompute them and put them directly into the index

#

Or separately I guess. Just don't use pandas so much. Convert them to numpy vectors at least.

meager siren
#

Should I just put those hparams as part of the config.json file I use and let it run from there?

timid sphinx
#

You just need to get rid of any non-O(1) or just slow parts of this

#

E.g. indexing into pandas is slow - use numpy array

#

You have a hot path - the part that takes too long to run (preferably also the part that runs in a loop). On that path you need to eliminate all excessive complexity. That could be pandas (and polars), any complex computations - precompute as much as feasible, any file access or complex file parsing. Anything from the top of the profiler results has to go. But get a flame graph visualizer

meager siren
#

Thinking aloud, even with optimizations of numpy array, the sorting is going to take n log n runtime because you have to sort the vector values. At a scale of 40M values to compare....

#

This reminds me again why I was looking at different inverted index methods to try and filter that down.

meager siren
#

Adding the extra steps means more storage and compute... Ugh, I'm not sure how to sort this out.

timid sphinx
#

What are you sorting?

meager siren
#

Sorting the cosine similarity/aggr BM25 value

meager siren
#

Current performance has me considering expanding the scope of Minimum hardware specs for the project. 4 threads/processors should yield final results in around 2.5 hours, 8 thread/processors in around 1.25 hours, so on with 16 threads/processors and so forth (2.5M articles per worker at that point).

timid sphinx
inland lynx
#

what happened so far? so many messages

inland lynx
inland lynx
#

looks like not a h/w issue just the code

meager siren
timid sphinx
#

It's n log k for a k sized heap

#

The constant can be smaller as well

meager siren
#

I think my problem is more of the n aspect

#

40M articles to scan through is A LOT in general.

#

Even if the insert/sort is log k for the heap

timid sphinx
#

Well what do you do in that loop?

#

I am still confused how it can be so slow. Plus if you have an inverted index, the number should drop drastically based on the search terms. Cosine similarity is trickier.

#

But then again, it's python. Even just porting to Java I used to be able to get 50x over python

meager siren
#

I had the inverted index initially connected BUT I took it out because it was in that OOP trie structure that was taking ages to deserialize back to its full form.

#

I'm looking at my options for what I can do with just a simple {term, [doc, doc, doc]} structure without using a compression technique like tries

#

Best I can come up with is another parquet table with columns term, file, docs (SHA1 hashes)

#

I was pondering additional compression by taking the doc or file strings and converting them to ints and keeping a decoding map for each one. It would keep memory down compared to strings but takes additional costs to lookup in the the map (even if it's O(1) lookup, for M (M < N) docs you'd still get an O(M) overall runtime).

timid sphinx
#

Omg

#

What's the format of the index files?

meager siren
#

The original format was a sharded trie (1 trie per starting character). All tries were nested dict structures

timid sphinx
#

Omg

meager siren
#

That 5+ minute runtime for tries was with the OOP deserialization on a heavily sharded set of tries (I had done additional sharding to keep file size down so that the deserialization could go faster)

timid sphinx
#

It's so overengineered

#

You should have a vocabulary of words, a single trie. Loaded once.

#

Everything else should be plain arrays

#

Documents should be processed once and converted into a weighted list of word ids

meager siren
timid sphinx
#

Before any queries ever happen

#

How large?

meager siren
#

Well over 2M last I checked

#

gimme a sec to confirm

timid sphinx
#

That's at most 30 MB stored, no? Maybe 100 MB

#

Peanuts

meager siren
#

I was wrong. It's 42M unique words captured

#

(with stemming & lemmatizing)

timid sphinx
#

That's a bit worse

#

Do you want it to run on commodity hardware?

meager siren
#

I kinda did

#

But after working on this for over 6 months, I'm not sure.

#

My server has done a lot of the compute work because it'll give results faster

timid sphinx
#

I am sure it is possible

meager siren
#

especially preprocessing

timid sphinx
#

I bet it can run on commodity hardware but the way you are doing it in python...

meager siren
#

This is sort of where I was hoping something like category grouping could help streamline search results down better than the plain inverted index.

meager siren
timid sphinx
#

If you want things fast, you can't read from disk so much, you can't use so many tries (bad memory access patterns), especially made out of python dicts, you shouldn't use heavyweight serialization libraries and heavyweight data structures (e.g. dataframes)

#

Disk access is extremely slow compared to memory

#

Especially with hdds

meager siren
#

I'm still following except for the "read from disk" aspect. Commodity hardware these days is still floating at around 8 - 16 GB RAM (even though it's not expensive to upgrade unless you have a laptop)

timid sphinx
#

Mmmm well at least don't do heavy deserialization. Difficult in python I guess

#

Idk how realistic it is to have a mmaped data structure in python

meager siren
#

I wonder how software like elasticsearch does it

#

(first step probably is 1: use C or rust)

#

Oh damn, their repo is 99% Java

meager siren
# timid sphinx Mmmm well at least don't do heavy deserialization. Difficult in python I guess

So effectively, here's what I'm thinking from the feedback.

  1. Precompute EVERYTHING you can. Because all you want to do is just read those values (because diskIO is going to be your first and biggest bottleneck when working around commodity hardware's memory restraints).
  2. Load only what you need and do it to the base data structures of the language.
  3. Swap for rust (or other faster languages) where you can
#

I'm still unsure what to do for the inverted index as that's still tricky given its scale (size of vocab and I guess the documents)

timid sphinx
#

You are using some insane data structures for it

#

Serialized as an array it should be much smaller

#

Not python array btw

#

I am pretty sure it is an array of pointers underneath the hood

meager siren
#

Ideas on an effective way to store this to disk?

#

42M words, 40M docs... Keep it all as strings and the mappings as pointers...

#

That would save on in-memory. But I'm just failing to grasp how to serialize this to disk

#

with the intent of loading the data from disk once and leaving it in memory

timid sphinx
#

You can split a trie into multiple tries actually. One for most common terms

#

See if there is a mmap abstraction that allows reading various sized types at different locations

#

Encode the whole datastructure as an array of bytes

meager siren
#

I have the precompute script running right now and I'm amazed at how long it's expected to handle computing the TF-IDF and BM25 for just the one file (out of 200)

#

This was probably another reason why I optied for runtime computation of the values over precomputing....

#

I think I need to just jump head first into the rust implementation right now.

meager siren
#

I also just found out that polars does not play well with multithreading from the main python branch (probably because it already is multithreaded underneath by rust)

meager siren
#

I found out my implementation for generating the precomputed TF-IDF and BM25 was really stupid, hence the super slow ETAs. Fixed everything by leveraging pandas dataframes to pick up the slack and things are going MUCH faster. Going to bed soon but the work should be done long before I get up tomorrow. Data files are currently at around 43 GB but I suspect the dataframe is containing everything at the moment (file, document, word, term frequency, inverse document frequency, TF-IDF, BM25), hence the bloat (the raw document to word frequency mappings were at around 23 GB in msgpack).

#

Still fumbling around on the inverted index. Experimenting with ways to not use OOP in python and/or leverage rust for faster reads.

meager siren
#

54 GB for the who thing of precomputed sparse vectors. But it also includes the TF, IDF, document length, etc on top of the final calculations.

#

Plus 800MB of additional misc data

meager siren
#

I asked GPT for some ideas on how to implement or go about abstracting optimizations for the inverted index. Got a lot of answers that shouldn't surprise me.

#

Will probably go back to looking at an over-engineered sharded Trie for storing the inverted index.

#

Only this time, no OOP abstraction for it.

meager siren
#

Posting my own message link here because BERT got an upgrade: #📚・articles-and-other-ai-resources message

meager siren
#

@timid sphinx Thoughts in on this design with the inverted index?:

  1. Use prefix trees (trie)
  2. 1 trie per starting character (26 alphabetical + 1 for "other")
  3. multiply across all files 200 files

Reasons for doing this:

  1. keeps trie files smaller (should mean smaller memory footprint)
  2. currently trying to compute tries is taking way too long (1 trie per starting char but sharded such that 1 trie has max vocab of 100K words).

Cons:

  1. that's A LOT of files to keep track of.
  2. would mean each trie for each file (200 files) would have to be queried, so a scaled up time at inference?
timid sphinx
#

Can you do a napkin calculation for sharded trie for vocabulary->word id mapping plus inverted index based on just word ids?

#

Split vocabulary based on frequency of occurrence

#

So the top 100k words go into the first trie

#

Then the next 100k words by frequency etc

#

If you split inverted index, split it across documents

#

The inverted index split is just to have easy multithreading

#

Actually... maybe split inverted indices by word frequency as well?

#

I don't understand why you insist on having a trie for inverted index rather than splitting them like above

#

I can show what I mean in C

meager siren
#

Trie offers compression for storage

#

That’s the main benefit

#

I also may have to adjust that number of 100K to 500K because the vocab is 42M large

timid sphinx
#

can you give me some more numbers? |V| = 42M, |D|=?, mean(#words_per_document)=?, mean(#documents_per_word)=?

#

also that's an insane vocabulary size. does it include numbers or something?

#

I guess it's multilingual?

#

|D|=40M

#

what's mean(#documents_per_word)=?

#

actually, I can assume on the order of magnitude of 1 document per word

meager siren
#

It's partially multi lingual. There are some vocab or symbols that were not removed during processing. Also did my best to remove any URLs from there as well.

#

Give me a sec to get you mean # doc per word

timid sphinx
#

and can you also calculate mean chars per word in the vocabulary?

meager siren
#

mean number of docs per word is 64

timid sphinx
#

some pretty crazy outliers probably like is or are

meager siren
#

those should have been picked up by nltk stopwords and filtered

#

but maybe www or other small nouns like cat or dog are throwing things off

#

mean word character length was 10.3

meager siren
#

I tried doing your way of having 1 simple inverted index as a basic map (had an ETA of maybe 6 hours). OOMed at around 50/200 files. Looks like sharding is needed.

#

This was also after I converted all document IDs from full path (str) to int.

#

Maybe I'll do 1 inverted index per file (200 files) and see how that goes. Fewer files and all that. Plus, I can still take advantage of mmap on loading.

timid sphinx
#

here is what I would use for the inverted index

#

I obviously haven't quite finished it (no tf-idf compute etc) and the search is a bit suboptimal

#

you need 4 + (|V| + 1) * 4 + A * |V| * 8 bytes for the whole index, where A ~= 64 (the mean docs per word)

#

20.1GB index

#

(4 + (42000000 + 1) * 4 + 64 * 42000000 * 8) / 1024 / 1024 / 1024 = 20.1

#

the largest adjustable factor is 64 * 42000000 * 8

#

so we need to pack (doc_id, freq) pair

#

frequency is a small number

#

we can probably pack things into ~5 bytes, maybe 6

#

which would reduce the size down to 15 GB

#

this should be exceptionally fast

#

like I would expect milliseconds per query

meager siren
#

This will need some adjustment. For instance you have the word2int mapping and vice versa but doc2int (and vice versa) is not tracked

timid sphinx
#

Yeah for sure

meager siren
#

And just a quick point of clarification, by document, you're referring to the individual articles right? not the physical files?

timid sphinx
#

yes

#

or whatever is the unit containing words for the purposes of searching

meager siren
#

Not sure why token_freq needed to be a thing when you can use a set() to create the list of unique words for word2id.

timid sphinx
#

It needs to be

#

Because you want to sort them by frequency

timid sphinx
#

Because if you only touch files with more frequent terms, you don't need to load as much data into memory. Plus it might help with cache locality, since you will be mostly accessing the same memory regions.

#

It's not an obvious improvement, especially considering query distribution of word frequencies usually != document distribution of word frequencies.

#

But I doubt it hurts anything

#

If you use mmap, the OS will likely cache with some sort of LRU policy, which means that most of the beginning of the file will be in memory all the time. You can think of mmap as OS-controlled caching mechanism.

meager siren
#

Managed to get 22GB with file- sorted words with msgpack. Working on implementing your term frequency sorted example.

#

Still have to chunk to multiple files because of OOM but I think I'll start with 1M terms per file.

timid sphinx
#

Why do you get OOM again?

meager siren
#

I'm not sure

#

I just recall when I was trying to put all in one file, I got OOM after 1 hour

#

Probably a bunch of crap also loaded in memory (this is all a part of the same larger preprocessing script)

#

Shit, I was playing with multiprocessing vs multithreading to speed my read up but I just hit OOM

meager siren
# timid sphinx Why do you get OOM again?

Running the processing to align with your design. ETA is about 82 hours (so maybe 4 days?). RAM usage holding steady at around 35GB with a peak of around 42 GB at the moment. I chunked the vocab to a size of 1M per file and have it sorted by document frequency.

#

Hope it doesn't OOM after an hour or anything like that.

meager siren
#

Had to restart because of a bug in my code (I wasnt returning the sorted vocab list properly)

meager siren
#

I'm also noticing the first file is using A LOT of RAM now that it's correct (hitting 44GB and holding but I'm only on file 18/195)

#

I may have to drop my chunk size from 1M to 500K and see how that goes if this thing does OOM.

meager siren
#

ETA is looking to be a lo longer than 4 days

meager siren
#

Ok, I tried 500K chunk size. Still got OOM. Pushing it down to 300K

meager siren
#

300K still got OOM. Now down to 100K. If I dont see any improvement, IDK what to do. at 100K words per file, that puts the number of files generated at around 430 (for a 43M word vocab).

#

Maybe have an uneven chunking? Smaller chunks for more frequent words and larger chunks otherwise?

west pagoda
#

In python any multithreading is gonna get OOM

#

All data in threads is a copy not a refrence or pass... So yeah 100k multiplied by No of threads is possibly OOM level

meager siren
#

and I'm waking up to find that 100K is causing OOM.

west pagoda
#

100k of string data

#

Magbe do some code memory profiler if there is one

meager siren
#

And it's sorted most common (number of documents the word appears in) to least

#

@west pagoda To give some context, this is the top 20 vocab terms (yes, they've lemmatized/stemmed) and how many documents they appear in

west pagoda
#

Thats a lot of documents so are these categories going to have duplicates as in one and two are two categories but would a document with one also have two and two have one

meager siren
#

Assuming 32bit int (Python doesn't do that but let's just assume for simplicity), at 81K docs per the top 100K words, that's 32,400,000,000 bytes

#

or 32 GB

#

Just for the data in the map, not including other data that's loaded to help compute it.

#

@timid sphinx Thoughts on this?

Given the rough napkin math to assume the amount of space it takes just for the document IDs to be stored, would it make sense to chunk in a non-uniform way? ie chunk 1 is 10K, chunk 2 is 50K, chunk 3 is 100K, chunk 4 is 500K, chunk 5 is 1M, and then we keep 1M for the rest?

#

This does hinder the original idea because it requires more file reads at runtime BUT I'm hitting that OOM on just the top 100K words is concerning

meager siren
#

Ok, I refuse to believe this thing is OOMing on 50K most use words.

#

Each chunk size is OOM'ing on the same file too (49/195). I can confirm the variable I am using for the chunk size is changing....

timid sphinx
#

Lol

timid sphinx
meager siren
#

I'm not sure. I made the assumptions on my napkin math because Python handles data like ints differently and even the sys.sizeof() isn't recursive to the whole object you measure with it.

timid sphinx
#

I mean the Wikipedia files

meager siren
#

Oh

#

Raw data in its xml format is like 95GB

timid sphinx
#

Where do I get them and can I try running your scripts?

meager siren
#

One sec

timid sphinx
#

I have a machine with 128gb of memory

timid sphinx
#

It is possible to do all of this using disk but the algorithms are annoying to implement.

#

E.g. you can do merge sort recursively to sort

meager siren
meager siren
#

I am not sure of the runtime that would take though

timid sphinx
#

I'll try running it later today once I am at my pc.

#

The script is in the repo as well?

meager siren
#

Yup

meager siren
#

script is precompute_sparse_vectors.py

#

Program OOM'ed on chunk size of 10,000 but it did make it to file 59/195 before hitting that limit

meager siren
#

OOM on 5K vocab chunk size. Same limit as 10K...

meager siren
#

Bunkered it down to 1K chunk size. 1.33 hours in on that first chunk (the biggest) and it's still going (89/195 files) with 56/64 GB used.

#

Really hopeful that it makes it all the way through without OOM.

meager siren
#

It OOM'ed on 1,000 words

#

stopped at 61% of files (119/195)

#

that's the farthest it's ever gone but still very concerning.

#

I think I'm going to get up tomorrow and try and chart the word frequencies in matplotlib and see how that looks.

meager siren
#

I got some interesting results from graphing out the top 500, 1000, and 5000 most frequent terms in the vocab.

@timid sphinx I think the appropriate way to go is to chunk the vocab based on aggregate number of documents (ie max_num_aggr_docs = 50_000. This should help avoid the OOM issue I think (again, at the cost of possibly more files "up front" with the more frequent vocab terms).

#

Also, I had some dataframe variable up in the program before this and deleting it + calling garbage collection helped drop the memory down from 22 GB to 10 GB before it even got to calling the inverted index code. Or not...

timid sphinx
meager siren
#

XD

meager siren
#

Running with max aggregate number of documents at 500K. So far so good. Unfortunately this does result in the generation of 123 files and the first file generated was hovering at around 350MB in size.

I wonder if I should adjust that threshold from 500K to 1M documents.

meager siren
#

123 files x 2 hours per file puts this at around 10 days... Not looking forward to that. Hope this is everything I need.

#

Especially considering that my break is about just as many days.

meager siren
#

Came back from my gf parents' house. Happy to see my inverted index is at file 75/123 (around 60% completion)

meager siren
#

Inverted index finished just before midnight for new years. 123 msgpack files at 13 GB (each file averaged at 400 MB in the beginning).

#

All these metadata files adding to more than the storage that is being used for the original data in xml form (125GB vs 95 GB).

#

8 ish days aint so bad.

meager siren
#

I'm rerunning the script again, this time with a aggregate chunking of 1M documents instead of 500K. Memory overhead looks pretty much the same so far. Number of expected files is down from 123 to 62. May explore more sizes (1.5M, 2M, 2.5M) depending on the resulting maximum file sizes. Would be ideal to keep files under 1.5 or 2 GB due to latency from file IO.

#

ETA per chunk is still around 1.5 to 2 hours to cover all 200 (parquet) files.

meager siren
#

@timid sphinx So the sorted inverted index worked much better than the file level inverted indices. I aggregated words such that the max total number of documents contained within a file was 1M (ie sum of all word to docs was 1M docs total).

This gave me 62 files and ran in around 4 days (compared to the above aggregation limit of 500K docs which took 8 days and gave 123 files) and it still takes up the same amount of space.

Comparison of the file-level inverted index vs sorted level inverted index:

Total storage usage
File-level: 15 GB, 195 files, max file size 130MB
Sorted: 13 GB, 62 files, max file size ~800MB

Runtimes:
File-level (used parallelization/concurrency):
# 4 processors
# - Small sample: 4.5 minute mean time (360s or ~6 minutes mean total)
# - Medium sample: 7.5 minute mean time (550s or ~11 minutes mean total)
# - Large sample: 15 minute mean time (1000s or ~20 minutes mean total)
Sorted (serial only):
# - Small sample: 4 minute mean time (240s or ~4 minutes mean total)
# - Medium sample: 4.75 minute mean time (280s or ~4.5 minutes mean total)
# - Large sample: 6.25 minute mean time (380s or ~6 minutes mean total)

#

My conclusion is that the reduced file IO and sorting definitely helped bring down those runtimes. Also, the sample sizes were a random sampling of 10, 50, and 150 words from the vocab (unweighted/even weight).

meager siren
#

That said, I'm thinking of winding down/shuttering the project. I still don't feel anywhere closer to the level of speed I feel this needs to make it even semi viable (ie answer full query in under 5 - 10 minutes). I also feel like I've maximized all avenues of algorithms and methods that combine speed and compression to allows this project to even exist on a consumer level PC. And last but not least, it's coming up to about a year since I started the project. I kinda wanted to cut down the time I spent on projects from years (literally) to under 1 year, starting with this one. Sometimes a hard part is knowing when to call it quits.

I'm going time 1 full run with this system up to the Vector search stage (hell, I may even do the vector search stage if I feel good enough). That will be Inverted Index + TF-IDF/BM25.

meager siren
#

Still working on getting that single run going. Inverted index hovering at around 4.5 to 5 minutes to run. Returns 12M documents for one of my queries. Takes about 1.5 to 2 minutes to organize which parquet files and article hashes to target (rather than reading all 195 parquet files) which could definitely use some acceleration with rust but I don't want to think about that now. Still waiting on how long it takes to run through all the targeted parquet files (using multithreading with 8 thread workers).

#

It is interesting to see that it's taking a lot of time and I'm not seeing any indicators showing once moving onto the run through the parquet files. No progress bar or cprofiler details after what is definitely been more than five minutes. Wonder if it's something about the multithreading that is affecting the reporting.

meager siren
#

With what I have now, I'm somewhat optimistic about the condition of the sparse vector search.
Inverted Index search -> 4 to 5 minutes
Organized retrieved documents -> 1 to 2.5 minutes
Aggregate and sort TF-IDF for retrieved documents -> 1.5 hours (single thread/processor; mean of 30s per file for 195 files)

That last part I'm looking at in terms of parallelization and concurrency. I figure it's about time to go full mask-off and maximize the usage of the 64GB RAM as that is technically within consumer hardware limits (albeit on the higher end).

meager siren
#

Alright, I'm calling it. TF-IDF is taking around 1 hour regardless of single processor/thread or multithreading (and multiprocessing is going to OOM even at small numbers). I'd say the only thing that could save this would be to do a total rewrite of the engine in rust. But seeing as I'm still very much learning rust and there is A LOT of code to go over a rewrite, that's just not in the cards for me at this time.

I'll be updating any notes to the project at this point as well as updating all metadata files in the huggingface repo. I'll also be sharing the respective links (repo included) in #1145719762438074368 .

Overall, it was a fun project that taught me a lot. It was really interesting to tackle search at scale with limited resources (still much more than what most people have). I think I'll probably be cannibalizing a lot of the scripts and lessons over for smaller RAG projects, as well as use this for a downstream "research" project (ie just have a lazy/general search engine that does keyword search for the top N articles from wikipedia around a certain topic, where retrieval time is not an important factor as the records are just being pulled as part of a reference DB for another project).

dusty vector
#

Hello buddy, hows it going?

Have you looked anything at vector optimization? If the compute limit is hard to pass by, why not focus on making the vectors smaller?

Its not required for a better mathematical vector computation, just a better way to re-organize ans structure the vectors in a smaller size

dusty vector
timid sphinx
dusty vector
#

but how do we make them dynamic per search is not covered by any algorithm

#

?

#

that is we keep changing the centre terms and values per search

timid sphinx
#

I am not sure what you mean. You search based on a fixed vector and preprocess only the documents

#

The documents get put into an efficient index

#

The query doesn't, because it is already small

dusty vector
timid sphinx
#

There are two pieces of information here, query and document embeddings. There is some function m(q, d) that measures similarity.

#

m doesn't change with q, d doesn't change with q, yet m(q, d) changes when q changes

#

You don't adjust gigabytes of data for each query, you structure your calculations such that you don't need to

dusty vector
#

the similarity function happens always on the same vector space

if the vector space was re-structured it could make the processing faster? some way of assigning priority instead of re-calculating the vectors

timid sphinx
#

Yes, that's what approximate vector search does

dusty vector
#

is it similar to ScaNN?

timid sphinx
#

It restructures / indexes the document vectors such that search is faster

#

Yes

#

The indexing they do doesn't depend on any one query, it depends only on the documents themselves, so you don't have to do it more than once

dusty vector
#

And while creating the vectors, is there any method they follow to condense the word vector value?
Like say for example i can initially group all synonyms as one same value

And then next step these condensed values are used for creating vector soace

timid sphinx
#

The methods used do not look at individual words, but at embeddings of the entire sentence / paragraphs / chunks

dusty vector
#

oh yes

timid sphinx
#

You can do it at word level, but it makes little sense

dusty vector
#

lets consider a sentence

#

similar sentences grouped as one value

timid sphinx
#

Most methods do exactly that

#

You build a bunch of nested buckets such that similar items are in the same bucket

#

How you do it is a different question

#

You can probably do agglomerative clustering as you are describing, but it is expensive

#

Since it requires finding 2 nearest neighbors repeatedly, which is n log n in any fixed dimension count but probably slower in higher dimensions

#

I might be off on the exact complexity

#

Just read the FAISS paper

#

They describe a lot of techniques they use

dusty vector
#

the embedding part - to make it also efficient, the chunking of sentences is done in some way that it can be summarized as a single sentence/word?

that is the chunks are again reduced

dusty vector
timid sphinx
#

In practice often not and that hurts the performance

dusty vector
#

but all these reductions are very expensive like you mentioned

dusty vector
#

though it is only one time compute expensive - we only need the space constructed once at start

timid sphinx
#

Yes. And creating a good vector index is not always a matter of just splitting strings and computing embeddings

#

Though that is exactly how it is portrayed in most blogs lol

#

Vector search database company money at work

#

Indexing (as in optimizing vector search) is the least important part of most applications and will have the least impact on performance (assuming there is an LLM in the pipeline). Coming up with good vectors to begin with will have a much greater effect on quality.

dusty vector
timid sphinx
#

Quality - primarily you care about some sort of Recall@k or MAP@k or MRR@k metrics

#

Recall@k - is your recall of the correct information if you take top-k results

#

MRR@k is the mean reciprocal rank of the first correct answer (@k adds a restriction to top-k items)

#

MAP@k measures average precision within the set of top-k results and then computes its average across all queries

#

Personally I like MRR and Recall the best since usually you care more about finding the right information and not making sure that all top results are relevant

#

In other words, there is usually one correct document/chunk for a query, not multiple.

#

But obviously depends on the task

timid sphinx
#

But you have many other quality metrics for the whole system. For example, are you chunks semantically coherent or are you pulling extraneous information / cutting things at inappropriate boundaries? Are you pulling too many irrelevant chunks? Are you pulling too many duplicate/almost duplicate chunks? Does the whole system answer questions correctly?

#

There are other metrics used for search quality as well btw, I just listed the top 3 that came to mind.

dusty vector
timid sphinx
#

I don't think intent matters when evaluating a query in isolation, but if you are aiming for user satisfaction, it does.

#

It's just much harder to capture good training data for this.

#

You have to ask people to query the system and then stop them and ask what they actually meant

#

maybe with an LLM you could do that semi-automatically

#

like have an LLM come up with plausible user intents and then ask the user "Do you want to: a) ... b) ..."

broken summit
#

Why not using the Elastic or OpenSearch Docker images? You'd be spinning them as language-independent REST endpoints, which means that you can communicate with them with any tool (e.g. curl) or library (e.g. requests) supporting HTTP(S). You just need to install Docker (and assign >= 4GB of RAM to the images, depending on the number of vectors you want to store, and their dimensions).