#Best Vector DB for Local Use
2390 messages · Page 3 of 3 (latest)
uh i meant at an OS level
second, the incrementation of the db size is scaling accordingly to the number of checkpoints
Why at the OS level? lance is written in rust and should not be unsafe at all.
they clearly mentioned they cant do writes safe beyond a limit
Yeah but my data is well beneath that limit
hmm
with my data, (assume 6 kb per entry), it would be 325.000 entries per table.add()
Kinda
I just add up all the new batches into a big chunk and when I reach a safe point (100,000 for my program), I call table.add() to that data and reset the stuff after before repeating again.
how yo know if any embedding is corrupted
No fancy parallelization with multiprocessing/multithreading but it's the best of what I can do
Two ways. 1) "bad" embeddings cause the program to crash if they try to get added to the table. So I check for them and skip them/prevent them from getting into the big batch chunk
other way is on the reading of the table which is more manual and inspecting the embedding returned
oh ok so there is clearly a difference when you read a bad embedding
so basically you could do a check intermediatelly for bad embeddings and clean it up
No need to clean if it doesnt go into the db
right
But in those cases i have found a bad embedding, the problem was in the string itself
uh
ie it was an empty string or a string of white space
and the produced embedding would be NaN as a result
are you doing a check before creating embedding now
Im preprocessing/remove such strings from the categories list before we even search the vector db for existng entries.
thats odd
and also, like I mentioned before, in the case that NaNs are detected, I just skip adding that batch to the big batch chunk
stuff that wasn't preprocessed out of the text dataset. Didn't come from the wikipedia tree download
@inland lynx Should I tempt fate and try and get my embedding script to try and parallelize? The idea is that for the big embedding loop:
- chunk categories into sizes of 100,000
- pass those chunks to processors to get 100,000 embeddings (per processor)
- Wait for processors to finish their chunks
- upon aggregation (back on the main thread/processor), iterate over each processors 100,000 category, embedding pairs and write each one to table (one chunk at a time).
- move to the next set of chunks and repeat starting from step 2.
Unfortunately, I cannot apply the same thing to running the global search to count which categories have already been embedding (so that is going to be VERY slow -> I was getting 30 minutes with a semi populated table).
Cuz at this point, I'm at around 48 hours into the run time and only 2M out of 4M categories have been embedded.
Back when I was running multiprocessing, I was done in under 24 hours.
maybe you could dial the notch down and reduce the batch size for parallel by a half for each processor
I mean what is the best load for your h/w? For all the processors to be 80% running at the same time. We can target max 80% load on each processor
Best load is 256 to 386 texts at a time on the card (puts it at around 14GB VRAM)
So you either do 16 processors x 16 batch size or some variation that gets you 256
I got the 16 x 16 (processors x batch size) to run in around the 21 hours I had before.
So far I don’t think it’s a bad run
So another note to myself is that searches will default to a limit of 10 results. Gotta specify the limit to what you want/need. Found this out the hard way when running my search for existing keys/categories and I only kept getting back 10 results instead of 100,000 (based on my SQL query parameters).
Also, that check after the DB is supposedly full takes around 9 hours with no parallelization (multithreading/processing). Oof.
how much time is it going to take for the 100k result?
I’m off for a week to go home but iirc, it was about 2 min on my server CPU
IDK. I'm worried that 100K will be too much. To be fair, the vector retrieval itself should be sparse for most other parts of the actual search inference.
right, we ll start in smaller batches and see...
we may need to optimize the way we are having the graph of categories tbh...its not optimal storage design
oh cool enjoy buddy
I figure the search through the table would be more along the lines of pulling 10K instead of 100K
why are there so many deleted messages here? it's hard to read 😭
I havent deleted much. But I figure this thread is still good to carry the convo around this work so I keep contributing to it.
I'm back from all the school and vacations to work on this project. I have added parameters to scale the category processing and embedding as necessary but I still need to append the "missed" categories to the downloaded category tree.
Fine. I've got the code for that but now I'm running into the issue from my vector DB such that my search is returning nothing. Specifically, the query to the vector DB for the nearest embedding given the query embedding and the restriction on categories.
Code for this section:
for node in tqdm(missing_categories):
# Get current category node embedding.
_, embedding = table.search()\
.where(f"category = '{node}'")\
.limit(1)\
.to_list()
# Search for "closest" node embedding from the graph.
best_category, _ = table.search(embedding)\
.where(f"category IN ({nodes_for_query})")\
.limit(1)\
.to_list()
# Create an edge in the downloaded graph connecting the "best
# matching" category (from the original downloaded graph) to
# the current category node from the "missing" categories list.
downloaded_graph.add_edge(best_category, node)
I'm away from my server for a bit today but I'm thinking that, once I have the vector DB initialize and populated, I train/create the ann_index with the table.create_index() function and see if I can query the vector DB appropriately and get back some result.
Ok, so querying after initializing the index helped but the .where(SQL) line is still over filtering and returning nothing.
Current plan is to build a second table based on only items from the downloaded graph (query from main table, no recomputing embeddings) and have that be the “filtered” table. I then use the filtered table to help build similarity between missing categories and those in the downloaded graph.
Why do this? Well, because I can’t find a way to properly filter the table in a few commands so this will have to suffice for now (at the cost of doing the same steps as part of the initial table read).
I'm starting to wonder if the 10 deep category tree downloaded was the right idea. It is taking an immense amount of processing/compute time to do the following:
- embed all categories (local and downloaded).
- verify which and all categories are embedded.
- create/isolate a filtered table of embeddings from the downloaded category tree.
- perform ANN between local categories and downloaded ones.
To my understanding, there are around 4.6M total categories under consideration with around 3M being from the local dump and the rest from the 10 deep category tree.
I have category trees for depths of 2, 3, and 5 as well as depth 10. Those trees will have considerably less nodes. The cost of this is a significantly flatter tree with more documents being returned (not as fine tooth comb as compared to the depth 10 tree). Maybe it's worth considering over the depth 10 model.
Running on depth 5, there seems to be 3.8M total categories vs depth 10 4.6M.
Maybe it would be a good idea to revisit the problem of identifying the minimum number of categories needed for full document coverage.
@inland lynx can I get your thoughts on this?
Building the category tree by connecting the downloaded tree to the categories from the dumped files is taking a long time. It takes a day just to embed all 4.6M categories (depth 10 tree + dump). From there, I have to map around 3M "dump" categories to the downloaded tree categories. Even with parallelization and caching, this is taking a really long time for each time I run the script and get farther into the program.
So I was thinking of some alternatives.
- Cluster the dump categories to reduce the number way down from 3M
Pros:
- Reduces number of clusters to map from the downloaded category tree
Cons: - Requires there be another map for groups to category names
- Build my own category tree with just the dump data
Pros:
- Allows me to disregard the categories from the downloaded tree (benefits scale with that tree's size)
Cons: - Still very compute expensive to deal with 3M categories
For the second option, I was considering clustering the categories based on their computed semantic vectors with k means clustering (maybe implement the algo myself). I also considered implementing HNSW too which would be a "fast" abstraction of the tree. While researching what I'd need for each algo, I just realized something: Doesn't LanceDB already do a lot of this?
And that brought my to option 3:
3) Store all dump categories + embeddings to lancedb. Query with a high K limit (ie k = 100 or 1,000)
Pros:
- Drastically simplifies things. No need to look at downloaded category tree or even category trees in general
- LanceDB already does the work.
- Disk based thanks to Lancedb
Cons: - Not as "audit-able" as the tree structure (but still does the job)
- Don't have any metrics on query time.
Looks like this is pretty good with existing categories (no brainer) BUT using raw text does not translate well in this system. Going to have to see if keyword extraction can help with this.
Also, storage is only costing 13GB so that's nice
The direct retrieval of the categories via the vectors is pretty fast, even for high number for top_n (ie 1,000). Downside is that I'm unable to translate this success to raw queries (ie "what does the fox say?" will not give good categories around foxes nor speaking).
To try and fix this, I'm thinking of doing some form of key word extraction from the query and try to run the vector search that way. Maybe I'll get something better.
@timid sphinx I think I found one of the issues. I have an inverted index formatted as a prefix tree that is initialized as a class (tree class and node class) with the load/save being recursive (stored in msgpack files).
Profiler is giving me around 5+ minutse for load trie + retrieve documents tied to word.
That's one part (even with multihreading) is taking too much time
Hehe I am just glad I guessed roughly right
Welp, now I'm on to figure out a new inverted index structure/storage solution
Was pleasantly surprised to see that parquet files saved 3GB of storage (total) comapred to msgpack (20GB vs 23GB). That's nice
I ran the profiler again using pandas. Good news is that the whole process of
- Load parquet data
- filter "redirect" documents
- compute TF-IDF
- compute cosine similarity
- update top-n heap
is coming in really fast at a low-low time of 165s (that's 1/2 the time it took just to initialize the prefix trees in the inverted index). Downside is that it's still way too slow for that section and that is per file (with ~200 in the database I have).
Well parquet is not exactly lightweight. And you are still reading from disk in a loop?
Unfortunately
I'm still shopping for a solution like lancedb that has low fileIO cost
1 file is read once. 1 file contains around 200K articles iirc
Nope
I'm not surprised that most of these ops are around the pandas dataframe operations
Current rate puts things at around 10 minutes per query just for TF-IDF
Substantially better compared to the 2+ hours I had before but still not great
How much time do you think I can cut down if I precompute TF-IDF for each doc, word pair?
Effectively making this a reading task?
I think it's these lines here:
doc_vectors = df_doc2words.groupby("doc")\
.apply(lambda group: create_aligned_tfidf_vector(group, words))
# Compute the document cosine similarity.
# doc_vectors["cosine similarity"] = doc_vectors.apply(
# lambda vec: cosine_similarity(vec, query_tfidf_vector)
# )
results = doc_vectors.reset_index(name="tfidf_vector")
results["cosine_similarity"] = results["tfidf_vector"].apply(
lambda vec: cosine_similarity(vec, query_tfidf_vector)
)
where doc_vectors is getting the full TF-IDF vector for a document
and doesn't return a full dataframe. <- this is where I think I get the most trouble,
I'm looking at my code and I'm not sure why I didn't just precompute the word level TF-IDF/BM25 values and store that in parquet files. It would probably be 2x as fast right now to just "read" and aggregate the target values for each method instead of compute it dynamically. I think the only reason I wanted to have that compute on inference time was in case values had change or something.
Yeah, you should precompute them and put them directly into the index
Or separately I guess. Just don't use pandas so much. Convert them to numpy vectors at least.
I can do TF-IDF but now I am realizing why I had BM25 computed at runtime instead of precomputed. BM25 has two hyperparameters k1 and b that have an affect in the formula.
Should I just put those hparams as part of the config.json file I use and let it run from there?
You just need to get rid of any non-O(1) or just slow parts of this
E.g. indexing into pandas is slow - use numpy array
You have a hot path - the part that takes too long to run (preferably also the part that runs in a loop). On that path you need to eliminate all excessive complexity. That could be pandas (and polars), any complex computations - precompute as much as feasible, any file access or complex file parsing. Anything from the top of the profiler results has to go. But get a flame graph visualizer
Thinking aloud, even with optimizations of numpy array, the sorting is going to take n log n runtime because you have to sort the vector values. At a scale of 40M values to compare....
This reminds me again why I was looking at different inverted index methods to try and filter that down.
Adding the extra steps means more storage and compute... Ugh, I'm not sure how to sort this out.
What are you sorting?
Sorting the cosine similarity/aggr BM25 value
Current performance has me considering expanding the scope of Minimum hardware specs for the project. 4 threads/processors should yield final results in around 2.5 hours, 8 thread/processors in around 1.25 hours, so on with 16 threads/processors and so forth (2.5M articles per worker at that point).
Why are you sorting and not selecting top N?
what happened so far? so many messages
wdym? need to return the most relevant documents for the query
yeah this. why is there no numpy arr
looks like not a h/w issue just the code
It’s n log n regardless of sorting or utilizing a heap with a capped length
I think my problem is more of the n aspect
40M articles to scan through is A LOT in general.
Even if the insert/sort is log k for the heap
Well what do you do in that loop?
I am still confused how it can be so slow. Plus if you have an inverted index, the number should drop drastically based on the search terms. Cosine similarity is trickier.
But then again, it's python. Even just porting to Java I used to be able to get 50x over python
The for loop is just a blind iteration through all files (200 files x 200K docs per file) + the inference compute of TF-IDF/BM25 given TF(doc, word) and IDF(word)
I had the inverted index initially connected BUT I took it out because it was in that OOP trie structure that was taking ages to deserialize back to its full form.
I'm looking at my options for what I can do with just a simple {term, [doc, doc, doc]} structure without using a compression technique like tries
Best I can come up with is another parquet table with columns term, file, docs (SHA1 hashes)
I was pondering additional compression by taking the doc or file strings and converting them to ints and keeping a decoding map for each one. It would keep memory down compared to strings but takes additional costs to lookup in the the map (even if it's O(1) lookup, for M (M < N) docs you'd still get an O(M) overall runtime).
The original format was a sharded trie (1 trie per starting character). All tries were nested dict structures
Omg
That 5+ minute runtime for tries was with the OOP deserialization on a heavily sharded set of tries (I had done additional sharding to keep file size down so that the deserialization could go faster)
It's so overengineered
You should have a vocabulary of words, a single trie. Loaded once.
Everything else should be plain arrays
Documents should be processed once and converted into a weighted list of word ids
The vocab is too large
I kinda did
But after working on this for over 6 months, I'm not sure.
My server has done a lot of the compute work because it'll give results faster
I am sure it is possible
especially preprocessing
I bet it can run on commodity hardware but the way you are doing it in python...
This is sort of where I was hoping something like category grouping could help streamline search results down better than the plain inverted index.
The python stuff is also why I'm learning rust.
If you want things fast, you can't read from disk so much, you can't use so many tries (bad memory access patterns), especially made out of python dicts, you shouldn't use heavyweight serialization libraries and heavyweight data structures (e.g. dataframes)
Disk access is extremely slow compared to memory
Especially with hdds
I'm still following except for the "read from disk" aspect. Commodity hardware these days is still floating at around 8 - 16 GB RAM (even though it's not expensive to upgrade unless you have a laptop)
Mmmm well at least don't do heavy deserialization. Difficult in python I guess
Idk how realistic it is to have a mmaped data structure in python
I wonder how software like elasticsearch does it
(first step probably is 1: use C or rust)
Oh damn, their repo is 99% Java
So effectively, here's what I'm thinking from the feedback.
- Precompute EVERYTHING you can. Because all you want to do is just read those values (because diskIO is going to be your first and biggest bottleneck when working around commodity hardware's memory restraints).
- Load only what you need and do it to the base data structures of the language.
- Swap for rust (or other faster languages) where you can
I'm still unsure what to do for the inverted index as that's still tricky given its scale (size of vocab and I guess the documents)
You are using some insane data structures for it
Serialized as an array it should be much smaller
Not python array btw
I am pretty sure it is an array of pointers underneath the hood
Ideas on an effective way to store this to disk?
42M words, 40M docs... Keep it all as strings and the mappings as pointers...
That would save on in-memory. But I'm just failing to grasp how to serialize this to disk
with the intent of loading the data from disk once and leaving it in memory
You can split a trie into multiple tries actually. One for most common terms
See if there is a mmap abstraction that allows reading various sized types at different locations
Encode the whole datastructure as an array of bytes
I have the precompute script running right now and I'm amazed at how long it's expected to handle computing the TF-IDF and BM25 for just the one file (out of 200)
This was probably another reason why I optied for runtime computation of the values over precomputing....
I think I need to just jump head first into the rust implementation right now.
I also just found out that polars does not play well with multithreading from the main python branch (probably because it already is multithreaded underneath by rust)
I found out my implementation for generating the precomputed TF-IDF and BM25 was really stupid, hence the super slow ETAs. Fixed everything by leveraging pandas dataframes to pick up the slack and things are going MUCH faster. Going to bed soon but the work should be done long before I get up tomorrow. Data files are currently at around 43 GB but I suspect the dataframe is containing everything at the moment (file, document, word, term frequency, inverse document frequency, TF-IDF, BM25), hence the bloat (the raw document to word frequency mappings were at around 23 GB in msgpack).
Still fumbling around on the inverted index. Experimenting with ways to not use OOP in python and/or leverage rust for faster reads.
54 GB for the who thing of precomputed sparse vectors. But it also includes the TF, IDF, document length, etc on top of the final calculations.
Plus 800MB of additional misc data
I asked GPT for some ideas on how to implement or go about abstracting optimizations for the inverted index. Got a lot of answers that shouldn't surprise me.
Will probably go back to looking at an over-engineered sharded Trie for storing the inverted index.
Only this time, no OOP abstraction for it.
Posting my own message link here because BERT got an upgrade: #📚・articles-and-other-ai-resources message
@timid sphinx Thoughts in on this design with the inverted index?:
- Use prefix trees (trie)
- 1 trie per starting character (26 alphabetical + 1 for "other")
- multiply across all files 200 files
Reasons for doing this:
- keeps trie files smaller (should mean smaller memory footprint)
- currently trying to compute tries is taking way too long (1 trie per starting char but sharded such that 1 trie has max vocab of 100K words).
Cons:
- that's A LOT of files to keep track of.
- would mean each trie for each file (200 files) would have to be queried, so a scaled up time at inference?
Can you do a napkin calculation for sharded trie for vocabulary->word id mapping plus inverted index based on just word ids?
Split vocabulary based on frequency of occurrence
So the top 100k words go into the first trie
Then the next 100k words by frequency etc
If you split inverted index, split it across documents
The inverted index split is just to have easy multithreading
Actually... maybe split inverted indices by word frequency as well?
I don't understand why you insist on having a trie for inverted index rather than splitting them like above
I can show what I mean in C
Trie offers compression for storage
That’s the main benefit
I also may have to adjust that number of 100K to 500K because the vocab is 42M large
can you give me some more numbers? |V| = 42M, |D|=?, mean(#words_per_document)=?, mean(#documents_per_word)=?
also that's an insane vocabulary size. does it include numbers or something?
I guess it's multilingual?
|D|=40M
what's mean(#documents_per_word)=?
actually, I can assume on the order of magnitude of 1 document per word
It's partially multi lingual. There are some vocab or symbols that were not removed during processing. Also did my best to remove any URLs from there as well.
Give me a sec to get you mean # doc per word
and can you also calculate mean chars per word in the vocabulary?
mean number of docs per word is 64
some pretty crazy outliers probably like is or are
those should have been picked up by nltk stopwords and filtered
but maybe www or other small nouns like cat or dog are throwing things off
mean word character length was 10.3
I tried doing your way of having 1 simple inverted index as a basic map (had an ETA of maybe 6 hours). OOMed at around 50/200 files. Looks like sharding is needed.
This was also after I converted all document IDs from full path (str) to int.
Maybe I'll do 1 inverted index per file (200 files) and see how that goes. Fewer files and all that. Plus, I can still take advantage of mmap on loading.
here is what I would use for the inverted index
I obviously haven't quite finished it (no tf-idf compute etc) and the search is a bit suboptimal
you need 4 + (|V| + 1) * 4 + A * |V| * 8 bytes for the whole index, where A ~= 64 (the mean docs per word)
20.1GB index
(4 + (42000000 + 1) * 4 + 64 * 42000000 * 8) / 1024 / 1024 / 1024 = 20.1
the largest adjustable factor is 64 * 42000000 * 8
so we need to pack (doc_id, freq) pair
frequency is a small number
we can probably pack things into ~5 bytes, maybe 6
which would reduce the size down to 15 GB
this should be exceptionally fast
like I would expect milliseconds per query
This will need some adjustment. For instance you have the word2int mapping and vice versa but doc2int (and vice versa) is not tracked
Yeah for sure
And just a quick point of clarification, by document, you're referring to the individual articles right? not the physical files?
Not sure why token_freq needed to be a thing when you can use a set() to create the list of unique words for word2id.
Because if you only touch files with more frequent terms, you don't need to load as much data into memory. Plus it might help with cache locality, since you will be mostly accessing the same memory regions.
It's not an obvious improvement, especially considering query distribution of word frequencies usually != document distribution of word frequencies.
But I doubt it hurts anything
If you use mmap, the OS will likely cache with some sort of LRU policy, which means that most of the beginning of the file will be in memory all the time. You can think of mmap as OS-controlled caching mechanism.
Managed to get 22GB with file- sorted words with msgpack. Working on implementing your term frequency sorted example.
Still have to chunk to multiple files because of OOM but I think I'll start with 1M terms per file.
Why do you get OOM again?
I'm not sure
I just recall when I was trying to put all in one file, I got OOM after 1 hour
Probably a bunch of crap also loaded in memory (this is all a part of the same larger preprocessing script)
Shit, I was playing with multiprocessing vs multithreading to speed my read up but I just hit OOM
Running the processing to align with your design. ETA is about 82 hours (so maybe 4 days?). RAM usage holding steady at around 35GB with a peak of around 42 GB at the moment. I chunked the vocab to a size of 1M per file and have it sorted by document frequency.
Hope it doesn't OOM after an hour or anything like that.
Had to restart because of a bug in my code (I wasnt returning the sorted vocab list properly)
I'm also noticing the first file is using A LOT of RAM now that it's correct (hitting 44GB and holding but I'm only on file 18/195)
I may have to drop my chunk size from 1M to 500K and see how that goes if this thing does OOM.
ETA is looking to be a lo longer than 4 days
Ok, I tried 500K chunk size. Still got OOM. Pushing it down to 300K
300K still got OOM. Now down to 100K. If I dont see any improvement, IDK what to do. at 100K words per file, that puts the number of files generated at around 430 (for a 43M word vocab).
Maybe have an uneven chunking? Smaller chunks for more frequent words and larger chunks otherwise?
In python any multithreading is gonna get OOM
All data in threads is a copy not a refrence or pass... So yeah 100k multiplied by No of threads is possibly OOM level
It's a single threaded/process operation
and I'm waking up to find that 100K is causing OOM.
It's 100K of words (mean length was something like 10 characters) but the kicker is that it's mapping to document IDs (int).
And it's sorted most common (number of documents the word appears in) to least
@west pagoda To give some context, this is the top 20 vocab terms (yes, they've lemmatized/stemmed) and how many documents they appear in
Thats a lot of documents so are these categories going to have duplicates as in one and two are two categories but would a document with one also have two and two have one
Assuming 32bit int (Python doesn't do that but let's just assume for simplicity), at 81K docs per the top 100K words, that's 32,400,000,000 bytes
or 32 GB
Just for the data in the map, not including other data that's loaded to help compute it.
@timid sphinx Thoughts on this?
Given the rough napkin math to assume the amount of space it takes just for the document IDs to be stored, would it make sense to chunk in a non-uniform way? ie chunk 1 is 10K, chunk 2 is 50K, chunk 3 is 100K, chunk 4 is 500K, chunk 5 is 1M, and then we keep 1M for the rest?
This does hinder the original idea because it requires more file reads at runtime BUT I'm hitting that OOM on just the top 100K words is concerning
Ok, I refuse to believe this thing is OOMing on 50K most use words.
Each chunk size is OOM'ing on the same file too (49/195). I can confirm the variable I am using for the chunk size is changing....
Lol
What's the size of the raw data?
I'm not sure. I made the assumptions on my napkin math because Python handles data like ints differently and even the sys.sizeof() isn't recursive to the whole object you measure with it.
I mean the Wikipedia files
Where do I get them and can I try running your scripts?
One sec
I have a machine with 128gb of memory
It is possible to do all of this using disk but the algorithms are annoying to implement.
E.g. you can do merge sort recursively to sort
https://huggingface.co/datasets/dmmagdal/enwiki-2024-04-20-metadata/tree/main/bag_of_words
You should only need the word_to_docs and doc_to_words folders in this one to run this precompute_sparse_vectors.py script
I do have inverted index computed file-wise. Maybe worth a shot to merge data so that it's organized word-wise
I am not sure of the runtime that would take though
I'll try running it later today once I am at my pc.
The script is in the repo as well?
Yup
script is precompute_sparse_vectors.py
Program OOM'ed on chunk size of 10,000 but it did make it to file 59/195 before hitting that limit
OOM on 5K vocab chunk size. Same limit as 10K...
Bunkered it down to 1K chunk size. 1.33 hours in on that first chunk (the biggest) and it's still going (89/195 files) with 56/64 GB used.
Really hopeful that it makes it all the way through without OOM.
It OOM'ed on 1,000 words
stopped at 61% of files (119/195)
that's the farthest it's ever gone but still very concerning.
I think I'm going to get up tomorrow and try and chart the word frequencies in matplotlib and see how that looks.
I got some interesting results from graphing out the top 500, 1000, and 5000 most frequent terms in the vocab.
@timid sphinx I think the appropriate way to go is to chunk the vocab based on aggregate number of documents (ie max_num_aggr_docs = 50_000. This should help avoid the OOM issue I think (again, at the cost of possibly more files "up front" with the more frequent vocab terms).
Also, I had some dataframe variable up in the program before this and deleting it + calling garbage collection helped drop the memory down from 22 GB to 10 GB before it even got to calling the inverted index code. Or not...
zipf law empirically obtained :P
XD
Running with max aggregate number of documents at 500K. So far so good. Unfortunately this does result in the generation of 123 files and the first file generated was hovering at around 350MB in size.
I wonder if I should adjust that threshold from 500K to 1M documents.
123 files x 2 hours per file puts this at around 10 days... Not looking forward to that. Hope this is everything I need.
Especially considering that my break is about just as many days.
Came back from my gf parents' house. Happy to see my inverted index is at file 75/123 (around 60% completion)
Inverted index finished just before midnight for new years. 123 msgpack files at 13 GB (each file averaged at 400 MB in the beginning).
All these metadata files adding to more than the storage that is being used for the original data in xml form (125GB vs 95 GB).
8 ish days aint so bad.
I'm rerunning the script again, this time with a aggregate chunking of 1M documents instead of 500K. Memory overhead looks pretty much the same so far. Number of expected files is down from 123 to 62. May explore more sizes (1.5M, 2M, 2.5M) depending on the resulting maximum file sizes. Would be ideal to keep files under 1.5 or 2 GB due to latency from file IO.
ETA per chunk is still around 1.5 to 2 hours to cover all 200 (parquet) files.
@timid sphinx So the sorted inverted index worked much better than the file level inverted indices. I aggregated words such that the max total number of documents contained within a file was 1M (ie sum of all word to docs was 1M docs total).
This gave me 62 files and ran in around 4 days (compared to the above aggregation limit of 500K docs which took 8 days and gave 123 files) and it still takes up the same amount of space.
Comparison of the file-level inverted index vs sorted level inverted index:
Total storage usage
File-level: 15 GB, 195 files, max file size 130MB
Sorted: 13 GB, 62 files, max file size ~800MB
Runtimes:
File-level (used parallelization/concurrency):
# 4 processors
# - Small sample: 4.5 minute mean time (360s or ~6 minutes mean total)
# - Medium sample: 7.5 minute mean time (550s or ~11 minutes mean total)
# - Large sample: 15 minute mean time (1000s or ~20 minutes mean total)
Sorted (serial only):
# - Small sample: 4 minute mean time (240s or ~4 minutes mean total)
# - Medium sample: 4.75 minute mean time (280s or ~4.5 minutes mean total)
# - Large sample: 6.25 minute mean time (380s or ~6 minutes mean total)
My conclusion is that the reduced file IO and sorting definitely helped bring down those runtimes. Also, the sample sizes were a random sampling of 10, 50, and 150 words from the vocab (unweighted/even weight).
That said, I'm thinking of winding down/shuttering the project. I still don't feel anywhere closer to the level of speed I feel this needs to make it even semi viable (ie answer full query in under 5 - 10 minutes). I also feel like I've maximized all avenues of algorithms and methods that combine speed and compression to allows this project to even exist on a consumer level PC. And last but not least, it's coming up to about a year since I started the project. I kinda wanted to cut down the time I spent on projects from years (literally) to under 1 year, starting with this one. Sometimes a hard part is knowing when to call it quits.
I'm going time 1 full run with this system up to the Vector search stage (hell, I may even do the vector search stage if I feel good enough). That will be Inverted Index + TF-IDF/BM25.
Still working on getting that single run going. Inverted index hovering at around 4.5 to 5 minutes to run. Returns 12M documents for one of my queries. Takes about 1.5 to 2 minutes to organize which parquet files and article hashes to target (rather than reading all 195 parquet files) which could definitely use some acceleration with rust but I don't want to think about that now. Still waiting on how long it takes to run through all the targeted parquet files (using multithreading with 8 thread workers).
It is interesting to see that it's taking a lot of time and I'm not seeing any indicators showing once moving onto the run through the parquet files. No progress bar or cprofiler details after what is definitely been more than five minutes. Wonder if it's something about the multithreading that is affecting the reporting.
With what I have now, I'm somewhat optimistic about the condition of the sparse vector search.
Inverted Index search -> 4 to 5 minutes
Organized retrieved documents -> 1 to 2.5 minutes
Aggregate and sort TF-IDF for retrieved documents -> 1.5 hours (single thread/processor; mean of 30s per file for 195 files)
That last part I'm looking at in terms of parallelization and concurrency. I figure it's about time to go full mask-off and maximize the usage of the 64GB RAM as that is technically within consumer hardware limits (albeit on the higher end).
Alright, I'm calling it. TF-IDF is taking around 1 hour regardless of single processor/thread or multithreading (and multiprocessing is going to OOM even at small numbers). I'd say the only thing that could save this would be to do a total rewrite of the engine in rust. But seeing as I'm still very much learning rust and there is A LOT of code to go over a rewrite, that's just not in the cards for me at this time.
I'll be updating any notes to the project at this point as well as updating all metadata files in the huggingface repo. I'll also be sharing the respective links (repo included) in #1145719762438074368 .
Overall, it was a fun project that taught me a lot. It was really interesting to tackle search at scale with limited resources (still much more than what most people have). I think I'll probably be cannibalizing a lot of the scripts and lessons over for smaller RAG projects, as well as use this for a downstream "research" project (ie just have a lazy/general search engine that does keyword search for the top N articles from wikipedia around a certain topic, where retrieval time is not an important factor as the records are just being pulled as part of a reference DB for another project).
Hello buddy, hows it going?
Have you looked anything at vector optimization? If the compute limit is hard to pass by, why not focus on making the vectors smaller?
Its not required for a better mathematical vector computation, just a better way to re-organize ans structure the vectors in a smaller size
Is there some method where we can organize vectors in a certain clustering hierarchy,
The terms vector values all are calculated in terms of the centre term. And all centre terms form their own hierachy , based on relation with one other
This should reduce the vector size I am thinking?
That's very roughly how most approximate vector search algorithms work
but how do we make them dynamic per search is not covered by any algorithm
?
that is we keep changing the centre terms and values per search
I am not sure what you mean. You search based on a fixed vector and preprocess only the documents
The documents get put into an efficient index
The query doesn't, because it is already small
all the documents are converted to static vectors
everytime a search happens, how can the values be dynamically changed or structured in a way with some priority that is relevant to the query
i think this is what i was asking about
https://research.google/blog/announcing-scann-efficient-vector-similarity-search/
There are two pieces of information here, query and document embeddings. There is some function m(q, d) that measures similarity.
m doesn't change with q, d doesn't change with q, yet m(q, d) changes when q changes
You don't adjust gigabytes of data for each query, you structure your calculations such that you don't need to
the similarity function happens always on the same vector space
if the vector space was re-structured it could make the processing faster? some way of assigning priority instead of re-calculating the vectors
Yes, that's what approximate vector search does
is it similar to ScaNN?
It restructures / indexes the document vectors such that search is faster
Yes
The indexing they do doesn't depend on any one query, it depends only on the documents themselves, so you don't have to do it more than once
And while creating the vectors, is there any method they follow to condense the word vector value?
Like say for example i can initially group all synonyms as one same value
And then next step these condensed values are used for creating vector soace
The methods used do not look at individual words, but at embeddings of the entire sentence / paragraphs / chunks
oh yes
You can do it at word level, but it makes little sense
Most methods do exactly that
You build a bunch of nested buckets such that similar items are in the same bucket
How you do it is a different question
You can probably do agglomerative clustering as you are describing, but it is expensive
Since it requires finding 2 nearest neighbors repeatedly, which is n log n in any fixed dimension count but probably slower in higher dimensions
I might be off on the exact complexity
Just read the FAISS paper
They describe a lot of techniques they use
the embedding part - to make it also efficient, the chunking of sentences is done in some way that it can be summarized as a single sentence/word?
that is the chunks are again reduced
thanks
Ideally
In practice often not and that hurts the performance
but all these reductions are very expensive like you mentioned
got it, if thinking on paper so many levels of reducation could be done
though it is only one time compute expensive - we only need the space constructed once at start
Yes. And creating a good vector index is not always a matter of just splitting strings and computing embeddings
Though that is exactly how it is portrayed in most blogs lol
Vector search database company money at work
Indexing (as in optimizing vector search) is the least important part of most applications and will have the least impact on performance (assuming there is an LLM in the pipeline). Coming up with good vectors to begin with will have a much greater effect on quality.
By quality what do you mean exactly?
for a general topic knowledge base - how can we say that the vector s computed were high quality?
If it was a closed single domain, accuracy, similarity - many ways to say there is quality in the vectors
I am kind of mixing up two different ideas there. Performance and quality. Performance should be clear.
Quality - primarily you care about some sort of Recall@k or MAP@k or MRR@k metrics
Recall@k - is your recall of the correct information if you take top-k results
MRR@k is the mean reciprocal rank of the first correct answer (@k adds a restriction to top-k items)
MAP@k measures average precision within the set of top-k results and then computes its average across all queries
Personally I like MRR and Recall the best since usually you care more about finding the right information and not making sure that all top results are relevant
In other words, there is usually one correct document/chunk for a query, not multiple.
But obviously depends on the task
So to answer your question - vectors are good if within your domain you find the right data (Recall/MRR) and you saturate the top-k with correct data (MAP; applicable if there are many relevant chunks)
But you have many other quality metrics for the whole system. For example, are you chunks semantically coherent or are you pulling extraneous information / cutting things at inappropriate boundaries? Are you pulling too many irrelevant chunks? Are you pulling too many duplicate/almost duplicate chunks? Does the whole system answer questions correctly?
There are other metrics used for search quality as well btw, I just listed the top 3 that came to mind.
if i capture user-intent , then for a given query multiple answers may be correct ( that is semantically people were asking different things )
so are we going to quantify quality based on the highest matching user-intent?
I don't think intent matters when evaluating a query in isolation, but if you are aiming for user satisfaction, it does.
It's just much harder to capture good training data for this.
You have to ask people to query the system and then stop them and ask what they actually meant
maybe with an LLM you could do that semi-automatically
like have an LLM come up with plausible user intents and then ask the user "Do you want to: a) ... b) ..."
Why not using the Elastic or OpenSearch Docker images? You'd be spinning them as language-independent REST endpoints, which means that you can communicate with them with any tool (e.g. curl) or library (e.g. requests) supporting HTTP(S). You just need to install Docker (and assign >= 4GB of RAM to the images, depending on the number of vectors you want to store, and their dimensions).