Best Vector DB for Local Use | Learn AI Together | Page 3

meager siren Nov 14, 2024, 1:09 PM

#

For one, I have validated this thing on small data

inland lynx Nov 14, 2024, 1:09 PM

#

uh i meant at an OS level

meager siren Nov 14, 2024, 1:10 PM

#

second, the incrementation of the db size is scaling accordingly to the number of checkpoints

meager siren Nov 14, 2024, 1:10 PM

#

inland lynx uh i meant at an OS level

Why at the OS level? lance is written in rust and should not be unsafe at all.

inland lynx Nov 14, 2024, 1:10 PM

#

meager siren Why at the OS level? lance is written in rust and should not be unsafe at all.

they clearly mentioned they cant do writes safe beyond a limit

meager siren Nov 14, 2024, 1:12 PM

#

inland lynx they clearly mentioned they cant do writes safe beyond a limit

Yeah but my data is well beneath that limit

inland lynx Nov 14, 2024, 1:13 PM

#

hmm

meager siren Nov 14, 2024, 1:13 PM

#

2GB per write (table.add() call).

#

and about 3 TB for the whole table

inland lynx Nov 14, 2024, 1:13 PM

#

How many concurrent writes?

#

woah is that all parallel?

meager siren Nov 14, 2024, 1:14 PM

#

with my data, (assume 6 kb per entry), it would be 325.000 entries per table.add()

meager siren Nov 14, 2024, 1:14 PM

#

inland lynx woah is that all parallel?

Kinda

#

I just add up all the new batches into a big chunk and when I reach a safe point (100,000 for my program), I call table.add() to that data and reset the stuff after before repeating again.

inland lynx Nov 14, 2024, 1:16 PM

#

meager siren I just add up all the new batches into a big chunk and when I reach a safe point...

how yo know if any embedding is corrupted

meager siren Nov 14, 2024, 1:16 PM

#

No fancy parallelization with multiprocessing/multithreading but it's the best of what I can do

meager siren Nov 14, 2024, 1:18 PM

#

inland lynx how yo know if any embedding is corrupted

Two ways. 1) "bad" embeddings cause the program to crash if they try to get added to the table. So I check for them and skip them/prevent them from getting into the big batch chunk

#

other way is on the reading of the table which is more manual and inspecting the embedding returned

inland lynx Nov 14, 2024, 1:18 PM

#

oh ok so there is clearly a difference when you read a bad embedding

#

so basically you could do a check intermediatelly for bad embeddings and clean it up

meager siren Nov 14, 2024, 1:19 PM

#

No need to clean if it doesnt go into the db

inland lynx Nov 14, 2024, 1:19 PM

#

right

meager siren Nov 14, 2024, 1:20 PM

#

But in those cases i have found a bad embedding, the problem was in the string itself

inland lynx Nov 14, 2024, 1:20 PM

#

uh

meager siren Nov 14, 2024, 1:20 PM

#

ie it was an empty string or a string of white space

#

and the produced embedding would be NaN as a result

inland lynx Nov 14, 2024, 1:21 PM

#

meager siren ie it was an empty string or a string of white space

are you doing a check before creating embedding now

meager siren Nov 14, 2024, 1:22 PM

#

Im preprocessing/remove such strings from the categories list before we even search the vector db for existng entries.

inland lynx Nov 14, 2024, 1:22 PM

#

thats odd

meager siren Nov 14, 2024, 1:22 PM

#

and also, like I mentioned before, in the case that NaNs are detected, I just skip adding that batch to the big batch chunk

inland lynx Nov 14, 2024, 1:22 PM

#

how did it come in the first place

#

i havent seen any categories like that

meager siren Nov 14, 2024, 1:23 PM

#

inland lynx how did it come in the first place

stuff that wasn't preprocessed out of the text dataset. Didn't come from the wikipedia tree download

meager siren Nov 15, 2024, 4:52 PM

#

@inland lynx Should I tempt fate and try and get my embedding script to try and parallelize? The idea is that for the big embedding loop:

chunk categories into sizes of 100,000
pass those chunks to processors to get 100,000 embeddings (per processor)
Wait for processors to finish their chunks
upon aggregation (back on the main thread/processor), iterate over each processors 100,000 category, embedding pairs and write each one to table (one chunk at a time).
move to the next set of chunks and repeat starting from step 2.

#

Unfortunately, I cannot apply the same thing to running the global search to count which categories have already been embedding (so that is going to be VERY slow -> I was getting 30 minutes with a semi populated table).

#

Cuz at this point, I'm at around 48 hours into the run time and only 2M out of 4M categories have been embedded.

#

Back when I was running multiprocessing, I was done in under 24 hours.

inland lynx Nov 16, 2024, 6:21 AM

#

meager siren <@456226577798135808> Should I tempt fate and try and get my embedding script to...

maybe you could dial the notch down and reduce the batch size for parallel by a half for each processor

I mean what is the best load for your h/w? For all the processors to be 80% running at the same time. We can target max 80% load on each processor

meager siren Nov 16, 2024, 1:02 PM

#

inland lynx maybe you could dial the notch down and reduce the batch size for parallel by a ...

Best load is 256 to 386 texts at a time on the card (puts it at around 14GB VRAM)

#

So you either do 16 processors x 16 batch size or some variation that gets you 256

meager siren Nov 16, 2024, 2:47 PM

#

I got the 16 x 16 (processors x batch size) to run in around the 21 hours I had before.

#

So far I don’t think it’s a bad run

meager siren Nov 16, 2024, 4:56 PM

#

So another note to myself is that searches will default to a limit of 10 results. Gotta specify the limit to what you want/need. Found this out the hard way when running my search for existing keys/categories and I only kept getting back 10 results instead of 100,000 (based on my SQL query parameters).

#

Also, that check after the DB is supposedly full takes around 9 hours with no parallelization (multithreading/processing). Oof.

inland lynx Nov 17, 2024, 5:13 PM

#

meager siren So another note to myself is that searches will default to a limit of 10 results...

how much time is it going to take for the 100k result?

meager siren Nov 17, 2024, 6:31 PM

#

I’m off for a week to go home but iirc, it was about 2 min on my server CPU

meager siren Nov 18, 2024, 1:03 AM

#

inland lynx how much time is it going to take for the 100k result?

IDK. I'm worried that 100K will be too much. To be fair, the vector retrieval itself should be sparse for most other parts of the actual search inference.

inland lynx Nov 18, 2024, 5:34 AM

#

meager siren IDK. I'm worried that 100K will be too much. To be fair, the vector retrieval it...

right, we ll start in smaller batches and see...
we may need to optimize the way we are having the graph of categories tbh...its not optimal storage design

inland lynx Nov 18, 2024, 5:34 AM

#

meager siren I’m off for a week to go home but iirc, it was about 2 min on my server CPU

oh cool enjoy buddy

meager siren Nov 18, 2024, 1:53 PM

#

inland lynx right, we ll start in smaller batches and see... we may need to optimize the way...

I figure the search through the table would be more along the lines of pulling 10K instead of 100K

tacit mica Nov 18, 2024, 2:50 PM

#

why are there so many deleted messages here? it's hard to read 😭

meager siren Nov 18, 2024, 3:45 PM

#

I havent deleted much. But I figure this thread is still good to carry the convo around this work so I keep contributing to it.

meager siren Nov 30, 2024, 4:06 PM

#

I'm back from all the school and vacations to work on this project. I have added parameters to scale the category processing and embedding as necessary but I still need to append the "missed" categories to the downloaded category tree.

Fine. I've got the code for that but now I'm running into the issue from my vector DB such that my search is returning nothing. Specifically, the query to the vector DB for the nearest embedding given the query embedding and the restriction on categories.

Code for this section:

    for node in tqdm(missing_categories):
        # Get current category node embedding.
        _, embedding = table.search()\
            .where(f"category = '{node}'")\
            .limit(1)\
            .to_list()

        # Search for "closest" node embedding from the graph.
        best_category, _ = table.search(embedding)\
            .where(f"category IN ({nodes_for_query})")\
            .limit(1)\
            .to_list()

        # Create an edge in the downloaded graph connecting the "best
        # matching" category (from the original downloaded graph) to
        # the current category node from the "missing" categories list.
        downloaded_graph.add_edge(best_category, node)

#

I'm away from my server for a bit today but I'm thinking that, once I have the vector DB initialize and populated, I train/create the ann_index with the table.create_index() function and see if I can query the vector DB appropriately and get back some result.

meager siren Dec 1, 2024, 4:21 PM

#

Ok, so querying after initializing the index helped but the .where(SQL) line is still over filtering and returning nothing.

Current plan is to build a second table based on only items from the downloaded graph (query from main table, no recomputing embeddings) and have that be the “filtered” table. I then use the filtered table to help build similarity between missing categories and those in the downloaded graph.

#

Why do this? Well, because I can’t find a way to properly filter the table in a few commands so this will have to suffice for now (at the cost of doing the same steps as part of the initial table read).

meager siren Dec 4, 2024, 4:31 AM

#

I'm starting to wonder if the 10 deep category tree downloaded was the right idea. It is taking an immense amount of processing/compute time to do the following:

embed all categories (local and downloaded).
verify which and all categories are embedded.
create/isolate a filtered table of embeddings from the downloaded category tree.
perform ANN between local categories and downloaded ones.

To my understanding, there are around 4.6M total categories under consideration with around 3M being from the local dump and the rest from the 10 deep category tree.

I have category trees for depths of 2, 3, and 5 as well as depth 10. Those trees will have considerably less nodes. The cost of this is a significantly flatter tree with more documents being returned (not as fine tooth comb as compared to the depth 10 tree). Maybe it's worth considering over the depth 10 model.

meager siren Dec 4, 2024, 12:45 PM

#

Running on depth 5, there seems to be 3.8M total categories vs depth 10 4.6M.

#

Maybe it would be a good idea to revisit the problem of identifying the minimum number of categories needed for full document coverage.

meager siren Dec 4, 2024, 6:17 PM

#

@inland lynx can I get your thoughts on this?

Building the category tree by connecting the downloaded tree to the categories from the dumped files is taking a long time. It takes a day just to embed all 4.6M categories (depth 10 tree + dump). From there, I have to map around 3M "dump" categories to the downloaded tree categories. Even with parallelization and caching, this is taking a really long time for each time I run the script and get farther into the program.

So I was thinking of some alternatives.

Cluster the dump categories to reduce the number way down from 3M
Pros:

Reduces number of clusters to map from the downloaded category tree
Cons:
Requires there be another map for groups to category names

Build my own category tree with just the dump data
Pros:

Allows me to disregard the categories from the downloaded tree (benefits scale with that tree's size)
Cons:
Still very compute expensive to deal with 3M categories

For the second option, I was considering clustering the categories based on their computed semantic vectors with k means clustering (maybe implement the algo myself). I also considered implementing HNSW too which would be a "fast" abstraction of the tree. While researching what I'd need for each algo, I just realized something: Doesn't LanceDB already do a lot of this?

And that brought my to option 3:
3) Store all dump categories + embeddings to lancedb. Query with a high K limit (ie k = 100 or 1,000)
Pros:

Drastically simplifies things. No need to look at downloaded category tree or even category trees in general
LanceDB already does the work.
Disk based thanks to Lancedb
Cons:
Not as "audit-able" as the tree structure (but still does the job)
Don't have any metrics on query time.

meager siren Dec 5, 2024, 8:58 PM

#

Looks like this is pretty good with existing categories (no brainer) BUT using raw text does not translate well in this system. Going to have to see if keyword extraction can help with this.

#

Also, storage is only costing 13GB so that's nice

meager siren Dec 7, 2024, 4:40 AM

#

The direct retrieval of the categories via the vectors is pretty fast, even for high number for top_n (ie 1,000). Downside is that I'm unable to translate this success to raw queries (ie "what does the fox say?" will not give good categories around foxes nor speaking).

#

To try and fix this, I'm thinking of doing some form of key word extraction from the query and try to run the vector search that way. Maybe I'll get something better.

meager siren Dec 8, 2024, 4:52 PM

#

@timid sphinx I think I found one of the issues. I have an inverted index formatted as a prefix tree that is initialized as a class (tree class and node class) with the load/save being recursive (stored in msgpack files).

#

Profiler is giving me around 5+ minutse for load trie + retrieve documents tied to word.

#

That's one part (even with multihreading) is taking too much time

timid sphinx Dec 8, 2024, 5:41 PM

#

Hehe I am just glad I guessed roughly right

meager siren Dec 8, 2024, 5:42 PM

#

Welp, now I'm on to figure out a new inverted index structure/storage solution

meager siren Dec 11, 2024, 5:16 PM

#

Was pleasantly surprised to see that parquet files saved 3GB of storage (total) comapred to msgpack (20GB vs 23GB). That's nice

meager siren Dec 12, 2024, 4:37 AM

#

timid sphinx Hehe I am just glad I guessed roughly right

I ran the profiler again using pandas. Good news is that the whole process of

Load parquet data
filter "redirect" documents
compute TF-IDF
compute cosine similarity
update top-n heap

is coming in really fast at a low-low time of 165s (that's 1/2 the time it took just to initialize the prefix trees in the inverted index). Downside is that it's still way too slow for that section and that is per file (with ~200 in the database I have).

timid sphinx Dec 12, 2024, 4:39 AM

#

Well parquet is not exactly lightweight. And you are still reading from disk in a loop?

meager siren Dec 12, 2024, 4:39 AM

#

Unfortunately

timid sphinx Dec 12, 2024, 4:39 AM

#

How many times do you read files each loop?

#

It's not 10M, is it?

meager siren Dec 12, 2024, 4:39 AM

#

I'm still shopping for a solution like lancedb that has low fileIO cost

meager siren Dec 12, 2024, 4:39 AM

#

timid sphinx How many times do you read files each loop?

1 file is read once. 1 file contains around 200K articles iirc

timid sphinx Dec 12, 2024, 4:40 AM

#

Hmm so what is the hottest path?

#

In profiler results

#

Do you have a flame graph?

meager siren Dec 12, 2024, 4:41 AM

#

#

Nope

#

I'm not surprised that most of these ops are around the pandas dataframe operations

#

📎 message.txt

#

Current rate puts things at around 10 minutes per query just for TF-IDF

#

Substantially better compared to the 2+ hours I had before but still not great

#

How much time do you think I can cut down if I precompute TF-IDF for each doc, word pair?

#

Effectively making this a reading task?

timid sphinx Dec 12, 2024, 6:33 AM

#

I presume that most of pandas time?

#

Also why are you indexing frames

meager siren Dec 12, 2024, 1:51 PM

#

timid sphinx Also why are you indexing frames

I think it's these lines here:

            doc_vectors = df_doc2words.groupby("doc")\
                .apply(lambda group: create_aligned_tfidf_vector(group, words))
            
            # Compute the document cosine similarity.
            # doc_vectors["cosine similarity"] = doc_vectors.apply(
            #     lambda vec: cosine_similarity(vec, query_tfidf_vector)
            # )
            results = doc_vectors.reset_index(name="tfidf_vector")
            results["cosine_similarity"] = results["tfidf_vector"].apply(
                lambda vec: cosine_similarity(vec, query_tfidf_vector)
            )

#

where doc_vectors is getting the full TF-IDF vector for a document

#

and doesn't return a full dataframe. <- this is where I think I get the most trouble,

meager siren Dec 12, 2024, 4:34 PM

#

I'm looking at my code and I'm not sure why I didn't just precompute the word level TF-IDF/BM25 values and store that in parquet files. It would probably be 2x as fast right now to just "read" and aggregate the target values for each method instead of compute it dynamically. I think the only reason I wanted to have that compute on inference time was in case values had change or something.

timid sphinx Dec 12, 2024, 6:11 PM

#

Yeah, you should precompute them and put them directly into the index

#

Or separately I guess. Just don't use pandas so much. Convert them to numpy vectors at least.

meager siren Dec 13, 2024, 3:36 PM

#

timid sphinx Yeah, you should precompute them and put them directly into the index

I can do TF-IDF but now I am realizing why I had BM25 computed at runtime instead of precomputed. BM25 has two hyperparameters k1 and b that have an affect in the formula.

#

Should I just put those hparams as part of the config.json file I use and let it run from there?

#

timid sphinx Dec 14, 2024, 4:04 AM

#

You just need to get rid of any non-O(1) or just slow parts of this

#

E.g. indexing into pandas is slow - use numpy array

#

You have a hot path - the part that takes too long to run (preferably also the part that runs in a loop). On that path you need to eliminate all excessive complexity. That could be pandas (and polars), any complex computations - precompute as much as feasible, any file access or complex file parsing. Anything from the top of the profiler results has to go. But get a flame graph visualizer

meager siren Dec 14, 2024, 5:20 AM

#

Thinking aloud, even with optimizations of numpy array, the sorting is going to take n log n runtime because you have to sort the vector values. At a scale of 40M values to compare....

#

This reminds me again why I was looking at different inverted index methods to try and filter that down.

meager siren Dec 14, 2024, 6:16 AM

#

Adding the extra steps means more storage and compute... Ugh, I'm not sure how to sort this out.

timid sphinx Dec 14, 2024, 6:31 AM

#

What are you sorting?

meager siren Dec 14, 2024, 6:35 AM

#

Sorting the cosine similarity/aggr BM25 value

meager siren Dec 14, 2024, 6:54 AM

#

Current performance has me considering expanding the scope of Minimum hardware specs for the project. 4 threads/processors should yield final results in around 2.5 hours, 8 thread/processors in around 1.25 hours, so on with 16 threads/processors and so forth (2.5M articles per worker at that point).

timid sphinx Dec 14, 2024, 8:17 AM

#

meager siren Sorting the cosine similarity/aggr BM25 value

Why are you sorting and not selecting top N?

inland lynx Dec 14, 2024, 11:40 AM

#

what happened so far? so many messages

inland lynx Dec 14, 2024, 11:41 AM

#

timid sphinx Why are you sorting and not selecting top N?

wdym? need to return the most relevant documents for the query

inland lynx Dec 14, 2024, 11:42 AM

#

timid sphinx E.g. indexing into pandas is slow - use numpy array

yeah this. why is there no numpy arr

#

looks like not a h/w issue just the code

meager siren Dec 14, 2024, 1:05 PM

#

timid sphinx Why are you sorting and not selecting top N?

It’s n log n regardless of sorting or utilizing a heap with a capped length

timid sphinx Dec 14, 2024, 4:40 PM

#

It's n log k for a k sized heap

#

The constant can be smaller as well

meager siren Dec 14, 2024, 5:29 PM

#

I think my problem is more of the n aspect

#

40M articles to scan through is A LOT in general.

#

Even if the insert/sort is log k for the heap

timid sphinx Dec 15, 2024, 8:50 AM

#

Well what do you do in that loop?

#

I am still confused how it can be so slow. Plus if you have an inverted index, the number should drop drastically based on the search terms. Cosine similarity is trickier.

#

But then again, it's python. Even just porting to Java I used to be able to get 50x over python

meager siren Dec 16, 2024, 2:29 PM

#

timid sphinx Well what do you do in that loop?

The for loop is just a blind iteration through all files (200 files x 200K docs per file) + the inference compute of TF-IDF/BM25 given TF(doc, word) and IDF(word)

#

I had the inverted index initially connected BUT I took it out because it was in that OOP trie structure that was taking ages to deserialize back to its full form.

#

I'm looking at my options for what I can do with just a simple {term, [doc, doc, doc]} structure without using a compression technique like tries

#

Best I can come up with is another parquet table with columns term, file, docs (SHA1 hashes)

#

I was pondering additional compression by taking the doc or file strings and converting them to ints and keeping a decoding map for each one. It would keep memory down compared to strings but takes additional costs to lookup in the the map (even if it's O(1) lookup, for M (M < N) docs you'd still get an O(M) overall runtime).

timid sphinx Dec 16, 2024, 2:51 PM

#

Omg

#

What's the format of the index files?

meager siren Dec 16, 2024, 2:58 PM

#

The original format was a sharded trie (1 trie per starting character). All tries were nested dict structures

timid sphinx Dec 16, 2024, 2:58 PM

#

Omg

meager siren Dec 16, 2024, 2:59 PM

#

That 5+ minute runtime for tries was with the OOP deserialization on a heavily sharded set of tries (I had done additional sharding to keep file size down so that the deserialization could go faster)

timid sphinx Dec 16, 2024, 2:59 PM

#

It's so overengineered

#

You should have a vocabulary of words, a single trie. Loaded once.

#

Everything else should be plain arrays

#

Documents should be processed once and converted into a weighted list of word ids

meager siren Dec 16, 2024, 3:02 PM

#

timid sphinx You should have a vocabulary of words, a single trie. Loaded once.

The vocab is too large

timid sphinx Dec 16, 2024, 3:02 PM

#

Before any queries ever happen

#

How large?

meager siren Dec 16, 2024, 3:02 PM

#

Well over 2M last I checked

#

gimme a sec to confirm

timid sphinx Dec 16, 2024, 3:03 PM

#

That's at most 30 MB stored, no? Maybe 100 MB

#

Peanuts

meager siren Dec 16, 2024, 3:07 PM

#

I was wrong. It's 42M unique words captured

#

(with stemming & lemmatizing)

timid sphinx Dec 16, 2024, 3:08 PM

#

That's a bit worse

#

Do you want it to run on commodity hardware?

meager siren Dec 16, 2024, 3:08 PM

#

I kinda did

#

But after working on this for over 6 months, I'm not sure.

#

My server has done a lot of the compute work because it'll give results faster

timid sphinx Dec 16, 2024, 3:09 PM

#

I am sure it is possible

meager siren Dec 16, 2024, 3:09 PM

#

especially preprocessing

timid sphinx Dec 16, 2024, 3:11 PM

#

I bet it can run on commodity hardware but the way you are doing it in python...

meager siren Dec 16, 2024, 3:11 PM

#

This is sort of where I was hoping something like category grouping could help streamline search results down better than the plain inverted index.

meager siren Dec 16, 2024, 3:11 PM

#

timid sphinx I bet it can run on commodity hardware but the way you are doing it in python...

The python stuff is also why I'm learning rust.

timid sphinx Dec 16, 2024, 3:12 PM

#

If you want things fast, you can't read from disk so much, you can't use so many tries (bad memory access patterns), especially made out of python dicts, you shouldn't use heavyweight serialization libraries and heavyweight data structures (e.g. dataframes)

#

Disk access is extremely slow compared to memory

#

Especially with hdds

meager siren Dec 16, 2024, 3:14 PM

#

I'm still following except for the "read from disk" aspect. Commodity hardware these days is still floating at around 8 - 16 GB RAM (even though it's not expensive to upgrade unless you have a laptop)

timid sphinx Dec 16, 2024, 3:15 PM

#

Mmmm well at least don't do heavy deserialization. Difficult in python I guess

#

Idk how realistic it is to have a mmaped data structure in python

meager siren Dec 16, 2024, 3:18 PM

#

I wonder how software like elasticsearch does it

#

(first step probably is 1: use C or rust)

#

Oh damn, their repo is 99% Java

meager siren Dec 16, 2024, 3:26 PM

#

timid sphinx Mmmm well at least don't do heavy deserialization. Difficult in python I guess

So effectively, here's what I'm thinking from the feedback.

Precompute EVERYTHING you can. Because all you want to do is just read those values (because diskIO is going to be your first and biggest bottleneck when working around commodity hardware's memory restraints).
Load only what you need and do it to the base data structures of the language.
Swap for rust (or other faster languages) where you can

#

I'm still unsure what to do for the inverted index as that's still tricky given its scale (size of vocab and I guess the documents)

timid sphinx Dec 16, 2024, 3:28 PM

#

You are using some insane data structures for it

#

Serialized as an array it should be much smaller

#

Not python array btw

#

I am pretty sure it is an array of pointers underneath the hood

meager siren Dec 16, 2024, 3:31 PM

#

Ideas on an effective way to store this to disk?

#

42M words, 40M docs... Keep it all as strings and the mappings as pointers...

#

That would save on in-memory. But I'm just failing to grasp how to serialize this to disk

#

with the intent of loading the data from disk once and leaving it in memory

timid sphinx Dec 16, 2024, 3:46 PM

#

You can split a trie into multiple tries actually. One for most common terms

#

See if there is a mmap abstraction that allows reading various sized types at different locations

#

Encode the whole datastructure as an array of bytes

meager siren Dec 16, 2024, 4:27 PM

#

I have the precompute script running right now and I'm amazed at how long it's expected to handle computing the TF-IDF and BM25 for just the one file (out of 200)

#

#

This was probably another reason why I optied for runtime computation of the values over precomputing....

#

I think I need to just jump head first into the rust implementation right now.

meager siren Dec 16, 2024, 6:20 PM

#

I also just found out that polars does not play well with multithreading from the main python branch (probably because it already is multithreaded underneath by rust)

meager siren Dec 17, 2024, 5:37 AM

#

I found out my implementation for generating the precomputed TF-IDF and BM25 was really stupid, hence the super slow ETAs. Fixed everything by leveraging pandas dataframes to pick up the slack and things are going MUCH faster. Going to bed soon but the work should be done long before I get up tomorrow. Data files are currently at around 43 GB but I suspect the dataframe is containing everything at the moment (file, document, word, term frequency, inverse document frequency, TF-IDF, BM25), hence the bloat (the raw document to word frequency mappings were at around 23 GB in msgpack).

#

Still fumbling around on the inverted index. Experimenting with ways to not use OOP in python and/or leverage rust for faster reads.

meager siren Dec 17, 2024, 1:33 PM

#

54 GB for the who thing of precomputed sparse vectors. But it also includes the TF, IDF, document length, etc on top of the final calculations.

#

Plus 800MB of additional misc data

meager siren Dec 18, 2024, 3:34 PM

#

I asked GPT for some ideas on how to implement or go about abstracting optimizations for the inverted index. Got a lot of answers that shouldn't surprise me.

📎 message.txt

#

Will probably go back to looking at an over-engineered sharded Trie for storing the inverted index.

#

Only this time, no OOP abstraction for it.

meager siren Dec 19, 2024, 7:25 PM

#

Posting my own message link here because BERT got an upgrade: #📚・articles-and-other-ai-resources message

meager siren Dec 19, 2024, 8:10 PM

#

@timid sphinx Thoughts in on this design with the inverted index?:

Use prefix trees (trie)
1 trie per starting character (26 alphabetical + 1 for "other")
multiply across all files 200 files

Reasons for doing this:

keeps trie files smaller (should mean smaller memory footprint)
currently trying to compute tries is taking way too long (1 trie per starting char but sharded such that 1 trie has max vocab of 100K words).

Cons:

that's A LOT of files to keep track of.
would mean each trie for each file (200 files) would have to be queried, so a scaled up time at inference?

timid sphinx Dec 19, 2024, 8:17 PM

#

Can you do a napkin calculation for sharded trie for vocabulary->word id mapping plus inverted index based on just word ids?

#

Split vocabulary based on frequency of occurrence

#

So the top 100k words go into the first trie

#

Then the next 100k words by frequency etc

#

If you split inverted index, split it across documents

#

The inverted index split is just to have easy multithreading

#

Actually... maybe split inverted indices by word frequency as well?

#

I don't understand why you insist on having a trie for inverted index rather than splitting them like above

#

I can show what I mean in C

meager siren Dec 19, 2024, 8:27 PM

#

Trie offers compression for storage

#

That’s the main benefit

#

I also may have to adjust that number of 100K to 500K because the vocab is 42M large

timid sphinx Dec 19, 2024, 8:44 PM

#

can you give me some more numbers? |V| = 42M, |D|=?, mean(#words_per_document)=?, mean(#documents_per_word)=?

#

also that's an insane vocabulary size. does it include numbers or something?

#

I guess it's multilingual?

#

|D|=40M

#

what's mean(#documents_per_word)=?

#

actually, I can assume on the order of magnitude of 1 document per word

meager siren Dec 19, 2024, 9:03 PM

#

It's partially multi lingual. There are some vocab or symbols that were not removed during processing. Also did my best to remove any URLs from there as well.

#

Give me a sec to get you mean # doc per word

timid sphinx Dec 19, 2024, 9:18 PM

#

and can you also calculate mean chars per word in the vocabulary?

meager siren Dec 19, 2024, 9:22 PM

#

mean number of docs per word is 64

timid sphinx Dec 19, 2024, 9:22 PM

#

some pretty crazy outliers probably like is or are

meager siren Dec 19, 2024, 9:23 PM

#

those should have been picked up by nltk stopwords and filtered

#

but maybe www or other small nouns like cat or dog are throwing things off

#

mean word character length was 10.3

meager siren Dec 20, 2024, 2:43 AM

#

I tried doing your way of having 1 simple inverted index as a basic map (had an ETA of maybe 6 hours). OOMed at around 50/200 files. Looks like sharding is needed.

#

This was also after I converted all document IDs from full path (str) to int.

#

Maybe I'll do 1 inverted index per file (200 files) and see how that goes. Fewer files and all that. Plus, I can still take advantage of mmap on loading.

timid sphinx Dec 20, 2024, 3:42 AM

#

https://gist.github.com/sir-wabbit/d992fc1e3ecc9df29e114df1be797913

Gist

search.py

GitHub Gist: instantly share code, notes, and snippets.

#

here is what I would use for the inverted index

#

I obviously haven't quite finished it (no tf-idf compute etc) and the search is a bit suboptimal

#

you need 4 + (|V| + 1) * 4 + A * |V| * 8 bytes for the whole index, where A ~= 64 (the mean docs per word)

#

20.1GB index

#

(4 + (42000000 + 1) * 4 + 64 * 42000000 * 8) / 1024 / 1024 / 1024 = 20.1

#

the largest adjustable factor is 64 * 42000000 * 8

#

so we need to pack (doc_id, freq) pair

#

frequency is a small number

#

we can probably pack things into ~5 bytes, maybe 6

#

which would reduce the size down to 15 GB

#

this should be exceptionally fast

#

like I would expect milliseconds per query

meager siren Dec 20, 2024, 4:09 AM

#

This will need some adjustment. For instance you have the word2int mapping and vice versa but doc2int (and vice versa) is not tracked

timid sphinx Dec 20, 2024, 4:09 AM

#

Yeah for sure

meager siren Dec 20, 2024, 4:11 AM

#

And just a quick point of clarification, by document, you're referring to the individual articles right? not the physical files?

timid sphinx Dec 20, 2024, 4:13 AM

#

yes

#

or whatever is the unit containing words for the purposes of searching

meager siren Dec 20, 2024, 4:20 AM

#

Not sure why token_freq needed to be a thing when you can use a set() to create the list of unique words for word2id.

timid sphinx Dec 20, 2024, 10:49 AM

#

It needs to be

#

Because you want to sort them by frequency

timid sphinx Dec 20, 2024, 11:18 AM

#

Because if you only touch files with more frequent terms, you don't need to load as much data into memory. Plus it might help with cache locality, since you will be mostly accessing the same memory regions.

#

It's not an obvious improvement, especially considering query distribution of word frequencies usually != document distribution of word frequencies.

#

But I doubt it hurts anything

#

If you use mmap, the OS will likely cache with some sort of LRU policy, which means that most of the beginning of the file will be in memory all the time. You can think of mmap as OS-controlled caching mechanism.

meager siren Dec 20, 2024, 8:45 PM

#

Managed to get 22GB with file- sorted words with msgpack. Working on implementing your term frequency sorted example.

#

Still have to chunk to multiple files because of OOM but I think I'll start with 1M terms per file.

timid sphinx Dec 20, 2024, 9:52 PM

#

Why do you get OOM again?

meager siren Dec 20, 2024, 9:52 PM

#

I'm not sure

#

I just recall when I was trying to put all in one file, I got OOM after 1 hour

#

Probably a bunch of crap also loaded in memory (this is all a part of the same larger preprocessing script)

#

Shit, I was playing with multiprocessing vs multithreading to speed my read up but I just hit OOM

meager siren Dec 21, 2024, 5:41 AM

#

timid sphinx Why do you get OOM again?

Running the processing to align with your design. ETA is about 82 hours (so maybe 4 days?). RAM usage holding steady at around 35GB with a peak of around 42 GB at the moment. I chunked the vocab to a size of 1M per file and have it sorted by document frequency.

#

Hope it doesn't OOM after an hour or anything like that.

meager siren Dec 21, 2024, 5:41 PM

#

Had to restart because of a bug in my code (I wasnt returning the sorted vocab list properly)

meager siren Dec 21, 2024, 6:08 PM

#

I'm also noticing the first file is using A LOT of RAM now that it's correct (hitting 44GB and holding but I'm only on file 18/195)

#

I may have to drop my chunk size from 1M to 500K and see how that goes if this thing does OOM.

meager siren Dec 21, 2024, 7:14 PM

#

ETA is looking to be a lo longer than 4 days

meager siren Dec 22, 2024, 5:53 AM

#

Ok, I tried 500K chunk size. Still got OOM. Pushing it down to 300K

meager siren Dec 22, 2024, 7:37 AM

#

300K still got OOM. Now down to 100K. If I dont see any improvement, IDK what to do. at 100K words per file, that puts the number of files generated at around 430 (for a 43M word vocab).

#

Maybe have an uneven chunking? Smaller chunks for more frequent words and larger chunks otherwise?

west pagoda Dec 22, 2024, 9:46 AM

#

In python any multithreading is gonna get OOM

#

All data in threads is a copy not a refrence or pass... So yeah 100k multiplied by No of threads is possibly OOM level

meager siren Dec 22, 2024, 2:35 PM

#

west pagoda All data in threads is a copy not a refrence or pass... So yeah 100k multiplied ...

It's a single threaded/process operation

#

and I'm waking up to find that 100K is causing OOM.

west pagoda Dec 22, 2024, 2:39 PM

#

100k of string data

#

Magbe do some code memory profiler if there is one

meager siren Dec 22, 2024, 3:04 PM

#

west pagoda 100k of string data

It's 100K of words (mean length was something like 10 characters) but the kicker is that it's mapping to document IDs (int).

#

And it's sorted most common (number of documents the word appears in) to least

#

@west pagoda To give some context, this is the top 20 vocab terms (yes, they've lemmatized/stemmed) and how many documents they appear in

Screenshot_2024-12-22_at_10.08.47_AM.png

west pagoda Dec 22, 2024, 3:11 PM

#

Thats a lot of documents so are these categories going to have duplicates as in one and two are two categories but would a document with one also have two and two have one

meager siren Dec 22, 2024, 3:12 PM

#

Assuming 32bit int (Python doesn't do that but let's just assume for simplicity), at 81K docs per the top 100K words, that's 32,400,000,000 bytes

#

or 32 GB

#

Just for the data in the map, not including other data that's loaded to help compute it.

#

@timid sphinx Thoughts on this?

Given the rough napkin math to assume the amount of space it takes just for the document IDs to be stored, would it make sense to chunk in a non-uniform way? ie chunk 1 is 10K, chunk 2 is 50K, chunk 3 is 100K, chunk 4 is 500K, chunk 5 is 1M, and then we keep 1M for the rest?

#

This does hinder the original idea because it requires more file reads at runtime BUT I'm hitting that OOM on just the top 100K words is concerning

meager siren Dec 22, 2024, 4:29 PM

#

Ok, I refuse to believe this thing is OOMing on 50K most use words.

#

Each chunk size is OOM'ing on the same file too (49/195). I can confirm the variable I am using for the chunk size is changing....

timid sphinx Dec 22, 2024, 4:49 PM

#

Lol

timid sphinx Dec 22, 2024, 4:50 PM

#

meager siren <@179561959933476864> Thoughts on this? Given the rough napkin math to assume t...

What's the size of the raw data?

meager siren Dec 22, 2024, 4:55 PM

#

I'm not sure. I made the assumptions on my napkin math because Python handles data like ints differently and even the sys.sizeof() isn't recursive to the whole object you measure with it.

timid sphinx Dec 22, 2024, 4:55 PM

#

I mean the Wikipedia files

meager siren Dec 22, 2024, 4:55 PM

#

Oh

#

Raw data in its xml format is like 95GB

timid sphinx Dec 22, 2024, 4:56 PM

#

Where do I get them and can I try running your scripts?

meager siren Dec 22, 2024, 4:56 PM

#

One sec

timid sphinx Dec 22, 2024, 4:56 PM

#

I have a machine with 128gb of memory

meager siren Dec 22, 2024, 4:57 PM

#

https://huggingface.co/datasets/dmmagdal/enwiki-2024-04-20-xml/tree/main

dmmagdal/enwiki-2024-04-20-xml at main

timid sphinx Dec 22, 2024, 4:57 PM

#

It is possible to do all of this using disk but the algorithms are annoying to implement.

#

E.g. you can do merge sort recursively to sort

meager siren Dec 22, 2024, 4:57 PM

#

https://huggingface.co/datasets/dmmagdal/enwiki-2024-04-20-metadata/tree/main/bag_of_words
You should only need the word_to_docs and doc_to_words folders in this one to run this precompute_sparse_vectors.py script

dmmagdal/enwiki-2024-04-20-metadata at main

meager siren Dec 22, 2024, 4:58 PM

#

timid sphinx It is possible to do all of this using disk but the algorithms are annoying to i...

I do have inverted index computed file-wise. Maybe worth a shot to merge data so that it's organized word-wise

#

I am not sure of the runtime that would take though

timid sphinx Dec 22, 2024, 6:15 PM

#

I'll try running it later today once I am at my pc.

#

The script is in the repo as well?

meager siren Dec 22, 2024, 8:11 PM

#

Yup

meager siren Dec 22, 2024, 10:31 PM

#

script is precompute_sparse_vectors.py

#

Program OOM'ed on chunk size of 10,000 but it did make it to file 59/195 before hitting that limit

meager siren Dec 22, 2024, 11:39 PM

#

OOM on 5K vocab chunk size. Same limit as 10K...

meager siren Dec 23, 2024, 3:27 AM

#

Bunkered it down to 1K chunk size. 1.33 hours in on that first chunk (the biggest) and it's still going (89/195 files) with 56/64 GB used.

#

Really hopeful that it makes it all the way through without OOM.

meager siren Dec 23, 2024, 4:57 AM

#

It OOM'ed on 1,000 words

#

stopped at 61% of files (119/195)

#

that's the farthest it's ever gone but still very concerning.

#

I think I'm going to get up tomorrow and try and chart the word frequencies in matplotlib and see how that looks.

meager siren Dec 23, 2024, 2:59 PM

#

I got some interesting results from graphing out the top 500, 1000, and 5000 most frequent terms in the vocab.

@timid sphinx I think the appropriate way to go is to chunk the vocab based on aggregate number of documents (ie max_num_aggr_docs = 50_000. This should help avoid the OOM issue I think (again, at the cost of possibly more files "up front" with the more frequent vocab terms).

#

Also, I had some dataframe variable up in the program before this and deleting it + calling garbage collection helped drop the memory down from 22 GB to 10 GB before it even got to calling the inverted index code. Or not...

timid sphinx Dec 23, 2024, 3:22 PM

#

meager siren I got some interesting results from graphing out the top 500, 1000, and 5000 mos...

zipf law empirically obtained :P

meager siren Dec 23, 2024, 3:23 PM

#

XD

meager siren Dec 23, 2024, 10:46 PM

#

Running with max aggregate number of documents at 500K. So far so good. Unfortunately this does result in the generation of 123 files and the first file generated was hovering at around 350MB in size.

I wonder if I should adjust that threshold from 500K to 1M documents.

meager siren Dec 24, 2024, 3:54 AM

#

123 files x 2 hours per file puts this at around 10 days... Not looking forward to that. Hope this is everything I need.

#

Especially considering that my break is about just as many days.

meager siren Dec 29, 2024, 1:57 AM

#

Came back from my gf parents' house. Happy to see my inverted index is at file 75/123 (around 60% completion)

meager siren Jan 1, 2025, 3:02 AM

#

Inverted index finished just before midnight for new years. 123 msgpack files at 13 GB (each file averaged at 400 MB in the beginning).

#

All these metadata files adding to more than the storage that is being used for the original data in xml form (125GB vs 95 GB).

Screenshot_2024-12-31_at_10.03.54_PM.png

#

8 ish days aint so bad.

meager siren Jan 1, 2025, 4:18 AM

#

I'm rerunning the script again, this time with a aggregate chunking of 1M documents instead of 500K. Memory overhead looks pretty much the same so far. Number of expected files is down from 123 to 62. May explore more sizes (1.5M, 2M, 2.5M) depending on the resulting maximum file sizes. Would be ideal to keep files under 1.5 or 2 GB due to latency from file IO.

#

ETA per chunk is still around 1.5 to 2 hours to cover all 200 (parquet) files.

meager siren Jan 5, 2025, 6:11 PM

#

@timid sphinx So the sorted inverted index worked much better than the file level inverted indices. I aggregated words such that the max total number of documents contained within a file was 1M (ie sum of all word to docs was 1M docs total).

This gave me 62 files and ran in around 4 days (compared to the above aggregation limit of 500K docs which took 8 days and gave 123 files) and it still takes up the same amount of space.

Comparison of the file-level inverted index vs sorted level inverted index:

Total storage usage
File-level: 15 GB, 195 files, max file size 130MB
Sorted: 13 GB, 62 files, max file size ~800MB

Runtimes:
File-level (used parallelization/concurrency):
# 4 processors
# - Small sample: 4.5 minute mean time (360s or ~6 minutes mean total)
# - Medium sample: 7.5 minute mean time (550s or ~11 minutes mean total)
# - Large sample: 15 minute mean time (1000s or ~20 minutes mean total)
Sorted (serial only):
# - Small sample: 4 minute mean time (240s or ~4 minutes mean total)
# - Medium sample: 4.75 minute mean time (280s or ~4.5 minutes mean total)
# - Large sample: 6.25 minute mean time (380s or ~6 minutes mean total)

#

My conclusion is that the reduced file IO and sorting definitely helped bring down those runtimes. Also, the sample sizes were a random sampling of 10, 50, and 150 words from the vocab (unweighted/even weight).

meager siren Jan 5, 2025, 6:51 PM

#

That said, I'm thinking of winding down/shuttering the project. I still don't feel anywhere closer to the level of speed I feel this needs to make it even semi viable (ie answer full query in under 5 - 10 minutes). I also feel like I've maximized all avenues of algorithms and methods that combine speed and compression to allows this project to even exist on a consumer level PC. And last but not least, it's coming up to about a year since I started the project. I kinda wanted to cut down the time I spent on projects from years (literally) to under 1 year, starting with this one. Sometimes a hard part is knowing when to call it quits.

I'm going time 1 full run with this system up to the Vector search stage (hell, I may even do the vector search stage if I feel good enough). That will be Inverted Index + TF-IDF/BM25.

meager siren Jan 6, 2025, 4:01 AM

#

Still working on getting that single run going. Inverted index hovering at around 4.5 to 5 minutes to run. Returns 12M documents for one of my queries. Takes about 1.5 to 2 minutes to organize which parquet files and article hashes to target (rather than reading all 195 parquet files) which could definitely use some acceleration with rust but I don't want to think about that now. Still waiting on how long it takes to run through all the targeted parquet files (using multithreading with 8 thread workers).

#

It is interesting to see that it's taking a lot of time and I'm not seeing any indicators showing once moving onto the run through the parquet files. No progress bar or cprofiler details after what is definitely been more than five minutes. Wonder if it's something about the multithreading that is affecting the reporting.

meager siren Jan 6, 2025, 7:34 PM

#

With what I have now, I'm somewhat optimistic about the condition of the sparse vector search.
Inverted Index search -> 4 to 5 minutes
Organized retrieved documents -> 1 to 2.5 minutes
Aggregate and sort TF-IDF for retrieved documents -> 1.5 hours (single thread/processor; mean of 30s per file for 195 files)

That last part I'm looking at in terms of parallelization and concurrency. I figure it's about time to go full mask-off and maximize the usage of the 64GB RAM as that is technically within consumer hardware limits (albeit on the higher end).

meager siren Jan 7, 2025, 5:26 AM

#

Alright, I'm calling it. TF-IDF is taking around 1 hour regardless of single processor/thread or multithreading (and multiprocessing is going to OOM even at small numbers). I'd say the only thing that could save this would be to do a total rewrite of the engine in rust. But seeing as I'm still very much learning rust and there is A LOT of code to go over a rewrite, that's just not in the cards for me at this time.

I'll be updating any notes to the project at this point as well as updating all metadata files in the huggingface repo. I'll also be sharing the respective links (repo included) in #1145719762438074368 .

Overall, it was a fun project that taught me a lot. It was really interesting to tackle search at scale with limited resources (still much more than what most people have). I think I'll probably be cannibalizing a lot of the scripts and lessons over for smaller RAG projects, as well as use this for a downstream "research" project (ie just have a lazy/general search engine that does keyword search for the top N articles from wikipedia around a certain topic, where retrieval time is not an important factor as the records are just being pulled as part of a reference DB for another project).

dusty vector Feb 4, 2025, 9:26 PM

#

Hello buddy, hows it going?

Have you looked anything at vector optimization? If the compute limit is hard to pass by, why not focus on making the vectors smaller?

Its not required for a better mathematical vector computation, just a better way to re-organize ans structure the vectors in a smaller size

dusty vector Feb 4, 2025, 9:33 PM

#

timid sphinx Can you do a napkin calculation for sharded trie for vocabulary->word id mapping...

Is there some method where we can organize vectors in a certain clustering hierarchy,

The terms vector values all are calculated in terms of the centre term. And all centre terms form their own hierachy , based on relation with one other

This should reduce the vector size I am thinking?

timid sphinx Feb 4, 2025, 9:33 PM

#

dusty vector Is there some method where we can organize vectors in a certain clustering hiera...

That's very roughly how most approximate vector search algorithms work

dusty vector Feb 4, 2025, 9:34 PM

#

but how do we make them dynamic per search is not covered by any algorithm

#

?

#

that is we keep changing the centre terms and values per search

timid sphinx Feb 4, 2025, 9:35 PM

#

I am not sure what you mean. You search based on a fixed vector and preprocess only the documents

#

The documents get put into an efficient index

#

The query doesn't, because it is already small

dusty vector Feb 4, 2025, 10:45 PM

#

timid sphinx I am not sure what you mean. You search based on a fixed vector and preprocess o...

all the documents are converted to static vectors

everytime a search happens, how can the values be dynamically changed or structured in a way with some priority that is relevant to the query

#

i think this is what i was asking about

https://research.google/blog/announcing-scann-efficient-vector-similarity-search/

Announcing ScaNN: Efficient Vector Similarity Search

Posted by Philip Sun, Software Engineer, Google Research Suppose one wants to search through a large dataset of literary works using queries that r...

timid sphinx Feb 4, 2025, 10:50 PM

#

There are two pieces of information here, query and document embeddings. There is some function m(q, d) that measures similarity.

#

m doesn't change with q, d doesn't change with q, yet m(q, d) changes when q changes

#

You don't adjust gigabytes of data for each query, you structure your calculations such that you don't need to

dusty vector Feb 4, 2025, 10:53 PM

#

the similarity function happens always on the same vector space

if the vector space was re-structured it could make the processing faster? some way of assigning priority instead of re-calculating the vectors

timid sphinx Feb 4, 2025, 10:54 PM

#

Yes, that's what approximate vector search does

dusty vector Feb 4, 2025, 10:54 PM

#

is it similar to ScaNN?

timid sphinx Feb 4, 2025, 10:54 PM

#

It restructures / indexes the document vectors such that search is faster

#

Yes

#

The indexing they do doesn't depend on any one query, it depends only on the documents themselves, so you don't have to do it more than once

dusty vector Feb 4, 2025, 10:58 PM

#

And while creating the vectors, is there any method they follow to condense the word vector value?
Like say for example i can initially group all synonyms as one same value

And then next step these condensed values are used for creating vector soace

timid sphinx Feb 4, 2025, 10:59 PM

#

The methods used do not look at individual words, but at embeddings of the entire sentence / paragraphs / chunks

dusty vector Feb 4, 2025, 10:59 PM

#

oh yes

timid sphinx Feb 4, 2025, 11:00 PM

#

You can do it at word level, but it makes little sense

dusty vector Feb 4, 2025, 11:00 PM

#

lets consider a sentence

#

similar sentences grouped as one value

timid sphinx Feb 4, 2025, 11:00 PM

#

Most methods do exactly that

#

You build a bunch of nested buckets such that similar items are in the same bucket

#

How you do it is a different question

#

You can probably do agglomerative clustering as you are describing, but it is expensive

#

Since it requires finding 2 nearest neighbors repeatedly, which is n log n in any fixed dimension count but probably slower in higher dimensions

#

I might be off on the exact complexity

#

Just read the FAISS paper

#

They describe a lot of techniques they use

dusty vector Feb 4, 2025, 11:07 PM

#

the embedding part - to make it also efficient, the chunking of sentences is done in some way that it can be summarized as a single sentence/word?

that is the chunks are again reduced

dusty vector Feb 4, 2025, 11:08 PM

#

timid sphinx Just read the FAISS paper

thanks

timid sphinx Feb 4, 2025, 11:08 PM

#

dusty vector the embedding part - to make it also efficient, the chunking of sentences is don...

Ideally

#

In practice often not and that hurts the performance

dusty vector Feb 4, 2025, 11:09 PM

#

but all these reductions are very expensive like you mentioned

dusty vector Feb 4, 2025, 11:09 PM

#

timid sphinx In practice often not and that hurts the performance

got it, if thinking on paper so many levels of reducation could be done

#

though it is only one time compute expensive - we only need the space constructed once at start

timid sphinx Feb 4, 2025, 11:14 PM

#

Yes. And creating a good vector index is not always a matter of just splitting strings and computing embeddings

#

Though that is exactly how it is portrayed in most blogs lol

#

Vector search database company money at work

#

Indexing (as in optimizing vector search) is the least important part of most applications and will have the least impact on performance (assuming there is an LLM in the pipeline). Coming up with good vectors to begin with will have a much greater effect on quality.

dusty vector Feb 6, 2025, 5:35 PM

#

timid sphinx Indexing (as in optimizing vector search) is the least important part of most ap...

By quality what do you mean exactly?

for a general topic knowledge base - how can we say that the vector s computed were high quality?

If it was a closed single domain, accuracy, similarity - many ways to say there is quality in the vectors

timid sphinx Feb 6, 2025, 5:59 PM

#

dusty vector By quality what do you mean exactly? for a general topic knowledge base - how c...

I am kind of mixing up two different ideas there. Performance and quality. Performance should be clear.

#

Quality - primarily you care about some sort of Recall@k or MAP@k or MRR@k metrics

#

Recall@k - is your recall of the correct information if you take top-k results

#

MRR@k is the mean reciprocal rank of the first correct answer (@k adds a restriction to top-k items)

#

MAP@k measures average precision within the set of top-k results and then computes its average across all queries

#

Personally I like MRR and Recall the best since usually you care more about finding the right information and not making sure that all top results are relevant

#

In other words, there is usually one correct document/chunk for a query, not multiple.

#

But obviously depends on the task

timid sphinx Feb 6, 2025, 6:05 PM

#

dusty vector By quality what do you mean exactly? for a general topic knowledge base - how c...

So to answer your question - vectors are good if within your domain you find the right data (Recall/MRR) and you saturate the top-k with correct data (MAP; applicable if there are many relevant chunks)

#

But you have many other quality metrics for the whole system. For example, are you chunks semantically coherent or are you pulling extraneous information / cutting things at inappropriate boundaries? Are you pulling too many irrelevant chunks? Are you pulling too many duplicate/almost duplicate chunks? Does the whole system answer questions correctly?

#

There are other metrics used for search quality as well btw, I just listed the top 3 that came to mind.

#

e.g. https://www.evidentlyai.com/ranking-metrics/ndcg-metric

dusty vector Feb 7, 2025, 3:22 PM

#

timid sphinx Quality - primarily you care about some sort of Recall@k or MAP@k or MRR@k metri...

if i capture user-intent , then for a given query multiple answers may be correct ( that is semantically people were asking different things )
so are we going to quantify quality based on the highest matching user-intent?

timid sphinx Feb 7, 2025, 3:23 PM

#

I don't think intent matters when evaluating a query in isolation, but if you are aiming for user satisfaction, it does.

#

It's just much harder to capture good training data for this.

#

You have to ask people to query the system and then stop them and ask what they actually meant

#

maybe with an LLM you could do that semi-automatically

#

like have an LLM come up with plausible user intents and then ask the user "Do you want to: a) ... b) ..."

broken summit Mar 5, 2025, 8:43 PM

#

Why not using the Elastic or OpenSearch Docker images? You'd be spinning them as language-independent REST endpoints, which means that you can communicate with them with any tool (e.g. curl) or library (e.g. requests) supporting HTTP(S). You just need to install Docker (and assign >= 4GB of RAM to the images, depending on the number of vectors you want to store, and their dimensions).

#Best Vector DB for Local Use