#Best Vector DB for Local Use

1 messages · Page 2 of 1

meager siren
#

It should be

west pagoda
#

So how will it be queried on words?

meager siren
#

You have substantial overlap between groups

#

Computers overlap with history and math

#

History can be literally anything with a date

west pagoda
#

example?

inland lynx
meager siren
#

What algo keeps this fair given the corpus?

#

Yes

west pagoda
#

Hey @meager siren there are two sets of categrories in the xml. one is main category like computers then you get subset mainframe as an example and then you have articles with titles. You could extract those and make a trie of it?

#

yep

meager siren
west pagoda
#

I saw them when scrolling through the data but not confirmed i'll check and see

#

haven't looked too much myself ghost would know I am busy elsewhere

#

true...

meager siren
#

I gotta go back and check too

#

Invert index words to categories and then use that to get a set of documents

west pagoda
#

index words = nodes

#

I think there were sub categories as I said computer is main and mainframe computer is sub

meager siren
#

Not confident about generating additional data but I follow the rest

west pagoda
#

so sub sub will be generated if needed

#

I understand I'm giving example

#

mainframe itself has many sub catog I presume but I dont think I saw anything in metadata.

meager siren
#

Not sure TBH

#

You’d have to peak the dataset that I’m using in general

west pagoda
#

[[Category:Kansas State University academic buildings]]
[[Category:Architecture schools in Kansas]]
[[Category:Landscape architecture schools]]
[[Category:1908 establishments in Kansas]]
[[Category:University and college buildings completed in 1908]]

#

this is how it is.

meager siren
#

It’s structured XML data with the documents inside a tag

west pagoda
#

abakysis?

#

I see

meager siren
#

More blind meta meta data analysis

west pagoda
#

I was wondering the same

meager siren
#

Hard to visualize in a notebook with 95 GB of data

#

Looking at just the meta meta data like the category words takes a while because the script has to load so many files

#

Kinda

#

195 files but each one contains 150,000 documents

#

Yup but not by category

#

Just chunked by whatever they were when i downloaded them from these big bz compressed files

west pagoda
#

okay so some of them have categories in title tag
<title>
Category:Mark Eitzel albums
</title>

#

words

meager siren
#

Yeah, I’m also assuming lexicographically sorted

#

It’s not

#

But that’s how they came

west pagoda
#

tf-idf was calculated for corpus and then you make them as nodes and add documents under the respective nodes.

#

No

#

data was downloaded

#

in chunks of files

#

one file has 150k articles

#

no idea about catog here

#

just downloaded it

#

then extract articles

#

find tf-ifd

#

idf*

#

yep.

#

in what sense?

#

Ahaan good idea.

#

That's why I asked ghost to keep a higher threshold for tf-idf values any word below it will discarded

#

that was one way of doing it

#

now catog are introduced they are better

#

nope

#

I am talking about the words

#

you only discard the words with which the trie will be made

meager siren
#

Already built tries for the inverted index, word frequency for each document, document frequency for each word, and word IDF for all words. And again, scale of the data would make any sort of analysis difficult to verify.

#

Verify, no build

west pagoda
#

I know hence the new method of using catogs

meager siren
#

…. No it’s probably in the order of a few GB depending on what you’re looking at

#

Word IDF when stored in msg pack is like 4 GB

west pagoda
#

can you limit the byte size of the stream?

#

python should have it. many other langs normally do.

#

So @meager siren limit how many bytes are read

#

and output to terminal

#

seems to be for online stuff

#

if you dont limit the read bytes it will load everything in memory

#

he said 4gb in msgpack unpacked would be much bigger in ram.

#

not by a lot

#

but still

#

what are we talking about here lol. Outputting metadata to stdout?

#

batch load would be slower then just loading the file.

#

well I am gonna leave and install some different OS now. Cause I feel like arch is getting old

#

will meet up with you guys some time later.

#

what data is being generated?

#

site unreachable.

#

hey @meager siren @inland lynx I see short description tag as well

#

which gives little know how about the article

#

here is the dataset

#

@inland lynx ^

meager siren
#

It's not my dataset, it's just a copy of the data I pulled from Wikipedia (which is updated frequently) and I am just storing one copy I pulled from April.

meager siren
#

The FULL wiki data though?

meager siren
#

English wiki, but yes

#

Create search engine to run in reasonable time (ie 5 minutes per query) to allow me

  1. get experience building search engines at scale rather than using langchain or llama index
  2. plug in as part of a RAG system and replicate experiments done in WikiChat paper from Meta AI using TinyLlama 1.1B instead of Llama 2 7B
#

If you can use small LLM to replicate wikichat, it emphasizes you dont even need 7B models to be effective at simple tasks like QnA.

Leave it to the nerds in Meta or OpenAI to figure out how to make LLMs "smarter"

#

I'm just trying to get it to not hallucinate with knowledge that can be stored offline

#

Not really

#

it's about hallucination reduction using a generally decent ground truth dataset

#

Well, that's why I said "generally"

#

Are you saying that the generative part of RAG is going to obscure or destroy the truth that is retrieved from the knowledge database or are you looking at a more philosophical question?

#

look, let's put the thought experiment aside

#

This project is primarily for fun and for hands on learning

#

I have yet to add in optimizations for speed such as using Rust or multithreading

#

I'm assuming single core processor for the sake of min specs (consumer level PCs/laptops)

#

Dont want the system to take up all cores and RAM

#

Systems can have limited number of cores

#

ie raspberry pi is only 4 core ARM processor

#

I could have designed this thing to just run on my server

#

but then no one else could go over and replicate my results, build upon them, or even tested the program

#

It's not but it's an example of an "offline" usecase

#

Imagine something like a robot that has a 1TB HDD with this program and data on it. You have a mobile and semi competant machine that is portable

#

They're all relatively the same documents

#

But using a hierarchical grouping system does seem like an interesting approach. Downside is that not all articles have explicit categories like mentioned before

#

On the plus side, RAM usage is pretty low ATM for search

#

No

#

It's a server, but like I said, single thread single core ops

#

It shouldn't be

#

It's just a mail thing on Ubuntu I set up for easy stuff'

#

small batch high variance doesn'ts speak to much confidence

#

Did you find the algo that would be good to build the category trees?

meager siren
#

Another thing I found. Some of the data (articles/documents) are just category data. In other words, they have a “Category:Some subject” in the title and the rest of the text body is just “[[Category:Category data]]”. I gotta filter those out now too.

#

Not sure if I can use them as a way to aggregate different category labels under a few larger categories and build a category tree that way.

meager siren
#

@west pagoda Setting and IDF filter threshold to 1.0 did seem to help cut back on some of the documents returned from the trie inverted index. Went from around 10M to 9.5M. I'm still not sure what I can be doing with the timetraveler's idea around using categories to construct the inverted index. Also perusing the links he sent above

meager siren
#

500K articles removed does help but I agree that it’s not great

#

I’m looking through them now

#

Remind me again how I go from bag of words to inverted index with the category tree?

meager siren
#

Right but what about queries that are not key words but rather an actual text that was just preprocessed into a bag of words?

meager siren
#

How is he using a different TF-IDF? The formula is the same regardless.

#

Is he weighing documents based on the cosine similarity between TF-IDF vectors of the query and document?

#

If so, then this is fundamentally what my current search engine does

#

Nevermind, I just read the paper

main cargo
#

FAISS probably for local use

meager siren
#

Ok, let me see if I understand the general idea of what you've been proposing @inland lynx :

  1. Create a clustering algorithm to group words with different categories
  2. Use a category tree to organize the categories.
  3. Our inverted index now maps words to categories
  4. Once we have the categories/sub-categories isolated, we get the mapping of categories to document IDs
  5. We aggregate the results into a list and then run TF-IDF on those documents and then get the top best results from there.

does that about sum it up?

#

And for the word -> Category mapping, we have some limit/threshold so that we don't accidentally pull ALL categories for words that are common across many articles and we do the same for Category -> documents?

#

Why is the tf-idf weighed at step 5?

#

I assume so

#

do a set union and use tf-idf would be my guess

#

for straight TF-IDF, we vectorize the query and cosine similarity the query against the doc vectors

#

for BM25, we still have parts of TF-IDF but BM25 does not require cosine similarity nor vectorizing the query

#

The query still influences the search because of how it still uses the query term TF and IDF

#

you want to remap query to just the terms in the (sub) categories?

#

yes

#

TBH, I have both implemented for the sake of comparing results

#

But the flow of data is expected to be similar across both

#

The flow of data is as follows: the query should map to a list of documents (via the inverted index) and we perform the algo search (TF-IDF or BM25) for which documents are relevant to the search

#

Current issue is the inverted index is returning too many docs and so we're investigating a mapping to categories to shorten the results

#

BUT we are finding that the mappings may not be effective in theory.

#

well, we already have trie index mapping

#

straight up word to document ID inverted index with tries

#

Currently working on figuring out how to classify words to categories

#

Right but how to take a word and search through a category to get the sub categories

#

unlike the trie inverted index in which I can organize things to be roughly O(1)-ish per word, I'd have to traverse each category trie O(n) for every word to get the correct categories

#

It's not about the levels

#

It's about how many categories I have to sift through in general per word

#

I can build a trie that maps word to categories/sub-categories. But that doesnt help me much more than my current inverted index

#

Just like I mapped each word to a document, I just pull the category data from the document

#

thing is that still doesnt help eliminate options as efficiently

#

particularly with words that span a large chunk of (unrelated) documents

#

It's building right now

#

the trie for word to category

#

No, it doesnt

#

tf-idf is just a sparse vector scheme for sparse vector comparisons

#

nothing has pulled actual key words from the query

#

calculation is based on the words in the query vs words in the document. Take the matching words and create a vector to perform cosine similarity

#

It doesnt

#

the inverted index doesnt prioritize words over others

#

it treats all words equally and aggregates the document IDs of each word in a set union

#

No weight to documents based on return frequency across words or weight based on word IDF

#

that's kind of the hope with the ideas for the inverted index

#

to figure out how to prioritize words in a way that is fair and wont accidentlally kill valid documents from the search

#

yes

#

we need to short list the inverted index without using calculations form there

#

the edge case is the inverted index step

#

ie the word "plot" expands to 10M articles around stories and movies and land and schemes

#

when the original query was "who started the gunpowder plot"

#

depends on the context of the query my guy

#

that's why it's a problematic word

#

right

#

but when we look at the documents returned, we get documetns for "gunpowder", "plot", "start", etc

#

hence, we get A TON of documents (like half the corpus)

#

because we do a simple aggregation of set union

#

I told you, the trie for tha tis still being built

#

and this was more of an example

#

You can run the code yourself man

#

links to repo and data are in the thread

#

It does

#

taking a raw query and putting it into bag of words, and doing a word level inverted index is returning a lot of documents

#

Well no kidding, the tree is still being built

#

Check the actual example I have in searchwiki.py

#

That's downstream

#

I have a class called rerank that handles exact passage lookup with vector search

#

But yes, we also need something simple up front

meager siren
#

Hope you got enough storage space

#

?

#

Huggingface hub for backup

#

but I have a 1TB external hdd as another backup

#

and my server storage is 1Tb large

#

Server wise?

#

56 cores total (2x Xeon CPUs), 68 GB RAM, and 1 TB storage

#

Idea is to get it to run on any consumer level system. Server offers maximum parallelization for preprocessing scripts

#

You're fine to scale it as needed

#

Just have more than 8GB regular RAM

#

Anyways, just get an idea for the number of texts in the corpus vs the number of texts are returned from the inverted index

#

and then you see the problem I'm having

#

It's hard for python to scale the TF-IDF or BM25 compute across all files quickly

meager siren
#

@west pagoda I think I found an interesting issue with one of the modules I'm using, specifically msgpack. I noticed that there was A LOT of cache/buffer memory usage after I had finished running scripts. So I went through all the packages I used and settled on msgpack. I assumed msgpack was implemented in C given the speed and since it's a third party package that is used a lot. And I found an open issue on GitHub related to this.

https://github.com/msgpack/msgpack-python/issues/612

GitHub

Hello, I think I've found a second Python 3.12 memory leak (having read the issue for the other one, and tried upgrading to msgpack 1.0.8 to fix). This issue occurs with dicts, and key uniquene...

west pagoda
#

Ahh I see...

#

yeah I have had issues with python 3.11 and 3.12

#

so I settled with 3.10 for now. Cause there are some packages that don't work in 3.12 due to inter package dependency and one package not being upto date cause issues in another.

#

3.11 had issues with llama index not running properly

west pagoda
meager siren
#

Yup. And it’s even on 3.10 Python

#

Have you been implementing stuff in rust? I wonder if it would be better loading in that data with rust instead of the map hack pypi package

west pagoda
#

msgpack doesnt seem to be maintained last commit is 3 months ago

#

and there are 2 PR still un-merged

#

I would say try doing this in rust if possible. do file manipulation in python like extracting metadata or articles from xml

#

etc

#

then for preprocessing you can move onto rust.

#

it'll be faster as well

meager siren
#

I figured that the search functions can be converted to rust and called from Python

#

Things like the inverted index and computing TF-IDF or BM25 and returning the best scoring entries

west pagoda
#

ahh

#

sorry

#

I get it

meager siren
#

Yup

west pagoda
#

so rust binding for python

meager siren
#

Keep to Python but offload the heavy compute stuff to rust

#

Cuz Python can call rust code

#

With maturin or something like that

west pagoda
#

got it. yeah send msgpack to rust as well. I just don't see python msgpack being a good package

#

i saw the msgs for category issue. You can't grab the category from the xml file?

meager siren
#

If I’m understanding timetraveler correctly, the bag of words component allows the current inverted index to return too broad of a search and we can’t figure out how using the categories would help the issue

west pagoda
#

instead of words being the node the category would represent nodes and then you can add relevant documents under each category

#

although you may use sub categories within categories

west pagoda
#

so you most likely wont have 10 million under each category

meager siren
#

Right but figuring out the logic for that is a bit hard for me to grasp

#

Turning words to categories isn’t too hard

#

Could just build a map for that

west pagoda
#

and if you make more sub categories then you'll have even better grouping hence even less articles at once

meager siren
west pagoda
meager siren
#

Question is to turn categories to documents without returning too many

west pagoda
meager siren
#

Recall that using set union for the word to doc tries inverted index returned a lot of documents

west pagoda
meager siren
#

Any suggestions?

west pagoda
#

one minute i'm thinking on it

#

here's the parser btw

#

this is basically what I am thinking

#

it uses tf-idf method to create vectors

meager siren
#

I’m not sure how I like the amount of ads and popups on the site

west pagoda
#

lol.

#

I use brave so I don't see them

#

@meager siren what was the dimension size you were using for embeddings?

meager siren
west pagoda
#

when using encoders

west pagoda
#

were you using the whole articles or chunked?

meager siren
#

Chunked

#

512 tokens max

west pagoda
#

so the max size was 512 tokens and 768 sized dimensions

meager siren
#

Yup

#

Vectors are not a compressed representation of data but an abstract one that just makes it easier to operate on instead of words

west pagoda
#

I am reading an articles that can scale the dimension size by 32x smaller

#

so you get lots of memory benefits

#

I feel we might be able to go back to the vector db version of this project

#

lemme read further

#

okkk

#

this is it

#

@meager siren

#

42 million vector embeddings how much space would those take? take a guess

meager siren
#

At how many dims?

west pagoda
#

same wikipedia articles that you have.

#

1024

meager siren
#

Hang on

#

Let me guess

west pagoda
#

yep yep go on

meager siren
#

4 bytes x 1024 x 42,000,000 so probably around 42 TB?

#

4bytes from one float32 value per dim

west pagoda
#

wait

#

wrong calc

#

it's 160gb

#

are you dividing by (1024*3) for GBs?

#

ohh sorry not

#

that

meager siren
#

Float 32 -> 4 bytes

west pagoda
#

yep

meager siren
#

Per number

west pagoda
#

4*1024 = 4096

meager siren
#

X 1024 dims is 4 KB

west pagoda
#

4096* 42000000

#

172032000000

#

172032000000/1024 = kbs

#

168000000

#

168000000/1024 = mbs

#

164062.5mbs

#

164062 / 1024 = gbs

meager siren
#

That doesn’t seem right

west pagoda
#

160.217

meager siren
#

I think you double divided the KB

west pagoda
#

nope

#

it's correct atleast on the calculator im using

#

check on yours

#

i'll use online version just in case

meager siren
#

I just checked

#

You’re right

west pagoda
#

lol

#

I was like 42tb

meager siren
#

But you are also assuming 1 embedding per article

west pagoda
#

yep for now

#

that's not the point

#

this is

#

so 42 million takes 160gb

#

the stats here are for 250 million vector embeddings

#

which is:

#

~937gb

#

now

#

if you use binary quantization you can scale this 250 million embeddings to ~29gb

#

29

#

just 29

#

from 937

#

lol

#

Here is the steps:

Combining binary and scalar quantization is possible to get the best of both worlds: the extreme speed from binary embeddings and the great performance preservation of scalar embeddings with rescoring. See the demo below for a real-life implementation of this approach involving 41 million texts from Wikipedia. The pipeline for that setup is as follows:

The query is embedded using the mixedbread-ai/mxbai-embed-large-v1 SentenceTransformer model.
The query is quantized to binary using the quantize_embeddings function from the sentence-transformers library.
A binary index (41M binary embeddings; 5.2GB of memory/disk space) is searched using the quantized query for the top 40 documents.
The top 40 documents are loaded on the fly from an int8 index on disk (41M int8 embeddings; 0 bytes of memory, 47.5GB of disk space).
The top 40 documents are rescored using the float32 query and the int8 embeddings to get the top 10 documents.
The top 10 documents are sorted by score and displayed.

Through this approach, we use 5.2GB of memory and 52GB of disk space for the indices. This is considerably less than normal retrieval, requiring 200GB of memory and 200GB of disk space. Especially as you scale up even further, this will result in notable reductions in latency and costs.

meager siren
#

we've lost the plot though.

meager siren
#

sparse vector retrieval via BM25 and TF-IDF reduces the amount of storage and compute you would have to do in comparison to full vector search

west pagoda
#

yes and nope. since they both produce vectors you would still have float32

#

in this case vectors are int8 or binary

#

most likely int8 for results

#

so even with less vector count in tf-idf the resulting sizes are roughly equivalent but one will give better results in top k

meager siren
#

Right but the first vectors are dependent on the length of the search query vs the dense vectors are dependend on the model output dims

#

I'm just saying, if you can use sparse vectors to isolate the correct articles of interest, the amount of compute to do dense vector retrieval is drastically reduced

#

that's why people have been looking at this two stage rerank system for RAG

meager siren
#

In this Video we discuss advanced retrieval techniques from Vector Databases with Query Expansion and Two-Stage Retrieval with a Cross Encoder. We use another Reranker on top to take the famous "lost in the middle" problem into consideration.

Code: https://github.com/Coding-Crashkurse/Applied-Advanced-RAG

Timestamps:
0:00 Two-Stage Retrieval i...

▶ Play video
west pagoda
meager siren
meager siren
#

It's seconds for a handful of documents

#

More like hours/days for all of wikipedia

west pagoda
#

doubt that hours part.

#

which vector db bench marks are you quoting?

#

lemme send some links

meager siren
#

Looking at lancedb since it's using disk storage

west pagoda
#

milvus can do 10248 queries per second with 1 million vectors

#

standalone is 7522

#

which is local

#

1.2 seconds for 1000 retreival queries among 1 million vectors 128 dimensions

#

for one query it's 0.0361 seconds

#

doubt your single user on a laptop can bunch out 1000 queries a second which will cause 1 second delay in his life

#

also I don't know if these are float32 most likely they are maybe... cannot confirm. but that means binary quantization will work faster

meager siren
#

I mean, be my guest if you can scale this. I wanted to do a vector search benchmark but I would have to clean up the data by remove redirect and category documents from the dataset

west pagoda
#

soka.

meager siren
#

One thing that I love about open sourcing this project here is that I've gotten a ton of feedback from you and other devs on how to approach this problem given the scale

#

everyone can build their own version given the data I have consolidated and made available

west pagoda
#

we can still technically go the trie based way and it still would be faster than using vectordb. However the main use of vectordb is where you dont get metadata such as category/short description etc

#

then you can't rely on much other than vector search by using vector embeddings

meager siren
#

I mean, right tool for the right job

#

if yo can filter our the target documents the vector DB can do its job on the limited amount of data quickly

#

dont forget that vector DBs take time to process all that data into vectors

meager siren
#

forcasted about 2 months to store it all but that was unoptimized

#

and I ran out of space very quickly so....

#

that's when I decided that the compute of vectors on the fly at inference time, as long as the number of vectors was limted, makes more sense

west pagoda
#

yep that's why the binary quantization would help in memory.

meager siren
#

Does transformers support binary quantization?

#

if the vectors are dynamic and limited, I could use faiss which does support binary quantized vectors

west pagoda
west pagoda
#

and yes faiss supports quantized vectors and a few mainstream vector dbs as well

meager siren
west pagoda
#

ok then, I have to go to sleep will talk when I wake up.

west pagoda
#

ghost can

west pagoda
#

the vectors made by them become very large

#

that was the conversation I had with ghost

#

he can use quantized vectors to scale the memory consumption down by 32x.

#

and that will also increase the lookup speed

#

he does want a trie based method.

#

so we're going with that

#

@inland lynx ^

meager siren
west pagoda
#

quantization of vectors would not loose information to such an extent

#

quantization is just using lesser bytes to represent the value with less precision

#

so yeah loss of data but not much given the vector can be safely quantized to such a level with good retrieval.

west pagoda
#

so what you said was transformations.

#

quantization of vectors is more like scaling.

#

so you represent the same vector on a small scale.

#

say if the vector was (10.010201, 10.103020) then we can represent that as say (10.01, 10.10) so we reduced the precision. But the information in 3 significant digits will have similar accuracy to 8 significant digits

#

the benefit is mainly memory, if memory size is reduced than the speed is automatically increased in terms of vector calculations.

#

i'll definitely look into this.

meager siren
#

@inland lynx can you give a step by step super simplified diagram of the logic flow and mappings that are needed for the category based inverted index for the bag-of-words search (TF-IDF, BM25)? I think something like that would be tremendously helpful (at least for me) wrapping my head around what you've been trying to communicate

west pagoda
#

I'm not being confused. I said what you explained "If you quantize the data, we loose Accuracy as we are using data in an encoded" <-This is exactly what I said but you are wrong to say encoded. It's not encoded. How is using float32 vs float64 encoding. I see that as precision scaling. If say you had a vector represented by both fp32 and fp64 and plot it on a graph you would see no difference at all in their location. however one is computationaly faster. But scale the float point even less say fp4 which is float point using 4 bit then the graph looks somwhat same but with large number of vectors it's going to be harder to distinguish since only so many values can be represented in 4 bits. That's why I said precision scaling.

#

"As its binary quant, the vector is similar to a Boolean yes or no, in terms of one characteristic" <- For this i'm not too sure. I haven't acually run the code to see if it outputs binary yes and no, even if the name implies it. But if that's the case we can simply use int8.

#

And yes int8 is a encoding since you went from decimal points to whole number. And yes you loose quite some accuracy but in terms of vector angle difference it isn't a whole lot. say you have 10 million vectors and you try to find top-k = 10 meaning top 10 closest values to some query vector X, then you can good retrieval but it doesn't scale all too well after a few 100 million vectors

meager siren
#

Problem with using pure transformers is that you have to consider the scale of the KNN search on consumer hardware. I think we already had a discussion on the time it would take to embed all vectors as well as performance of retrieval from the vector DB. From what I recall, we pretty much said "yeah, the metrics look good but it's untested on the wikipedia data with regular hardware so.... 🤷 "

#

The goal of the project was to narrow down the list of possible valid articles to some top K with the sparse vectors like TF-IDF and then find exact passages with BERT (that should keep cost down). I'm still re-reading the paper to see how the weighted TF-IDF would work/look

west pagoda
#

@meager siren

#

you are doing tf-idf in rust correct?

#

what type are you using? float32, float64 etc

#

also how many articles were left after duplicates?

meager siren
#

Float 32 should be sufficient

#

I did use 64 bit int for counting integers

meager siren
west pagoda
#

I suggest to use even lower floating point

meager siren
#

They have lower than float32?

west pagoda
#

or some sort of quantization that's int 8 based

#

that should be some good optimization on it's own

west pagoda
#

otherwise you can use another thing

#

they have pytorch/libtorch binding lib you could use the fp8 that pytorch provides

meager siren
#

Not sure how int 8 helps given the TF IDF of a character is a float

west pagoda
#

basically tensors

west pagoda
#

but I am just assuming there might e a way of quantization on tf-idf

meager siren
#

I think it’s good to keep memory low

#

And most vectors are pretty different by the first 6 digits after the decimal

west pagoda
#

lol

meager siren
#

Int8 and float32 are the smallest size dtypes in the vanilla rust library

west pagoda
#

wow

#

@meager siren just found this

#

idk how useful it is

#

since im working on my own thing

#

check it out and tell me

meager siren
#

@inland lynx I finished the paper on utilizing the categories, keyword extraction, and TF-IDF to give a more streamlined search. It's pretty interesting though that it heavily relies on the categorical mappings. I see a lot of the logic they used for the weighing functions and it sort of makes sense (I wonder if I can swap out TF-IDF with BM25 to compare results).

meager siren
#

I am working on building the category to category map from scratch. The only reason I'm not using the repo you suggested is because I'm not sure if key categories overlap or are missing from my current dataset. Current script takes a few hours (primarily because it reads the original xml article files and those files take time to load (more time to load than it takes to actually process the article data).

#

I can speed this up with multiprocessing but it will take some time to work out. I still have to break up the current graph to remove cycles and identify starter nodes

meager siren
#

Ok this may actually be much more helpful than doing it by hand

#

Still have to build the TF-IDF data in such a way that data is properly propagated up the graph/tree(s)

meager siren
#

The paper you shared earlier had used TF-IDF to determine which categories to go down. You need word freq & number of documents per category in order to determine this. The wikicat repo does organize the categories into a tree/DAG structure but you still need that metadata propagated up to all the categories (my code only has that data at the leaf nodes of the tree/DAG).

#

No, we are

#

but the metadata for that child has to get propagated up to the parents otherwise there is no way for the algorithm to know which category to go to

#

@inland lynx hang on let me break it down

#

wikicat gives us the categories structured in a tree right?

#

And the paper you shared, it used a modified TF-IDF to determine which category & sub category to traverse in the tree

#

It does not

#

wikicat does not have that metadata

#

Or at least, I'm not seeing it

#

right, that's not the problem

#

the problem is that, in order for the TF-IDF to be used, each level of the tree needs the metadata useful to compute that vector

#

we have the metadata (for TF-IDF) in the leaves

#

that metadata has to be propagated up so that the next level (parent categories) have it aggregated

#

the paper does a BFS implementation but yes

#

kinda

#

there is nothing wrong with the algo from the paper nor the structure provided from wikicat

#

We have to populate the metadata to the tree bottom up (leaf to root)

#

Yo have to

#

because it's cheaper to precompute that before inference

#

than to do this at every inference time

#

You read the same paper as I did, right?

#

kinda

#

it talked about document but we can do the same for query

#

right, that's true

#

But the paper builds that vector based on the data from the children/leaf nodes

#

that's wat I'm saying the problem is

#

right now, I'm trying to give the nodes enough data to build the vector from the child nodes

#

and store that to file

#

I understand that

#

you have to populate parent categories with vectors from the metadata. That metadata is coming from the sub-categories

#

You are assuming that wikicat already has that vector info precomputed

#

Your description doesnt line up with the resources you've provided

#

It's not that it's computationally efficient

#

It's that I'm having trouble finding a way to build the tree they did from the bottom up

#

they gloss over that heavily

#

I'm not talking about the relations

#

the graph structure is already made with wikicat

#

I'm talking populating the structure such that the traversal described in the paper is ready with minimal compute

#

stop talking about relationships. We already know the relationships have been established between categories

#

yes

#

but you have to move the data up the tree to parent nodes in order for that vector to be created

#

Information has to be moved bottom up for you to create vectors that go top down

#

aggregated vectors

#

Ok, but how do you take a query like "who ran for president in 1992 in the United states?" and go category by category down the tree?

#

That is not how the paper did things

#

If i were to do what you just suggested, you know what I'd do? I'd use word2vec to get the category label that closely matched my query bag of words.

#

My point is that strategy is tangential to the paper you had me read

#

I guess I'd just do a BFS traversal similar to the paper and use the highest cos similarity of a word in a bag of words against the category label

#

no

#

whatever

#

yeah, it's fine

meager siren
#

Summary of inverted index schemas that have been proposed thus far:

  1. Plain inverted index using a trie to store the information/mappings in a compressed manner. For each word, get the documents that the word is found in. Aggregate the documents in a set union operation
  2. Plain inverted index (like above) with a filter against words with low IDF scores (given an arbitrary threshold).
  3. Weighted inverted index. Similar to 1, each document gets a weight that is computed from the sum of the IDF values for all words that returned that document.
  4. Category search - Word2Vec. Do a BFS style search on the category tree where you choose the branch that is "most similar" to the query based on aggregated Word2Vec scores between the query and the category for that node
  5. Category search - modified TF-IDF. Similar to 4, except the comparison uses a modified TF-IDF for the vectors instead of Word2Vec.
meager siren
meager siren
#

@inland lynx Tried using wikicat.... Downloading the data dump has not been fruitful

meager siren
meager siren
#

@inland lynx I just had a thought. Since the original files are broken up into 195 files, each with 150,000 articles, why not have the search engine run on each file instead of building monolithic stores that contain metadata per some arbitrary chunk size?

Pros:
 - Conceptually easy to parallelize
 - No worries about load time in fileIO
 - Possible reduction in memory overhead
Cons:
 - More fileIO operations
 - Possible (likely) data redundancy
#

And before you ask, no I still have not implemented categorical hierarchy inverted index yet. It's something that's taking a long time to hack together.

meager siren
#

More like if we have 1 file for a search engine to process, You lose the compression of storing the inverted index in a trie because you would have 1 trie per file rather than the 50 or so tries that have coverage for the entire corpus.

Alternatively, we can still keep 1 trie per starting character found in the words for a file but I'm not sure if having extra files would help or hurt.

#

I also agree that searching the entire file is not ideal. But also consider that the files are pretty small (1.6GB token max but most are 700MB). Loading will be slow but just as slow as all the other file loading that's been done so far (ie load tries, redirect files, IDF files, etc)

#

Simplicity

#

Right, but it's also been compute (RAM) intensive

#

Running on my server, we used up 24 GB RAM easily when doing search

#

that's with iterating through tries that account for the corpus

#

virtual memory readings are unreliable due to the memory leak in the python msgpack module

#

On a run my cache memory is like 40GB

#

2GB MAX, mean is more like 700MB

#

The 24 GB was also including a few other files

#

7GB can be accounted for via the doc to int /int to doc mappings

#

IDK

#

remember, the metadata files I've made are for the entire corpus

#

I figured that sharding the search engine could help keep the overhead load and help reduce the number of documents I have to read into memory

#

entire file is only needed for final retrieval

#

No, sharding based on the file splits of the data (the 195 files)

#

The only metadata file that wont be sharded would be the IDF metadata, since that is corpus wide information (overhead/file size is still around 2 GB)

meager siren
#

It's actually kinda against TOS to use it for metadata storage

#

I wont tell if you dont though

meager siren
#

No hard limit but I would suggest keeping it below 500GB to avoid detection

#

Also, i think that still qualifies as metadata

meager siren
#

what's the threshold for high weights? What are the weights based on? IDF?

meager siren
#

Sounds like word2vec

#

Or TF-IDF with some extra steps. Either way, the term extraction/isolation from the query is a bit iffy without an algo.

crude walrus
#

i just startet my RAG project 2 days ago, should i search for some workarounds or just change VectorDB?

meager siren
#

Use TF-IDF to create an inverted index to filter out/reduce how many TF-IDF operations you have to do on the actual corpus

crude walrus
#

😕 wait correct me if im wrong
to run RAG app with vector search i need running Embbeded model every time i perform a query, not only when prepared data to populate vector DB
🤔

meager siren
#

because you query also needs to be converted to vector space so the vector DB can do KNN and return the closest results

crude walrus
#

thanks 🥹 i would try to find small embedding on HF model to run locally on CPU to finish my project

meager siren
#

It was a joke. That being we're using TF-IDF twice but one is for the inverted index to filter/narrow down our search

inland lynx
white bay
#

At this point you guys should consider making your own vectorDB and retrieving algorithms

white bay
#

No I know idea must be good thats why this thread is still alive

meager siren
#

That's one of the reasons why the repo is public. If someone can do it better, they have a place to start from

meager siren
#

I switched the TF-IDF to do a file level search (in other words, you have multiple threads where each thread processes a chunk of the 195 files). Spawned 32 threads for the process and waiting on results.

I had a few bugs which got my hopes up (I thought the return speeds I was getting of 100s per query were good but there were entire sections being skipped). Still hoping that the multithreading and chunking brings the search time really far down from 2 hours per query.

#

Inverted index trie was also split by file (and then by starting character). It reduces the savings on storage but keeps the inverted index still relatively compressed.

meager siren
#

File search

#

Figured that threads processing around 100,000 articles per thread would be more fruitful than doing the inverted index to get 10M articles and then sharing the TF-IDF process to threads

#

Preprocessed articles in each files into tries. Preprocessing skipped redirect and category articles

#

Same stuff I used to build the Tries previously

#

Still using bag of words atm

#

I am doing per article

#

I’m just allocating threads by file

#

195 files splits across 32 threads puts 7 files per thread. For each file, I get the relevant files via the inverted index trie for that file and pass that on to the TF-IDF/BM25 algo for scoring. Scores are kept in heap that has a max size.

#

Heap from each thread is aggregated and again, that has a max size.

#

That final heap is returned/passed to a function to retrieve article text.

meager siren
#

Note to self, num-workers value for threadpool is capped by the memory alloted to the PC. Each trie folder is about 200MB on average and doc_to_words map is about 400MB on average which gives overhead of around 600MB to as much as 1GB per thread.

#

Fewer threads means less parallelization (and thus slower inference/search) BUT that is all dependent on hardware so there's not much more I can do in terms of optimization (until I do the Rust conversion of the source code).

meager siren
#

OOM issue due to worker count

#

Each thread has to process a chunk of files in serial. Overhead for that is about 1GB memory. Across 32 workers, thats possibly 32GB memory usages.

#

No risk of deadlock since each worker gets its own unique files

#

32 workers means each thread has to do max 7 files. BUT because of the OOM I'm trying 8 workers right now (puts the count to around 25 files).

#

I'll also try with 16 and 12 workers on my macbook and see how that goes. I'm sure my server will be able to handle this just fine (maybe even handle it as multiprocessing). Kind of a downer because the whole point was that "any" PC could run this reasonably but you do have to consider the scale of the data and.... yeah, it's just a lot.

#

Not sure

#

Assuming rust provides something wild like a 10x speedup and we take the 2 hours per query (from my previous runs under serial load/no parallelization) as a benchmark.... That would be around 10 - 15 minutes per query...

#

Oh, I forgot to mention to you is that the category tree stuff failed to build on my system. Ran for so long and I think my system was rebooted after a few weeks from an update

#

No, I built from the category data in the dataset

#

Heap for my bag of words does this

#

stdlib from python

#

heap entries are the following tuple (TF-IDF score * -1, document filename + article hash)

#

Pretty much. I did everything I could to cut back redundancy and reads

#

Even put in calls to delete variables and use the garbage collection to clean them up.

#

Only memory wastage is the memory leak from the msgpack module

#

and the TF-IDF score for ordering

#

document name is going to take up some space, but because the heap has a set limit before it starts poppushing, it should be fine

#

Because I'd have to spend more time on getting category key working

#

I'll give the thing a try, but remember that 1) this is a hobby project and 2) I have other things to do so progress will be slow.

#

Since I'm away from my server, I wont be able to do any have processing for the category tree, so it would have to be a prebuilt tree that gets downloaded.

meager siren
#

I’d starts where I got the wiki dump.

meager siren
#

Downloading enwiki-latest-category.sql.gz-rss and enwiki-latest-categorylinks.sql.gz to try and see what I can get. Downside is that I have to use sqlite3 to explore it. I may consider swapping to sqlite instead of xml because I can interface easier on python. Not sure about the speed though.

#

Code/links?

#

Did you get the entire category trie?

meager siren
#

Not bad. Can parse it since it’s XML format

meager siren
#

Well, if I can implement the TF-IDF search via the category tree, it may eliminate the trie all together

#

Also, I just finished downloading the two sql files from the wikipedia hub. the latest category sql file is about 200MB but the category links is at abou 30GB

meager siren
#
#

But TBH, with the shear scale we're looking at... I dunno if I wanna do it

#

Primarily for the ability to take a dump and parse it when it updates.

#

The categories oerall wont shift from year to year but there are new ones that might be added in accordance with major historical events or scientific discoveries

#

in which case, that static graph would be out of date

#

ok then, let me look again at the repo one more time

#

Alrigh alright

#

Just give me today to look it over. I'm not by my server today and I have to drive back home later (It's a 2 hour drive from my current location)

#

I'm afraid that is gonna be compute intensive

#

since you have to work bottom up from the tree to build the TF-IDf word vectors

#

For their scale, it's not compute intensive

#

they have a budget for CDN but I imagine they have an entire server dedicated to search

#

It's interesting to ponder

#

most people entere wikipedia via Google search, not via the main wikipedia homepage

#

so most people dont search wikipedia via wikipedia search engine

#

No no

#

I’m doing a dumber search engine

#

One that can take in raw queries like goog but outputs the Wikipedia articles

#

Via that max depth parameter in your graph builder?

meager siren
#

Mess with the max depth to get a slightly taller tree. Something either 4 deep or 3 deep. Smaller depth means more articles per leaf

#

Each category maps to a set of articles

#

Higher categories are just aggregations of sub category articles

#

Well I. That case, let’s see what happens

#

If the category tree can return something like a few 100K articles l, then we’re in business

#

I’m just saying. If we can filter out from 10M articles down to 100K, it would be nice

meager siren
#

We still doing the TF IDF BFS search on the category tree, right?

meager siren
#

Thinking about it, the vocab is also huge…..

#

Top level category is gonna have a HUGE vocab

#

That cleanup is no big deal

#

I’m just thinking of once each category’s aggregated word frequency is done it needs to be stored and pulled into memory for searching through the tree

#

Group subcategories by number of articles then

#

If a sub category has less than 10K, group as one

#

More like throw something out and seeing what sticks

#

What about this

#

Ignore overlapping terms words at a level?

#

Nah, no go for that. If the query consists of overlapping terms, then you get nowhere

#

New thought: train a Word2Vec model on the data and use cosine similarity between the words and category labels?

#

This only falls apart at sub categories that are multi word

meager siren
#

No you’d just train 1 model on the whole corpus to get the relative vector embeddings

#

Yeah

#

And the problem is that the multi word query and categories also won’t work well

#

Maybe BERT embedding the query against the category?

#

Thought our graph was relevant categories

#

Not sure

#

Didn’t you have networkx in the code? You could print it out to see the tree

#

Kinda. I think they did more NER or classical methods

#

Ok

#

So to recap

#

Query and the graph labels (categories)

#

Why not vector embed the query and categories with BERT and traverse the graph (category tree) until the cosine similarity becomes worse

#

Think about it, it’s handling multi word texts

#

No pretraining on our end

#

How “many” are we talking? Is my question

#

If we can get shorter than 500K, then we’re golden

#

If not, then we need further refinement

#

Give mean, median, and mode

#

Most give mean when they do this

#

Just thought I’d specify

meager siren
#

Running at max depth 5 right now to see what happens. There should be a conceivable level where max depth returns "too fine grain" of search. I guess we're just finding it at this point

#

To be fair, the DFS algo in your code should be able to handle that, no?

#

And I did see in my own code (when trying to build the tree from the actual data dump) that indeed cycles were detected.

#

Then encountering cycles should not be a problem I think

#

🤷‍♂️

#

Probably 20 levels is all we'll want to explore IMO. Thinking from a high level, there should not be too many ways to categorize one article that isn't already covered by other categories

#

Have you been able to save the graph to a png instead of a pyplot?

#

I'm currently doing that for my depth 5 run right now

#

I have it set on mine

#

Is there a save to svg then?

#

OR something that wont blow out the canvas limit?

#

How long did it take for you to get a max depth 5 graph?

#

Damn

#

Well, unless chat jippity gives a source, I'm not inclined to believe it for something like this

#

Should do incremental runs. Depth of 5, 10, 25, 50 and compare results. I can leave the script on my server to run.

#

Plus, chat gpt was only talking about articles, when we're talking about the categories

inland lynx
meager siren
#

Overlap though?

#

What is F?

meager siren
#

So what we know now:
C = category
P = page (articles)
F = ?

#

Some categories overlap/crisscross and there are probably cycles in the graph

#

Got HTTP error on my run for depth = 5

#

Could not tell

#

Side note: running search on multithread for pages was not as fast as I was hoping (8 threads)
Search returned in 17078.927761 seconds
Also RAM jumped to 18GB for some reason (probably the memory leak - cache is at 47GB)

#

Let me try

#

It's the msgpack module

#

Gonna convert that loading in the search engine to rust

#

It's not, but some of the msgpack items that are serialized (ie tries) have to be handled in case we use them

#

I dont want to even deal with that. I'd have to submit a PR to the msgpack module repo

#

Running script again for depth = 2

#

Ran no problems. WOW that's a lot of text/fields

meager siren
#

Rerunning with max depth = 5 on server

meager siren
#

Thinking about it out loud, we probably wont get a better depth lower than 10/20 because of how much RAM you'd need to process the number of categories.

meager siren
#

The size of the English Wikipedia can be measured in terms of the number of articles, number of words, number of pages, and the size of the database, among other ways. As of 16 September 2024, there are 6,883,665 articles in the English Wikipedia containing over 4.6 billion words (giving an average of about 681 words per article). The total numb...

#

Only around 7M articles out of 61M possible pages.

#

C = Category
D = Draft
F = File
P = Portal
Template
TimedText
User
Wikipedia
Help
MediaWiki

meager siren
#

@inland lynx I think we need to look at parsing the damn sql file for category links

#

Running your script for depth 5 and it's still going

#

It would probably be easier to parse the sql tables

meager siren
#

Not sure. I have Max depth at 5 and it's still running. Memory usage is at around 28GB (with 38 GB cache). I feel like it should have stopped by now. Part of me wants to suspect network issues (since we're using API calls) but those would have given timeout errors.

meager siren
#

Surprisingly, yes

#

Memory is now down to 7GB with 40 GB in cache

meager siren
#

Not yet

#

I wanted to keep the run going rather than restart

#

For depth 2, I get a completion time of maybe 10 to 15 minutes (or less)

meager siren
#

I'm killing the depth 5 run. 24 hours is long enough

meager siren
#

So the categorylinks and all other sql files are MySQL format. Kinda dont want to have to install mysql on every machine so i need a way to convert it to something like SQLite

meager siren
#

I think it’s worth going back and exploring deriving the category tree from the articles. We have some category to category data so it may be worth exploring.

meager siren
#

@inland lynx I have an idea for how to build a (mostly complete) category tree. It will combine both the wikipedia API AND the categories from the article data dump.

  1. Compile a list of all unique categories mentioned in the data dump.
  2. We use the categories listed in the data dump to seed queries to the API. Each category is queried from the API, and should give us a subtree shard that we can use to build the bigger tree (we should be able to see subcategories and parent categories).
  3. We attempt to stitch as many shards as possible with the tree we get from querying the Main Topic Articles from the API
#

This comes from the fact that querying the APi is taking way too long and the data dump is somewhat incomplete when it comes to category data just from the XML containing articles\

meager siren
#

Also, I (think I) figured out why the download category script was taking so long. It was because of the saving the graph to png component that was taking forever

meager siren
#

So far I got remote connection error after some time

#

Also, current file size of category graphs
depth 2: 1.8MB
depth 3: 14MB
depth 5: 112MB

meager siren
#

depth 10: 490MB

meager siren
#

Took around 24 hours for gathering the depth 10 information. Should I try for depth 25?

meager siren
#

Number of unique categories captured by wikipedia API at depth 10: ~1.6M.
Number of unique categories captured by dump only: ~4.6M

I think I'll just run the depth 25 run and see what happens. Fingers crossed it only takes 2.5 to 3 days.

inland lynx
meager siren
#

Already kicked it off... might as well see what comes of it.

meager siren
#

I scrapped the sql dump idea because I'd have to set up a docker instance to host mySQL and do the querying. Effectively, it would be a lot of work to set up and maintain for when newer dumps came out.

meager siren
#

How do you propose we do that?

meager siren
#

Also, depth 25 category tree download failed (got HTTP 443 timeout exception - not surprised). I guess I'll just look at the depth 10 to see how many categories from the text fit.

inland lynx
meager siren
meager siren
#

Also thinking out loud here. Would it make sense to do unsupervised clustering on the articles and build a tree that way?

meager siren
#

Had another thought. We have the categories of each article in the dump. We have the category tree/map to at most a depth of 10 from the wikipedia API. What if we used BERT (since we cannot directly use word embeddings for multi word texts) to create links between the article categories from a dump and the leaf categories from the retrieved category tree? We have some sort of distance threshold so that you can have multiple mappings but there is a limit to that number. After that, prune the tree of all leaf categories that have no correlation to a category from a dump.

#

This is sort of like creating a weak connection between categories to "fill in the blanks" where the tree was unable to be populated.

meager siren
#

This may be an interesting thing, since there are around 1.5M categories in the tree but are ~4M categories in the dump (not included category to category articles).

#

So around 6M embeddings total (1 embedding per category, 4KB per embedding.... gonna be a lot of data)

meager siren
#

Agreed. I'm kinda wary of doing that too. Lot of resources look at TF-IDF to create the vectors but the vocab is gonna make the vectors large as you compare more and more articles and groupings.

#

I'm saying more like we use BERT to map categories from the tree we made from the API with the categories from articles that were not captured in the tree.

#

I had a script running take inventory of what was covered and what wasnt by the category tree (API, depth 10).

#

When you consider there are 4 million categories, having a gap coverage of 2M is not bad (especially when you consider the tree also had around 2M categories). Also, only 1M articles being completely unreachable is crazy good all things considered.

meager siren
#

Of course, that would be ideal but there is a lot of processing that would have to be done to gather those categories

#

A lot of waiting for gathering the entirety of the category tree

#

I thought about using abstracts to aid the search but the information within them can be either sparse or dense relative to the topic

#

They do but like I said, the info is sparse

#

even metadata is minimal

#

Not even that

#

From what I recall, you get title, abstract text, and maybe one other metadata bit but that is it

#

And the abstract text itself was something like 1 paragrah tops

#

For persons, it was 1 short line like date of birth and region of birth

#

So I dont think those texts will provide any meaningful assistance

inland lynx
#

i feel like actual search engines have multiple layers of condensed information before the actual article
starting from tags -> more metadata -> summary -> node link of actual data

meager siren
#

We could have an LLM like llama 3.2 1B try and summarize the articles to create our own abstracts. See if that 128K context length is true.

#

It's just a thought

#

Can we maintain category weights separate from an LLM summary?

#

More like summary based on the article text. Ignore categories

#

We keep category map but we confine our TF-IDF/BoW analysis to the summary instead of the whole text

#

Unfortunately, I dont think anyone will have an answer for that.

#

Honestly, the lack of relative control and traceability is always a good reason against using LLMs

#

yeah exactly

#

Lowering parameters like temp can help

#

I can do that, no problem

#

Should I use the instruct model?

#

we want to minimize "creative liberties" the model may take in the summary

#

I'll just generate 5 summaries and send you the original articles as well.

#

Also sharing the screenshot of my idea on "filling in the blanks" for the category tree.

#

I know we just discussed it but I'm leaving it here for notes

meager siren
#

I got something stiched together for a few articles

meager siren
#

took 50 seconds to generate for 6 articles

#

not great when you have to scale up to like 7M

#

that' the summary generating

#

there is no article that has an explicit map to that category

#

they are from the category tree

#

and make up the tree for traversal

#

yeah

#

like I said, we discussed this

#

we can prune some but not all. Prune away all non-direct reports andthe tree falls apart

#

well, I dont know what you mean by "short" but depth of 10 is pretty short to me. Depth of 2 will have no overlapping categories

#

Most immediate idea is lack of batching

#

gotcha

#

It was just for proof of concept, so no multiprocessing/batching

#

let's evaluate the summaries first

#
#

Damn, you gotta parse through it a bit. But it's not that bad

meager siren
#

This isn’t search, just summary. I can send you the code in a bit

meager siren
meager siren
#

I think with the way the LLM is prompted for the summary (and things such as the temperature of the generation), it is staying as close to the original material as possible. It does not seem to do anything with the classification categories that are attached at the bottom of the original article

meager siren
meager siren
#

I was looking back at this Medium article on the people who did a similar project (Wikipedia search engine). They operated on a much more limited scope, choosing to grab articles from the Health category and diving depth 10 down the tree. They ended up with about 650 articles and filtered it down to 450 after some additional processing around document similarity (used Jaccard Index + some arbitrary threshold for similarity and removed documents that were too similar).

It honestly got me thinking about how "easy" they got it for their project not having to deal with the vast scale of the corpus of millions of articles....

meager siren
#

It’s more of a comment of envy at the simplicity of their dataset

#

Just putting it out there as another alternative

inland lynx
# meager siren Just putting it out there as another alternative

so where are you with your summaries?
I was thinking,

  1. you generate summaries for articles in a few catgeories
  2. do a cosine similarity between query and summary vs query and article
    Q) is there any difference? If we do similarity with summary vs article itself
meager siren
inland lynx
meager siren
#

I'll generate ~100 summaries for that then

#

Is this a dense vector cos-similarity or just BoW (TF-IDF) cosine similarity?

inland lynx
meager siren
#

What is the expected gain you're hoping out of this compared to how we're trying to set up the search engine?

meager siren
#

My ask is more around how does this tackle the scaling problem we're having

inland lynx
#

We just want to be sure that summaries are enough

#

summaries will reduce data load by a ton

meager siren
#

How so?

#

Through a seemingly reduced vocab per article?

inland lynx
#

We are doing vector comparison between query and article also right

#

so yes less tokens

meager siren
#

i feel like for sparse vectors, the cost difference is minimal

inland lynx
#

we started with how much load everything was on yout servers

#

also think of 50 articles at a time right?

meager siren
#

More like threads of 8 to 16 articles at a time

#

64 if you have the memory for it

inland lynx
#

right, we want to reduce the load by introducing a condensed version of the articles

#

atleast thats what I thought we were discussing before

meager siren
#

I thought we were looking at the document filtering based on query.

inland lynx
#

hmm yes

#

so your on the same page that we need the summaries right?

meager siren
#

I'm not sure about the necessity but I already have the script to generate them.

#

So I should do the following:

  1. Generate summaries for all articles and store them
  2. Compute Word Freq of all article summaries and store them
  3. Use the summary Word Freq for...... TF-IDF inverted index?
#

And we are going to tackle part 3 via the method covered in the category tree traversal from that one white paper. Am I getting that right?

meager siren
#

Rust has me convinced. Preprocessing took 30 seconds on something that was going to take tens of thousands of hours

meager siren
#

Not sure if it’s doing what I want it to do but it was damn fast compared to Python

meager siren
#

I am going to try to minimize the number of categories used while still ensuring total document coverage

#

I abstracted that problem (least category for full doc coverage) into a state space problem

#

Downside is that generating a space is compute intensive. There are two sort operations (to give preference to certain properties)

#

And the number of missed categories and documents are around 3M and 1M respectively

#

Essentially solving the problem would have been very slow for python without some high performance library written in C or Rust because the algorithm I had implemented (while efficient) was going to take 10,000s or 1,000s of hours to compute

meager siren
#

Notes for myself:

I was able to implement code in rust that would search for the minimal number of "missing" categories to get complete document coverage for "missing" documents. I converted the problem to a state space problem and implemented a modified BFS algorithm to return a solution. Even after several optimizations and greedy strategies, the search space was so wide and expansive, OOM issues were persisting even on my server.

I am currently investigating the option of breaking down the problem into sub-problems and iteratively building the solution. I've already made some optimizations by taking the category to documents map and removing/ignoring categories that are not missing (only focus on "missed" categories) and for those values for those remaining categories, I removed the documents that are not missing as well (only focus on the "missed" documents).

Now, rather than try to solve the whole problem "fewest categories to cover all documents", what if I broke the problem down to "fewest categories to maximize document coverage". The difference is that in the sub-problem, we look at subsections of missed categories, while still keeping the missed documents list the same AND we strategically start our iterations by:

  1. sorting the missed categories list by number of documents covered in each category
  2. chunk the sorted missed categories list such that each chunk generates a solution
  3. for each chunk, we pass in the previous solution state and the use the modified BFS to solve the modified problem statement ("fewest categories to maximize document coverage")
  4. continue until either the end of the chunked list or we have found a solution to the greater problem statement ("fewest categories to cover all documents").
#

This will probably be costly in terms of compute (even with rust speeding things up) but hopefully the OOM issues will die down.

meager siren
#

@inland lynx I have my incremental script up and running. It took a bit but I got a depth 15 graph together (it should be cleaned). Funny enough, the size seems to be roughly the same as the depth 10 graph, so I have to audit the two and see what the difference is. It's over here https://github.com/dmmagdal/WikipediaEnSearch/blob/main/download_category_tree.py

GitHub

Contribute to dmmagdal/WikipediaEnSearch development by creating an account on GitHub.

inland lynx
#

depth 10

#

depth 15 - its failing a lot on the cluster / i am looking some alternate means

meager siren
#

Your original implementation was able to work at depth 10. It appears that anything above that depth causes issues of some kind

inland lynx
#

my suggestion would be to run from depth 10 to depth 11 and first verify anything is there beyond that level

meager siren
#

I’ll test it in a few days. Trying to get a depth 20 run finished right now

#

Every 100,000 nodes I have a 4 hour pause to give the API time to cool down

inland lynx
meager siren
#

Yeah. Just FYI, 4hours between every 100,000 subtrees seems to be the magic number

meager siren
#

And an inconsistent LLM. Yeah

meager siren
#

Ran mine for depth 11 and 13. Both came out to be the same size. Ran a diff on the two files, also got no changes between the two. So there is something about my script that's not saving the data properly.

#

Did you get the bugs out of yours for depth > 10 ?

inland lynx
# meager siren And an inconsistent LLM. Yeah

what?
That is not the point
We are labelling the answers and collectively weighing the correct web results
Thats how atleast google gen AI works
You can try to replicate the same by asking for a code snippet or answer to a topic which is rarely searched such as adding a ceritifcate to a request since nowadays everything is handled by some SaaS or kube

inland lynx
#

I will be running for depth 10 to 11 and see if anything is there
seems fishy
we can verify manually by going and fetching in a code snippet pages of categories with depth=10

meager siren
#

I think I found the snippet that's causing the issue on my script.

                subcategories = get_all_subcategories_bfs(
                    category, intermediate_max_depth - depth
                )

The depth that it is looking at is intermediate_max_depth - depth which would be a negative value if depth > intermediate_max_depth.

meager siren
#

I'm re-running my script again but with just depth - intermediate_max_depth for a depth 11 graph.

meager siren
#

Saw that the "for" loop wasnt running when I ran my stuff so I think I found another bug in my code

meager siren
#

4.9 GB to store category tree data (depth 10) as vectors. Despite being very abstract and in numerical format, the dense representation is costly in terms of space.

inland lynx
meager siren
inland lynx
inland lynx
meager siren
#

I plan on keeping the graph nodes pre computer in a vector DB

#

Right now trying to find the fastest way to process the remaining vectors

#

It took 2.5 hours at 48 cores to run the depth 10 graph

#

But my server is showing 13 hours to finish the remaining categories that map to documents.

#

I’m investigating batching the inputs on top of the multiprocessing now

meager siren
#

Wow, batching is a wild fast at least in terms of processing data and keeps memory usage low compared to batch size 1 + muliprocessing

#

GPU 0 is working with the 48 processors + batch size 1 job (~40 GB RAM + 16 GB VRAM usage)
GPU 1 is working with the 8 processors + batch size 32 job (~16 GB RAM + 3 GB VRAM usage)

meager siren
#

I restarted it and am running ike 16 processors + 64 batch size. It is blazing

inland lynx
meager siren
#

Not yet. I'm hitting CUDA OOM errors from the amount I'm trying to parallelize

meager siren
#

I think I over-estimated the "power" of the GPUs. I saw all of that speed increase that I was happy about was primarily under the condition that there are no additional embeddings to generate. When starting from 0 (no embeddings), the resource usage becomes more apparent. I am effectively limited to some combination that puts, at most, 256 text categories on a GPU at a time. This means 16 processors + 16 batch size, or 8 processors + 32 batch size are going to be the cap for what my machine is able to handle right now (I cannot distribute on all 3 GPUs without some extra work).

I know 8 processors + 16 batch size was able to do the first half of embeddings but the second half I may not know. This also is putting embedding times for each half at around 8.5 - 10 hours and around 13 hours respectively (again, assuming you have to generate all new embeddings).

meager siren
#

yes

#

that's the categories that map directly to documents. The ones that were not captured by the download from wikipedia

meager siren
#

I'd still have to generate all embeddings because I would have to group the categories from the second half into the graph downloaded in the first half. The pruning would have to be done post "merge".

inland lynx
meager siren
meager siren
#

I had an interesting thing come up with the embedding for the last "half" of the categories. Apparently, there is a category detected that exceeded the 512 BERT token limit and it caused an error for the program (didn't find out until after it was supposed to finish). It's probably a line that got caught up on accident from the main body of an article but interesting occurence nonetheless.

meager siren
#

@inland lynx I figured it out with the GPU. I had some stuff in the code to try and save intermediately. Additionally, I had lines in the embedding function to verify that a category wasn't already in the DB table. So the reality was that the check was probably going to cost a lot of time BUT it probably wasn't going to be 120 hours.

meager siren
#

I did the following:

  1. consolidate node processing. Combine node list from downloaded graph and category-to-documents map. Eliminate duplicates by converting it to a set.
  2. check table in vector db. Remove nodes from the list that appear in the table.
  3. pass the filtered node list to the embedding function. Also added an argument to bypass the category check in the embedding function too.

This is much more straight forward I think and should bring the process down to maybe around 6 hours if I'm right.

meager siren
#

Ok, I really over estimated on the 6 hours. It's looking more like 21/22 hours. But since everything is on one loop (for the embedding), I only have to wait once.

inland lynx
meager siren
#

Stage 1: load all nodes and clean the texts. Use set operations to get first and second half node list. Combine to one list. Iterate through and identify nodes that already exist in the table. Remove those words from the node list.

Stage 2: embed node list and update table. There are 4M nodes to embed. This stage will take 21 hours if starting from nothing.

meager siren
# inland lynx damn, 21 hours for what exactly? Can you breadown the timing for what processes

After 1 full run yesterday, I found out that there is a limit to what pyarrow is able to handle. Apparently, objects over 2GB in size will cause it to error out. So I think I’m kinda forced to do an iterative saving along the way. That kinda sucks because it will still add time (to stage 1) if the program was interrupted or were to crash and I don’t know how much that will affect performance.

inland lynx
meager siren
#

Well, I might as well write it to the DB where it's going anyway

#

Stage 1 should filter out already stored values.

meager siren
#

That said, it will probably take much longer to complete stage 1 as it goes up

meager siren
#

this is the (category, embedding) pairs

inland lynx
#

oh

inland lynx
#

do it like a hashmap
Store the embedding in its suitable form and categories in a suitable form

inland lynx
#

any thoughts ghost?

meager siren
#

It's about the same

#

lancedb is the vector DB I'm using right now

meager siren
inland lynx
meager siren
#

Yeah, I let lancedb do all the work because it also does handles stuff at a disk level, not just memory.

inland lynx
meager siren
#

If you do one big write to the db, you'll get a errors (if the data is over 2GB large)

#

I did the math

#

assume 6 kb per pair, I would have a max of 250,000 pairs per batch write to the db

inland lynx
#

We dont want to write MAX limit if it s costing efficiency/time

meager siren
inland lynx
#

there is upsert, its labelled as merge insert

#

and insert

meager siren
#

Ah I found it

inland lynx
#

yeah I hope that makes everything more faster

inland lynx
# meager siren Ah I found it

Note there are many insert types, which does the checking if embedding exists
Try to use all the native methods instead of your code

meager siren
#

Have you seen my code

inland lynx
#

no

meager siren
#

I mean, I'm not denying it's probably inefficient

#

but I wonder how this will handle the scale

meager siren
#

We have 4M entries to put into the db

inland lynx
#

Your doing everything in a batch

inland lynx
#

isnt it supposed to scale to that level?

#

I mean otherwise using that db makes not much sense tbh

meager siren
#

Right but what I'm saying is "how efficient is that upsert if it has to check by category name ?"

#

embeddings can be organized into a hierarchical tree based on similarity

#

but if you store 1M pairs into the db, and you have 3M left, how performant will that upsert be?

inland lynx
#

i mean they would be internally managing the data in a different way thats why your search is fast

meager siren
#

Right but if you are using the category name as the primary key, you dont get that benefit

#

Youd have to do something else to account for that variable in the db (though I guess they probably would be able to figure something out since they can infer/use the types defined in the schema)

inland lynx
#

why dont u just try it and see

inland lynx
#

so that embedding is stored efficiently

#

its probably just interfaced as a db which you can query out like SQL

meager siren
#

Already defined the schema

#

they have the option to infer it based on the data or define it.

meager siren
#

depending on the initialization method one uses

meager siren
#

one problem with upsert

#

if you upsert, you have a constant runtime of 21 hours since you would be updating entries already seen/stored

inland lynx
#

not upsert

#

i just mentioned upsert in the case we want a scenario where we update Catgeories, like the label changes

meager siren
#

Ah, I see

inland lynx
#

hmm actually we should have stored some page id

meager siren
#

Why?

inland lynx
#

we dont have anything to uniquely identify an article

#

it makes it hard to update existing article nodes

meager siren
#

this is just the vector db to allow for semantic search across the category graph.

inland lynx
#

yeah as time passes wiki changes

#

the Category name may change for the same article

#

anyways … carry on

meager siren
#

As it stands, the DB would only get bigger since it uses the categories found in the tree and category-to-document mapping. Writing a "clean up" function to remove categories not found in either should be easy

inland lynx
#

thats mean we end up doing pre processing multiple times for the same article

meager siren
#

Not true at all

inland lynx
#

But if the article itself changed then it makes sense

inland lynx
#

today my article was 1977 War tmrw its The history of 1977 and war of an era

meager siren
#

consider this. The data (wikipedia and the dump files) are updated periodically

inland lynx
#

yeah but we end up doing un necessary inserts when the name only changes
Do you get what Im trying to say
We identify by a category name which may change

meager siren
#

if the category changes, then the embedding changes because the embedding is tied to the category, NOT the article

#

If there is a new category, it has to be added to the db with its own embedding.

inland lynx
meager siren
#

the embedding model is tied to the category because it was generated as a function of the category string

#

embedding = model(category_string)

inland lynx
#

that is true

meager siren
#

new category -> new embedding

inland lynx
#

uh ok makes sense

meager siren
#

update the category to document map, those new categories from there will be captured and stored to the vector db

inland lynx
#

but we need to update only the category node and upsert

inland lynx
#

did u add any id or metadata

meager siren
#

no

inland lynx
#

uh ok its possible needs a little more extra effort
We query the article history and do search and see what has changed

#

or whatever stores category audit history

meager siren
#

insert categories unique to the category to documents map into the downloaded category tree via the semantic embeddings to match the categories

inland lynx
#

that wont work

#

semantically many will be similar

inland lynx
#

its there actually whn you hit the API

meager siren
inland lynx
#

lol calm down

meager siren
#

and we have no good way to download the whole graph

inland lynx
#

this is only when we want to update some categories

meager siren
#

It's interesting. The vectorDB is operating at a pace of around a little over 1 query per second when trying to search the table with SQL querying for known categories. I've tried what I can to speed things up, including mutliprocessing and even batching items into one query but still I get long estimates for this function (after there have been a large number of category, embedding pairs committed to the DB).

#

I even looked at using lancedb rust package to create a helper function in rust but ran into some issues with building and installing dependencies.

meager siren
#

Update to that, I figured a way around that. Setting a bigger batch size for that (ie 1000) works well. Downsize is that the RAM usage really goes up a lot

meager siren
#

@inland lynx Something I noticed is that I was getting really inconsistent results with my vector db. Like, why is it that, after 1 full run, I still have vectors that need to be embedded? Why is is it that each time I run a search on the db to get vectors already seen, I get different results? My program logic is right.

I'm thinking that perhaps multiprocessing could be the cause behind this issue. So I am running the script again, this time with single processor, to see how many vectors it finds in the db.

#

Could be the situation that, since lancedb is written in rust, it cannot handle multiprocessing well.

#

I use "spawn" instead of "fork" in my code but maybe it's just safer to go single-thread/single-processor

meager siren
#

Ok, so after running that part 3 times, I was getting some solid consistency.

#

The number is concerning because that means I have around 4000 unique embeddings out of the entire 13 GB stored there which allows me to deduce that there are A LOT of either duplicates or corrupt vectors in the DB

#

So that means, for the sake of stability, I should run the program on single processor (and single thread I guess) and see what happens.

inland lynx
#

if your dealing with a ton of read/write hazards, we need to handle the number of threads optimized to lance
instead of multiple threads failing and we need to again spawn a new thread for all the data lost by the previous one s

inland lynx
inland lynx
meager siren
meager siren
# inland lynx makes sense, might be a LOT of WRITE hazards

I haven't checked multi-threading yet and despite the info I saw on the FAQ saying certain kinds of multiprocessing are supported, I dont think it's work the risk right now. Downside right now to running single thread/processor is that it is taking A LONG time to run the program (create and save embeddings) even with batching/chunk insertion.

#

No more 21 hour runs.

inland lynx
meager siren
#

Yes

inland lynx
#

hmm how are you checking that