Best Vector DB for Local Use | Learn AI Together | Page 2

meager siren Aug 5, 2024, 5:59 PM

#

It should be

west pagoda Aug 5, 2024, 5:59 PM

#

So how will it be queried on words?

meager siren Aug 5, 2024, 5:59 PM

#

You have substantial overlap between groups

#

Computers overlap with history and math

#

History can be literally anything with a date

west pagoda Aug 5, 2024, 6:00 PM

#

example?

inland lynx Aug 5, 2024, 6:01 PM

#

meager siren Aug 5, 2024, 6:02 PM

#

What algo keeps this fair given the corpus?

#

Yes

west pagoda Aug 5, 2024, 6:03 PM

#

Hey @meager siren there are two sets of categrories in the xml. one is main category like computers then you get subset mainframe as an example and then you have articles with titles. You could extract those and make a trie of it?

#

yep

meager siren Aug 5, 2024, 6:04 PM

#

west pagoda Hey <@400827393415053333> there are two sets of categrories in the xml. one is m...

You know the tag where that’s at?

west pagoda Aug 5, 2024, 6:04 PM

#

I saw them when scrolling through the data but not confirmed i'll check and see

#

haven't looked too much myself ghost would know I am busy elsewhere

#

true...

meager siren Aug 5, 2024, 6:06 PM

#

I gotta go back and check too

#

Invert index words to categories and then use that to get a set of documents

west pagoda Aug 5, 2024, 6:07 PM

#

meager siren Invert index words to categories and then use that to get a set of documents

categories has sub categories if needed we generate sub sub categrories and they have documents is that the plan?

#

index words = nodes

#

I think there were sub categories as I said computer is main and mainframe computer is sub

meager siren Aug 5, 2024, 6:08 PM

#

Not confident about generating additional data but I follow the rest

west pagoda Aug 5, 2024, 6:08 PM

#

so sub sub will be generated if needed

#

I understand I'm giving example

#

mainframe itself has many sub catog I presume but I dont think I saw anything in metadata.

meager siren Aug 5, 2024, 6:10 PM

#

Not sure TBH

#

You’d have to peak the dataset that I’m using in general

west pagoda Aug 5, 2024, 6:10 PM

#

[[Category:Kansas State University academic buildings]]
[[Category:Architecture schools in Kansas]]
[[Category:Landscape architecture schools]]
[[Category:1908 establishments in Kansas]]
[[Category:University and college buildings completed in 1908]]

#

this is how it is.

meager siren Aug 5, 2024, 6:11 PM

#

It’s structured XML data with the documents inside a tag

meager siren Aug 5, 2024, 6:11 PM

#

west pagoda [[Category:Kansas State University academic buildings]] [[Category:Architecture ...

Jackpot

west pagoda Aug 5, 2024, 6:14 PM

#

abakysis?

#

I see

meager siren Aug 5, 2024, 6:15 PM

#

More blind meta meta data analysis

west pagoda Aug 5, 2024, 6:15 PM

#

I was wondering the same

meager siren Aug 5, 2024, 6:15 PM

#

Hard to visualize in a notebook with 95 GB of data

#

Looking at just the meta meta data like the category words takes a while because the script has to load so many files

#

Kinda

#

195 files but each one contains 150,000 documents

#

Yup but not by category

#

Just chunked by whatever they were when i downloaded them from these big bz compressed files

west pagoda Aug 5, 2024, 6:19 PM

#

okay so some of them have categories in title tag
<title>
Category:Mark Eitzel albums
</title>

#

words

meager siren Aug 5, 2024, 6:19 PM

#

Yeah, I’m also assuming lexicographically sorted

#

It’s not

#

But that’s how they came

west pagoda Aug 5, 2024, 6:20 PM

#

tf-idf was calculated for corpus and then you make them as nodes and add documents under the respective nodes.

#

No

#

data was downloaded

#

in chunks of files

#

one file has 150k articles

#

no idea about catog here

#

just downloaded it

#

then extract articles

#

find tf-ifd

#

idf*

#

yep.

#

in what sense?

#

Ahaan good idea.

#

That's why I asked ghost to keep a higher threshold for tf-idf values any word below it will discarded

#

that was one way of doing it

#

now catog are introduced they are better

#

nope

#

I am talking about the words

#

you only discard the words with which the trie will be made

meager siren Aug 5, 2024, 6:30 PM

#

Already built tries for the inverted index, word frequency for each document, document frequency for each word, and word IDF for all words. And again, scale of the data would make any sort of analysis difficult to verify.

#

Verify, no build

west pagoda Aug 5, 2024, 6:31 PM

#

I know hence the new method of using catogs

meager siren Aug 5, 2024, 6:32 PM

#

…. No it’s probably in the order of a few GB depending on what you’re looking at

#

Word IDF when stored in msg pack is like 4 GB

west pagoda Aug 5, 2024, 6:34 PM

#

can you limit the byte size of the stream?

#

python should have it. many other langs normally do.

#

So @meager siren limit how many bytes are read

#

and output to terminal

#

seems to be for online stuff

#

if you dont limit the read bytes it will load everything in memory

#

he said 4gb in msgpack unpacked would be much bigger in ram.

#

not by a lot

#

but still

#

what are we talking about here lol. Outputting metadata to stdout?

#

batch load would be slower then just loading the file.

#

well I am gonna leave and install some different OS now. Cause I feel like arch is getting old

#

will meet up with you guys some time later.

#

what data is being generated?

#

site unreachable.

#

hey @meager siren @inland lynx I see short description tag as well

#

which gives little know how about the article

#

here is the dataset

#

@inland lynx ^

west pagoda Aug 5, 2024, 7:18 PM

#

wrong link sorry

#

@inland lynx https://huggingface.co/datasets/dmmagdal/enwiki-2024-04-20-xml

dmmagdal/enwiki-2024-04-20-xml · Datasets at Hugging Face

meager siren Aug 5, 2024, 7:36 PM

#

It's not my dataset, it's just a copy of the data I pulled from Wikipedia (which is updated frequently) and I am just storing one copy I pulled from April.

meager siren Aug 5, 2024, 8:13 PM

#

The FULL wiki data though?

meager siren Aug 5, 2024, 8:14 PM

#

west pagoda [[Category:Kansas State University academic buildings]] [[Category:Architecture ...

Clarification here. This is specially formatted from the document/article text. It's not its own metadata tag from xml

#

English wiki, but yes

#

Create search engine to run in reasonable time (ie 5 minutes per query) to allow me

get experience building search engines at scale rather than using langchain or llama index
plug in as part of a RAG system and replicate experiments done in WikiChat paper from Meta AI using TinyLlama 1.1B instead of Llama 2 7B

#

If you can use small LLM to replicate wikichat, it emphasizes you dont even need 7B models to be effective at simple tasks like QnA.

Leave it to the nerds in Meta or OpenAI to figure out how to make LLMs "smarter"

#

I'm just trying to get it to not hallucinate with knowledge that can be stored offline

#

Not really

#

it's about hallucination reduction using a generally decent ground truth dataset

#

Well, that's why I said "generally"

#

Are you saying that the generative part of RAG is going to obscure or destroy the truth that is retrieved from the knowledge database or are you looking at a more philosophical question?

#

look, let's put the thought experiment aside

#

This project is primarily for fun and for hands on learning

#

I have yet to add in optimizations for speed such as using Rust or multithreading

#

I'm assuming single core processor for the sake of min specs (consumer level PCs/laptops)

#

Dont want the system to take up all cores and RAM

#

Systems can have limited number of cores

#

ie raspberry pi is only 4 core ARM processor

#

I could have designed this thing to just run on my server

#

but then no one else could go over and replicate my results, build upon them, or even tested the program

#

It's not but it's an example of an "offline" usecase

#

Imagine something like a robot that has a 1TB HDD with this program and data on it. You have a mobile and semi competant machine that is portable

#

They're all relatively the same documents

#

But using a hierarchical grouping system does seem like an interesting approach. Downside is that not all articles have explicit categories like mentioned before

#

On the plus side, RAM usage is pretty low ATM for search

#

#

No

#

It's a server, but like I said, single thread single core ops

#

It shouldn't be

#

It's just a mail thing on Ubuntu I set up for easy stuff'

#

small batch high variance doesn'ts speak to much confidence

#

Did you find the algo that would be good to build the category trees?

meager siren Aug 5, 2024, 11:40 PM

#

Another thing I found. Some of the data (articles/documents) are just category data. In other words, they have a “Category:Some subject” in the title and the rest of the text body is just “[[Category:Category data]]”. I gotta filter those out now too.

#

Not sure if I can use them as a way to aggregate different category labels under a few larger categories and build a category tree that way.

meager siren Aug 7, 2024, 11:47 PM

#

@west pagoda Setting and IDF filter threshold to 1.0 did seem to help cut back on some of the documents returned from the trie inverted index. Went from around 10M to 9.5M. I'm still not sure what I can be doing with the timetraveler's idea around using categories to construct the inverted index. Also perusing the links he sent above

meager siren Aug 8, 2024, 1:19 AM

#

500K articles removed does help but I agree that it’s not great

#

I’m looking through them now

#

Remind me again how I go from bag of words to inverted index with the category tree?

meager siren Aug 8, 2024, 1:45 AM

#

Right but what about queries that are not key words but rather an actual text that was just preprocessed into a bag of words?

meager siren Aug 8, 2024, 3:50 AM

#

How is he using a different TF-IDF? The formula is the same regardless.

#

Is he weighing documents based on the cosine similarity between TF-IDF vectors of the query and document?

#

If so, then this is fundamentally what my current search engine does

#

Nevermind, I just read the paper

main cargo Aug 8, 2024, 3:59 PM

#

FAISS probably for local use

meager siren Aug 8, 2024, 4:45 PM

#

Ok, let me see if I understand the general idea of what you've been proposing @inland lynx :

Create a clustering algorithm to group words with different categories
Use a category tree to organize the categories.
Our inverted index now maps words to categories
Once we have the categories/sub-categories isolated, we get the mapping of categories to document IDs
We aggregate the results into a list and then run TF-IDF on those documents and then get the top best results from there.

does that about sum it up?

#

And for the word -> Category mapping, we have some limit/threshold so that we don't accidentally pull ALL categories for words that are common across many articles and we do the same for Category -> documents?

#

Why is the tf-idf weighed at step 5?

#

I assume so

#

do a set union and use tf-idf would be my guess

#

for straight TF-IDF, we vectorize the query and cosine similarity the query against the doc vectors

#

for BM25, we still have parts of TF-IDF but BM25 does not require cosine similarity nor vectorizing the query

#

The query still influences the search because of how it still uses the query term TF and IDF

#

you want to remap query to just the terms in the (sub) categories?

#

yes

#

TBH, I have both implemented for the sake of comparing results

#

But the flow of data is expected to be similar across both

#

The flow of data is as follows: the query should map to a list of documents (via the inverted index) and we perform the algo search (TF-IDF or BM25) for which documents are relevant to the search

#

Current issue is the inverted index is returning too many docs and so we're investigating a mapping to categories to shorten the results

#

BUT we are finding that the mappings may not be effective in theory.

#

well, we already have trie index mapping

#

straight up word to document ID inverted index with tries

#

Currently working on figuring out how to classify words to categories

#

Right but how to take a word and search through a category to get the sub categories

#

unlike the trie inverted index in which I can organize things to be roughly O(1)-ish per word, I'd have to traverse each category trie O(n) for every word to get the correct categories

#

It's not about the levels

#

It's about how many categories I have to sift through in general per word

#

I can build a trie that maps word to categories/sub-categories. But that doesnt help me much more than my current inverted index

#

Just like I mapped each word to a document, I just pull the category data from the document

#

thing is that still doesnt help eliminate options as efficiently

#

particularly with words that span a large chunk of (unrelated) documents

#

It's building right now

#

the trie for word to category

#

No, it doesnt

#

tf-idf is just a sparse vector scheme for sparse vector comparisons

#

nothing has pulled actual key words from the query

#

calculation is based on the words in the query vs words in the document. Take the matching words and create a vector to perform cosine similarity

#

It doesnt

#

the inverted index doesnt prioritize words over others

#

it treats all words equally and aggregates the document IDs of each word in a set union

#

No weight to documents based on return frequency across words or weight based on word IDF

#

that's kind of the hope with the ideas for the inverted index

#

to figure out how to prioritize words in a way that is fair and wont accidentlally kill valid documents from the search

#

yes

#

we need to short list the inverted index without using calculations form there

#

the edge case is the inverted index step

#

ie the word "plot" expands to 10M articles around stories and movies and land and schemes

#

when the original query was "who started the gunpowder plot"

#

depends on the context of the query my guy

#

that's why it's a problematic word

#

right

#

but when we look at the documents returned, we get documetns for "gunpowder", "plot", "start", etc

#

hence, we get A TON of documents (like half the corpus)

#

because we do a simple aggregation of set union

#

I told you, the trie for tha tis still being built

#

and this was more of an example

#

You can run the code yourself man

#

links to repo and data are in the thread

#

It does

#

taking a raw query and putting it into bag of words, and doing a word level inverted index is returning a lot of documents

#

Well no kidding, the tree is still being built

#

Check the actual example I have in searchwiki.py

#

That's downstream

#

I have a class called rerank that handles exact passage lookup with vector search

#

But yes, we also need something simple up front

#

https://github.com/dmmagdal/WikipediaEnSearch

GitHub

GitHub - dmmagdal/WikipediaEnSearch

Contribute to dmmagdal/WikipediaEnSearch development by creating an account on GitHub.

meager siren Aug 8, 2024, 6:04 PM

#

meager siren https://github.com/dmmagdal/WikipediaEnSearch

#

Hope you got enough storage space

#

?

#

Huggingface hub for backup

#

but I have a 1TB external hdd as another backup

#

and my server storage is 1Tb large

#

Server wise?

#

56 cores total (2x Xeon CPUs), 68 GB RAM, and 1 TB storage

#

Idea is to get it to run on any consumer level system. Server offers maximum parallelization for preprocessing scripts

#

You're fine to scale it as needed

#

Just have more than 8GB regular RAM

#

Anyways, just get an idea for the number of texts in the corpus vs the number of texts are returned from the inverted index

#

and then you see the problem I'm having

#

It's hard for python to scale the TF-IDF or BM25 compute across all files quickly

meager siren Aug 9, 2024, 12:39 PM

#

@west pagoda I think I found an interesting issue with one of the modules I'm using, specifically msgpack. I noticed that there was A LOT of cache/buffer memory usage after I had finished running scripts. So I went through all the packages I used and settled on msgpack. I assumed msgpack was implemented in C given the speed and since it's a third party package that is used a lot. And I found an open issue on GitHub related to this.

https://github.com/msgpack/msgpack-python/issues/612

GitHub

Dict memory leak in Python 3.12.x / msgpack 1.0.x · Issue #612 · ms...

Hello, I think I've found a second Python 3.12 memory leak (having read the issue for the other one, and tried upgrading to msgpack 1.0.8 to fix). This issue occurs with dicts, and key uniquene...

west pagoda Aug 9, 2024, 7:25 PM

#

Ahh I see...

#

yeah I have had issues with python 3.11 and 3.12

#

so I settled with 3.10 for now. Cause there are some packages that don't work in 3.12 due to inter package dependency and one package not being upto date cause issues in another.

#

3.11 had issues with llama index not running properly

west pagoda Aug 9, 2024, 7:30 PM

#

meager siren <@1198576392300933287> I think I found an interesting issue with one of the modu...

this issue is from may... that's a long time ago. It means it's still not fixed given you are facing it.

meager siren Aug 9, 2024, 7:31 PM

#

Yup. And it’s even on 3.10 Python

#

Have you been implementing stuff in rust? I wonder if it would be better loading in that data with rust instead of the map hack pypi package

west pagoda Aug 9, 2024, 7:32 PM

#

msgpack doesnt seem to be maintained last commit is 3 months ago

#

and there are 2 PR still un-merged

#

I would say try doing this in rust if possible. do file manipulation in python like extracting metadata or articles from xml

#

etc

#

then for preprocessing you can move onto rust.

#

it'll be faster as well

meager siren Aug 9, 2024, 7:34 PM

#

I figured that the search functions can be converted to rust and called from Python

#

Things like the inverted index and computing TF-IDF or BM25 and returning the best scoring entries

west pagoda Aug 9, 2024, 7:35 PM

#

meager siren I figured that the search functions can be converted to rust and called from Pyt...

what does this mean converted to rust then called from python?

#

ahh

#

sorry

#

I get it

meager siren Aug 9, 2024, 7:36 PM

#

Yup

west pagoda Aug 9, 2024, 7:36 PM

#

so rust binding for python

meager siren Aug 9, 2024, 7:36 PM

#

Keep to Python but offload the heavy compute stuff to rust

#

Cuz Python can call rust code

#

With maturin or something like that

west pagoda Aug 9, 2024, 7:37 PM

#

got it. yeah send msgpack to rust as well. I just don't see python msgpack being a good package

#

i saw the msgs for category issue. You can't grab the category from the xml file?

meager siren Aug 9, 2024, 7:44 PM

#

west pagoda i saw the msgs for category issue. You can't grab the category from the xml file...

It’s some extra parsing. Just having the problem of figuring out how it would best work in the inverted index component

#

If I’m understanding timetraveler correctly, the bag of words component allows the current inverted index to return too broad of a search and we can’t figure out how using the categories would help the issue

west pagoda Aug 9, 2024, 7:45 PM

#

instead of words being the node the category would represent nodes and then you can add relevant documents under each category

#

although you may use sub categories within categories

west pagoda Aug 9, 2024, 7:46 PM

#

meager siren If I’m understanding timetraveler correctly, the bag of words component allows t...

categories will help better filter and group articles

#

so you most likely wont have 10 million under each category

meager siren Aug 9, 2024, 7:46 PM

#

Right but figuring out the logic for that is a bit hard for me to grasp

#

Turning words to categories isn’t too hard

#

Could just build a map for that

west pagoda Aug 9, 2024, 7:47 PM

#

and if you make more sub categories then you'll have even better grouping hence even less articles at once

meager siren Aug 9, 2024, 7:47 PM

#

west pagoda and if you make more sub categories then you'll have even better grouping hence ...

According to time travelers research, Wikipedia has shallow category trees

west pagoda Aug 9, 2024, 7:48 PM

#

meager siren According to time travelers research, Wikipedia has shallow category trees

yep they do. given that's how they structure their data.

meager siren Aug 9, 2024, 7:48 PM

#

Question is to turn categories to documents without returning too many

west pagoda Aug 9, 2024, 7:49 PM

#

have you looked into this?

https://pypi.org/project/Wikipedia-API/

PyPI

Wikipedia-API

Python Wrapper for Wikipedia

meager siren Aug 9, 2024, 7:49 PM

#

Recall that using set union for the word to doc tries inverted index returned a lot of documents

west pagoda Aug 9, 2024, 7:49 PM

#

meager siren Question is to turn categories to documents without returning too many

that would be to group them effeciently, use a cluster algo

meager siren Aug 9, 2024, 7:49 PM

#

Any suggestions?

west pagoda Aug 9, 2024, 7:50 PM

#

one minute i'm thinking on it

#

https://pypi.org/project/wikitextparser/

PyPI

wikitextparser

A simple parsing tool for MediaWiki's wikitext markup.

#

here's the parser btw

#

this is basically what I am thinking

#

https://iq.opengenus.org/text-classification-using-k-nearest-neighbors/

OpenGenus IQ: Learn Algorithms, DL, System Design

Text classification using K Nearest Neighbors (KNN)

In this article, we will demonstrate how we can use K-Nearest Neighbors (KNN) algorithm for classifying input text into different categories. We used 20 news groups for a demo.

#

it uses tf-idf method to create vectors

meager siren Aug 9, 2024, 8:08 PM

#

I’m not sure how I like the amount of ads and popups on the site

west pagoda Aug 9, 2024, 8:10 PM

#

lol.

#

I use brave so I don't see them

#

@meager siren what was the dimension size you were using for embeddings?

meager siren Aug 9, 2024, 8:21 PM

#

west pagoda <@400827393415053333> what was the dimension size you were using for embeddings?

768 for BERT vector emebeddings

west pagoda Aug 9, 2024, 8:21 PM

#

when using encoders

west pagoda Aug 9, 2024, 8:22 PM

#

meager siren 768 for BERT vector emebeddings

and the size of text?

#

were you using the whole articles or chunked?

meager siren Aug 9, 2024, 8:23 PM

#

Chunked

#

512 tokens max

west pagoda Aug 9, 2024, 8:24 PM

#

so the max size was 512 tokens and 768 sized dimensions

meager siren Aug 9, 2024, 8:25 PM

#

Yup

#

Vectors are not a compressed representation of data but an abstract one that just makes it easier to operate on instead of words

west pagoda Aug 9, 2024, 8:26 PM

#

I am reading an articles that can scale the dimension size by 32x smaller

#

so you get lots of memory benefits

#

I feel we might be able to go back to the vector db version of this project

#

lemme read further

#

okkk

#

this is it

#

@meager siren

#

42 million vector embeddings how much space would those take? take a guess

meager siren Aug 9, 2024, 8:29 PM

#

At how many dims?

west pagoda Aug 9, 2024, 8:29 PM

#

same wikipedia articles that you have.

#

1024

meager siren Aug 9, 2024, 8:29 PM

#

Hang on

#

Let me guess

west pagoda Aug 9, 2024, 8:30 PM

#

yep yep go on

meager siren Aug 9, 2024, 8:31 PM

#

4 bytes x 1024 x 42,000,000 so probably around 42 TB?

#

4bytes from one float32 value per dim

west pagoda Aug 9, 2024, 8:33 PM

#

wait

#

wrong calc

#

it's 160gb

#

are you dividing by (1024*3) for GBs?

#

ohh sorry not

#

that

meager siren Aug 9, 2024, 8:34 PM

#

Float 32 -> 4 bytes

west pagoda Aug 9, 2024, 8:34 PM

#

yep

meager siren Aug 9, 2024, 8:34 PM

#

Per number

west pagoda Aug 9, 2024, 8:35 PM

#

4*1024 = 4096

meager siren Aug 9, 2024, 8:35 PM

#

X 1024 dims is 4 KB

west pagoda Aug 9, 2024, 8:35 PM

#

4096* 42000000

#

172032000000

#

172032000000/1024 = kbs

#

168000000

#

168000000/1024 = mbs

#

164062.5mbs

#

164062 / 1024 = gbs

meager siren Aug 9, 2024, 8:36 PM

#

That doesn’t seem right

west pagoda Aug 9, 2024, 8:36 PM

#

160.217

meager siren Aug 9, 2024, 8:36 PM

#

I think you double divided the KB

west pagoda Aug 9, 2024, 8:36 PM

#

nope

#

it's correct atleast on the calculator im using

#

check on yours

#

i'll use online version just in case

meager siren Aug 9, 2024, 8:37 PM

#

I just checked

#

You’re right

west pagoda Aug 9, 2024, 8:38 PM

#

lol

#

I was like 42tb

meager siren Aug 9, 2024, 8:38 PM

#

But you are also assuming 1 embedding per article

west pagoda Aug 9, 2024, 8:38 PM

#

yep for now

#

that's not the point

#

this is

#

so 42 million takes 160gb

#

the stats here are for 250 million vector embeddings

#

which is:

#

~937gb

#

now

#

if you use binary quantization you can scale this 250 million embeddings to ~29gb

#

29

#

just 29

#

from 937

#

lol

#

rolf

#

Here is the steps:

Combining binary and scalar quantization is possible to get the best of both worlds: the extreme speed from binary embeddings and the great performance preservation of scalar embeddings with rescoring. See the demo below for a real-life implementation of this approach involving 41 million texts from Wikipedia. The pipeline for that setup is as follows:

The query is embedded using the mixedbread-ai/mxbai-embed-large-v1 SentenceTransformer model.
The query is quantized to binary using the quantize_embeddings function from the sentence-transformers library.
A binary index (41M binary embeddings; 5.2GB of memory/disk space) is searched using the quantized query for the top 40 documents.
The top 40 documents are loaded on the fly from an int8 index on disk (41M int8 embeddings; 0 bytes of memory, 47.5GB of disk space).
The top 40 documents are rescored using the float32 query and the int8 embeddings to get the top 10 documents.
The top 10 documents are sorted by score and displayed.

Through this approach, we use 5.2GB of memory and 52GB of disk space for the indices. This is considerably less than normal retrieval, requiring 200GB of memory and 200GB of disk space. Especially as you scale up even further, this will result in notable reductions in latency and costs.

meager siren Aug 9, 2024, 8:41 PM

#

we've lost the plot though.

west pagoda Aug 9, 2024, 8:41 PM

#

here's the full article:

https://huggingface.co/blog/embedding-quantization

Binary and Scalar Embedding Quantization for Significantly Faster &...

meager siren Aug 9, 2024, 8:41 PM

#

sparse vector retrieval via BM25 and TF-IDF reduces the amount of storage and compute you would have to do in comparison to full vector search

west pagoda Aug 9, 2024, 8:42 PM

#

yes and nope. since they both produce vectors you would still have float32

#

in this case vectors are int8 or binary

#

most likely int8 for results

#

so even with less vector count in tf-idf the resulting sizes are roughly equivalent but one will give better results in top k

meager siren Aug 9, 2024, 8:43 PM

#

Right but the first vectors are dependent on the length of the search query vs the dense vectors are dependend on the model output dims

#

I'm just saying, if you can use sparse vectors to isolate the correct articles of interest, the amount of compute to do dense vector retrieval is drastically reduced

#

that's why people have been looking at this two stage rerank system for RAG

west pagoda Aug 9, 2024, 8:44 PM

#

meager siren that's why people have been looking at this two stage rerank system for RAG

any link?

meager siren Aug 9, 2024, 8:46 PM

#

https://www.youtube.com/watch?v=3w_D1L0F-uE&pp=ygUNdHdvIHN0YWdlIHJhZw%3D%3D

YouTube

Coding Crash Courses

LangChain Advanced RAG - Two-Stage Retrieval with Cross Encoder (BERT)

In this Video we discuss advanced retrieval techniques from Vector Databases with Query Expansion and Two-Stage Retrieval with a Cross Encoder. We use another Reranker on top to take the famous "lost in the middle" problem into consideration.

Code: https://github.com/Coding-Crashkurse/Applied-Advanced-RAG

Timestamps:
0:00 Two-Stage Retrieval i...

▶ Play video

west pagoda Aug 9, 2024, 8:46 PM

#

meager siren I'm just saying, if you can use sparse vectors to isolate the correct articles o...

but most vector retreival are already in secs range with this approach it would be 32x faster

meager siren Aug 9, 2024, 8:47 PM

#

https://www.youtube.com/watch?v=0FQ2WmM0t3w

YouTube

James Briggs

SPLADE: the first search model to beat BM25

AI-powered search is heating up, but it's nothing new. Google, Netflix, Amazon, and many more big tech companies have all powered their search and recommendation systems with “vector search”.

In this video, we'll talk about sparse and dense vector search, their pros and cons, and how the latest sparse embedding model called SPLADE can help us e...

▶ Play video

meager siren Aug 9, 2024, 8:47 PM

#

west pagoda but most vector retreival are already in secs range with this approach it would ...

Sort of. Depends on the scale of the data

#

It's seconds for a handful of documents

#

More like hours/days for all of wikipedia

west pagoda Aug 9, 2024, 8:49 PM

#

doubt that hours part.

#

which vector db bench marks are you quoting?

#

lemme send some links

meager siren Aug 9, 2024, 8:50 PM

#

Looking at lancedb since it's using disk storage

west pagoda Aug 9, 2024, 8:54 PM

#

milvus can do 10248 queries per second with 1 million vectors

#

standalone is 7522

#

which is local

#

1.2 seconds for 1000 retreival queries among 1 million vectors 128 dimensions

#

for one query it's 0.0361 seconds

#

doubt your single user on a laptop can bunch out 1000 queries a second which will cause 1 second delay in his life

#

also I don't know if these are float32 most likely they are maybe... cannot confirm. but that means binary quantization will work faster

meager siren Aug 9, 2024, 8:59 PM

#

I mean, be my guest if you can scale this. I wanted to do a vector search benchmark but I would have to clean up the data by remove redirect and category documents from the dataset

west pagoda Aug 9, 2024, 8:59 PM

#

soka.

meager siren Aug 9, 2024, 9:01 PM

#

One thing that I love about open sourcing this project here is that I've gotten a ton of feedback from you and other devs on how to approach this problem given the scale

#

everyone can build their own version given the data I have consolidated and made available

west pagoda Aug 9, 2024, 9:02 PM

#

we can still technically go the trie based way and it still would be faster than using vectordb. However the main use of vectordb is where you dont get metadata such as category/short description etc

#

then you can't rely on much other than vector search by using vector embeddings

meager siren Aug 9, 2024, 9:05 PM

#

I mean, right tool for the right job

#

if yo can filter our the target documents the vector DB can do its job on the limited amount of data quickly

#

dont forget that vector DBs take time to process all that data into vectors

west pagoda Aug 9, 2024, 9:06 PM

#

meager siren dont forget that vector DBs take time to process all that data into vectors

yep very long

meager siren Aug 9, 2024, 9:06 PM

#

forcasted about 2 months to store it all but that was unoptimized

#

and I ran out of space very quickly so....

#

that's when I decided that the compute of vectors on the fly at inference time, as long as the number of vectors was limted, makes more sense

west pagoda Aug 9, 2024, 9:07 PM

#

yep that's why the binary quantization would help in memory.

meager siren Aug 9, 2024, 9:08 PM

#

Does transformers support binary quantization?

#

if the vectors are dynamic and limited, I could use faiss which does support binary quantized vectors

west pagoda Aug 9, 2024, 9:09 PM

#

meager siren Does transformers support binary quantization?

nope an extra step does. it'a quantiation function in the transformers library by hugging face

meager siren Aug 9, 2024, 9:09 PM

#

west pagoda https://iq.opengenus.org/text-classification-using-k-nearest-neighbors/

west pagoda Aug 9, 2024, 9:09 PM

#

and yes faiss supports quantized vectors and a few mainstream vector dbs as well

meager siren Aug 9, 2024, 9:09 PM

#

west pagoda one minute i'm thinking on it

meager siren Aug 9, 2024, 9:09 PM

#

west pagoda https://pypi.org/project/wikitextparser/

meager siren Aug 9, 2024, 9:10 PM

#

west pagoda have you looked into this? https://pypi.org/project/Wikipedia-API/

west pagoda Aug 9, 2024, 9:11 PM

#

ok then, I have to go to sleep will talk when I wake up.

west pagoda Aug 11, 2024, 8:31 AM

#

ghost can

west pagoda Aug 11, 2024, 11:10 AM

#

the vectors made by them become very large

#

that was the conversation I had with ghost

#

he can use quantized vectors to scale the memory consumption down by 32x.

#

and that will also increase the lookup speed

#

he does want a trie based method.

#

so we're going with that

#

@inland lynx ^

meager siren Aug 11, 2024, 6:36 PM

#

west pagoda Aug 11, 2024, 9:52 PM

#

quantization of vectors would not loose information to such an extent

#

quantization is just using lesser bytes to represent the value with less precision

#

so yeah loss of data but not much given the vector can be safely quantized to such a level with good retrieval.

west pagoda Aug 12, 2024, 9:03 AM

#

so what you said was transformations.

#

quantization of vectors is more like scaling.

#

so you represent the same vector on a small scale.

#

say if the vector was (10.010201, 10.103020) then we can represent that as say (10.01, 10.10) so we reduced the precision. But the information in 3 significant digits will have similar accuracy to 8 significant digits

#

the benefit is mainly memory, if memory size is reduced than the speed is automatically increased in terms of vector calculations.

#

i'll definitely look into this.

meager siren Aug 12, 2024, 2:11 PM

#

@inland lynx can you give a step by step super simplified diagram of the logic flow and mappings that are needed for the category based inverted index for the bag-of-words search (TF-IDF, BM25)? I think something like that would be tremendously helpful (at least for me) wrapping my head around what you've been trying to communicate

west pagoda Aug 14, 2024, 9:24 AM

#

I'm not being confused. I said what you explained "If you quantize the data, we loose Accuracy as we are using data in an encoded" <-This is exactly what I said but you are wrong to say encoded. It's not encoded. How is using float32 vs float64 encoding. I see that as precision scaling. If say you had a vector represented by both fp32 and fp64 and plot it on a graph you would see no difference at all in their location. however one is computationaly faster. But scale the float point even less say fp4 which is float point using 4 bit then the graph looks somwhat same but with large number of vectors it's going to be harder to distinguish since only so many values can be represented in 4 bits. That's why I said precision scaling.

#

"As its binary quant, the vector is similar to a Boolean yes or no, in terms of one characteristic" <- For this i'm not too sure. I haven't acually run the code to see if it outputs binary yes and no, even if the name implies it. But if that's the case we can simply use int8.

#

And yes int8 is a encoding since you went from decimal points to whole number. And yes you loose quite some accuracy but in terms of vector angle difference it isn't a whole lot. say you have 10 million vectors and you try to find top-k = 10 meaning top 10 closest values to some query vector X, then you can good retrieval but it doesn't scale all too well after a few 100 million vectors

meager siren Aug 14, 2024, 4:11 PM

#

Problem with using pure transformers is that you have to consider the scale of the KNN search on consumer hardware. I think we already had a discussion on the time it would take to embed all vectors as well as performance of retrieval from the vector DB. From what I recall, we pretty much said "yeah, the metrics look good but it's untested on the wikipedia data with regular hardware so.... 🤷 "

#

The goal of the project was to narrow down the list of possible valid articles to some top K with the sparse vectors like TF-IDF and then find exact passages with BERT (that should keep cost down). I'm still re-reading the paper to see how the weighted TF-IDF would work/look

west pagoda Aug 14, 2024, 4:42 PM

#

meager siren The goal of the project was to narrow down the list of possible valid articles t...

so I say

#

@meager siren

#

you are doing tf-idf in rust correct?

#

what type are you using? float32, float64 etc

#

also how many articles were left after duplicates?

meager siren Aug 14, 2024, 5:00 PM

#

Float 32 should be sufficient

#

I did use 64 bit int for counting integers

meager siren Aug 14, 2024, 5:01 PM

#

west pagoda also how many articles were left after duplicates?

Not sure. I’d have to remove the articles that are just categories mapping to other (sub) categories

west pagoda Aug 14, 2024, 5:02 PM

#

I suggest to use even lower floating point

meager siren Aug 14, 2024, 5:02 PM

#

They have lower than float32?

west pagoda Aug 14, 2024, 5:02 PM

#

or some sort of quantization that's int 8 based

#

that should be some good optimization on it's own

west pagoda Aug 14, 2024, 5:03 PM

#

meager siren They have lower than float32?

I think they might otherwise you can try using int8 they have very good ML libs

#

otherwise you can use another thing

#

they have pytorch/libtorch binding lib you could use the fp8 that pytorch provides

meager siren Aug 14, 2024, 5:03 PM

#

Not sure how int 8 helps given the TF IDF of a character is a float

west pagoda Aug 14, 2024, 5:04 PM

#

basically tensors

west pagoda Aug 14, 2024, 5:04 PM

#

meager siren Not sure how int 8 helps given the TF IDF of a character is a float

Yeah I'm not sure either.

#

but I am just assuming there might e a way of quantization on tf-idf

meager siren Aug 14, 2024, 5:05 PM

#

I think it’s good to keep memory low

#

And most vectors are pretty different by the first 6 digits after the decimal

west pagoda Aug 14, 2024, 5:06 PM

#

meager siren And most vectors are pretty different by the first 6 digits after the decimal

multiply those by 1000_000 and you got ints.

#

lol

meager siren Aug 14, 2024, 5:12 PM

#

Int8 and float32 are the smallest size dtypes in the vanilla rust library

west pagoda Aug 14, 2024, 6:44 PM

#

wow

#

@meager siren just found this

#

idk how useful it is

#

since im working on my own thing

#

check it out and tell me

#

https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#

scikit-learn

Topic extraction with Non-negative Matrix Factorization and Latent ...

This is an example of applying NMF and LatentDirichletAllocation on a corpus of documents and extract additive models of the topic structure of the corpus. The output is a plot of topics, each repr...

meager siren Aug 15, 2024, 1:52 AM

#

@inland lynx I finished the paper on utilizing the categories, keyword extraction, and TF-IDF to give a more streamlined search. It's pretty interesting though that it heavily relies on the categorical mappings. I see a lot of the logic they used for the weighing functions and it sort of makes sense (I wonder if I can swap out TF-IDF with BM25 to compare results).

meager siren Aug 16, 2024, 1:04 AM

#

I am working on building the category to category map from scratch. The only reason I'm not using the repo you suggested is because I'm not sure if key categories overlap or are missing from my current dataset. Current script takes a few hours (primarily because it reads the original xml article files and those files take time to load (more time to load than it takes to actually process the article data).

#

I can speed this up with multiprocessing but it will take some time to work out. I still have to break up the current graph to remove cycles and identify starter nodes

meager siren Aug 18, 2024, 3:42 PM

#

Ok this may actually be much more helpful than doing it by hand

#

Still have to build the TF-IDF data in such a way that data is properly propagated up the graph/tree(s)

meager siren Aug 18, 2024, 4:02 PM

#

The paper you shared earlier had used TF-IDF to determine which categories to go down. You need word freq & number of documents per category in order to determine this. The wikicat repo does organize the categories into a tree/DAG structure but you still need that metadata propagated up to all the categories (my code only has that data at the leaf nodes of the tree/DAG).

#

No, we are

#

but the metadata for that child has to get propagated up to the parents otherwise there is no way for the algorithm to know which category to go to

#

@inland lynx hang on let me break it down

#

wikicat gives us the categories structured in a tree right?

#

And the paper you shared, it used a modified TF-IDF to determine which category & sub category to traverse in the tree

#

It does not

#

wikicat does not have that metadata

#

Or at least, I'm not seeing it

#

right, that's not the problem

#

the problem is that, in order for the TF-IDF to be used, each level of the tree needs the metadata useful to compute that vector

#

we have the metadata (for TF-IDF) in the leaves

#

that metadata has to be propagated up so that the next level (parent categories) have it aggregated

#

the paper does a BFS implementation but yes

#

kinda

#

there is nothing wrong with the algo from the paper nor the structure provided from wikicat

#

We have to populate the metadata to the tree bottom up (leaf to root)

#

Yo have to

#

because it's cheaper to precompute that before inference

#

than to do this at every inference time

#

You read the same paper as I did, right?

#

kinda

#

it talked about document but we can do the same for query

#

right, that's true

#

But the paper builds that vector based on the data from the children/leaf nodes

#

that's wat I'm saying the problem is

#

right now, I'm trying to give the nodes enough data to build the vector from the child nodes

#

and store that to file

#

I understand that

#

you have to populate parent categories with vectors from the metadata. That metadata is coming from the sub-categories

#

You are assuming that wikicat already has that vector info precomputed

#

Your description doesnt line up with the resources you've provided

#

It's not that it's computationally efficient

#

It's that I'm having trouble finding a way to build the tree they did from the bottom up

#

they gloss over that heavily

#

I'm not talking about the relations

#

the graph structure is already made with wikicat

#

I'm talking populating the structure such that the traversal described in the paper is ready with minimal compute

#

stop talking about relationships. We already know the relationships have been established between categories

#

yes

#

but you have to move the data up the tree to parent nodes in order for that vector to be created

#

Information has to be moved bottom up for you to create vectors that go top down

#

aggregated vectors

#

Ok, but how do you take a query like "who ran for president in 1992 in the United states?" and go category by category down the tree?

#

That is not how the paper did things

#

If i were to do what you just suggested, you know what I'd do? I'd use word2vec to get the category label that closely matched my query bag of words.

#

My point is that strategy is tangential to the paper you had me read

#

I guess I'd just do a BFS traversal similar to the paper and use the highest cos similarity of a word in a bag of words against the category label

#

no

#

whatever

#

yeah, it's fine

meager siren Aug 19, 2024, 12:13 PM

#

Summary of inverted index schemas that have been proposed thus far:

Plain inverted index using a trie to store the information/mappings in a compressed manner. For each word, get the documents that the word is found in. Aggregate the documents in a set union operation
Plain inverted index (like above) with a filter against words with low IDF scores (given an arbitrary threshold).
Weighted inverted index. Similar to 1, each document gets a weight that is computed from the sum of the IDF values for all words that returned that document.
Category search - Word2Vec. Do a BFS style search on the category tree where you choose the branch that is "most similar" to the query based on aggregated Word2Vec scores between the query and the category for that node
Category search - modified TF-IDF. Similar to 4, except the comparison uses a modified TF-IDF for the vectors instead of Word2Vec.

meager siren Aug 20, 2024, 2:35 PM

#

west pagoda https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extractio...

meager siren Aug 20, 2024, 2:37 PM

#

west pagoda https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extractio...

Looks promising but given the scale of the data (10 - 20 million articles with a corpus vocab of around 40 million) I dont think the sklearn vectorizer is equipped to handle that much data.

west pagoda Aug 21, 2024, 8:49 PM

#

meager siren Looks promising but given the scale of the data (10 - 20 million articles with a...

possibly yes.

meager siren Aug 22, 2024, 1:38 PM

#

@inland lynx Tried using wikicat.... Downloading the data dump has not been fruitful

meager siren Aug 22, 2024, 4:57 PM

#

miyazaki_pain

#

k_lonely

meager siren Aug 27, 2024, 11:28 PM

#

@inland lynx I just had a thought. Since the original files are broken up into 195 files, each with 150,000 articles, why not have the search engine run on each file instead of building monolithic stores that contain metadata per some arbitrary chunk size?

Pros:
 - Conceptually easy to parallelize
 - No worries about load time in fileIO
 - Possible reduction in memory overhead
Cons:
 - More fileIO operations
 - Possible (likely) data redundancy

#

And before you ask, no I still have not implemented categorical hierarchy inverted index yet. It's something that's taking a long time to hack together.

meager siren Aug 28, 2024, 2:01 PM

#

More like if we have 1 file for a search engine to process, You lose the compression of storing the inverted index in a trie because you would have 1 trie per file rather than the 50 or so tries that have coverage for the entire corpus.

Alternatively, we can still keep 1 trie per starting character found in the words for a file but I'm not sure if having extra files would help or hurt.

#

I also agree that searching the entire file is not ideal. But also consider that the files are pretty small (1.6GB token max but most are 700MB). Loading will be slow but just as slow as all the other file loading that's been done so far (ie load tries, redirect files, IDF files, etc)

#

Simplicity

#

Right, but it's also been compute (RAM) intensive

#

Running on my server, we used up 24 GB RAM easily when doing search

#

that's with iterating through tries that account for the corpus

#

virtual memory readings are unreliable due to the memory leak in the python msgpack module

#

On a run my cache memory is like 40GB

#

2GB MAX, mean is more like 700MB

#

The 24 GB was also including a few other files

#

7GB can be accounted for via the doc to int /int to doc mappings

#

IDK

#

remember, the metadata files I've made are for the entire corpus

#

I figured that sharding the search engine could help keep the overhead load and help reduce the number of documents I have to read into memory

#

entire file is only needed for final retrieval

#

No, sharding based on the file splits of the data (the 195 files)

#

The only metadata file that wont be sharded would be the IDF metadata, since that is corpus wide information (overhead/file size is still around 2 GB)

meager siren Aug 28, 2024, 3:06 PM

#

It's actually kinda against TOS to use it for metadata storage

#

I wont tell if you dont though

meager siren Aug 28, 2024, 4:00 PM

#

No hard limit but I would suggest keeping it below 500GB to avoid detection

#

Also, i think that still qualifies as metadata

meager siren Aug 28, 2024, 4:57 PM

#

what's the threshold for high weights? What are the weights based on? IDF?

meager siren Aug 29, 2024, 4:19 AM

#

Sounds like word2vec

#

Or TF-IDF with some extra steps. Either way, the term extraction/isolation from the query is a bit iffy without an algo.

crude walrus Aug 29, 2024, 8:55 PM

#

hey im confused about LangChain and Weaviate support
Weaviate release v4 client api and make v3 depreciated
But

from langchain_community.vectorstores import Weaviate

Doesn't work with v4 client
i found topic from June that is not resolved yet
https://github.com/langchain-ai/langchain/issues/23088#issuecomment-2268069252

#

i just startet my RAG project 2 days ago, should i search for some workarounds or just change VectorDB?

#

update: i will test this solution:
https://github.com/langchain-ai/langchain/pull/20443#issuecomment-2080199607

meager siren Aug 29, 2024, 9:35 PM

#

Use TF-IDF to create an inverted index to filter out/reduce how many TF-IDF operations you have to do on the actual corpus

#

crude walrus Aug 29, 2024, 10:22 PM

#

😕 wait correct me if im wrong
to run RAG app with vector search i need running Embbeded model every time i perform a query, not only when prepared data to populate vector DB
🤔

meager siren Aug 29, 2024, 10:27 PM

#

crude walrus 😕 wait correct me if im wrong to run RAG app with vector search i need running...

Correct

#

because you query also needs to be converted to vector space so the vector DB can do KNN and return the closest results

crude walrus Aug 29, 2024, 10:35 PM

#

thanks 🥹 i would try to find small embedding on HF model to run locally on CPU to finish my project

meager siren Aug 30, 2024, 3:38 AM

#

It was a joke. That being we're using TF-IDF twice but one is for the inverted index to filter/narrow down our search

inland lynx Aug 30, 2024, 9:15 AM

#

white bay Aug 30, 2024, 2:12 PM

#

At this point you guys should consider making your own vectorDB and retrieving algorithms

white bay Aug 30, 2024, 3:27 PM

#

No I know idea must be good thats why this thread is still alive

meager siren Aug 30, 2024, 4:30 PM

#

That's one of the reasons why the repo is public. If someone can do it better, they have a place to start from

meager siren Sep 14, 2024, 1:55 PM

#

I switched the TF-IDF to do a file level search (in other words, you have multiple threads where each thread processes a chunk of the 195 files). Spawned 32 threads for the process and waiting on results.

I had a few bugs which got my hopes up (I thought the return speeds I was getting of 100s per query were good but there were entire sections being skipped). Still hoping that the multithreading and chunking brings the search time really far down from 2 hours per query.

#

Inverted index trie was also split by file (and then by starting character). It reduces the savings on storage but keeps the inverted index still relatively compressed.

meager siren Sep 14, 2024, 2:35 PM

#

File search

#

Figured that threads processing around 100,000 articles per thread would be more fruitful than doing the inverted index to get 10M articles and then sharing the TF-IDF process to threads

#

Preprocessed articles in each files into tries. Preprocessing skipped redirect and category articles

#

Same stuff I used to build the Tries previously

#

Still using bag of words atm

#

I am doing per article

#

I’m just allocating threads by file

#

195 files splits across 32 threads puts 7 files per thread. For each file, I get the relevant files via the inverted index trie for that file and pass that on to the TF-IDF/BM25 algo for scoring. Scores are kept in heap that has a max size.

#

Heap from each thread is aggregated and again, that has a max size.

#

That final heap is returned/passed to a function to retrieve article text.

meager siren Sep 14, 2024, 3:34 PM

#

Note to self, num-workers value for threadpool is capped by the memory alloted to the PC. Each trie folder is about 200MB on average and doc_to_words map is about 400MB on average which gives overhead of around 600MB to as much as 1GB per thread.

#

Fewer threads means less parallelization (and thus slower inference/search) BUT that is all dependent on hardware so there's not much more I can do in terms of optimization (until I do the Rust conversion of the source code).

meager siren Sep 14, 2024, 5:46 PM

#

OOM issue due to worker count

#

Each thread has to process a chunk of files in serial. Overhead for that is about 1GB memory. Across 32 workers, thats possibly 32GB memory usages.

#

No risk of deadlock since each worker gets its own unique files

#

32 workers means each thread has to do max 7 files. BUT because of the OOM I'm trying 8 workers right now (puts the count to around 25 files).

#

I'll also try with 16 and 12 workers on my macbook and see how that goes. I'm sure my server will be able to handle this just fine (maybe even handle it as multiprocessing). Kind of a downer because the whole point was that "any" PC could run this reasonably but you do have to consider the scale of the data and.... yeah, it's just a lot.

#

Not sure

#

Assuming rust provides something wild like a 10x speedup and we take the 2 hours per query (from my previous runs under serial load/no parallelization) as a benchmark.... That would be around 10 - 15 minutes per query...

#

Oh, I forgot to mention to you is that the category tree stuff failed to build on my system. Ran for so long and I think my system was rebooted after a few weeks from an update

#

No, I built from the category data in the dataset

#

Heap for my bag of words does this

#

stdlib from python

#

heap entries are the following tuple (TF-IDF score * -1, document filename + article hash)

#

Pretty much. I did everything I could to cut back redundancy and reads

#

Even put in calls to delete variables and use the garbage collection to clean them up.

#

Only memory wastage is the memory leak from the msgpack module

#

and the TF-IDF score for ordering

#

document name is going to take up some space, but because the heap has a set limit before it starts poppushing, it should be fine

#

Because I'd have to spend more time on getting category key working

#

I'll give the thing a try, but remember that 1) this is a hobby project and 2) I have other things to do so progress will be slow.

#

Since I'm away from my server, I wont be able to do any have processing for the category tree, so it would have to be a prebuilt tree that gets downloaded.

meager siren Sep 14, 2024, 6:52 PM

#

I’d starts where I got the wiki dump.

meager siren Sep 14, 2024, 7:10 PM

#

Downloading enwiki-latest-category.sql.gz-rss and enwiki-latest-categorylinks.sql.gz to try and see what I can get. Downside is that I have to use sqlite3 to explore it. I may consider swapping to sqlite instead of xml because I can interface easier on python. Not sure about the speed though.

#

Code/links?

#

Did you get the entire category trie?

inland lynx Sep 14, 2024, 8:26 PM

#

meager siren Did you get the entire category trie?

https://github.com/rx-112358/Wiki-Category-Graph/blob/main/wiki_categories.graphml

meager siren Sep 14, 2024, 8:38 PM

#

Not bad. Can parse it since it’s XML format

meager siren Sep 15, 2024, 1:30 PM

#

Well, if I can implement the TF-IDF search via the category tree, it may eliminate the trie all together

#

Also, I just finished downloading the two sql files from the wikipedia hub. the latest category sql file is about 200MB but the category links is at abou 30GB

meager siren Sep 15, 2024, 2:03 PM

#

sigh looks like I'm on the right track with this dump:
https://stackoverflow.com/questions/27279649/how-to-build-wikipedia-category-hierarchy
https://stackoverflow.com/questions/17432254/wikipedia-category-hierarchy-from-dumps
https://kodingnotes.wordpress.com/2014/12/03/parsing-wikipedia-page-hierarchy/

Stack Overflow

How to build wikipedia category hierarchy?

I'm trying to build the treegraph of wikipedia articles and its categories. What do I need to do that?

From this site (http://dumps.wikimedia.org/enwiki/latest/), I've downloaded:
enwiki-latest-p...

Stack Overflow

Wikipedia Category Hierarchy from dumps

Using Wikipedia's dumps I want to build a hierarchy for its categories. I have downloaded the main dump (enwiki-latest-pages-articles) and the category SQL dump (enwiki-latest-category). But I can'...

koding notes

kanesee

Parsing Wikipedia Page Hierarchy

Wikipedia is a set of pages that has relationships with one another. For example, the Anarchism page has a set of categories (located at the bottom of the page). Categories group related pages toge…

#

But TBH, with the shear scale we're looking at... I dunno if I wanna do it

#

Primarily for the ability to take a dump and parse it when it updates.

#

The categories oerall wont shift from year to year but there are new ones that might be added in accordance with major historical events or scientific discoveries

#

in which case, that static graph would be out of date

#

ok then, let me look again at the repo one more time

#

Alrigh alright

#

Just give me today to look it over. I'm not by my server today and I have to drive back home later (It's a 2 hour drive from my current location)

#

I'm afraid that is gonna be compute intensive

#

since you have to work bottom up from the tree to build the TF-IDf word vectors

#

For their scale, it's not compute intensive

#

they have a budget for CDN but I imagine they have an entire server dedicated to search

#

It's interesting to ponder

#

most people entere wikipedia via Google search, not via the main wikipedia homepage

#

so most people dont search wikipedia via wikipedia search engine

#

No no

#

I’m doing a dumber search engine

#

One that can take in raw queries like goog but outputs the Wikipedia articles

#

Via that max depth parameter in your graph builder?

meager siren Sep 15, 2024, 2:41 PM

#

Mess with the max depth to get a slightly taller tree. Something either 4 deep or 3 deep. Smaller depth means more articles per leaf

#

Each category maps to a set of articles

#

Higher categories are just aggregations of sub category articles

#

Well I. That case, let’s see what happens

#

If the category tree can return something like a few 100K articles l, then we’re in business

#

I’m just saying. If we can filter out from 10M articles down to 100K, it would be nice

meager siren Sep 15, 2024, 3:07 PM

#

We still doing the TF IDF BFS search on the category tree, right?

meager siren Sep 15, 2024, 6:44 PM

#

Thinking about it, the vocab is also huge…..

#

Top level category is gonna have a HUGE vocab

#

That cleanup is no big deal

#

I’m just thinking of once each category’s aggregated word frequency is done it needs to be stored and pulled into memory for searching through the tree

#

Group subcategories by number of articles then

#

If a sub category has less than 10K, group as one

#

More like throw something out and seeing what sticks

#

What about this

#

Ignore overlapping terms words at a level?

#

Nah, no go for that. If the query consists of overlapping terms, then you get nowhere

#

New thought: train a Word2Vec model on the data and use cosine similarity between the words and category labels?

#

This only falls apart at sub categories that are multi word

meager siren Sep 15, 2024, 7:25 PM

#

No you’d just train 1 model on the whole corpus to get the relative vector embeddings

#

Yeah

#

And the problem is that the multi word query and categories also won’t work well

#

Maybe BERT embedding the query against the category?

#

Thought our graph was relevant categories

#

Not sure

#

Didn’t you have networkx in the code? You could print it out to see the tree

#

Kinda. I think they did more NER or classical methods

#

Ok

#

So to recap

#

Query and the graph labels (categories)

#

Why not vector embed the query and categories with BERT and traverse the graph (category tree) until the cosine similarity becomes worse

#

Think about it, it’s handling multi word texts

#

No pretraining on our end

#

How “many” are we talking? Is my question

#

If we can get shorter than 500K, then we’re golden

#

If not, then we need further refinement

#

Give mean, median, and mode

#

Most give mean when they do this

#

Just thought I’d specify

meager siren Sep 16, 2024, 3:50 PM

#

Running at max depth 5 right now to see what happens. There should be a conceivable level where max depth returns "too fine grain" of search. I guess we're just finding it at this point

#

To be fair, the DFS algo in your code should be able to handle that, no?

#

And I did see in my own code (when trying to build the tree from the actual data dump) that indeed cycles were detected.

#

Then encountering cycles should not be a problem I think

#

🤷‍♂️

#

Probably 20 levels is all we'll want to explore IMO. Thinking from a high level, there should not be too many ways to categorize one article that isn't already covered by other categories

#

Have you been able to save the graph to a png instead of a pyplot?

#

I'm currently doing that for my depth 5 run right now

#

I have it set on mine

#

Is there a save to svg then?

#

OR something that wont blow out the canvas limit?

#

How long did it take for you to get a max depth 5 graph?

#

Damn

#

Well, unless chat jippity gives a source, I'm not inclined to believe it for something like this

#

Should do incremental runs. Depth of 5, 10, 25, 50 and compare results. I can leave the script on my server to run.

#

Plus, chat gpt was only talking about articles, when we're talking about the categories

inland lynx Sep 16, 2024, 4:38 PM

#

meager siren Sep 16, 2024, 4:42 PM

#

Overlap though?

#

What is F?

meager siren Sep 16, 2024, 5:02 PM

#

So what we know now:
C = category
P = page (articles)
F = ?

#

Some categories overlap/crisscross and there are probably cycles in the graph

#

Got HTTP error on my run for depth = 5

#

#

📎 message.txt

#

Could not tell

#

Side note: running search on multithread for pages was not as fast as I was hoping (8 threads)
Search returned in 17078.927761 seconds
Also RAM jumped to 18GB for some reason (probably the memory leak - cache is at 47GB)

#

Let me try

#

It's the msgpack module

#

Gonna convert that loading in the search engine to rust

#

It's not, but some of the msgpack items that are serialized (ie tries) have to be handled in case we use them

#

I dont want to even deal with that. I'd have to submit a PR to the msgpack module repo

#

Running script again for depth = 2

#

Ran no problems. WOW that's a lot of text/fields

meager siren Sep 16, 2024, 5:46 PM

#

Rerunning with max depth = 5 on server

meager siren Sep 16, 2024, 6:53 PM

#

Thinking about it out loud, we probably wont get a better depth lower than 10/20 because of how much RAM you'd need to process the number of categories.

meager siren Sep 16, 2024, 9:15 PM

#

Breakdown of the data: https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

Wikipedia:Size of Wikipedia

The size of the English Wikipedia can be measured in terms of the number of articles, number of words, number of pages, and the size of the database, among other ways. As of 16 September 2024, there are 6,883,665 articles in the English Wikipedia containing over 4.6 billion words (giving an average of about 681 words per article). The total numb...

#

Only around 7M articles out of 61M possible pages.

#

C = Category
D = Draft
F = File
P = Portal
Template
TimedText
User
Wikipedia
Help
MediaWiki

meager siren Sep 17, 2024, 3:31 AM

#

@inland lynx I think we need to look at parsing the damn sql file for category links

#

Running your script for depth 5 and it's still going

#

It would probably be easier to parse the sql tables

meager siren Sep 17, 2024, 12:50 PM

#

Not sure. I have Max depth at 5 and it's still running. Memory usage is at around 28GB (with 38 GB cache). I feel like it should have stopped by now. Part of me wants to suspect network issues (since we're using API calls) but those would have given timeout errors.

meager siren Sep 18, 2024, 2:34 AM

#

Surprisingly, yes

#

Memory is now down to 7GB with 40 GB in cache

#

Screenshot_2024-09-17_at_10.35.05_PM.png

meager siren Sep 18, 2024, 2:58 AM

#

Not yet

#

I wanted to keep the run going rather than restart

#

For depth 2, I get a completion time of maybe 10 to 15 minutes (or less)

meager siren Sep 18, 2024, 1:54 PM

#

I'm killing the depth 5 run. 24 hours is long enough

meager siren Sep 18, 2024, 3:16 PM

#

So the categorylinks and all other sql files are MySQL format. Kinda dont want to have to install mysql on every machine so i need a way to convert it to something like SQLite

meager siren Sep 19, 2024, 2:38 AM

#

I think it’s worth going back and exploring deriving the category tree from the articles. We have some category to category data so it may be worth exploring.

meager siren Sep 19, 2024, 2:57 PM

#

@inland lynx I have an idea for how to build a (mostly complete) category tree. It will combine both the wikipedia API AND the categories from the article data dump.

Compile a list of all unique categories mentioned in the data dump.
We use the categories listed in the data dump to seed queries to the API. Each category is queried from the API, and should give us a subtree shard that we can use to build the bigger tree (we should be able to see subcategories and parent categories).
We attempt to stitch as many shards as possible with the tree we get from querying the Main Topic Articles from the API

#

This comes from the fact that querying the APi is taking way too long and the data dump is somewhat incomplete when it comes to category data just from the XML containing articles\

meager siren Sep 19, 2024, 5:53 PM

#

Also, I (think I) figured out why the download category script was taking so long. It was because of the saving the graph to png component that was taking forever

meager siren Sep 20, 2024, 2:59 PM

#

meager siren Also, I (think I) figured out why the download category script was taking so lon...

You dont think so @inland lynx ? I managed to get the depth = 5 to run in under a day on my machine. I'm currently working on depth = 10 to see how long that takes.

#

So far I got remote connection error after some time

#

Also, current file size of category graphs
depth 2: 1.8MB
depth 3: 14MB
depth 5: 112MB

meager siren Sep 21, 2024, 10:32 PM

#

depth 10: 490MB

meager siren Sep 22, 2024, 12:35 AM

#

Took around 24 hours for gathering the depth 10 information. Should I try for depth 25?

meager siren Sep 22, 2024, 6:04 PM

#

Number of unique categories captured by wikipedia API at depth 10: ~1.6M.
Number of unique categories captured by dump only: ~4.6M

I think I'll just run the depth 25 run and see what happens. Fingers crossed it only takes 2.5 to 3 days.

inland lynx Sep 23, 2024, 9:00 AM

#

meager siren You dont think so <@456226577798135808> ? I managed to get the depth = 5 to run ...

im running the code w/o drawing the graph and it takes equally long I think the time is spent on loading the entire page
page = something
This part needs to be optimized

meager siren Sep 23, 2024, 2:46 PM

#

inland lynx im running the code w/o drawing the graph and it takes equally long ...

I thought it was only grasping links to other categories, not (article) pages?

#

Already kicked it off... might as well see what comes of it.

meager siren Sep 23, 2024, 3:57 PM

#

I scrapped the sql dump idea because I'd have to set up a docker instance to host mySQL and do the querying. Effectively, it would be a lot of work to set up and maintain for when newer dumps came out.

meager siren Sep 24, 2024, 12:45 PM

#

How do you propose we do that?

meager siren Sep 24, 2024, 1:44 PM

#

Also, depth 25 category tree download failed (got HTTP 443 timeout exception - not surprised). I guess I'll just look at the depth 10 to see how many categories from the text fit.

inland lynx Sep 25, 2024, 5:01 PM

#

meager siren How do you propose we do that?

need to use a different wiki api (or) figure out how to create a new WIKI APi which is more efficient

meager siren Sep 25, 2024, 5:59 PM

#

inland lynx need to use a different wiki api (or) figure out how to create a new WIKI APi wh...

Would it be too much for me to ask you to explore this part?

meager siren Sep 26, 2024, 9:48 PM

#

Also thinking out loud here. Would it make sense to do unsupervised clustering on the articles and build a tree that way?

meager siren Sep 27, 2024, 3:33 AM

#

Had another thought. We have the categories of each article in the dump. We have the category tree/map to at most a depth of 10 from the wikipedia API. What if we used BERT (since we cannot directly use word embeddings for multi word texts) to create links between the article categories from a dump and the leaf categories from the retrieved category tree? We have some sort of distance threshold so that you can have multiple mappings but there is a limit to that number. After that, prune the tree of all leaf categories that have no correlation to a category from a dump.

#

This is sort of like creating a weak connection between categories to "fill in the blanks" where the tree was unable to be populated.

meager siren Sep 27, 2024, 4:15 AM

#

This may be an interesting thing, since there are around 1.5M categories in the tree but are ~4M categories in the dump (not included category to category articles).

#

So around 6M embeddings total (1 embedding per category, 4KB per embedding.... gonna be a lot of data)

meager siren Sep 27, 2024, 12:37 PM

#

Agreed. I'm kinda wary of doing that too. Lot of resources look at TF-IDF to create the vectors but the vocab is gonna make the vectors large as you compare more and more articles and groupings.

#

I'm saying more like we use BERT to map categories from the tree we made from the API with the categories from articles that were not captured in the tree.

#

I had a script running take inventory of what was covered and what wasnt by the category tree (API, depth 10).

#

#

When you consider there are 4 million categories, having a gap coverage of 2M is not bad (especially when you consider the tree also had around 2M categories). Also, only 1M articles being completely unreachable is crazy good all things considered.

meager siren Sep 27, 2024, 3:40 PM

#

Of course, that would be ideal but there is a lot of processing that would have to be done to gather those categories

#

A lot of waiting for gathering the entirety of the category tree

#

I thought about using abstracts to aid the search but the information within them can be either sparse or dense relative to the topic

#

They do but like I said, the info is sparse

#

even metadata is minimal

#

Not even that

#

From what I recall, you get title, abstract text, and maybe one other metadata bit but that is it

#

And the abstract text itself was something like 1 paragrah tops

#

For persons, it was 1 short line like date of birth and region of birth

#

So I dont think those texts will provide any meaningful assistance

inland lynx Sep 27, 2024, 3:47 PM

#

i feel like actual search engines have multiple layers of condensed information before the actual article
starting from tags -> more metadata -> summary -> node link of actual data

meager siren Sep 27, 2024, 3:48 PM

#

We could have an LLM like llama 3.2 1B try and summarize the articles to create our own abstracts. See if that 128K context length is true.

#

It's just a thought

#

Can we maintain category weights separate from an LLM summary?

#

More like summary based on the article text. Ignore categories

#

We keep category map but we confine our TF-IDF/BoW analysis to the summary instead of the whole text

#

Unfortunately, I dont think anyone will have an answer for that.

#

Honestly, the lack of relative control and traceability is always a good reason against using LLMs

#

yeah exactly

#

Lowering parameters like temp can help

#

I can do that, no problem

#

Should I use the instruct model?

#

we want to minimize "creative liberties" the model may take in the summary

#

I'll just generate 5 summaries and send you the original articles as well.

#

Also sharing the screenshot of my idea on "filling in the blanks" for the category tree.

#

I know we just discussed it but I'm leaving it here for notes

#

Screenshot_2024-09-27_at_12.01.38_PM.png

meager siren Sep 27, 2024, 5:59 PM

#

I got something stiched together for a few articles

meager siren Sep 27, 2024, 6:15 PM

#

took 50 seconds to generate for 6 articles

#

not great when you have to scale up to like 7M

#

that' the summary generating

#

there is no article that has an explicit map to that category

#

they are from the category tree

#

and make up the tree for traversal

#

yeah

#

like I said, we discussed this

#

we can prune some but not all. Prune away all non-direct reports andthe tree falls apart

#

well, I dont know what you mean by "short" but depth of 10 is pretty short to me. Depth of 2 will have no overlapping categories

#

Most immediate idea is lack of batching

#

gotcha

#

It was just for proof of concept, so no multiprocessing/batching

#

let's evaluate the summaries first

#

📎 pages-articles-multistream10_288ec473bfbc27d7686cfee3396c7ea4db31c2f7b2b967778e1e3a21c08a583b.xml9mhbv5bb1ouszwwbe7tnk4cmndmhtcx_summary.txt 📎 pages-articles-multistream10_288ec473bfbc27d7686cfee3396c7ea4db31c2f7b2b967778e1e3a21c08a583b.xml77dc02ek2x5qlza7dplrjdgi4smmbso_summary.txt 📎 pages-articles-multistream10_288ec473bfbc27d7686cfee3396c7ea4db31c2f7b2b967778e1e3a21c08a583b.xmlalpwsbm4z7vc9d1f18qe92c2sf5tpqz_summary.txt 📎 pages-articles-multistream10_288ec473bfbc27d7686cfee3396c7ea4db31c2f7b2b967778e1e3a21c08a583b.xmlf792aqy9mdauyvzbihq89rg2n05xa72_summary.txt 📎 pages-articles-multistream10_288ec473bfbc27d7686cfee3396c7ea4db31c2f7b2b967778e1e3a21c08a583b.xmlfbc9mind2xijdhy7hzwm32ne7larvfz_summary.txt 📎 pages-articles-multistream10_288ec473bfbc27d7686cfee3396c7ea4db31c2f7b2b967778e1e3a21c08a583b.xmlqfvznvvpcm96bch35yagim92oetkgx2_summary.txt

#

Damn, you gotta parse through it a bit. But it's not that bad

meager siren Sep 27, 2024, 10:13 PM

#

This isn’t search, just summary. I can send you the code in a bit

meager siren Sep 27, 2024, 11:33 PM

#

📎 summarize_articles.py

meager siren Sep 28, 2024, 2:35 PM

#

I think with the way the LLM is prompted for the summary (and things such as the temperature of the generation), it is staying as close to the original material as possible. It does not seem to do anything with the classification categories that are attached at the bottom of the original article

meager siren Sep 30, 2024, 9:34 PM

#

Saw this article (Use freedium to access it). What if we trained a document embedding model? https://freedium.cfd/https://medium.com/@evertongomede/doc2vec-understanding-document-embeddings-for-natural-language-processing-ba244e55eba3

Doc2Vec: Understanding Document Embeddings for Natural Language Pro...

Introduction

meager siren Sep 30, 2024, 11:25 PM

#

I was looking back at this Medium article on the people who did a similar project (Wikipedia search engine). They operated on a much more limited scope, choosing to grab articles from the Health category and diving depth 10 down the tree. They ended up with about 650 articles and filtered it down to 450 after some additional processing around document similarity (used Jaccard Index + some arbitrary threshold for similarity and removed documents that were too similar).

It honestly got me thinking about how "easy" they got it for their project not having to deal with the vast scale of the corpus of millions of articles....

#

https://medium.com/@akleber/developing-a-search-engine-framework-for-wikipedia-articles-81fcbd95a928

Medium

Developing a Search Engine Framework for Wikipedia Articles

André Kleber, Trevor McCalmont, Julien W

meager siren Oct 1, 2024, 11:58 AM

#

It’s more of a comment of envy at the simplicity of their dataset

#

Just putting it out there as another alternative

inland lynx Oct 1, 2024, 12:26 PM

#

meager siren Just putting it out there as another alternative

so where are you with your summaries?
I was thinking,

you generate summaries for articles in a few catgeories
do a cosine similarity between query and summary vs query and article
Q) is there any difference? If we do similarity with summary vs article itself

meager siren Oct 1, 2024, 12:27 PM

#

inland lynx so where are you with your summaries? I was thinking, 1) you generate summaries ...

Probably will be a slight but noticeable difference

inland lynx Oct 1, 2024, 12:28 PM

#

meager siren Probably will be a slight but noticeable difference

I want to know does it affect the query result
That’s all. It should not be that summary returns different set of x documents

meager siren Oct 1, 2024, 12:29 PM

#

I'll generate ~100 summaries for that then

#

Is this a dense vector cos-similarity or just BoW (TF-IDF) cosine similarity?

inland lynx Oct 1, 2024, 12:32 PM

#

meager siren Is this a dense vector cos-similarity or just BoW (TF-IDF) cosine similarity?

tf idf
If its possible, can you do both?
Want to know how much difference in using different vector similarity also

meager siren Oct 1, 2024, 12:33 PM

#

What is the expected gain you're hoping out of this compared to how we're trying to set up the search engine?

meager siren Oct 1, 2024, 1:01 PM

#

My ask is more around how does this tackle the scaling problem we're having

inland lynx Oct 1, 2024, 1:02 PM

#

meager siren My ask is more around how does this tackle the scaling problem we're having

oh right

#

We just want to be sure that summaries are enough

#

summaries will reduce data load by a ton

meager siren Oct 1, 2024, 1:02 PM

#

How so?

#

Through a seemingly reduced vocab per article?

inland lynx Oct 1, 2024, 1:03 PM

#

We are doing vector comparison between query and article also right

#

so yes less tokens

meager siren Oct 1, 2024, 1:03 PM

#

i feel like for sparse vectors, the cost difference is minimal

inland lynx Oct 1, 2024, 1:04 PM

#

meager siren i feel like for sparse vectors, the cost difference is minimal

hmm but you dont havr the processing memory right

#

we started with how much load everything was on yout servers

#

also think of 50 articles at a time right?

meager siren Oct 1, 2024, 1:05 PM

#

More like threads of 8 to 16 articles at a time

#

64 if you have the memory for it

inland lynx Oct 1, 2024, 1:06 PM

#

right, we want to reduce the load by introducing a condensed version of the articles

#

atleast thats what I thought we were discussing before

meager siren Oct 1, 2024, 1:07 PM

#

I thought we were looking at the document filtering based on query.

inland lynx Oct 1, 2024, 1:07 PM

#

hmm yes

#

so your on the same page that we need the summaries right?

meager siren Oct 1, 2024, 1:08 PM

#

I'm not sure about the necessity but I already have the script to generate them.

#

So I should do the following:

Generate summaries for all articles and store them
Compute Word Freq of all article summaries and store them
Use the summary Word Freq for...... TF-IDF inverted index?

#

And we are going to tackle part 3 via the method covered in the category tree traversal from that one white paper. Am I getting that right?

inland lynx Oct 1, 2024, 1:19 PM

#

meager siren And we are going to tackle part 3 via the method covered in the category tree tr...

which paper? Sorry to ask

meager siren Oct 1, 2024, 1:25 PM

#

inland lynx which paper? Sorry to ask

This one https://www.researchgate.net/publication/283227199_A_method_for_automated_document_classification_using_Wikipedia-derived_weighted_keywords

meager siren Oct 5, 2024, 1:15 AM

#

Rust has me convinced. Preprocessing took 30 seconds on something that was going to take tens of thousands of hours

meager siren Oct 5, 2024, 1:36 AM

#

Not sure if it’s doing what I want it to do but it was damn fast compared to Python

inland lynx Oct 5, 2024, 2:49 PM

#

meager siren Rust has me convinced. Preprocessing took 30 seconds on something that was going...

what did you do?

meager siren Oct 5, 2024, 3:04 PM

#

inland lynx what did you do?

Remember how I was talking about filling in the gaps with the remaining categories by loosely using BERT embeddings?

#

I am going to try to minimize the number of categories used while still ensuring total document coverage

#

I abstracted that problem (least category for full doc coverage) into a state space problem

#

Downside is that generating a space is compute intensive. There are two sort operations (to give preference to certain properties)

#

And the number of missed categories and documents are around 3M and 1M respectively

#

Essentially solving the problem would have been very slow for python without some high performance library written in C or Rust because the algorithm I had implemented (while efficient) was going to take 10,000s or 1,000s of hours to compute

meager siren Oct 8, 2024, 3:36 AM

#

Notes for myself:

I was able to implement code in rust that would search for the minimal number of "missing" categories to get complete document coverage for "missing" documents. I converted the problem to a state space problem and implemented a modified BFS algorithm to return a solution. Even after several optimizations and greedy strategies, the search space was so wide and expansive, OOM issues were persisting even on my server.

I am currently investigating the option of breaking down the problem into sub-problems and iteratively building the solution. I've already made some optimizations by taking the category to documents map and removing/ignoring categories that are not missing (only focus on "missed" categories) and for those values for those remaining categories, I removed the documents that are not missing as well (only focus on the "missed" documents).

Now, rather than try to solve the whole problem "fewest categories to cover all documents", what if I broke the problem down to "fewest categories to maximize document coverage". The difference is that in the sub-problem, we look at subsections of missed categories, while still keeping the missed documents list the same AND we strategically start our iterations by:

sorting the missed categories list by number of documents covered in each category
chunk the sorted missed categories list such that each chunk generates a solution
for each chunk, we pass in the previous solution state and the use the modified BFS to solve the modified problem statement ("fewest categories to maximize document coverage")
continue until either the end of the chunked list or we have found a solution to the greater problem statement ("fewest categories to cover all documents").

#

This will probably be costly in terms of compute (even with rust speeding things up) but hopefully the OOM issues will die down.

meager siren Oct 23, 2024, 4:19 PM

#

@inland lynx I have my incremental script up and running. It took a bit but I got a depth 15 graph together (it should be cleaned). Funny enough, the size seems to be roughly the same as the depth 10 graph, so I have to audit the two and see what the difference is. It's over here https://github.com/dmmagdal/WikipediaEnSearch/blob/main/download_category_tree.py

GitHub

WikipediaEnSearch/download_category_tree.py at main · dmmagdal/Wiki...

Contribute to dmmagdal/WikipediaEnSearch development by creating an account on GitHub.

inland lynx Oct 25, 2024, 4:06 PM

#

meager siren <@456226577798135808> I have my incremental script up and running. It took a bit...

if its the same size, I doubt it ran beyond depth = 10
https://github.com/rx-112358/Wiki-Category-Graph/blob/main/createGraph_append
use this its easier

meager siren Oct 25, 2024, 4:08 PM

#

inland lynx if its the same size, I doubt it ran beyond depth = 10 https://github.com/rx-112...

How far did you run this?

inland lynx Oct 25, 2024, 4:11 PM

#

depth 10

#

depth 15 - its failing a lot on the cluster / i am looking some alternate means

meager siren Oct 27, 2024, 1:26 AM

#

Your original implementation was able to work at depth 10. It appears that anything above that depth causes issues of some kind

inland lynx Oct 27, 2024, 5:55 AM

#

my suggestion would be to run from depth 10 to depth 11 and first verify anything is there beyond that level

meager siren Oct 27, 2024, 3:36 PM

#

I’ll test it in a few days. Trying to get a depth 20 run finished right now

#

Every 100,000 nodes I have a 4 hour pause to give the API time to cool down

inland lynx Oct 27, 2024, 7:23 PM

#

meager siren Every 100,000 nodes I have a 4 hour pause to give the API time to cool down

so there are some properties which someone can set at the server side to stop x number of requests coming in a short span of time
i think better is if you add a sleep in between the parsing

meager siren Oct 27, 2024, 9:05 PM

#

Yeah. Just FYI, 4hours between every 100,000 subtrees seems to be the magic number

meager siren Oct 28, 2024, 12:15 PM

#

And an inconsistent LLM. Yeah

meager siren Oct 28, 2024, 1:24 PM

#

Ran mine for depth 11 and 13. Both came out to be the same size. Ran a diff on the two files, also got no changes between the two. So there is something about my script that's not saving the data properly.

#

Did you get the bugs out of yours for depth > 10 ?

inland lynx Oct 28, 2024, 8:46 PM

#

meager siren And an inconsistent LLM. Yeah

what?
That is not the point
We are labelling the answers and collectively weighing the correct web results
Thats how atleast google gen AI works
You can try to replicate the same by asking for a code snippet or answer to a topic which is rarely searched such as adding a ceritifcate to a request since nowadays everything is handled by some SaaS or kube

inland lynx Oct 28, 2024, 8:47 PM

#

meager siren Ran mine for depth 11 and 13. Both came out to be the same size. Ran a `diff` on...

so the inference is there is nothing beyond depth 10?

#

I will be running for depth 10 to 11 and see if anything is there
seems fishy
we can verify manually by going and fetching in a code snippet pages of categories with depth=10

meager siren Oct 29, 2024, 2:38 PM

#

I think I found the snippet that's causing the issue on my script.

                subcategories = get_all_subcategories_bfs(
                    category, intermediate_max_depth - depth
                )

The depth that it is looking at is intermediate_max_depth - depth which would be a negative value if depth > intermediate_max_depth.

meager siren Oct 29, 2024, 3:00 PM

#

I'm re-running my script again but with just depth - intermediate_max_depth for a depth 11 graph.

meager siren Oct 30, 2024, 6:10 PM

#

Saw that the "for" loop wasnt running when I ran my stuff so I think I found another bug in my code

meager siren Oct 30, 2024, 10:54 PM

#

4.9 GB to store category tree data (depth 10) as vectors. Despite being very abstract and in numerical format, the dense representation is costly in terms of space.

inland lynx Nov 1, 2024, 6:47 AM

#

meager siren 4.9 GB to store category tree data (depth 10) as vectors. Despite being very abs...

you can try some condensed version of each word
I have noticed in some Microsoft services they dont store words directly, they store it as a long int
This saves a lot of space

meager siren Nov 1, 2024, 3:44 PM

#

inland lynx you can try some condensed version of each word I have noticed in some Microsoft...

No, the issue isnt the word storage (categories are like sub 50 characters). It's just that the vector embeddings are going to be heavy because you are dealing with around 1024 float32 (while each float 32 is 4 bytes) puts you at around 4KB per embedding (multiply by around 1.6M embeddings or 1.6 categories in the depth 10 graph).

inland lynx Nov 1, 2024, 7:22 PM

#

meager siren No, the issue isnt the word storage (categories are like sub 50 characters). It'...

Its the same design for the vector storage
Ideally the vectors should be in such a way that 1922,1922 War are not separate but structured in a way words with same prefix/suffix are related and not stored as seperate vectors

inland lynx Nov 3, 2024, 3:16 PM

#

meager siren No, the issue isnt the word storage (categories are like sub 50 characters). It'...

we can explore run time vectorization, keeping only some level nodes as pre-computed vectors
we discussed about a possibility that we match by metadata till last level and in last level only use the comparison

meager siren Nov 3, 2024, 3:32 PM

#

I plan on keeping the graph nodes pre computer in a vector DB

#

Right now trying to find the fastest way to process the remaining vectors

#

It took 2.5 hours at 48 cores to run the depth 10 graph

#

But my server is showing 13 hours to finish the remaining categories that map to documents.

#

I’m investigating batching the inputs on top of the multiprocessing now

meager siren Nov 3, 2024, 4:25 PM

#

Wow, batching is a wild fast at least in terms of processing data and keeps memory usage low compared to batch size 1 + muliprocessing

#

Screenshot_2024-11-03_at_11.26.21_AM.png

Screenshot_2024-11-03_at_11.26.30_AM.png

Screenshot_2024-11-03_at_11.26.38_AM.png

#

GPU 0 is working with the 48 processors + batch size 1 job (~40 GB RAM + 16 GB VRAM usage)
GPU 1 is working with the 8 processors + batch size 32 job (~16 GB RAM + 3 GB VRAM usage)

#

So batching really helps (though I found something from Huggingface that shows that there is a limit https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#pipeline-batching)

Pipelines

meager siren Nov 3, 2024, 7:03 PM

#

I restarted it and am running ike 16 processors + 64 batch size. It is blazing

inland lynx Nov 4, 2024, 8:23 PM

#

meager siren Wow, batching is a wild fast at least in terms of processing data and keeps memo...

P.S remember RDD
Thats how data should be processed - utilize as many parallel compute

meager siren Nov 4, 2024, 8:35 PM

#

Not yet. I'm hitting CUDA OOM errors from the amount I'm trying to parallelize

meager siren Nov 5, 2024, 3:59 PM

#

I think I over-estimated the "power" of the GPUs. I saw all of that speed increase that I was happy about was primarily under the condition that there are no additional embeddings to generate. When starting from 0 (no embeddings), the resource usage becomes more apparent. I am effectively limited to some combination that puts, at most, 256 text categories on a GPU at a time. This means 16 processors + 16 batch size, or 8 processors + 32 batch size are going to be the cap for what my machine is able to handle right now (I cannot distribute on all 3 GPUs without some extra work).

I know 8 processors + 16 batch size was able to do the first half of embeddings but the second half I may not know. This also is putting embedding times for each half at around 8.5 - 10 hours and around 13 hours respectively (again, assuming you have to generate all new embeddings).

meager siren Nov 5, 2024, 5:18 PM

#

yes

#

that's the categories that map directly to documents. The ones that were not captured by the download from wikipedia

meager siren Nov 5, 2024, 7:11 PM

#

I'd still have to generate all embeddings because I would have to group the categories from the second half into the graph downloaded in the first half. The pruning would have to be done post "merge".

inland lynx Nov 5, 2024, 7:19 PM

#

meager siren I'd still have to generate all embeddings because I would have to group the cate...

Is there a way to generate the embeddings for a partial tree?
That is what I am trying to say

meager siren Nov 5, 2024, 10:32 PM

#

inland lynx Is there a way to generate the embeddings for a partial tree? That is what I am ...

Not without doing more extensive computing (ie recall my efforts to trim the number of categories that map to documents while still maintaining 100% coverage. That effort was computationally intractable). Embedding both "halves" is just faster.

meager siren Nov 6, 2024, 2:44 PM

#

I had an interesting thing come up with the embedding for the last "half" of the categories. Apparently, there is a category detected that exceeded the 512 BERT token limit and it caused an error for the program (didn't find out until after it was supposed to finish). It's probably a line that got caught up on accident from the main body of an article but interesting occurence nonetheless.

meager siren Nov 8, 2024, 3:35 AM

#

@inland lynx I figured it out with the GPU. I had some stuff in the code to try and save intermediately. Additionally, I had lines in the embedding function to verify that a category wasn't already in the DB table. So the reality was that the check was probably going to cost a lot of time BUT it probably wasn't going to be 120 hours.

meager siren Nov 8, 2024, 3:51 AM

#

I did the following:

consolidate node processing. Combine node list from downloaded graph and category-to-documents map. Eliminate duplicates by converting it to a set.
check table in vector db. Remove nodes from the list that appear in the table.
pass the filtered node list to the embedding function. Also added an argument to bypass the category check in the embedding function too.

This is much more straight forward I think and should bring the process down to maybe around 6 hours if I'm right.

meager siren Nov 8, 2024, 3:56 PM

#

Ok, I really over estimated on the 6 hours. It's looking more like 21/22 hours. But since everything is on one loop (for the embedding), I only have to wait once.

inland lynx Nov 9, 2024, 9:58 AM

#

meager siren Ok, I really over estimated on the 6 hours. It's looking more like 21/22 hours. ...

damn, 21 hours for what exactly?
Can you breadown the timing for what processes

meager siren Nov 9, 2024, 12:05 PM

#

Stage 1: load all nodes and clean the texts. Use set operations to get first and second half node list. Combine to one list. Iterate through and identify nodes that already exist in the table. Remove those words from the node list.

Stage 2: embed node list and update table. There are 4M nodes to embed. This stage will take 21 hours if starting from nothing.

meager siren Nov 9, 2024, 12:27 PM

#

inland lynx damn, 21 hours for what exactly? Can you breadown the timing for what processes

After 1 full run yesterday, I found out that there is a limit to what pyarrow is able to handle. Apparently, objects over 2GB in size will cause it to error out. So I think I’m kinda forced to do an iterative saving along the way. That kinda sucks because it will still add time (to stage 1) if the program was interrupted or were to crash and I don’t know how much that will affect performance.

inland lynx Nov 9, 2024, 2:42 PM

#

meager siren After 1 full run yesterday, I found out that there is a limit to what pyarrow is...

yeah you can do some checkpointing and save the obj part by part at a timestamp

meager siren Nov 9, 2024, 2:42 PM

#

Well, I might as well write it to the DB where it's going anyway

#

Stage 1 should filter out already stored values.

inland lynx Nov 9, 2024, 2:43 PM

#

meager siren Well, I might as well write it to the DB where it's going anyway

hmm partial xml s

meager siren Nov 9, 2024, 2:43 PM

#

That said, it will probably take much longer to complete stage 1 as it goes up

meager siren Nov 9, 2024, 2:43 PM

#

inland lynx hmm partial xml s

Why XMLs?

#

this is the (category, embedding) pairs

inland lynx Nov 9, 2024, 2:44 PM

#

oh

inland lynx Nov 9, 2024, 2:44 PM

#

meager siren this is the (category, embedding) pairs

uh dont store it as a pair together

#

do it like a hashmap
Store the embedding in its suitable form and categories in a suitable form

inland lynx Nov 9, 2024, 2:48 PM

#

meager siren That said, it will probably take much longer to complete stage 1 as it goes up

https://www.timescale.com/blog/a-complete-guide-to-creating-and-storing-embeddings-for-postgresql-data/amp/
Have a look at this

#

https://www.analyticsvidhya.com/blog/2024/08/compute-and-store-vector-embeddings-with-langchain/

Analytics Vidhya

How to Compute and Store Vector Embeddings with LangChain?

Learn to efficiently find content similar to queries using vector embeddings and LangChain. Understand embeddings, implement LangChain models.

inland lynx Nov 9, 2024, 2:51 PM

#

inland lynx https://www.analyticsvidhya.com/blog/2024/08/compute-and-store-vector-embeddings...

I think you should try using some existing frameworks for your hashing and embedding management, the way your doing rn isnt that efficient

#

any thoughts ghost?

meager siren Nov 9, 2024, 2:54 PM

#

It's about the same

#

lancedb is the vector DB I'm using right now

meager siren Nov 9, 2024, 2:56 PM

#

inland lynx I think you should try using some existing frameworks for your hashing and embed...

The interface and setup for lancedb is pretty straight forward. Search seems to be pretty fast too (on my mac, I'm getting 200 items/second with lancedb) for looking up already stored categories

inland lynx Nov 9, 2024, 2:56 PM

#

meager siren lancedb is the vector DB I'm using right now

so what is the best way to manage embedding s? Any existing frameworkd

meager siren Nov 9, 2024, 2:57 PM

#

Yeah, I let lancedb do all the work because it also does handles stuff at a disk level, not just memory.

inland lynx Nov 9, 2024, 3:03 PM

#

meager siren Well, I might as well write it to the DB where it's going anyway

too many writes per pair may not work efficiently maybe
Try keeping a temp batch where you push x pairs
Once the total max count is reached write at once to db

meager siren Nov 9, 2024, 3:07 PM

#

inland lynx too many writes per pair may not work efficiently maybe Try keeping a temp batch...

Way ahead of you there. Doing a batched writes periodically during the embedding process in Stage 2

#

If you do one big write to the db, you'll get a errors (if the data is over 2GB large)

#

I did the math

#

assume 6 kb per pair, I would have a max of 250,000 pairs per batch write to the db

inland lynx Nov 9, 2024, 3:12 PM

#

meager siren assume 6 kb per pair, I would have a max of 250,000 pairs per batch write to the...

I would suggest you do some dummy testing to test the limits with your h/w to find the optimal write batch size

#

We dont want to write MAX limit if it s costing efficiency/time

#

@meager siren is this what your doing?
https://lancedb.github.io/lance/read_and_write.html
Batching and merging is natively supported in lance

meager siren Nov 9, 2024, 3:42 PM

#

inland lynx <@400827393415053333> is this what your doing? https://lancedb.github.io/lance/r...

I see bulk update but nothing for add

inland lynx Nov 9, 2024, 3:43 PM

#

there is upsert, its labelled as merge insert

#

and insert

meager siren Nov 9, 2024, 3:43 PM

#

Ah I found it

inland lynx Nov 9, 2024, 3:44 PM

#

yeah I hope that makes everything more faster

inland lynx Nov 9, 2024, 3:45 PM

#

meager siren Ah I found it

Note there are many insert types, which does the checking if embedding exists
Try to use all the native methods instead of your code

meager siren Nov 9, 2024, 3:46 PM

#

Have you seen my code

inland lynx Nov 9, 2024, 3:46 PM

#

no

meager siren Nov 9, 2024, 3:46 PM

#

I mean, I'm not denying it's probably inefficient

#

but I wonder how this will handle the scale

inland lynx Nov 9, 2024, 3:46 PM

#

meager siren but I wonder how this will handle the scale

?

meager siren Nov 9, 2024, 3:46 PM

#

We have 4M entries to put into the db

inland lynx Nov 9, 2024, 3:46 PM

#

Your doing everything in a batch

inland lynx Nov 9, 2024, 3:47 PM

#

meager siren We have 4M entries to put into the db

hmm true

#

isnt it supposed to scale to that level?

#

I mean otherwise using that db makes not much sense tbh

meager siren Nov 9, 2024, 3:48 PM

#

Right but what I'm saying is "how efficient is that upsert if it has to check by category name ?"

#

embeddings can be organized into a hierarchical tree based on similarity

#

but if you store 1M pairs into the db, and you have 3M left, how performant will that upsert be?

inland lynx Nov 9, 2024, 3:49 PM

#

meager siren embeddings can be organized into a hierarchical tree based on similarity

internally does it not do all this?

#

i mean they would be internally managing the data in a different way thats why your search is fast

meager siren Nov 9, 2024, 3:50 PM

#

Right but if you are using the category name as the primary key, you dont get that benefit

#

Youd have to do something else to account for that variable in the db (though I guess they probably would be able to figure something out since they can infer/use the types defined in the schema)

inland lynx Nov 9, 2024, 3:51 PM

#

why dont u just try it and see

inland lynx Nov 9, 2024, 3:52 PM

#

meager siren Right but if you are using the category name as the primary key, you dont get th...

ideally they should be storing as some sort of hash maps internally

#

so that embedding is stored efficiently

#

its probably just interfaced as a db which you can query out like SQL

inland lynx Nov 9, 2024, 3:54 PM

#

meager siren Youd have to do something else to account for that variable in the db (though I ...

yes you can pass out a schema

meager siren Nov 9, 2024, 3:54 PM

#

Already defined the schema

#

they have the option to infer it based on the data or define it.

inland lynx Nov 9, 2024, 3:55 PM

#

meager siren they have the option to infer it based on the data or define it.

dont infer

meager siren Nov 9, 2024, 3:55 PM

#

depending on the initialization method one uses

meager siren Nov 9, 2024, 3:55 PM

#

inland lynx dont infer

I know. I have the schema explicitly defined

#

one problem with upsert

#

if you upsert, you have a constant runtime of 21 hours since you would be updating entries already seen/stored

inland lynx Nov 9, 2024, 3:56 PM

#

meager siren one problem with upsert

use merge insert

#

not upsert

#

i just mentioned upsert in the case we want a scenario where we update Catgeories, like the label changes

meager siren Nov 9, 2024, 3:57 PM

#

Ah, I see

inland lynx Nov 9, 2024, 3:58 PM

#

hmm actually we should have stored some page id

meager siren Nov 9, 2024, 3:59 PM

#

Why?

inland lynx Nov 9, 2024, 3:59 PM

#

we dont have anything to uniquely identify an article

#

it makes it hard to update existing article nodes

meager siren Nov 9, 2024, 3:59 PM

#

this is just the vector db to allow for semantic search across the category graph.

inland lynx Nov 9, 2024, 3:59 PM

#

yeah as time passes wiki changes

#

the Category name may change for the same article

#

anyways … carry on

meager siren Nov 9, 2024, 4:00 PM

#

As it stands, the DB would only get bigger since it uses the categories found in the tree and category-to-document mapping. Writing a "clean up" function to remove categories not found in either should be easy

inland lynx Nov 9, 2024, 4:01 PM

#

meager siren As it stands, the DB would only get bigger since it uses the categories found in...

its un necessary and in efficient in the case ppl are allowed to change name

#

thats mean we end up doing pre processing multiple times for the same article

meager siren Nov 9, 2024, 4:01 PM

#

Not true at all

inland lynx Nov 9, 2024, 4:01 PM

#

But if the article itself changed then it makes sense

inland lynx Nov 9, 2024, 4:02 PM

#

meager siren Not true at all

if the category name changed what do you find in the graph?

#

today my article was 1977 War tmrw its The history of 1977 and war of an era

meager siren Nov 9, 2024, 4:02 PM

#

consider this. The data (wikipedia and the dump files) are updated periodically

inland lynx Nov 9, 2024, 4:03 PM

#

yeah but we end up doing un necessary inserts when the name only changes
Do you get what Im trying to say
We identify by a category name which may change

meager siren Nov 9, 2024, 4:04 PM

#

if the category changes, then the embedding changes because the embedding is tied to the category, NOT the article

#

If there is a new category, it has to be added to the db with its own embedding.

inland lynx Nov 9, 2024, 4:04 PM

#

meager siren if the category changes, then the embedding changes because the embedding is tie...

but the category is the same right why change the embedding
They probably made it sound fancier or some synonym crap

meager siren Nov 9, 2024, 4:05 PM

#

inland lynx but the category is the same right why change the embedding They probably made ...

No, that's not correct at all

#

the embedding model is tied to the category because it was generated as a function of the category string

#

embedding = model(category_string)

inland lynx Nov 9, 2024, 4:05 PM

#

that is true

meager siren Nov 9, 2024, 4:05 PM

#

new category -> new embedding

inland lynx Nov 9, 2024, 4:06 PM

#

uh ok makes sense

meager siren Nov 9, 2024, 4:06 PM

#

update the category to document map, those new categories from there will be captured and stored to the vector db

inland lynx Nov 9, 2024, 4:06 PM

#

but we need to update only the category node and upsert

meager siren Nov 9, 2024, 4:06 PM

#

inland lynx but we need to update only the category node and upsert

Exactly!

inland lynx Nov 9, 2024, 4:07 PM

#

meager siren Exactly!

but how do you map to the original category?

#

did u add any id or metadata

meager siren Nov 9, 2024, 4:07 PM

#

no

inland lynx Nov 9, 2024, 4:07 PM

#

uh ok its possible needs a little more extra effort
We query the article history and do search and see what has changed

#

or whatever stores category audit history

meager siren Nov 9, 2024, 4:08 PM

#

insert categories unique to the category to documents map into the downloaded category tree via the semantic embeddings to match the categories

inland lynx Nov 9, 2024, 4:08 PM

#

that wont work

#

semantically many will be similar

meager siren Nov 9, 2024, 4:08 PM

#

inland lynx uh ok its possible needs a little more extra effort We query the article history...

we dont have article history

inland lynx Nov 9, 2024, 4:08 PM

#

its there actually whn you hit the API

meager siren Nov 9, 2024, 4:09 PM

#

inland lynx its there actually whn you hit the API

We have TRIED to hit the API

inland lynx Nov 9, 2024, 4:09 PM

#

lol calm down

meager siren Nov 9, 2024, 4:09 PM

#

and we have no good way to download the whole graph

inland lynx Nov 9, 2024, 4:09 PM

#

this is only when we want to update some categories

meager siren Nov 12, 2024, 4:46 AM

#

It's interesting. The vectorDB is operating at a pace of around a little over 1 query per second when trying to search the table with SQL querying for known categories. I've tried what I can to speed things up, including mutliprocessing and even batching items into one query but still I get long estimates for this function (after there have been a large number of category, embedding pairs committed to the DB).

#

I even looked at using lancedb rust package to create a helper function in rust but ran into some issues with building and installing dependencies.

meager siren Nov 12, 2024, 1:14 PM

#

Update to that, I figured a way around that. Setting a bigger batch size for that (ie 1000) works well. Downsize is that the RAM usage really goes up a lot

meager siren Nov 13, 2024, 3:11 PM

#

@inland lynx Something I noticed is that I was getting really inconsistent results with my vector db. Like, why is it that, after 1 full run, I still have vectors that need to be embedded? Why is is it that each time I run a search on the db to get vectors already seen, I get different results? My program logic is right.

I'm thinking that perhaps multiprocessing could be the cause behind this issue. So I am running the script again, this time with single processor, to see how many vectors it finds in the db.

#

Could be the situation that, since lancedb is written in rust, it cannot handle multiprocessing well.

#

I use "spawn" instead of "fork" in my code but maybe it's just safer to go single-thread/single-processor

Screenshot_2024-11-13_at_10.26.06_AM.png

meager siren Nov 13, 2024, 6:10 PM

#

Ok, so after running that part 3 times, I was getting some solid consistency.

#

#

The number is concerning because that means I have around 4000 unique embeddings out of the entire 13 GB stored there which allows me to deduce that there are A LOT of either duplicates or corrupt vectors in the DB

#

So that means, for the sake of stability, I should run the program on single processor (and single thread I guess) and see what happens.

inland lynx Nov 14, 2024, 9:25 AM

#

meager siren I use "spawn" instead of "fork" in my code but maybe it's just safer to go singl...

makes sense, might be a LOT of WRITE hazards

#

if your dealing with a ton of read/write hazards, we need to handle the number of threads optimized to lance
instead of multiple threads failing and we need to again spawn a new thread for all the data lost by the previous one s

inland lynx Nov 14, 2024, 9:28 AM

#

meager siren I use "spawn" instead of "fork" in my code but maybe it's just safer to go singl...

so basically what is the commit limit? are you following the same?
I feel there is something fishy going on with your h/w also, is it supposed to handle these many concurrent writes?

inland lynx Nov 14, 2024, 9:32 AM

#

meager siren So that means, for the sake of stability, I should run the program on single pro...

so they mentioned they have a certain limit to how they handle write hazards
you need to follow the same, hope its not a huge limitation in terms of number of threads

meager siren Nov 14, 2024, 12:49 PM

#

inland lynx so basically what is the commit limit? are you following the same? I feel there ...

I went back and ran diagnostics. H/W was fine. I forgot the specifics but the problem lie in the way the code was executing. After I changed the workflow of the program, haven’t had the same issue since.

meager siren Nov 14, 2024, 1:07 PM

#

inland lynx makes sense, might be a LOT of WRITE hazards

I haven't checked multi-threading yet and despite the info I saw on the FAQ saying certain kinds of multiprocessing are supported, I dont think it's work the risk right now. Downside right now to running single thread/processor is that it is taking A LONG time to run the program (create and save embeddings) even with batching/chunk insertion.

#

No more 21 hour runs.

inland lynx Nov 14, 2024, 1:08 PM

#

meager siren I went back and ran diagnostics. H/W was fine. I forgot the specifics but the pr...

Uh but does that mean your number of write hazards reduced?

meager siren Nov 14, 2024, 1:09 PM

#

Yes

inland lynx Nov 14, 2024, 1:09 PM

#

hmm how are you checking that

#Best Vector DB for Local Use