#Embedding questions

2 messages · Page 1 of 1 (latest)

lilac gyro
#

Before I do a whole lot of embedding I want some advice.

What I'm trying to do: I have a data set of a large number of documents written, (100k+ segments pre-processed so far) some of the segments are much larger than the token limit while others are short replies of negation or agreement with other people, if they're short I'm generally throwing the data out 10 word segments don't generally encapsulate a useful opinion on their own but 100-100,000 words can with degrees of nuance. I want to be able to get an embedding for each segment regardless of size.
I have many segments from the same people at different points in time, I want to be able to encapsulate a sum total of a person or an era. I'd like to be able to calculate the average stance of one person or the average opinion of everyone in one time frame and other potential clusters of embeddings. Perhaps work out if one person is a thought leader or if there's groupings of people within the that seem to be working on the same things or in similar ways.
I want to be able to see if there's a trend line for a particular cluster over time, and would like to be able to interpolate data that doesn't exist. For example opinions people or groups might give in the future assuming the trends continue. Or guess at the opinions for topics that they never wrote about in a given era.
I'd like to take the calculated embeddings and give it back to GPT and go "write me a statement about some topic which is within some distance of this embedding"

So a few questions about embeddings

  1. embeddings are like multidimensional scatter plots right? but do the dimensions have meaning like is one end of one dimension clearly more of a thing and the other less of a thing? (even if we don't know what that meaning is) or is it more that zones of the plot-space correspond to specific things and similar things humans might consider conceptually near each other could be miles apart within embeddings with no overlap in dimensionality? (a phyisics paper may have no dimensions in common with a chemistry paper)
  2. how can I combine embeddings, if I have a piece of text that's longer than 8192 tokens can I split the text into chunks and add them? average them? weighted average? can I wait for a future model and expect it to be compatible?
  3. Should I be able to do the sorts of averaging I'm looking to do to find the locations of average opinions? I'm thinking a mean and standard deviation for each axis might be more useful than a simple average. What if I used Principal Component Analysis to reduce the space dimensionality I think this should speed up the processing but I'm not sure how it would impact the resulting analysis
  4. is there a way to feed an embedding (real or calculated) back into the api the way I mentioned to get a text response.

I understand that the inferred or interpolated responses may be significantly different from reality. But here's some examples of what this type of analysis might be able to do:

Read all papers within a given discipline and track scientific consensus over time. Ask what were the most popular avenues of exploration? Or What was the average position of the scientific community? At a given point in time. Ask what Einstein might have said about Calabi-Yau space? An area of mathematics largely developed after his death. Find out if psychology has a trajectory of acceptance for various mental health diagnoses and predict how the DSM might talk about those that are currently least developed in future editions.

The avenues of application for this might well be limitless, and I have a particular test bed in mind that I haven't mentioned so far here.

little wagon