#data-science-and-ml
1 messages · Page 112 of 1
If I dont add that , i get an error
it's probably there because of broadcasting rules
One more thing have you heard of cross validation score?
In that we divide the training set into train-test folds and then compute the accuracies right?
What does a fold mean?
I think it just means split
scalers (like StandardScaler() or MinMaxScaler()) expect their input to be 2D
single output regressors (which are most of them, like LinearRegression() or RandomForestRegressor()) yield their .predict output as 1D
so one must reshape the 1D prediction output to be 2D to be inverse-transformable via those scalers
so you reshape a 1D input of shape (N,) to be (N, 1)
.reshape(-1, 1) is one of the ways, another is [:, None] or [:, np.newaxis]
in your code, I presume you'll find a similar sort of reshaping to fit sc2 to your training target values in the first place
because again, they expect a 2D input, your target is 1D
can also do .reshape(N, 1) but why be error prone when you can put -1 to make it infer
I usually use unsqueeze
Isnt the input already in 2D?
by the use of double square brackets
[[6.5]] is in 2D right
what's the error? error messages are there to inform you what the problem is. you are supposed to read them, not treat them as an opaque blob of red text.
sometimes error messages are unhelpful, but even knowing where the error came from (the "traceback" part) is useful
and it's especially useful when asking for help, because otherwise you're forcing other people to guess at what the problem might be
is there a meaningful difference between a) df.drop(index = 50) and b) df = df.loc[~(df.index == 50)] also, should i focus less on bracket notation?
This is the error I am getting
yes. the latter is very wasteful. it constructs an entire boolean array
it doesn't help here, but don't forget that ~(x == y) is just x != y
ya i thought about that after the fact
the error message looks pretty clear to me. what's the issue?
regarding 50 though, im not working with unique indexes
is generating a series still overkill?
for starters, don't use non-unique indexes
in the case of a non-unique index then you might just get an error with .drop, you'd have to see what happens
i don't have a choice in that, it's the coursework
no it's just the instructors preference i guess but thats kind of besides the point
!e ```python
import pandas as pd
df = pd.DataFrame({"i": [1, 2, 2], "y": [4, 5, 6]}).set_index("i")
df = df.drop(index=1)
print(df)
@desert oar :white_check_mark: Your 3.12 eval job has completed with return code 0.
001 | y
002 | i
003 | 2 5
004 | 2 6
!e ```python
import pandas as pd
df = pd.DataFrame({"i": [1, 2, 2], "y": [4, 5, 6]}).set_index("i")
df = df.drop(index=2)
print(df)
@desert oar :white_check_mark: Your 3.12 eval job has completed with return code 0.
001 | y
002 | i
003 | 1 4
hmm think i will, just out of curiosity do you happen to know if pandas handles different index types differently? if it were a rangeindex it could obviously jump straight to it
eh nvm, i should practice reading docs
the input to .inverse_transform is the output of .predict right?
no way I actually need that sqrt right ? there's no floor nor anything, so there's an assumption that the result is always an integer, which means that I can simplify it from wtv properties guarantee that assumption
that's the 1D array I meant
the result is not only an integer, it is also odd
otherwise the -1 and division by 2 wouldnt be an integer
(odd number) = sqrt(x), what can I say about x?
anyone know why my acc is constant for all epochs (ive just shown 2 but it stays constant after that) when i use binary cross entropy as a loss function? seems to calculate just fine when i switch to normal non-binary cross entropy
anyone already experimented the training loss and inference result begain strange after tensorflow version upgrade ?
it's very weird, there is no warning or deprecation notice
is pytorch more safe?
actually, im gonna do the reasonable thing and just keep two extra arrays to use as a lookup table
it's allocated once when the model is instantiated and then reused across all layers
how many classes does your model output have ?
good idea to double check your code and then see if someone has opened an issue on their github repo
most times double checking your code works, in my experience at least
2 classes, flawless and flawed, 1 output neuron
shouldn't you have 2 output neurons ?
one per class
i read its better to have one output neuron for binary classifiers but idk
yeah im reading it too
so the loss goes down, but acc remains stable with repeated values
that is kinda odd
maybe the API for the binary cross entropy has something different to it
it's optimizing something other that for acc
that is, you may be using binary wrong
maybe, ill check the docs
funny thing is when i run this model on my test data it performs the best in acc
and it is 1D, hence the error
SO for inverse_transform we need a 2d array
indeed
For polynomial regression , how to we deicde the degree of the polynomial?
heavily depends on how your plot looks like, if it's a line you go for linear, if its a parabola you go for x**2, if it does an S kinda thing you go for cubed, if the shape is complicated you go for high degree, you're fitting a tailor approximation
dont go too high tho, cuz after a certain power floating point stops working at certain ranges
Okay so I did with degree 4 and here is the results
that looks like an x2 or x3
Oh so this would like this?
it works the same as when you're fitting a network
the degree is an hyper parameter
and you gotta check for overfitting
I don't know, which points were use for fitting and which are being used for validation/test ?
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1)
plt.scatter(X,Y,color = 'red')
plt.plot(X_grid,lin_reg2.predict(poly.fit_transform(X_grid)),color='Blue')
plt.title("Polynomial Regression Results")
plt.xlabel("Positon Level")
plt.ylabel("Salaries")
plt.show()
Only difference is the degree
Only is in 4 and the other 24
you gotta select a certain percentage of points at random
and not use them during fitting
then calculate the error on those points after the fit
if that's high, you got an overfit
Whats the problem if it is overfitting?
Or if a model is overfitting?
it doesn't predict unseen data with accuracy
there's like, an infinite number of curves you can pass through those points
you want the one that«ll be most useful to you
im surprised you could fit a 24 degree tho
I think it's doing that cuz you didn't give it enough time to fit
24 degrees is a lot, most of physics happens in the 1st 2nd and 3rd
oh, one way to improve it within the same number of iterations would be to to do x -> x - 5 substitution, so that the center of the approx is in x = 5
Thing is when i visualize the results normally , in the graph it shows me a straight line to each poinr
plt.scatter(X,Y,color = 'red')
plt.plot(X,lin_reg2.predict(X_poly),color = 'blue')
plt.title("Polynomial Regression Results")
plt.xlabel("Positon Level")
plt.ylabel("Salaries")
plt.show()
This is for normal visualization
makes sense
24 is too high
you gotta do the train/val split too
try like, 5 or 6 degrees
but when I divide the x -area into more specific values it becomes this.
Is it necessary for a validation set?
I mean you can gauge it by looking at the graph, but the correct way is using a split yeah
It's also a good idea to normalize the data
Divide the x axis by 10 and the y axis by 1e6
It helps preventing overflow
chat gpt almost nailed this transcription, I think if I'm more careful with my scribbles I can get it to do all my latex
awesome
this is the math for the cuda kernel
I already caught an error when writing this
chat gpt was also not very good, might as well just code it right away
dirac should be lower lower on the third eqn
4 to 5 is actually wrong, first term is running over too many l's
i gotta limit it to those where k = k'
there's also one more symmetry consideration besides Mkk' = Mk'k, which is talking about the coordinates of the two vectors being doted, but also, there's symmetry with cc', since the dot product itself is comutative, that is, qncc' = qnc'c
i have 3 classes represented by 0, 1 and 8, when i use tf's to_categorical on them all i get are 1 0 and 0 1 which is 2 classes, i checked the documentation and from what i understand it should work with multiple classes, i even changed my classes to 0 1 and 2 but in that case i only get a single class, 1
nvm its working now, had to explicitly tell it i had 3 classes
this is gonna require some faith
or maybe a drawing to convince myself it makes sense
but again, the idea is that the metric tensor is symmetric so I save half the operations on a given calculation of the dot product, and the dot product is comutative, so I save half the operation cuz the resulting scores matrix is then also symmetric
half at each step
Hey, everyone
I am currently in 2nd year AI & DS field and I need a mini project that I can represent at my University also which I can use for my portfolio any ideas??
@final kiln what times have you been getting on your instance+container startups?
am trying to see if I want to go instance -> container -> task or just instance -> task
am gonna run a couple of tests later
if you have the image locally it should be like a couple seconds
usually better after you've initted it at least once before
yeah i plan to upload the image as an artifact and spin up the instance from an already-made-for-the-task image
a custom minimal arch image or something
but it feels like it might be less efficient on a large scale
I pre-prepare an AMI that contains the image, it does work, but it still takes 2min to startup
it's a lot worst if I don't do it
haven't found a good solution for this
what's the image?
i'll keep you up to date on my times
just a py image with pip install torch >.>
the nvidia stuff is heavy
I almost wonder if I should yield and just use the base AMI
but I don't want vendor lockin
you compare without the nvidia build with non-nvidia build?
because virtualization can cause fuckups here
yeah torch is huge with nvidia
like really big in comparison with the cpu version
i mean in terms of speed
hm i'll be running these benchmarks in the coming week
yeah the virtualization can mess things up
there's no virtualization tho
unless you got a mismatch between the image arch and the machine arch
you're not on cloud?
I am, you mean they virtualize the underlying machine »
yeah the instance is a virtual machine, it's not bare metal
idk if they virtualize the gpu
I recall doing something like having multiple processes use the gpu at the same time, and there were no guard rails anywhere
might as well just do a 1:1 mounting of the gpu into the VM's right
“If you want to go fast, go alone. If you want to go far, go together”
Hey there! I'm currently self-studying statistics as a prerequisite for artificial intelligence. So, I'm looking to join a community of like-minded individuals. If you're also starting to learn prerequisites for AI, I'd love to connect. We could share knowledge, update each other on the topics we're covering each day, and discuss our plans for tomorrow or the week ahead. Let me know if you're interested in teaming up to support each other's learning journey!
Can someone help with part b?
mixing tailor and fourier features I see
yes
i did part a) and it was massively overfitted
enough that the dots fully overlapped
so isn't it the same problem but with L1 instead of L2
yes, but idk how to do the penalty term in sklearn
Encoding 2 categorial variables stored in one pandas data frame comprised from 2 columns. Each column has label built from string of alpha characters - no whitespaces.
One-hot encoder is used, Instantiated with arguments handle_unknown =ignore, sparse_output=False
Encoder delivers a data frame with columns axis labels as RangeIndex - means numeric.
The expectation is for new features the labels to be concatenation of original feature name and category encoded. But OH delivers data frame columns labeled in numeric way. Instantiating encoder with argument feature_name_combiner=‘combat’ doesn’t help.
What do I miss?
If I only understand it properly according to sklearn.preprocessing.OneHotEncoder API documentation, section constructor parameter feature_name_combiner the encoder should for presented circumstances deliver concatenated labels of encoded features: old feature and encoded category.
Well, get_feature_names_out() will help. Applied to encoder object, results stored as columns of data frame with encoded categories.
it sounds like they just want you to swap out L1 distance in place of L2, no?
that is, | y_predicted - y_actual | instead of sqrt( ( y_predicted - y_actual )^2 )
can you share the code that they provided in the python notebook?
what are you trying to do?
Is this the correct way to feed data in cnn?
I have data like each row represent an Image
I want to capture spatial features as well as temporal features
Each array is oh 1000*1000
this is not readable.
if you're going to post a screenshot, make sure it's only exactly what you need to share.
Oh I'm sorry for the overlook
I wanted to share the ss only I took it frm pc and sended to my home mail from my colgmail
I will explain it
First two columns are date and time and rest other columns are latitude longitude Imgtir1 imgtir2 imgvis imgswir all these are an array of 1000*1000 each in each row
what are you using for the neural network
For extracting spatial features from my data
"Imgtir1 imgtir2 imgvis imgswir" -- what do these mean?
I'm asking what library you're using to create the neural network. not why you're making it.
"Imgtir1 imgtir2 imgvis imgswir" -- what do these mean?
They are short wave infrared electromagnetic light and visible light
I am working on satellite data
so you probably need your data as a tensor with this shape
(num_rows, 1000, 1000, 4)
so basically, each image is a 3d array, two dimensions for height and width, and one dimension for those four... parts?
spectra, maybe?
dataframes are strictly two-dimensional. so you don't want it as that.
I used
From_records
So it stored the array as it is in it
Just for the to get an idea how my data
Is I then stack lat long in one array and then stacked other 4 in channels
So it was like this 26,1000,1000,6
I was confused that whether I should pass lat lon as feature or not as coordinates are Improtsnt for catching spatial features right?
After wards I also wanted to capture temporal features also so what I did first pass it it cnn without dates after wards I will set date and time as index and will pass it through lstm
As labels I have two arrays each of 1000*1000
They are named as flash and count
So flash is wherever we observe flash there will be 1 and and in count number of flashes at that particular area
Ngl, never a good sign when you're solving eqns in latex and a tilde just floats off to another letter
now I have to redo everything
deltas on u should be up up too
and there you go, the cuda kernel
looks complicated but it's not, the deltas are if statements, and only one term is computed for a given (u, l)
and the super and subscripts are just indexes, so they're like M^n_l = matrix_array[n][l]
so each term will be computed in parallel and used to fill a matrix shaped (n, u, l) which I can then reduce sum along l, no space is wasted because every position in this matrix is filled. The f and g mappings are easily constructed using one of numpy's or pytorch's triu functions
I gotta do the gradient, but it's quite easy to calculate since it's all just simple multiplication
here's the draft for the full treatment
(excuse the phrasing, i'm still learning) when someone is writing a neural network or other similar kind of ai, and it comes to the programming of the model itself, i seem to see a lot of people writing nn's comprised of only dense layers like in the image attached? can someone explain to me what the purpose or point is, how they compare with writing a neural network comprised of conv2d or LSTM layers (i do understand these kinds of nn's are used for different purposes but thats beside my point)
a lot of it is trial and error mixed with heuristics, but sometimes you have good motivation for the choice
if you have shift invariance, convolutions make sense
if you have slow variance along one axis, then LSTMs make sense
so there is actually logic behind it? cause ive only ever written a nn totally alone and with zero guidance so i assumed it was supposed to be a component of a bigger whole, not it's own standalone layer
dense layers are just general affine transformations. the network can learn to make them shift invariant, but also not. you can use these if you know nothing at all about the problem
(as in yes it is A standalone layer, but that it was supposed to work in conjunction with the "actual" nn)
also, do different nn layers require specific inputs, cause i assumed i could just add a different kind of layer (for example adding an LSTM layer but i started to get an error (this was a few days ago and i just undid it for the time being)
wdym by "specific inputs"
as long as the shape is correct, a layer won't care what the input is and will enforce its special conditions
i guess i meant "specififcally shaped inputs" in that case
same as a sine function doesn't care what number you give as an input, it'll always treat it as an angle in radians return a number between -1 and 1
then yes, each layer requires a specifically shaped input
i'm going to go and try running it again and see if i can get the shape bit right 👀
Petition to use Q-learning with self driving cars
Y do you need a petition ?
It's a joke dude
Uhm, I didn't get it
Reinforcement learning
I'm not very knowledgeable about reinforcement learning. Gotta get on it eventually
Is it possible to train a T5 model WITHOUT teacher forcing?
I don't understand any of this. I want my model to NEVER generate a specific token at the beginning. I'm so desperate that I've given it an extra penalty... and yes, the model doesn't use Teacher Forcing.
Morning DS/ML
I have a data science portfolio project idea. I got this tortilla price dataset from Mexico from Kaggle. My hypothesis is that supermarkets offer lower prices for tortillas compared to convenience stores and traditional markets in Mexico. Is this a good portfolio project to have?
I was also thinking of creating a website to showcase all my data science projects
I just don’t know what impacts the project itself is going to have. Like if it’s impressive enough for an employer.
I’ll do it anyways I guess?
honestly just do it, once you get into the meat of the problem you'll know how to layer it more and more cuz you'll start having endless ideas
I need less ideas rn
have too many and there's not enough time for all
it's also a very reasonable hypothesis
more expensive, and significantly lower quality cuz they just buy it from the same supplier at lower quantities but closer to expire, or they just buy it from the supermarket and resell it
ig stuff specific to the place, like here they produce rice, so ig their rice is gonna be cheaper and super natural
whereas super market will be processed
Gotcha, thanks for the advice!!
I need to plot to graphs into one plot, but each having their own axes, e.g. like the following 2 gaussians, with one being rotated 90degrees cw. Is there a easy way to do this, I've tried for some time and didn't manage it, the only thing that I could image working is that I plot the second one on the y axis and scaled all the values to fit in the range of the first?
you'd just need to swap the axes
!e
import numpy as np
import matplotlib.pyplot as plt
N = 50
x = np.arange(N)
y = np.exp(-N/1000*(x - N//2)**2)*50
plt.plot(x, y)
plt.plot(y, x)
plt.show()
plt.savefig("biggest_oof.png")
@wooden sail :white_check_mark: Your 3.12 eval job has completed with return code 0.
like so? if you swap the x and y axis you get a 90° rotation of a plot
Hmm, could work if I scale the data to fit the ranges, I just would have thought that there might be a option to maybe have 2 subplots overlapping. Thanks for the help!
this is also possible https://matplotlib.org/stable/gallery/subplots_axes_and_figures/two_scales.html
also tried this before, but they always have shared axis, but if I turn the plots the don't x and y are swapped
Maybe just normalize the whole data to [0,1] and then use the approach from before, just removeing the ticks and that should work
OneHot Encoder acting on pandas data frame. How to prevent fit_transform method from placing old index labels in new column?
any time you fit an encoder, you're basically resetting it. so fit_transform shouldn't do that.
Are there any guides on converting pandas-on-pyspark code to pyspark SQL? I have to convert a few thousand LoC next week 😓
Thanks for input from you - it’s good to know that.
can someone please explain why my model gives the exact same prediction every time. it works for single predictions but when i mass predict it gives the same output for all predictions.
def process_image(img):
resized = np.zeros((50,50,3))
resized[:, :, :3] = read_img
img_tensor = tf.convert_to_tensor(resized)
img_tensor = tf.expand_dims(img_tensor, axis=0)
return img_tensor
img_label_arr = random.choices(combined_data, k=4)
print(img_label_arr)
for il in img_label_arr:
label = il[1]
read_image = il[0]
plt.imshow(read_image)
plt.title(f"Label: {label}")
plt.axis("off")
plt.show()
image_tensor = process_image(read_image)
predictions = model.predict(image_tensor)
print("Predictions:", predictions)
prediction = benign_or_malignant(predictions[0][0])
print("Prediction:", prediction)
almost there
may i ask what your motivation behind doing this is?
it halves the number of floating point calculations, and also halves the amount of memory
not that, that i get
just the idea of learning metric tensors is common, so i would expect this to already exist in several flavors
couldn't even find anyone using quadratic forms
like, xMx.T
the transformer does it implicitly
you might've looked with the wrong key words
uhm, it is possible
every single time you read mahalanobis distance, this is what they're doing
let me see
.wa s mahalanobis distance
Failed to get response.
right, but I haven't seen this concept used in ML
it's used everywhere
where ?
anywhere you read "maximum likelihood" or under the name "mahalanobis distance" just as above
this is how optimization has been done for the past 100 years
I'm not sure I follow
but maybe you needed to look under statistical methods
this is an attention mechanism
machine learning people call this by the statistical name
this is why many problems in optimization have several names: the different communities don't talk with each other
I have not seen quadratic forms be used explicitly in deeplearning NLP
the statistical mahalanobis distance squared is the same as using a metric tensor on a point of a manifold
you don't have to call it by name for it to be equivalent though
I would've at least seen the equation I imagine
I do recall seeing something in computer vision
any term that involves any sort of mean squared error or maximum likelihood or maximum a posteriori or bayesian methods
anything that builds up a covariance matrix of some sort or a hessian matrix
they're all doing this, with a different name
I'm confused because the stuff you mention seems to be related to loss right, this is a layer
no, not necessarily only loss
if you look up deep unfolding algorithms, they treat iterations of older algorithms as layers of a neural network
I think it's very telling that they dont mention quadratic form on the 2017 paper
any unfolding of a 2nd order or higher method (e.g. quasi newton), or even of linear methods, involves products with gramian matrices (another name they go under)
they're using a quadratic form as a layer and don't mention it
i would never take "they're not using my preferred terminology" as a sign
it's not a matter of preferred terminology
quadratic forms are super standard and everyone expects the rest to take them for granted
it's a no mention of it
they're using this whole CS Inspired terminology to talk about something that is just xMy.T
all newton methods do that too and they won't call it metric tensor nor quadratic form
I would imagine you'd use the simple eqn tho
you'd be surprised
I was
you also just wrote like 2 or 3 pages on the same product 😛 as you can see
wdym ?
it's less than a page and it's not just doing one product, I'm calculating a bunch of stuff at the same time
like, im not explaining a layer
im placing it in a form to feed it to the gpu
at any rate, the area you're looking for under statistical optimization is called "information geometry" and it's all about learning manifolds and metric tensors to do parallel transport
all right
the actual motivation is that it is intuitive, in fact, even people who are not math savy find it to be an interesting concept. our brains like geometry so I see it as the way to make networks interpretable
ig the layer can exist somewhere under some other name, but it still wouldn't clash with my objective
Anyone have specific tips to debug the location of tensors and/or memory related issues with torch?
I have a 70k parameter neural network that should be on my GPU (I literally call .to('cuda') on it) and when I call .to('cuda') on it closer to where I do inference if goes oom trying to allocate 60GB... For a 70k parameter network
can only be the data right
are you doing backprop ? the more batches, the more gradients it stores, I've also found that loss functions tend to be inefficient with their allocations
I wrote a determinant calculator, for dimension 200, it takes around 10-11 secs
written in Go
the most relateable thing
i compared numpy solution with mine
numpy takes 1.2 secs while my takes 6 secs
numpy literally takes less than a second ughh
how does your algorithm compute the determinant?
i find the row-echelon form and then use this
I wrote the code for REF in Go
func RowEchelonForm(A [][]float64) ([][]float64, int) {
matrix := Copy(A)
var determinantFactor = 1
rows := NumberOfRows(A)
for i := 0; i < rows; i++ {
r_idx, c_idx := LeftMostColumnWithNonZeroEntry(matrix, i)
// r_idx = represents the row_index of the entry which is non-zero; it needs to be same as "i"; or else swap it.
// c_idx = represents the col_index of the entry which is non-zero; on that
if r_idx == -1 || c_idx == -1 {
break
}
if r_idx != i {
matrix, _ = RowSwitch(r_idx, i, matrix)
determinantFactor *= -1
}
column, _ := GetColumnAt(c_idx, matrix)
for j := r_idx + 1; j < rows; j++ {
scalar := -1 * (float64(column[j]) / float64(column[i]))
matrix, _ = RowAddition(scalar, j, i, matrix)
}
}
return matrix, determinantFactor
}
func LeftMostColumnWithNonZeroEntry(A [][]float64, currentRow int) (int, int) {
for i := 0; i < NumberOfCols(A); i++ {
for j := currentRow; j < NumberOfRows(A); j++ {
if A[j][i] != 0 {
return j, i
}
}
}
return -1, -1
}
The code is a mess I know 😅
looks about right, i think LU decomposition is the most common, which is the same as row reducing
google does seem to say C is a little faster than go, which would explain the difference
1:5 ?
depends on the actual code, but people in google searches claim 3 to 20x factor
C is a factor, but my code is creating a lot of variables, and I am not mutating the original matrix (tryna do functional programming, (tho usiing for loops llmao))
that certainly makes it slower
see how it performs over several matrix sizes
but usually BLAS/LAPACK is hard to beat
whats BLAS, LAPACK ??
the libraries (usually written in c or fortran) that numpy wraps. they're libraries optimized for special linear algebra operations, also optimized for specific processor architectures
quite vexing because it automatically implements SIMD and parallelization
you can usually only beat it in cases of special composite matrices where you can split the total action of a matrix into smaller ones... for which you use BLAS/LAPACK 😛 instead of doing it naively for the whole matrix
I'm using lightning and training on GPU. The interesting thing is that it doesn't go OOM while training 🤷
I'll look again tomorrow
Maybe I just needed sleep
personally i liked the interpretation angle it provided
i definitely encouraged him to pursue it 😆
you know about the manifold learning stuff and i don't, so maybe this has already been investigated and i didn't realize
i'm with Edd though, this does show up everywhere all the time. when i said nobody has done it before, i specifically meant that i wasn't aware of anyone looking into this particular restatement of the attention mechanism. sorry if i wasn't clearer about that before.
it could be that no one has done it for the attention mechanism in particular, i have never read an NLP paper. the idea is overall standard and you find it in any book on optimization though
You're both missing the point tho, I'm iterating on an existing system, totally irrelevant if the underlying math is used everywhere else
It's like arguing against matrix mul
"the idea" meaning quadratic forms and their usefulness/interpretation in general, right?
yeah
from what little i recall of what is done in attention, the matrix is neither symmetric nor square
i think that was the point of the investigation: if you rearrange terms, you get something that looks like it could arise from the decomposition of a square symmetric matrix, so let's see what happens if you actually impose that constraint
that's interesting in its own right
a reasonable mapping where this even makes sense is nice to think about
wdym by a reasonable mapping?
how to make it so that the vectors participating in the bilinear form are in the same vector space
one where this metric makes sense
They all go through the same projection
this i think was the derivation:
Q = X @ Wq
K = X @ Wk
V = X @ Wv
Q @ K.T == (X @ Wq) @ (X @ Wk).T
== (X @ Wq) @ (Wk.T @ X.T)
== X @ (Wq @ Wk.T) @ X.T
Before being doted
what's what here
and then the idea was to set M = (Wq @ Wk.T) and impose that M is symmetric and square, right?
that's a pretty strong constraint
Yeah, like, I think the ideal thing to do is quadratic, but I wanted to explore metric just for the sake of it, in early experiments I found it was actually way easier to interpret what it was doing
X is the sequence, and Wq Wk Wv are the attention projection matrices. then Q K and V are the query key and value matrices. following the attention-is-all-you-need notation
without any special justification, this completely changes the data manifold
there's no reason why this should be better. it's gonna be faster due to structure and have nice geometric properties
not necessarily useful ones
nice to investigate though
right. i was very curious what that would do to the model 😆
It can also have a regularization effect maybe
yeah, that'd be the case
The early experiment was an array sorter
So it would take a sequence 2, 3, 1 and output 1, 2, 3
With metric I could look at the distances between embeddings and see that they were actually being sorted along an axis
I have been posting everything here
But it was way back idk if I can retrieve it
it looks like they are expecting you to implement the optimization routine yourself using cvxopt, not use scikit-learn. the example even uses L1 in computeL1LeastSquares. so i'd start there instead of fishing around in a library that you aren't expected to be using.
i thought the rectangularness of the matrices in attention stuff is what lets you deal with sequences of irregular lengths
i wouldn't know though
maybe that just comes from the choice of tokenization and embedding
i've barely touched this application
which, the projection matrices? i thought they were rectangular in order to get a kind of bottleneck effect, so if your word vectors are 50 or 100 dimensional then the attention mechanism is only operating on 10 or 20 dimensional vectors
i think you're right about the tokenization thing. stelercus is the NLP expert though
The same projection is applied to every embedding and then every embedding "dots" with each other embedding, the sequence length is accounted by the cross doting
and from what i gathered, you have some pair of sets of input vectors that you elementwise inner product using this symmetric (positive definite???) matrix to get a new vector whose entries are the products?
or you have a different one per pair of vectors in the set
If you're asking about 2017 scaled dot product, you have 3 projections. The results from two of them are used to produce the matrix whose entries are the "cross dots". Then you use that matrix as a transformation of on the third projection
So it's like the matrix of dot products is an MLP layer constructed on the fly
There's softmax applied to that matrix before using it as a transformation, and the values are scaled by 1/sqrt(dim of proj space) so they called scaled dot product
In my case I threw away the three projections and use only one, followed by a quadratic form
i'm just trying to figure out how much efficiency one can squeeze out
compared to, say, letting M for one pair of vectors be L + L^T for a lower triangular L, which makes it easier to guarantee symmetry
though that's not a problem if you compute the gradients by hand considering the symmetry explicitly
Yeah I went the custom kernel route, that's what I was calculating earlier
But there's another layer to this
Which is that the particulars of the attention mechanism might not even be important
There's a study where they substituted the attention for an avg pooling and it still worked out fine
Caviat is
It was for vision and it's a paper on arxiv
So I'm reproducing their results but for NLP, and also exploring this other side with the metric tensor condition thing
I got a bunch of layers done, scaled dot product, quadratic, avg pooling, etc etc. Now I'm finishing up this one
But early results
For sentiment analysis it doesn't care one bit what you use for attention
if you have a deep enough network and enough data, the architecture doesn't matter much 😛 idk
fun times implementing the layers, but why don't you test it on what's already there? other than the street cred
Well it's stranger than that cuz the rest of the network doesn't really make the embeddings interact
fwiw i suspect that this is because sentiment analysis is mostly a matter of finding useful n-grams, you don't have a lot of sentiment encoded in long-range high-order relationships in text that isn't also apparent from word choice
So like, I even used identity and it worked out
Especially if the architecture has little inductive bias
Yes I agree with your take
my intuition for vision is that you have much less of a requirement for long-range relationships between video frames because so much more information is available already in each frame. which is why the "token mixing" mechanism matters a lot less for video
Wdym by what's already there ?
that you can do all of this with pytorch or smth
i'm looking forward to the first 1T parameter fully-connected MLP that's competitive with GPT 2
I am doing it on torch. Just not pytorch cuz I needed to brush up on my low level programming
ok, that answers the question
new paper coming up: money is all you need
"we replaced everything with dense layers and threw as many A100's as we could at it. it just works."
"Hyperparameters? We have graduate students exploring different sets as their master's thesis"
the reality of that hurts
The famed grad student descent
I hope you don't mind that I screenshot this hilarious exchange
I had a bit of an evil thought today
The results of my modelling is beating the one of the prev batch by 10-20 %
The previous ones weren't bad either. Maybe an idea would be to present $client with middling results the first time and then show better ones the second 😩
that way, you always end the project on a positive note (I wouldn't actually do this)
you kinda rediscovered academia tbh
Isn't this salami publishing and frowned upon
Hi everyone
I just started working on a regression task that uses data from IoT devices, each device has a different sampling rate.
The question is how can I create samples for my dataset if all the sensor readings (my features) have a different time stamp.
And also, when it comes to putting the model in production. How can I present the data for inference if all the data points are not available at the same time. Is there some kind of buffering technique that I could use?
I know there are tools in the cloud to stream data, but I am working on a problem where I won’t have access to the cloud, expect for training so I need to rely on open source.
Any feedback helps, thank you!
do you know anything about the measurement data? if you want to use the latest data and you know the data varies slowly, you can think of extrapolation methods
if you're ok with introducing a delay, storing some n previous samples makes sense, and then you could interpolate the data of all sensors to the timestamp of the sensor that was updated last
Yes, the data comes from an industrial machine. The data points are: temperatures, motor speeds, pressure, etc. The sample rates are between 1 to 8 seconds.
I think I could try both methods and see which one gets the most accurate model! Thank you for the advice.
Greetings Community:
I am Javascript developer and I have a task of getting text out of pdf and images
I searched google and find out that pytesseract and paddleocr are very good ocrs
any suggestions which libraries of python I can use to fullfil my task andatleast of getting 80% accuracy
extracting text from PDFs isn't bad if it's just plain text. but PDFs with lots of images and stuff are hell on earth
tabula is also a great tooll !!
Now my particles will follow the gradient as they drop heat. This causes them to have pseudo collisions and bounce off of other particles
Hello guys, when merging two dataframes and returning selected columns; if the values are NaN it matches will all the NaNs and returns duplicates instead of returning NaN. What would be the ideal way to resolve this?
That looks awesome.
is there a possible way to run a python script in ios?
with apple, anything could be impossible
woah 
But btw, it was only one layer, don't recall the embedding dimension but I likely not very large too. I used the IMBd dataset.
I'll get the full results once I'm done with this and prep a series of datasets
the network as a whole isn't just the one layer though, is it?
(i'm asking, i don't know)
well by one layer, I mean one transformer block, which is equal to the attention module followed by the MLP, and the MLP doesn't let information flow from one embedding to the other
if you have enough layers after the transformer block, you can do whatever
I didn't see much sense in having an avg pooling attention split into several heads, so there's also nothing additional there like projections
after that, I average the embeddings to get one embedding, and then project that to the number of classes
simple average too if im not mistaken, so no weights there
my reasoning for why it works is that it's just counting good words and bad words and using that to decide if it's a positive or negative review
the interesting part is gonna be when I get it to do next token prediction, which I suspect it won't work, would make no sense if it did
no wait it's the other way around, I first project them and then avg
there's only one projection matrix for all embeddings tho so the point still stands
it's not the capacity of the output layer that is doing it
that explanation would also have been the first thing I would've thought
Does anyone know if what can be done in R can be done in Python easily?
I'm not sure if it's worth investing time to learn R
Never seen it be used outside of school
There are many methods that aren't as accessible in Python as in R
Whether you should care about them is a different question 😄
What if we use numpy?
for the most part, what can be done in R can also be accomplished in python quite easily.
but it's worth noting that R has first-class time series support and there exists implementations of some niche models like multinomial logistic regression, where python might not offer as much out-of-the-box support (though statsforecast in python has been a thing for some time, maybe R's time series capabilities are already matched in python?)
They haven't been matched yet imo
For niche things at least
Also models like GAMs aren't as great in python. I think the valid question is, should you care about GAMs, maybe not
I don't think most people should learn it though
R has two great niches, it's easier to do descriptive statistics, plotting, ... etc. in it because dplyr, ggplot2 and co have a way more intuitive API than Pandas and matplotlib. The first niche is let's say social science researchers that don't want to dive deep into coding but want to do data analysis and ML
The second niche is cutting edge statistics. Some implementations are only in R land. I think the same applies for MATLAB.
The vast majority of people aren't in both scenarios so I'd just say: learn python 😉
my take is that you cant go wrong with learning more things and languages in particular always have some new ways of thinking about things, tho I personally dont like R, py covers all my bases
I think that if you learn enough languages with similar paradigms, learning an additional one within that paradigm is not very hard, but learning languages that work very differently take time to get used to
Most people don't learn R tho, I never learnt it. I learnt how to do things in it
I haven't touched it in a year but I'm probably still faster at certain things with it than python
there's also different phases right, most modern languages will be optimized so you can learn the basics quickly and be able to use it, but it might take a long time to master them
Tbh i'm not sure really how true that is anymore
I think outside of some super specific industry thing which maybe is bundled up with a bunch of legacy stuff
It seems in general Python can go beyond what R can, especially with the number crunching related tasks, whether that be with Numba + Numpy or even Torch and tensors
sticking with py is a good strategic choice career wise, cuz it is used in a variety of industry scenarios, not just ML and data science
For my thesis I wanted to do a statistical test that had no credible python implementations
For R the implementation was from the author
Again, it's a question of whether or not you care about the advantages R can give you, because they exist but my entire point is that they're niche enough the vast majority of people shouldn't
forward kernel is done (probably), i should throw a party
I'm also hoping Polars becomes better integrated in the ecosystem because what I never got over was how incoherent pandas was
And matplotlib
I wonder if the spreadsheet+python synergy could substitute pandas
The whole idea of an index in pandas is just ... idk
It makes the library deeply strange
I really like excel, and I really like py, so both of them together, could work
I think there's a company that used spreadsheets as their production db for an unreasonably long time
That's just already Pandas/Polars?
I suppose the difference would be that I like using one over the other
if I do like it
Speaking of legacy, I don't know how it is abroad but here Java and C# dominated so most Python projects are Greenfields and data projects
I'm curious how it'll look like 5 and 10 years down the line 😅
i've seen takes that in 5 to 10 years legacy will be much bigger problem because of tools like copilot
well, in excel they're just using daframes api, so no improvement
the amount of ways I can shoot myself in the foot in cpp
it's kind of a radical design approach, compared to most "data frame" libraries and frameworks that actively eschew row labels
Do you like it?
i think so. i've gotten used to it, at any rate. the API for multi-indexes is still very much lacking though
i think the ideal scenario would be to not have separated indexes and data, but to have built-in "indexes" in the sense of a database index that can be attached to a data frame
data.table has that, but it's a bit auto-magical, compared to a proper database where you have a relatively wide range of control over what kind of index to use
the polars devs have expressed that they don't want to add that to polars, because they felt like it wasn't necessary (which they are totally wrong about, but it's their library and their choice), but they also said it should be easy enough to build a sidecar index thing that works with polars (not totally wrong)
for example in data.table you get to specify exactly one column as the primary index, which it uses to sort the rows and uses binary search for lookup and join operations. and you get to specify more columns as secondary indexes, but i don't remember offhand what kind of data structure it maintains for those and what kinds of optimizations it provides.
but unlike in pandas the index column does not become separated from the data, it's just the sorting key
at the opposite end of the spectrum is xarray, which is basically pandas but multi-dimensional and fully embracing the separation of "coordinates" (the xarray equivalent of a pandas index) and "values" (data)
so it's kind of use-case-dependent
And I love them for this tbh
i think pandas indexes are great when you have clear obvious choices. it enables you to get much better performance than you otherwise might be able to do with a "dumb" data frame library that lacks a query execution engine
They won performance in different places
and it does help keep things organized by separating "id" columns from everything else, which i find appealing and convenient when "exporting" data to numpy or torch for use in ML. it also makes equi-joins on keys and as-of joins on timestamps very convenient.
so i like it in its own time and place. but i'm not exactly clamoring for the index/data separation in other data frame libraries either. i really just want the option to designate certain columns as "coordinate" columns (without physically separating them from the data) and to define database-style indexes for performance as needed. but at that point maybe i just want to start embedding duckdb.
I also don't like .loc and .iloc and the general error messgage that goes something like "setting a copy on a slice on a dataframe..."
i think the distinction between loc and iloc is important if you are going to be using the indexes
the error messages and UX/API... yeah
(My obligatory: I just do everything in sql -because- of pandas's ridiculous api)
i think it makes perfect sense if you believe that it makes sense to separate coordinates from data
Just have .filter or .where be how you operate with data
it's not a ridiculous API if you buy into the index-data separation
but if you don't like the index-data separation then frankly i think pandas itself is just not the tool for you. it's a core part of the pandas design, like it or not
that, or you ignore indexes and suffer with slow linear scans for every filtering operation. but you probably should just use polars then.
That's the thing - it's tightly integrated with most of the DS stack you can't get around it
I use polars wherever I can
is it? you could probably just skip pandas entirely except with seaborn and statsmodels
it's very integrated into the community though
so it will be hard to find a DS team that doesn't use it at some point
and it wouldn't be fair to prohibit colleagues from using it
I have my interns using Pandas of course
Pandas is just at the tail end of many of our pipelines for sure, but duckdb is most of the rest of it. If it weren't for duckdb, we'd use polars.
tldr: it makes sense if you don't like the index/coordinate-data separation, but i maintain that the loc/iloc distinction makes sense and is necessary when that separation exists.
UX/API design for interacting with that distinction is another matter
Oh, I didn't argue loc/iloc being different is bad
oh, maybe that was just billy 😆
One is for labels and the other for positions, that much is clear
I'm arguing that the idea of having indexes on a df is bad yes
If you have them, then yeah, you need both
Indexes in Pandas' way
yeah, fair. it's like how maybe R didn't need to also be a radical lisp to be a useful stats language
I think my problem is that there are many analytical cases where there's no clear index/data separation
I think being able to add your index ad-hoc / when reading data would've made more sense. Kind of like how data.table does it
Or, more particularly, that the indices vary depending on the query
group_by().reset_index() 😩
(Aside from Pandas I have them doing dbt + duckdb)
My judgement call when selecting their stack was that SQL, dbt and Pandas will go a longer way for them than Polars early career wise (even though it's no secret that's my fav)
wait you can definitely do this
are you doing this for "local" work in projects? i've been thinking about trying it
normally i do DVC + ad-hoc scripts
I mean, polars style API where you could add a DB style index on top if you needed it
local as in not in the cloud/on a VM?
I've refined this a bit more. im not sure if I can call x_i the thing itself or the coordinates of the thing in a given basis. I also should preface the 4th paragraph with y Im defining so much new stuff, the motivation is that Im defining the memory layout for the tensors
I'm learning data science and my teacher send me this code,
But the accuracy seems to be at 100% which I find impossible. Is there anything wrong with this code?
Oh I can't send a file
import pandas as pd
import numpy as np
import sys
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import learning_curve
from sklearn import metrics
import scikitplot as skplt
from sklearn.model_selection import LearningCurveDisplay, ShuffleSplit
from sklearn.model_selection import train_test_split
df = pd.read_csv('data.csv',sep=';')
X = df.drop(['k'], axis=1)
y = df['k']
X_train, X_other, Y_train, Y_other = train_test_split(X, y, test_size=0.2, random_state=42)
X_test, X_val, Y_test, Y_val = train_test_split(X_other, Y_other, test_size=0.5, random_state=42)
x = X_train['w']
y = X_train['l1']
k = Y_train
for i,row in X_train.iterrows():
if k[i]==1:
plt.plot(x[i],y[i],'rx')
else:
plt.plot(x[i],y[i],'gx')
plt.xlabel("Weight")
plt.ylabel("First")
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
ac=pd.DataFrame({'k':[],'Accuracy':[]},index = [])
for k in [1,3,5]:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train,Y_train)
y_pred=knn.predict(X_val)
accuracy = accuracy_score(Y_val, y_pred)
row=pd.DataFrame({'k':[k],'Accuracy':[accuracy]})
ac = pd.concat([ac, row],ignore_index=True,axis=0)
data_p = scaler.transform([[70,16,15]])
knn.predict(data_p)
data_p = scaler.transform([[60,16,15]])
knn.predict(data_p)
ac['Accuracy'].max()
ac[ac['Accuracy']==ac['Accuracy'].max()]
k=int(ac[ac['Accuracy']==ac['Accuracy'].max()]['k'].values[0])
knn_super = KNeighborsClassifier(n_neighbors=k)
knn_super.fit(X_train,Y_train)
y_calculated_class=knn_super.predict(X_test)
y_probability_classs=knn_super.predict_proba(X_test)
knn_super.fit(X_train,Y_train)
y_calculated_class=knn_super.predict(X_train)
y_probability_classs=knn_super.predict_proba(X_train)
mac_pom=confusion_matrix(Y_train,y_calculated_class)
mac_pom
#PPV
for i in range(len(mac_pom)):
print('PPV',i, mac_pom[i,i]/mac_pom[:,i].sum())
print('TPR',i,mac_pom[i,i]/(mac_pom[i].sum()))
print('TNR',i,(mac_pom.sum()-mac_pom[i].sum()-mac_pom[:,i].sum()+mac_pom[i,i])/(mac_pom.sum()-mac_pom[i].sum()))
print(metrics.classification_report(Y_train, y_calculated_class))
fig, ax = plt.subplots(figsize=(2, 2))
cmd = metrics.ConfusionMatrixDisplay.from_predictions(Y_train, y_calculated_class,ax=ax)
plt.savefig('Knn_macierz_pomylek.pdf')
what debugging steps have you tried so far ?
I checked if tesst group, validation group and training group have diffrent data and they do. I added more data to 'data.csv'. Even extremly weird and out of order data and accuracy was sstill 100%
Honestly I had like one theoretical lesson of data science and no practice and I don't know if the code is correct or not
Have you tried printing out some results ?
Like, take the model and feed it some X and see if it is the correct Y
And also, print the Y
Usually in these cases theres something off with the data and the model converges to outputting some constant
I used diffrent data set, much larger and it still gave me a 100% accuracy
honestly I'm still trying to figure out how this exactly works because like I said I had only one theretical lesson of data science
And I think this was a good data set because I got it from kaggle and it seemed ok
Try printing out the output, it will be clearer from there
I don't want to tell my teacher that his code is wrong unlesss I'm absolutely sure it is 😅
And 100% accuracy is possible if the dataset is artificially constructed for it
I used two diffrent data sets so it's not likely
What does the output of the model print out ?
While I was looking trough it I found that accuracy for every k is the same as well. My etacher said that it was because of data givem but I changed and expanded the data. Can the problem lie here?
either/or. i mean "ad-hoc adding things as you build up a project" rather than in some kind of production pipeline
My resume be like "I use Ricci btw"
Guys, what to do with the dataset that have the number of positive sample way more than the other. Should I downsample it or ignore
Weighted cross entropy loss
Assuming you're using cross entropy loss
Or ... do nothing and look at your metrics to decide (ROC, DET, ...) an operating point
that's my preferred choice over downsampling, upsamling and the likes
that. you already have 2000 samples in the smallest case. that's probably enough.
start with not worrying about it
imbalance is less bad than outright not having enough data to make a good decision about one particular class
In anything you decide to do, avoid using SMOTE. Usually, I prefer tunning the class_weight or scale_pos_weight parameter.
From the image, it appears you're likely working on sentiment analysis.
If your explanatory variable(s) are text data, you can simply perform data augmentation using TextAttack (the same library used for performing adversarial text attack in NLP)
Im running a mistral 7B instruct model on my tesla p40 (24gb vram) and im getting a CUDA out of memory error for my gpu?
Anyone got an efficient way of turning a Sparse Polars df into the short arrays that can define a CSR array? I tried turning every column into a 1-element list of Structs (each containing a non-zero value and its corresponding Row) but I kept getting weird errors around duplicate columns (apparently this is a known bug with Structs atm)
In the future, please give text as actual text. Not as screenshots.
The error message tells you what the problem is. Something else appears to be using up a bunch of GPU memory.
Fixed a bug. Turns out int(1.9-2)==0 not -1
I was wondering why my stuff was sticking to the 0 line
What is happening under the hood here? Is this the solution of a DE with fancy boundary conditions?
SQL and pandas are fundamental skills for all data professions, and dbt helps make SQL development more manageable. But I guess this means you're preparing people either to be "full stack" jacks of all trades, or even to move away from data science itself towards analytic engineering, data engineering etc.
import matplotlib.pyplot as plt
import numpy as np
from io import BytesIO
plt.clf()
uber = 8
forge_frag = 1600
nitro_value = 175
for uber in range(8, 12):
i = 0
x = 1000
x_axis = []
y_axis = []
lol = False
lol2 = False
for z in range(100):
nitro = (nitro_value * (z + 1))
i += x
i += 1500 / (uber - 7) - forge_frag * (10 * (uber - 7) - 10)
f_p_n = i / nitro
x_axis.append(f_p_n)
y_axis.append(nitro)
x += 2000
plt.plot(y_axis, x_axis, label=f"Uber {uber}")
plt.axhline(y = 0, color = 'b', linestyle = 'dashed', label = "0 ea")
plt.axhline(y = 160, color = 'r', linestyle = 'dashed', label = "160 ea")
plt.legend()
plt.title = f"Flux cost per nitro (Vaults)"
plt.xlabel('Nitro')
plt.ylabel('Flux per nitro')
data = BytesIO()
plt.savefig(data, format="png")
data.seek(0)
I have this bit of code, however it's missing another X axis which is supposed to show 0 to 100 in intervals of 5
I wanted that axis to be on top...and grid the whole graph based on it
been looking in stack overflow with no real results, it ends up swalloing the whole graph and creating a new one over it it seems
the full scaled dot product attention from 2017, I'm gonna use these to show the equivalence to the quadratic form, and then argue in favor of just using the quadratic form and then try to make the case for the further restriction of it being a metric
N_k is not a matrix, probly not the best convention now that I think about it but N acts on an index to produce its maximum range
pandas DataFrame.to_numpy( )
Did this method never run through deprecation stage? In pandas 2.2 no more available - code runs onto „object has no attribute” error, yet the API Reference not reporting its support. As for pandas 2.1.2 the method is still present however under no warning of deprecation.
Huh, I see what you mean. Actually even weirder, I only see that in the docs - it's in the code still: https://github.com/pandas-dev/pandas/blob/main/pandas/core/frame.py#L1857
pandas/core/frame.py line 1857
def to_numpy(```
Pandas version checks I have checked that the issue still exists on the latest versions of the docs on main here Location of the documentation https://pandas.pydata.org/docs/reference/api/pandas.Da...
are you sure you're actually getting an error accessing it?
You' re right, myself seems to need to check own code for be correct first. Thanks for hints.
tomorrow is gonna be intense, I was supposed to have finished this stuff by yesterday, and this deadline was already a pushback from mid last week
there's actually a good reason for doing it the way they did it, it reduces the number of parameters, when you do Wq@Wk.T you're getting back a dxd matrix, basically, with their way, as long as you choose a ksuch that k < d/2 you're using less parameters to make a mathematically equivalent layer
it is possible however, to cook a better and still mathematically equivalent layer with even less parameters if you were to stick with the quadratic form, Im gonna include a small proof of this on my report thing
Guys I need some help. I'm working on a Cars Dataset to predict the price based on cars infos, my problem is i donno how to get insight from the Car's Model Feature since there are 2736 unique values and my dataset has 27k rows
what do you mean by car's model feature ?
this how my dataset looks
I think that there will some relation between the Model and the Price
but the problem is there 2736 unique values and as you can see the dataset is very large
@final kiln
that doesn't sound that large I think, isn't it possible to encode 2.7k classes efficiently ?
maybe using embeddings: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html
maybe but i want to visualize to spot any pattern
ah you are doing EDA
yeah
maybe PCA can help somehow
for the make for example i spot that even that the occurence of some cars is less frequent but it affects the price very high
like Ferrari, Rolls Roys
which is logical
but there are only 58 Make
so it was easy
you can likely show this very well with a color coded histogram
x axis would be price bins, y axis would be count
and then each bar would have an assortment of colors showing the percentage that goes to each class

no
this
hmmm
2k classes might still look cluttered even in my proposed plot tho
maybe get the statistics for each class and plot that
like avg, std, etc
then a scatter plot with some error bars
labeled with the car model
I want to build unsupervided learning semantic-based cluster grouping of key informations in uploaded files. I am considering between ChromaDB and Qdrant for the vector db. For the types of clustering algorithms Density-based or Centroid-based. For example, I have 10 uplaoded hedge fund files, and I want to add them into vect db and find up some informations for ex two advisers stock X will go up, one said it will go down, etc.. Did you solve similiar problem, if yes or if you have any suggestion for starting out and choosing optimal tech stack please lmk. ty
How much data is there?
Can you not fit this into a system locally? If so, I would use Neither and use PyNNDescent with SKlearn to create your clustering pipelines
A) It is faster to search B) less hassle and C) faster construction
Is there anyone online here?
Hey
Yes
I think my teacher is crazy
I'm not that good with Python but sure
Im working on a project
Who doesn't?
Say
but k means clustering doesnt have accuracy for predictions
I mean that question must be wrong
hi guys , becoming ML/AI engineer full , stack development require?
Nah
Hey guys I have a list of similar words, they are similar by typos i.e. INDONAKANO COMPANY, INDOKSANO COMPNY..., I would like to group these similar typo words together. What is the common practice for doing this?
Guys, is it bad to clean the sample for bert-base-uncased for classification problem like sentimental analysis problem
I think Levenshtein distance will do the work just fine here - this will give the measure of similarity
Hey guys! I am building a NLP virtual assistant. Currently so far I have build till semantic analysis where the machine can understand if my given text is positive or negative. I am trying on how machine can understand the entity and open the applications that I give the query to open. Example = “can you open calculator please?”
By NER I have such output: (S Can/MD you/PRP open/VB calculator/NN please/NN ?/.)
Im using the NLTK libraries but now idk how I would make a function that will make the machine understand that it has to open calculator. I was thinking of pattern matching but again it gets very tidious. Am I going correctly or is there anything I should consider for entity recognition that im lacking currently? Thank you for your help :D
also Im using datasets as nltk corpora
import pandas as pd
Import "pandas" could not be resolved from source
what does that mean 😿
Are you sure pandas is installed? Are you right virtual environment?
You can always double check by opening the terminal and running pip list | grep pandas
i just fixed it, i did pip install in the terminal and then closed vscode and reopened it.
when I use knn algorithm do I have to have a validation group?
what is "the validation group" according to your understanding?
Basically my teacher said that when you're using the knn algorithm you should have:
- training group
- Validation group where you check which k is best
- Test group - to test final accuracy of the model
And validation group messed up my model, so I was thinking if I could check which k is best juts on the test group
how did the validation group "mess up your model"?
I'm new to this so I basically have zero experience
Accuracy dropped from 97% to 60%
that doesn't mean that the validation set "messed up your model". the only set that has any influence on the model's behavior is the training set.
if your instructor told you that you need to have a validation set, then you do
but in general, to train a model, you only need a training set. (but if you don't have a test set, then you'll have no way of knowing how well it performs.)
so stel i read this medium article and it said to not use seaborn for "default visualizations"
bc it doesn't generate the most impressive ones. what should i use instead?
idk. I hate making data visualizations
matplotlib sucks
matplotlib looks so ass
but don't take anything on medium for granted. most of the content on there is written by wannabe influencers.
isn't seaborn built on top of mpl?
they all are.
idk i wanna create some more impressive data visualizations for my portfolio
maybe tableau is the answer?
no ofc not, but i think the source i found is onto something
should i put the link here? this guy put it exclusive to medium ppl only
If you want
i think he's right tbh
hi..
can anyone please teach me how to calculate the inter quartile range..
show the code for what you've tried so far.
i was reading book and there came this..i didn't write any code for it
is this didn't find fair to you?
seaborn looks better than mpl by default
Another option is plotly
there are some weird characters here when i open the file in excel, but when i put it in a pandas dataframe it's fine
also there's missing values... do i impute the values e.g fill in the missing ones with an average? or do i just drop the missing entirely?
Everyone knows they must replace missing values in their dataset before training a machine learning model.
Most people, however, miss one critical step.
This video will show you what you are missing and how to do it better.
🔔 Subscribe for more stories: https://www.youtube.com/@underfitted?sub_confirmation=1
📚 My 3 favorite Machine Learning...
this isn't a school project or anything if anyone is concerned about helping me
just my own curiosity
if you always fill in your missing values with imputed data, wouldn't your analysis be skewed?
Math is hard
the old kanye would've imputed everything with the mean
i miss the old kanye
but now i fr don't know what to do
if i drop the values too, there's a problem
I'm out of context here, who's Kanye
i was trying to be funny, i meant me
it's a song, look it up
It could be that the character encoding of Excel isn't what it's supposed to be
Ah Kanye west
yea
I have a symmetric matrix, and for some reason my brain can't think of a way to optimize the matrix mul
@wooden sail ?
The matrix is laid out as a 1d array
this is the context
it worked out great when it was M_kk' cuz the result was a number and I could partition the sum, but now Im stuck
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("/Users/rahuldas/Desktop/Tortilla Dataset/tortilla_prices.csv")
print(df.head)
print(df.info())
print(df.shape)
print(df.columns)
print(df.dtypes)
sns.distplot(df["Price per kilogram"])
plt.show()
price_per_kilogram_missing = df["Price per kilogram"].isna().sum()
print(price_per_kilogram_missing)
print("hello world")
s is symmetric in cc', and u=F(c,c'), F is the way Im flattening it
!pastein
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
does anyone know why nothing is being outputted?
not even the hello world works
whhat's the error that it throws
no error either
not even seg fault
nada
that's suss
i posted it in the pastebin up top
i changed distplot to displot
you gonna have to move the print line by line to try to figure out the guilty line
something is crashing the program silently
and wtv it is kill it with fire cuz programs shouldnt do that
yea my thoughts exactly
2024-03-24 11:24:54.258 Python[61800:5006159] WARNING: Secure coding is automatically enabled for restorable state! However, not on all supported macOS versions of this application. Opt-in to secure coding explicitly by implementing NSApplicationDelegate.applicationSupportsSecureRestorableState:.
im just gonna take the L on this one, primary mission objective has been achieved anyway
my guy what in the world of fuck is that
looks like a warning, shouldnt be stoping the program
but also, mac strikes again
I think I might prefer remaining unemployed than being forced to use one again
nah
ok i think i got the number. 6390.
there are 6390 rows of data missing for that specific column of Price Per Kilogram
based on that, is there any way i can conclude i should be imputing?
fuck it we ball, i'll impute anyways
i don't get it
how strong is your statistical background?
very basic
yea i would recommend augmenting that with some extra work w a course or two
that book really isn't beginner friendly
if you know the matrix ahead of time and it's manageably small, you can think of diagonalizing it. aside from that, i think computing either y^T(Mx) or (y^TM)x is much more efficient than looping for every single element as naive einstein notation would suggest. if you're coding it yourself, you might consider using strassen's algorithm for whichever product you associate with the matrix
i pinged the wrong person
@final kiln
no but im getting these through the gpu, im coding my own kernels so im not actually looping
im taking the L on it tho, I've already got what I wanted to do done
actually, that does give me the idea, I could just like, not care about memory and tank the repeated calculations, I'd still be squeezing out performance cuz I'd be doing two matrix mul in one operation, I'd just take the same amount of memory instead of half
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("/Users/rahuldas/Desktop/Tortilla Dataset/tortilla_prices.csv")
sns.set(rc={
"figure.figsize" : (11.7, 8.27)
})
print(df.head)
print(df.info())
print(df.shape)
print(df.columns)
print(df.dtypes)
print("hello world")
price_per_kilogram_missing = df["Price per kilogram"].isna().sum()
print(price_per_kilogram_missing)
price_per_kilogram_missing_mean = df["Price per kilogram"].mean()
print(price_per_kilogram_missing_mean)
df["Price per kilogram"] = df["Price per kilogram"].fillna(price_per_kilogram_missing_mean)
print(df["Price per kilogram"].isna().sum())
sns.kdeplot(df["Price per kilogram"], shade=True)
#plt.show()
sns.barplot(df, x = "Price per kilogram", y = "State", hue = "State")
plt.show()
anyone have any ideas on how to make this visualization more eye-friendly?
i might remove the error bars
but when i did i still have one
aguascalientes
weird ash
what would you say i should do?
just curious
i mean there's 32 cities here
maybe show the top 5 values?
i think im making this more complicated than it needs to be, I can take the derivative with respect to one of the symbols already present, cuz it's symmetric with respect to the choice of either p
let me see
here's the OG plot
the bars def ocupy too much space and the rainbow doesn't look useful
that's what i'm thinking too
i always run into this problem with data visualizations
potentially normalize the height with respect to the largest bar
whipped up a quick graph of what it could look like instead
the names are the problem tho
remove the gradient, just a solid color is standard
@final kiln GIGACHAD NAME bro
like, if it has no information it is not useful
is gigachad good? im getting old I cant keep up with the memeology no more
"cool name"
ty
this was not the case, I can't just pick one of the symbols cuz then I lose the case where they are equal
this is likely correct, when c = c' it becomes the derivative of x**2, which is 2x
kind of a crazy way to express it tho
better?
weird
i can't get it to have the highest value at the top
they should have the same color, like blue or black
and try dividing by the height of the largest bar
not sure what that means
price per kg = (price per kg) / max(price per kg )
im gonna assume my crazy equations are correct and move on, cuz I gotta get stuff done
I'll have the opportunity to refine them once I have a unit test on the entire layer
that did something weird
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import colorcet as cc
df = pd.read_csv("/Users/rahuldas/Desktop/Tortilla Dataset/tortilla_prices.csv")
sns.set(rc={
"figure.figsize" : (11.7, 8.27)
})
print(df.head)
print(df.info())
print(df.shape)
print(df.columns)
print(df.dtypes)
print("hello world")
price_per_kilogram_missing = df["Price per kilogram"].isna().sum()
print(price_per_kilogram_missing)
price_per_kilogram_missing_mean = df["Price per kilogram"].mean()
print(price_per_kilogram_missing_mean)
df.sort_values("Price per kilogram", ascending = False)
df["Price per kilogram"] = df["Price per kilogram"].fillna(price_per_kilogram_missing_mean)
print(df["Price per kilogram"].isna().sum())
df["Price per kilogram"] = (df["Price per kilogram"])/(df["Price per kilogram"].max())
sns.set_style("whitegrid")
sns.kdeplot(df["Price per kilogram"], shade=True);
#plt.show()
sns.barplot(df,estimator=np.median, x = "Price per kilogram", y = "State", color = "blue");
sns.despine(left = True
);
plt.show()
what in the world is that
at the top part of the graph
very odd behavior
i don't think it likes it
if none of the bars hit the value of 1, you did something wrong
draft
and now im gonna curl up and cry cuz I still didnt get everything done and that means im gonna be working on this til nite
there's something that doesn't sit quite right with me tho, even if they are equivalent, isn't the one that performs a project to a lower dimensional space still a bottleneck ? like, if the embedding dimension is 1000000, and the space where it's projected to has 10 dimensions, like, there should be a higher risk of loss of information
even tho I can multiply both weights and get a 1000000x1000000 matrix
I'm willing to accept that in the ideal math world I can funnel and recover any information regardless of how tight the bottleneck is, but I think there is gonna be some sort of limit IRL, if nothing else, just coming out of the fact that IRL we dont operate on the reals, we operate on the lattice of floating point numbers
im sure someone has already figured out this stuff
i thought scipy optimize minimize is some insane math but all its actually doing is changing every single parameter and seeing how it affects the loss and that is what all solvers do apparently
it just changes them one by one
i mean, it depends on the algorithm
anyone have good recommendation for resources on time series forecasting in python? I have a course by jm portilla on udemy but that one is kinda outdated now.
i tried all of ones that dont require gradient, i made it recreate an image, and then train a pytorch model, so i tried "Nelder-Mead", "Powell", "CG","BFGS","TNC","COBYLA" and "SLSQP", and i logged the image it outputs after each time it evaluates it, it changes each pixel one at a time and then puts it back to original value and does the next pixel
hey how can i handle memory error it says unable to allovate 2.01 gigs
but i have 16 gigs
and i was monitoring on task manager enough memory was there
Can anyone help with conda installation? I have two different ones installed and idk which one to keep
I think any will do
Unless you've installed stuff that took you a long time to install or anything like that
I have found that everything can be put into words that make the thing sound simple, but when you get into the meat of it, you realize it's actually pretty complicated
How are you guys doing
Im getting a Out of Memory error, im trying to run Mistral-7B-Instruct-v0.2 model and i have a tesla p40 (24gb vram):
OutOfMemoryError Traceback (most recent call last)
Cell In[7], line 2
1 model_inputs = encodeds.to(device)
----> 2 model.to(device)
OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 23.87 GiB of which 47.00 MiB is free. Including non-PyTorch memory, this process has 23.82 GiB memory in use. Of the allocated memory 23.68 GiB is allocated by PyTorch, and 1.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Run nvidia-smi, see if any other process is eating away at the memory
Its only the current task im doing
When dealing with the gini index for the purposes of deciding a split in a decision tree, you compute the gini index of samples on either side of the split, then take a weighted average of the indexes. However, a gini index is supposed to be the probability of a sample being misclassified - that is, the probability of random.choice(samples).class_ != random.choice(samples).class_ - the correct way to compute this for the two splits would be a different formula entirely. Why is the weighted average used?
wait, how do you deal with the gini index features = ["SEX","AGE","YEAR","EDUC","INCWAGE","WKSWORK2"]
sampling_weight = 'ASECWT'
df = sampled_df[features + [sampling_weight]]
df.isna().sum()
df = df[
(df['AGE'].between(21, 64)) &
(df['INCWAGE'].between(0, 99999998)) &
((df['WKSWORK2'] >= 1) & (df['WKSWORK2'] <= 6))
]
df['EDUC'] = df['EDUC']
df = df.drop('ASECWT',axis=1)
import numpy as np
df.info()
df.isna().sum()
a = 0.10
df['EDUC'] = df['EDUC'] * a
df['LOG_INCWAGE'] = np.log1p(df['INCWAGE'])
df['mean_hours_worked']= df['WKSWORK2'].mean()
df['WKSWORK2'] = df['WKSWORK2'].apply(lambda x: mean_hours_worked if 40 <= x <= 46 else x)
print(df)
df['LOG_INCWAGE_PER_WEEK'] = df['LOG_INCWAGE'] - np.log1p(df['WKSWORK2'])
df['MARKET_EXP'] = df['AGE'] - df['EDUC'] - 6
print(df)
print(df.dtypes)
df['YEAR_OF_EXP_SQUAREd'] = df['MARKET_EXP']**2
Weird, I think I've been able to run that model even on my CPU
With 8gb of memory that also is occupied with the rest of the system
Tho I did make a larger a
Swap file
should i try this?
Do you have more code before model.to(device)?
yh one second
No, the swap file was for normal ram to dump memory
heres all the code:
from transformers import AutoTokenizer, AutoModelForCausalLM
device = 'cuda'
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
messages = [
{"role": "user", "content": "What is your favourite condiment?"},
{"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
{"role": "user", "content": "Do you have mayonnaise recipes?"}
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to(device)
model.to(device)
Ah Ive never used the transformers lib. But, aren't you potentially loading the same model twice ?
which libary do you use?
I've been coding transformers from scratch, but also, I've used olama to experiment with the open source models
Not sure if you can use olama to finetune
Ive heard alot about olama, i havent tried it yet.
by loading twice do you mean the .to(device) lines?
I mean the from pertained, don't they load models directly to the GPU
Try to debug this line by line, start with loading only one model to the GPU
And see the effect of that on the memory
There's gonna be a line where it gets to 20gb, which I think it shouldn't right
Thanks ill give it a go
Yeah Mistral 7b should be like 4gb on the gpu
Just running the tokenizer line it takes up 150mb lol. And it successfully encodes my message to a tensor
is this with any quantisation? im probably mistaken but i taught the formula was 7b parameter would mean (7*2bits) 14gb memory
dont know, im just following this table
how do you add weights to variables in python
this is not true neither in math nor irl
especially in the linear case, the recoverability conditions are well known
Must be the case when you can decompose a matrix into two, making a bottleneck
the link between unique recoverability of high dimensional vectors from low dimensional ones is through so-called "sparse recovery", where the constraint is that the projection matrix needs to satisfy special identifiability conditions and the vectors you're looking for in high dimensions are sparse or have a sparse linear representation
the property can be thought of as approximately preserving distances between the vectors in the original vector space even after projecting them to a lower dimensional one
a popular formulation is via the "restricted isometry property" using the johnson-lindenstrauss lemma
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
df = pd.read_csv("/Users/rahuldas/Desktop/Tortilla Dataset/tortilla_prices.csv")
print(df.head)
print(df.info())
print(df.shape)
print(df.columns)
print(df.dtypes)
print("hello world")
price_per_kilogram_missing = df["Price per kilogram"].isna().sum()
print(price_per_kilogram_missing)
price_per_kilogram_missing_mean = df["Price per kilogram"].mean()
print(price_per_kilogram_missing_mean)
df["Price per kilogram"] = df["Price per kilogram"].fillna(price_per_kilogram_missing_mean)
print(df["Price per kilogram"].isna().sum())
sns.set_style("whitegrid")
sns.kdeplot(df["Price per kilogram"], shade=True);
#plt.show()
fig, ax = plt.subplots(figsize=(6, 6))
# drawing the plot
sns.boxplot(data=df, x = "Store type", y = "Price per kilogram", color = "lightblue", ax=ax);
plt.xticks(rotation=90)
sns.despine(left=True, right=True, top=True, bottom=True)
#plt.show()
df["Date"] = pd.to_datetime(df[["Year", "Month", "Day"]])
print(df.columns)
print(df.head)
sns.lineplot(x = "Date", y = "Price per kilogram", data=df)
plt.show()
Traceback (most recent call last):
File "/Users/rahuldas/Desktop/Tortilla Dataset/Tortilla Data Analysis.py", line 34, in <module>
sns.lineplot(x = "Date", y = "Price per kilogram", data=df)
File "/Users/rahuldas/Library/Python/3.9/lib/python/site-packages/seaborn/relational.py", line 508, in lineplot
p._attach(ax)
File "/Users/rahuldas/Library/Python/3.9/lib/python/site-packages/seaborn/_base.py", line 1135, in _attach
converter.update_units(seed_data)
File "/Library/Python/3.9/site-packages/matplotlib/axis.py", line 1717, in update_units
self._update_axisinfo()
File "/Library/Python/3.9/site-packages/matplotlib/axis.py", line 1729, in _update_axisinfo
info = self.converter.axisinfo(self.units, self)
File "/Library/Python/3.9/site-packages/matplotlib/dates.py", line 1882, in axisinfo
return self._get_converter().axisinfo(*args, **kwargs)
File "/Library/Python/3.9/site-packages/matplotlib/dates.py", line 1799, in axisinfo
majloc = AutoDateLocator(tz=tz,
File "/Library/Python/3.9/site-packages/matplotlib/dates.py", line 1333, in __init__
super().__init__(tz=tz)
File "/Library/Python/3.9/site-packages/matplotlib/dates.py", line 1132, in __init__
self.tz = _get_tzinfo(tz)
File "/Library/Python/3.9/site-packages/matplotlib/dates.py", line 236, in _get_tzinfo
raise TypeError(f"tz must be string or tzinfo subclass, not {tz!r}.")
TypeError: tz must be string or tzinfo subclass, not <matplotlib.category.UnitData object at 0x1291503a0>.
(base) rahuldas@Das ~ %
what does this error mean?
some kind of type mismatch
I thought there was an isometry between any Rn to any other Rm
Not isometry, wait
Ah idk, I thought you could say they have the same cardinality
even with n = m, general matrices are not invertible
with n != m, they cannot be invertible
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 State 278886 non-null object
1 City 278886 non-null object
2 Year 278886 non-null int64
3 Month 278886 non-null int64
4 Day 278886 non-null int64
5 Store type 278886 non-null object
6 Price per kilogram 278886 non-null float64
7 Date 278886 non-null datetime64[ns]
one looks for special conditions under which left invertibility is possible
I might've not expressed what I meant correctly
If I have a linear map from Rn to Rm
seemingly it doesn't
That's a matrix right, and I can decompose it into two matrix that multiply into the original
the cardinality of a vector space is card(field)^dimension, so also the cardinality is not the same btw
So that means I have a map like
Rn -> Rk -> Rm
yeah, and neither of the two will be invertible
If k is very small, I don't find it intuitive that this composition could represent the original one
Because there's information being compressed, whereas in the original, there was not
that's what i'm telling you, the operation is not invertible in general
I'm saying I didn't express what I meant correcty
This
There's compressiom going on right, so the second transformation must be recovering something from the first
for one, it cannot be done linearly
Not sure I follow
the dimension of the intermediate R^k also has to satisfy special properties
how's your linear algebra?
No like, n -> k and then k-> m
if the dimension of the vector space the data is in originally is larger than k, than you irremediably lose data and can't do anything about it
The second dimention matches the first from the second matrix
yes, that's what i'm telling you
what you're asking is exactly what the johnson-lindenstrauss lemma discusses
So again, I don't find it intuitive that the composition can fully represent the n -> m
But it's possible cuz it's a matrix mul
you'll need to review your linear algebra
I disagree, understanding the math is different from having an intuitive picture of it
i gave you an intuitive explanation in terms of isometry too but ok
at any rate, if you look up sparse recovery you'll find any level of abstraction and detail you like about the topic
Different people will have different thresholds for what is or what is not intuitive
I think the explanation is gonna be that the set of maps that you can build, which come from a composition of linear maps like this one, and where k is much smaller than both m and n, are not gonna be very complicated from the get go, so there's not a lot of information that needs to flow from one side to the other
Regarding this question, I built a pipeline that is not giving me the results I was hoping for. Pipeline is following:
Input:
List of file IDs for document body extraction.
Steps:
1) get_document_sentences()
Input Aggregates sentences within documents by semantic similarity.
Outputs a list of lists, where the inner list contains semantically grouped sentences, and the outer list aggregates these groups across documents.
- Sentences are initially split using '. ' (dot_split).
- Semantically similar sentences within a document are grouped using vectorization and clustering (semantic_split).
2) cluster_sentences()
- Further clusters the semantically grouped sentences across all documents to identify broader themes or contexts.
- Takes as input a list of lists, where each list represents a context, and clusters these contexts across documents using specified clustering techniques (e.g., agglomerative, DBSCAN, kMeans).
- Outputs a list of lists, with each inner list containing sentences from various documents that share a similar context.
Output:
List of semantically grouped lists.
Problem is that I get one dense "centroid" cluster and the rest are very sparse, and I dont have optimal number of clusters. I fine-tune it for one example group, but I overfit it and cant generalize
These are hyperparams I use:
methods = [
('agglomerative', {'distance_threshold': 1.2, 'linkage': 'ward'}),
('dbscan', {'eps': 4.0, 'min_samples': 2, 'n_neighbors': 50}),
('kmeans', {'n_clusters': 30})
]
def cluster_sentences(sentences, cluster_method='agglomerative', **kwargs):
sentence_vectors = vectorize_text(sentences)
if cluster_method == 'agglomerative':
model = AgglomerativeClustering(n_clusters=None, **kwargs)
elif cluster_method == 'dbscan':
n_neighbors = min(len(sentences), kwargs.pop('n_neighbors'))
nn_descent = NNDescent(sentence_vectors, n_neighbors=n_neighbors, metric='euclidean')
distances, indices = nn_descent.neighbor_graph
n_samples = sentence_vectors.shape[0]
indptr = np.arange(0, n_samples * n_neighbors + 1, n_neighbors)
precomputed_distance_matrix = csr_matrix((distances.ravel(), indices.ravel(), indptr), shape=(n_samples, n_samples))
precomputed_distance_matrix = sort_graph_by_row_values(precomputed_distance_matrix)
model = DBSCAN(metric='precomputed', **kwargs)
elif cluster_method == 'kmeans':
n_clusters = min(len(sentences), kwargs.get('n_clusters'))
kwargs.pop('n_clusters', None)
model = KMeans(n_clusters=n_clusters, **kwargs)
else:
raise ValueError("Unsupported clustering method.")
if cluster_method != 'dbscan':
model.fit(sentence_vectors)
labels = model.labels_
else:
labels = model.fit_predict(precomputed_distance_matrix)
return labels
I mean I can experiment with hyperparam tuning like gridsearch,random search, kfolds etc, but I am not sure how to establish validation metrics for unsupervised learning
If anyone did something similiar to what I am trying to build, please let me know if i fkd up pipeline logic
How to find an optimal number of clusters
How to evaluate clustering?
Everything I found about evaluation and hyperparam optimization is about supervised learning
Do I start labeling data?
elbow method
from analysis.inception import InceptionV3 isnt working in my python script:
PS C:\users\zayga\VATr-pp-main> python .\generate.py text --text hello
Traceback (most recent call last):
File "C:\users\zayga\VATr-pp-main\generate.py", line 2, in <module>
from generate import generate_text, generate_authors, generate_fid, generate_page, generate_ocr, generate_ocr_msgpack
File "C:\users\zayga\VATr-pp-main\generate_init_.py", line 1, in <module>
from generate.text import generate_text
File "C:\users\zayga\VATr-pp-main\generate\text.py", line 5, in <module>
from generate.writer import Writer
File "C:\users\zayga\VATr-pp-main\generate\writer.py", line 15, in <module>
from models.model import VATr
File "C:\users\zayga\VATr-pp-main\models\model.py", line 7, in <module>
from analysis.inception import InceptionV3
ModuleNotFoundError: No module named 'analysis.inception'
can sb help
dm me if you can because im exhausted ty sm if you can
hi guys so im working on a chatbot project and this is the error that i got:
You can run `openai migrate` to automatically upgrade your codebase to use the 1.0.0 interface.
Alternatively, you can pin your installation to the old version, e.g. `pip install openai==0.28`
A detailed migration guide is available here: https://github.com/openai/openai-python/discussions/742```
i tried doing the pip install openai==0.28 but it didnt work, and the tutorial on github is hard to understand, can someone explain to me
right, with n and m > k, a low dimensional projection is performed. that also means the overal matrix is rank-deficient and non invertible, even if square
it would also mean that you have to work with a positive semidefinite metric tensor instead of a positive definite one. otherwise you lose the low dimensional projection, which arguably is what the authors intended to have
you had mentioned an approach where the matrices are identities and then they use max pooling. that's the same as a low dimensional projection
Hi everybody, I have a question. I have to buy a new laptop. I want to do some deep learning and some CNN for computer vision. Do you think it make sense to buy one with an Nvidia GPU such as RTX 4050 ? Would that be enough to train models ? or is it better to have a dedicated server or a googlecolab with GPU to do that ?
I'm gonna re read their paper, but from what I recall they don't motivate their choices, tho their code and choice of hyperparameters is very telling, they always use the k that make the projection more efficient than using a quadratic directly - I even have a suspicion they started with quadratic and came up with this, but my impression is that they were just building on top of previous approaches and "accidentally" stumbled upon this
A low end gpu, nvidia 8gb vram, is very handy to have around for smaller models and general proof of concept work
but no reason to go overboard and you can get by with not having it, I don't and my setup is super efficient, I must spend on average less than 5 cents per day on gpu, some days I use more than others ofc, but avg it out and if you use it mindfully with a good setup, it's much cheaper
furthermore, and this is my personal take, the trend is gonna be towards decentralized ML training
if nothing else, I'll personally make it happen since I've had the idea in the back of my mind for while, but there's quite a lot of smart people pushing for it already
do they say anything about "embedding" or "subspace" or "low dimension"? in any case, you achieve the same effect by making your tensor low rank
thanks for your reply. That's what I was thinking. Also, GPU comes in with gaming laptop, which are huge and heavy. I agree on the decentralized ML. That was my first thought
im gonna check
embedding they mention for sure ofc
the word space appears once when talking about the 1/sqrt()
"Due to the reduced dimension of each head, the total computational cost
is similar to that of single-head attention with full dimensionality" they do mention it
yeah in fact I think the "multi-headed" part was also central to the transformer innovation, which makes sense
I gotta re read the entire thing, they actually do motivate some of these things quite well
how much vRAM do you get with a gaming laptop ?
6 to 8 GB from what I saw
that's pretty small in the context of ML, but still useful, and there's also cool tricks like gradient accumulation that let you train on more data than what fits in the gpu
ok, did not know that
I will keep investigating, see if I can make my mind
thanks for your help 😄
hello people, im trying to get more practical experience with pytorch and I have a question about the conventional way to select a row of a tensor. I have a standard scalar which has been trained on a dataset with (x,y) rows,cols. I also have a dataset with a getitem. It gets a row from my torch tensor, but in order to scale it, I need to transform the selected "row" from shape (y) into (1,y). (If im doing something weird here let me know)
My question then is, whats the more standard way to do it?
x = X_train[i : i+1], or
x = X_train[i].reshape(1, -1)
the latter is IMO more readable.
I need to know how they got these calulation.
Can somebody help me?