#data-science-and-ml
1 messages Β· Page 134 of 1
If the 2000 data points are a representative sample, then it can work
There's studies using 5-10 data points out there
[38400] per sample
The size is a concern but not everything
ok but in context
the data is 50 text columns
each with a bert encoding of 768
so [2000,50,768]
[2000,38400]
Do you know what internal and external validity is?
hmm but i think a bigger dataset would be more valid over a smaller one
i might not know the terms can you talk about it?
yea pretty much
It's worth googling because if you're writing a report I'd use those specific terms to reason about whether or not your dataset is big enough
yea it works great on validation set at only 200 epoch
ill look it up thanks
At the end of the day, this part isn't an exact science and you have to just make solid arguments
yeah but it is outside of training data
Internal validity refers to the degree of confidence that the causal relationship being tested is trustworthy and not influenced by other factors or variables. External validity refers to the extent to which results from a study can be applied (generalized) to other situations, groups, or events.
This is a general concept in science / stats
You can apply it here
Can you generalize the results on 20 data points to other data?
ok let me explain
i have a 2000 sample dataset
i test train on it
then when the model has learnt, i validate it externally on 20 samples as it would be if it were a product
the external 20 samples are not present in the 2000 sample dataset and cant be learnt or overfit
What does "I test train on it" mean?
the usual test/train cycle with backprop
So the regular training cycle
You mean that 500 data points are used as a validation set for early stopping or so?
ok i guess im not clear
Yeah, you're using a bit of non-standard lingo which makes it a bit hard to understand
But we'll get there
in this train is 1500
val is 500
test is 20
all samples are exclusive to their set
Okay this is clear now
This is still valid then (external validity)
Do you believe the result holds for other situations?
the data is self generated mostly and it is difficult to extract
The larger the dataset the more confidence in terms of external validity
i do believe it
that is the issue
Then you should motivate why
That's all, you should motivate why you believe the results are valid
is my split effiecient
Maybe the 20 data points are really representative for the population? Unlikely but possible π
i dont want the test set to be larger than it is. am i right for thinking that?
hmm
Actually look at those 20 data points
and?
the 20 datapoint are very diverse in terms of the model
Just look at them, qualitatively
And ask yourself if they're representable for your entire population
That gives you a basis to reason about external validity
if some one wants data science 50 tb drive dm me
it contains a whole lot of cool stuff
don't advertise shady stuff. we discourage dms here as well
yuh
Ok I'll bite, what's in the 50 tb drive?
inb4 "a single picture of your mom"
lmao
hey so I finished training and testing my model but it still needs alot of work I think. the AP ( Average Precision ) metric score it returns is 30, which seems quite bad but when I test it against random images it works very well and not in an overfitting sense as the outlining for the image predictions of facial areas isn't super rigged but rather a little abstract
mostly for @vagrant root and @hearty depot
or anyone if anyone can help with this π
Is your rcnn for face detection?
Precision is 30 for which data?
test or val?
yoo this is soo cool
when working with LSTMs does it treat every sequence independently so if there's a pattern that is happening over several consequences it won't be able to captures it?
what does each sequence represent? a sentence of text?
if there's a pattern that exists consistently across training instances, the model is supposed to learn that. but the order in which the model sees each training instance shouldn't make a difference.
Short answer is yes; how* you want the LSTM to notice the patterns is a multiple choice
https://stackoverflow.com/questions/44147827/can-i-reinforce-good-patterns-recognition-in-lstm
consecutive timesteps (chart candlesticks)
idk then. I've never used LSTMs for timeseries data.
oh ok
LSTMs is really only used for Timeseries, so you'll be fine
What did you use for timeseries
Noted
Rn I am facing a data leakage issue so can't even evaluate it properly xd
Pinescript V5
All the popular libraries in Python aren't that great compared to just Numpy/T-Flow
Noted but I don't get how you used pinescript instead of LSTMs
Oh sorry um, there is kindof a lot of Time Series Analysis to do on time series data before you get to LSTMs,
While LSTMs are powerful, they can be complex and computationally expensive. Here are some simpler time series algorithms that might be suitable for your project:
- Simple Moving Average (SMA): Calculate the average value of the last
nobservations to forecast the next value.
Example: forecast = (sum(last_n_values) / n)
- Exponential Smoothing (ES): A combination of a simple moving average and a smoothing factor to reduce the impact of noise.
Example: forecast = alpha * (last_value) + (1 - alpha) * forecast
- Autoregressive (AR): Model the current value as a linear combination of past values.
Example: forecast = a * last_value + b * last_last_value + ...
- Autoregressive Integrated Moving Average (ARIMA): A combination of AR and ES, which can handle non-stationarity.
Example: forecast = a * last_value + b * last_last_value + c * error
- Seasonal Decomposition: Break down the time series into trend, seasonality, and residuals using techniques like STL decomposition or seasonal decomposition.
Example: forecast = trend + seasonality + residuals
- ** Prophet**: A simple and interpretable algorithm that models time series as a piecewise linear function with seasonal trends.
Example: forecast = piecewise_linear_function(trend) + seasonality
These algorithms are relatively easy to implement and can provide good results for simple time series forecasting tasks. However, keep in mind that they may not perform as well as LSTMs on more complex or non-linear time series data.
Remember to evaluate your model's performance using metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), or Mean Absolute Percentage Error (MAPE) to determine its effectiveness.
^ except this is kindof wrong because you need a fully functioning LSTM before you can feed that (the LSTM) to an ARIMA model.
The ARIMA model with the added auto-correlation test is: what? Who knows?
Anyone..?
oh
also , if anyone is free to help pls check this #1261308168185712650 message
Uh, let's slow down on the GPT content (slow down = don't post it)
I don't follow your arima question though, what are you asking?
The origin of the algorithms
.
You mean you have a multivariate time series?
Yes
sorry wasn't familiar with the term , had to google it
It can find patterns that occur across different series yes. The way you should view any kind of recurrent neural network is that you have a latent variable that is a "summary" of all that happened in previous timesteps
This is also the case for multivariate series
So what difference does the size of the sequence make?
hi. Is there a way to make Pandas treat absent row header values in CSV file like they belong to the previous row header value instead of making it a new NaN header?
what I mean is basically:
data looks like this in Excel:
parameter1 parameter2 2010 2011 2012
A B foo foo foo
C foo foo foo
D foo foo foo
M N bar bar bar
O bar bar bar
but after exporting to CSV and importing to Pandas it looks like:
2010 2011 2012
A B foo foo foo
NaN C foo foo foo
D foo foo foo
M N bar bar bar
NaN O bar bar bar
Just confirming, do you mean the length or the amount of sequences?
Hello guys, sorry to interrupt but is it better to start learning matplotlib/pandas/numpy/scipy along with linear algebra/calculus? or only math first?
the length
If the sequence is very long you run the risk of the latent variable not being able to "remember" what happened in the beginning, hence why LSTMs are used over vanilla RNNs. At least that's some of the intuition.
focus on learning concepts and applying them. don't try to "learn pandas" or "learn scipy".
(Aside from vanishing gradients)
I was planning on buying pandas course
on Coursera
I don't think you need to pay to learn pandas. you can do the pandas kaggle tutorial for free.
Hm
Try using Pandas' documentation instead, it's really good
I see, thank you
People selling the courses summarize that and sell it to you
What do you want to do?
Also like, when to start andrew ng course?
pandas is probably the best documented library in all of python. (and it better be, because it would be incomprehensible otherwise.)
What are the pre requisites
Can you share what you'd like it to look like
I see
In practice? Numpy. Mostly because the Pandas user guide makes references to Numpy and some concepts that exist in Numpy without really explaining them
So numpy would give me a better understanding?
documents=["This is document1", "This is document2"], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well```
what does this comment here mean (it is from chromadb docs)
and what does tokenization mean in llm contexxt
I think so? But you could try with Pandas first and see if you need to try Numpy before you do so.
the data science world sort of has its own dialect of python, and a lot of the distinguishing features of that dialect started with numpy.
Btw @past meteor if you are free can you give #1261308168185712650 message a look, I am really struggling to continue due to it
I will go with numpy then
really good way to put it π
I see
!e
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print('This is going to do element-wise addition without a for loop.')
print(arr1 + arr2)
:white_check_mark: Your 3.12 eval job has completed with return code 0.
001 | This is going to do element-wise addition without a for loop.
002 | [5 7 9]
Also does andrew ng cover the math required?
Yea
I'm currently watching gilbert strang but like its super long
I feel like alot of it isn't actulaly necessary?
I can't commit to a help thread right now sorry
thank you π
I have to make things about linguistics as much as I can
What does arr1*arr2 do? does it throw out an error because matrix multiplication doesn't work or does that work differently
multiplies 2 arrays
No worries man , Tysm for the help anyways
!e
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print('This is going to do element-wise multiplication without a for loop.')
print(arr1 * arr2)
:white_check_mark: Your 3.12 eval job has completed with return code 0.
001 | This is going to do element-wise multiplication without a for loop.
002 | [ 4 10 18]
oh
basically I have tabular data that has column headers, but also has row headers. The row headers have 2 levels, you can think of them of as main_group, and sub_group:
X Y Z
main_group sub_group
A foo
bar
foobar
B asdf
qwert
uiop
When I export this tabular data to CSV and import it back to Pandas instead of the above structure I get additional NaN headers in the main_group seems like pandas' read_csv treats empty CSV values in this column as a NaN, even though it is specified as col_index
I need basically to retrieve from pandas an original structure, meaning that it know that, bar, and foobar also belong to the A main_group, and not to some NaN main group, which pandas seem to produce when reading csv
there's matrix multiplication, and there's elementwise multiplication
elementwise multiplication is also called the Hadamard product, but I hate when mathematical terms have uninformative names.
there is other way to do dot(?) product
hm
you can use @ to do matmul
and there's np.dot(a, b)
same as *
arr1@arr2?
yes
but the two arrays have to be valid for a matmul
so the shapes have to be (a, b), (b, c)
(a can equal c)
am i asking in wrong channel?
answers from google are going over my head
it might have treated the arrays as shapes (1, 3) and (3, 1)
which would reduce to an array of shape (1, 1), which is effectively a scalar.
without informing us
how can it transpose without any notification
i feel like this would cause huge problems somehow
np does alot of things very differently lel
you won't even be able to spot where the error is coming from
hm
same for pandas
write tests? manually
so we have to define our own function to do that
can we get the order of a matrix using np
no there r built in functions for it
I really wanna watch this but I don't know if its for beginners
hm
is this the one where he creates a very basic NN
im not sure
i think it is for begginers
i watched the 'essence of calculus' and it was really good
that really helped me atleast and i am very much begginer
hm
I have not started multivariable calc at all π¨
3b1b has videos on it on khanacademy but idk how much of it is necessary
try dropping the sub classification column, fillna and then add it again
Yo yo yo, after 2 and half weeks
OpenGL?
pygame!
mm
is that pong ai
yeah
reinforcement learning?
awesome
yeah!
how long did it train
yeah!
based
220k episodes
hm
does it play better/worse at a higher ball speed?
it is train on current speed !
I can train it on higher!
because I was just finding correct hyperparameters
try increasing speed with this model
does it completely crumble?
crumble?
I just trained it on 220k and tested for 2 minutes
so don't know need to test more
oh
Could add a calculation for possible deferred velocity after bounce so it slows down and speeds up randomly.
-or not randomly
decision tree, binary struct, whatever
yeah but this, the model will train , just need more episodes I think!
Looks good man.
yeah thanks!
but how would I achieve this?
I think this will be a great thing to try and replicate from scratch to improve my knowledge
Dont go directly to RL , it literally took me 4 weeks to fully understand
Oh okay xdd
I know what u are doing that's why I told you this
Just try some algo lime you aredoing then move to this!
Reinforcement Learning and Deep Learning are their own entire industries, or will be very soon. Lot of applications and typically the work of an organized department; not one person. So well done. RL is very useful thing.
First appreciation by you , ohh God !!! Thanks !
I fixed all the errors just to run into this during the eval stage. smh..```Epoch 1/3: 100%|βββββββββββββββββββββββββββββββββββββββββββββ| 351/351 [6:02:06<00:00, 61.90s/batch, Batch Loss=3.48e-5]
Evaluation: 0%| | 0/88 [00:00<?, ?it/s]Input to VideoEncoder: batch_size=16, num_frames=16, channels=3, height=128, width=128
After view reshape: torch.Size([16, 48, 128, 128])
After conv2d_layers: torch.Size([16, 512, 128, 128])
After view reshape before fc: torch.Size([16, 8388608])
Input to VideoDecoder: torch.Size([16, 512])
After fc layer: torch.Size([16, 131072])
After view reshape: torch.Size([16, 512, 16, 16])
After conv_reduce: torch.Size([16, 512, 16, 16])
After conv2d_transpose_layers: torch.Size([16, 48, 128, 128])
Channels: 3, Expected size: 12582912, Actual size: 12582912
Final output shape: torch.Size([16, 16, 3, 128, 128])
Evaluation: 0%| | 0/88 [00:00<?, ?it/s]
Error in epoch 1: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Epoch 1/3: 0%| | 0/351 [00:00<?, ?batch/s]
Error in epoch 2: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Epoch 1/3: 0%| | 0/351 [00:00<?, ?batch/s]
Error in epoch 3: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.```
Hey can anyone help me with tensorflow and keras error in my project ?
basically,
- lasso only avoids multicollinearity if a large number of the attributes are already linearly independent
- I know nothing about lasso as a "maximum a posteriori estimator" following a "laplace distribution," but the point is that lasso does not eliminate the need for VIF analysis
- adding features in excess is still bad, even for lasso, because lasso still performs poorly assuming low kruskal rank
correct?
yes, except i didn't get what you meant by "attributes" in the first point
help me plz
Are CNNs just normal neural networks besides how the initial data is prepped for input? Cause it just seems like CNNs are about how the initial image is separated and condensed and downscaled for higher performance with regard to the NN. Is that correct?
the main difference is not that, it's that the convolution operation contains extra information. nowadays we call this "model-based machine learning"
isnt the convolution operation merely producing a condensed version of aspects of the image.
a convolution has fewer parameters than a regular matrix multiplication, so it's easier to train. it also has the nice property of "spatial invariance", which is often what you want when processing images. these two things together give the cnn its power
what do you mean by "condensed version of aspects of the image"
like, here
you have the input image which is condensed into the convolution
tell me in words, i still don't know what you mean
and then condenced even more into the pooling
you have the original image. then the image itself is condensed when you apply the filter because it does the dot product between the values specified in the dimension of the filter. so that itself is inherently smaller. then that data that is produced is pooled. which is condensed even more because it takes the max value out of a part of that convolution (I think thats the term), which condenses the data even more. so then you have these pools which are now very condensed relative to the original CNN. then it gets fed into the NN for classification.
thats what I mean by condensed
nothing about convolutions inherently yields smaller dimensions
in fact, the standard definition of convolution yields a larger array than both the original image and the filter
pooling is also a separate operation and you can build CNNs without it, but you can roughly interpret it as keeping the most "salient" results in that they're the ones with largest magnitude
why? isnt data from each channel of the image condensed with the dot product on specified dimensions of that channel?
that's not how convolutions work
oh
convolutions are equivalent to matrix multiplication with a toeplitz matrix
you can vectorize your image, turn the convolution kernel into a huge toeplitz matrix, and multiply the two. you get an output the same size as the original image
the matrix, however, has few unique entries and is spatially invariant
oh
in image processing (e.g. in CNNs) convolutions are often done with a stride larger than 1, in which case it does make the result smaller
alright ty
the pooling part and setting up your convolutions to reduce dimensions does have the effect of projecting onto a lower dimensional vector space, maybe that's what you meant by "condensing"
"bottlenecking" the network so that the input is represented by a small vector. you can do that without CNNs though so i would treat that as a separate concept
Hi,
Hope u are doing well,
I am working on time series forecasting using multiple models (CNN-LSTM-Attention, CNN-LSTM,GRU-attnetion, Nbeats, ARIMA,Prophet). The three first algorithms produces good results compared to the two last ones, but when trying to plot curves, i noticed that the model is just shifting the last point time of the input and consider it as output. which means that models didn't learn in reality. Please any solution to this problem ?
Instead of feeding the true value as input, feed the previous output of the model, and see what the plot looks like @lapis sequoia
i am doing a one step ahead forecasting how to fed it in that case ?
model_predictions = []
x = ... # the value at t=0
while ...:
y = model(x)
model_predictions.append(y)
x = y
# Reciprocate the sub count
dislikes['uploader_sub_count_recip'] = 1 / (dislikes['uploader_sub_count'] + 1)
np.isinf(dislikes['uploader_sub_count_recip']).sum() #64000
May be a dumb question, but why am I getting infinite values when applying f(x) = 1/(x+1) to my column? Uploader subscriber counts are integer values >= 0.
nvm, I just found out that 64000 observations somehow had a subscriber count of -1
and of course f(-1) = 1 / (-1 + 1) = 1 / 0 so it checks out
Did you try this?
aight i did some more research and i think i got it now
so, the filters are produced from the training to take apart features that are then sent to the rest of the nn which dont use filters, but instead standard neural networks to process those high level features?
I think I found the issue with how the results were too good
it wasn't really data leakage but I was showing it the test data at once so it wasn't really predicting but marking up the patterns
somehow I have to make it view it one candle at a time and see if there's a viable trade or not then take it
what is the best way to get into data science as a high school student? I am just stuck following tutorials but they don't really help much. Any tips?
im in highschool too jm just sharpening my python skills
yeah i guess that's what you gotta do
What do you mean 'get into'?
'Don't really help much': how so? Tell us more?
well i have a decent understanding of python and different data structures etc. I also have watched seminars, and read papers on neural networks, even got a copy of the nnfs book for free from github. But i don't know where i can take it from there. Obviously im not looking for a job at the moment but just wondering who, what and how? Thanks.
Yeah
how to install torchviz in conda ?
If you really like data, spend some time in Kaggle. Theres all sorts of datasets and problems and challenges. Find one .
If you want to work on hard skills, learn SQL. Or, pick a specific ML technology and do a project. Maybe do something with opencv?
what's opencv
Computer vision
come up with a problem statement first
write a project charter
learn the soft skills/PM side of it too
at least that's what i'd do
yeah, i've made a whole bunch of github repos
ah, perfect.
yeah
nah i was just making sure.
mhm
but good stuff dude!
thanks
whats the best activation method for a transformer based model ReLU, SwiLU, GeLU or GeGLU
the nnfs book i find is written quite well
the what now?
oh yeah
its neural networks from scratch
here is a link to the book of it
free pdf
anyone here tried positron ide?
i wanted to ask if R implementation is as good as r studio?
with the column name autocompletes and such
Anybody using DataSpell? Why is my jupyter output so wieeerd
if u did not specify the color scheme of the plot then this may be the default colors, seem to be the pastel palette
yep its the viridis
yep that is the correct viridis palette
but let us try, delete the argument from 5th line in cell
i think the issue is the ide converts the image to the negative color values
look at the background, its now black since the color is converted
there should be an option to disable this
"Does disabling Invert image outputs for dark themes in Settings/Preferences | Jupyter help?" -https://intellij-support.jetbrains.com/hc/en-us/community/posts/4414926365842-Cell-output-and-plot-background-colors-in-DataSpell-notebook
try this
Hahah why it is default on
Thanks man really helped me π
I wanted to test the professional jetbrains products since I've received them free as student
Mixed feeling tbh
hello
has anyone done the andrew ng CNN course i wanted to ask some things
i am getting this error i have no idea what it is ???
Do you understand how assert statements work? I haven't done the course so I'm not sure if he teaches them or not
https://realpython.com/python-assert-statement/
@ember pawn that error message means that your code does not pass the tests. The error message gives you a hint for how you can change the code.
i honestly feel like tensorflow is too unintuitional, i like pytorch more (alot)
π€‘
something is wrong with this
i submitted and i got 100/100 and every other fucntion works idk what is thsi error
then its prob a version error
somthing changed between the versions of the packages and now it aint working
shows the course is outdated
a little
idk
it works honestly lost my mind with it ahahha i will do the next assingment
In my project, I developed a pretty good RandomForestClassifier model that's giving me great results. I have a dataset with 20k labeled records, and I also have around 200k more unlabeled ones. Should I use my current model to classify the rest 200k unlabeled records to create some baseline labels, which would help me get more labeled data to build an even better model. Or I should stop here? What are ur experience w it? Thanks π
Hi everyone
I was planning to do a classification task where an entire dataset ( has many measurements ) has one label (positive negative) and i have many of these datasets around a 100.
Any ideas on how to work through this or if anyone has experience with such a dataset
Any good resources on deep learning ? I been looking through random stuff online and looking for a more structured approach
udacity is good if ur willing to drop the buck but honestly just get any udemy starter course and the best way to learn is just doing ML competitions imo
What order should I study for ML (Which order is most efficient)?
Linear algebra
Calculus
Probability & Statistics
I heard Calculus and then Linear Algebra. And the the Statistics Stuff
Wouldn't linear algebra be easier?
Depends, I just heard it from a youtube video.
mhmm right
π I haven't learned this tbh!!, just learn parallely along with building some project !
but for basics like vectors, derivatives, partial and all
Using prebuilt models?
It depends!
well I don't want the "universal answer"
I am asking, did you make your own models, or did you used APIs?
I mean what can I say !!
it depends on projects sir!
if you wanna build a image classifier? what will you use? just create your own model
but if project is way more more more than this , then ofc used pre-trained!
HI guys
I was working on the mnist classification data
then I saw this line
model.predict(X_test[12].reshape(1,28,28)).argmax(axis=1)
Why do we need to reshape the X_test ? Isnt it already in the format of (1,28,28)?
So X_test will be of shape (nr_samples, 28, 28) or (nr_samples, 784) I assume @river cape
if you do X_test[i], you will get (28, 28) or (784,) but a model will always require shape (batch_size, 28, 28)
So you need to make it a batch, with a size of 1, and this you do with reshaping
You could think of it like this: The model wants a list of samples, but you only give it a single sample, so reshaping makes it a single element list.
So it makes it a list of 1 single element having the shape of 28,28?
Jup, a list with 1 grayscale image that is 28x28
But then a tensor ofcourse (not a Python list)
The list is just an analogy
I think you misunderstood.
I meant to say "You learnt math alongside building projects, so did you made your own model or used pre-existing ones as a beginner?"
do both , as per usecases -_-
does lstm model have catastrophic forgetting issue?
also what exactly is changed on the NN for continual learning
http://www.gepperth.net/alexander/papers/schak2019b.pdf yes, catastrophic forgetting is just a general phenomenon of models, particularly large models.
Hey mates, we are a team building an AI learning platform:
https://cone.ai
Need insights and reviews for it. Can you please check and provide me with your feedback or suggest something innovative you want in any learning platform...
Hi,
We don't do ads on this server
at least, not that blatant
Sorry mate, It wasn't intended to be an ad, I just wanted fellows to have review and insights on my startup
I understand
hey guys! i wanted to create a telegram bot to which i could send photo and it would recognise from photos of db. but im facing troubles with converting photos. pls is there anyone who could share some repos??
Hello, do you see any problem with this sorting algorithm?
def grow_buble():
global test_list, loop
for index, item in enumerate(test_list):
try:
test_list[index + loop]
test_list[index + 1]
pass
except IndexError:
break
if test_list[index] > test_list[index + loop]:
test_list[index], test_list[index + loop] = test_list[index + loop], test_list[index]
if test_list[index] > test_list[index + 1]:
test_list[index], test_list[index + 1] = test_list[index + 1], test_list[index]
all this ML is so confusing
I can agree π€£
hi guys , does any one here has experience with sdmx api ?
if its sorting, then its ok
the speed is just luxury
uhm isnt "loop" undefined? is it something outside the class?
what is it
need more context to see
newbie?
Yeah
when did u start
why
what's the motive
Curiousty xd
okay, any prior exp in coding?
School work etc.. but I try to allocate time for it daily
Yeah , did some stuff with python before
im learning it all
for a big reason
a cause
a mission
a project (already been developed from 5 months by few ppl)
what's the goto way u r following @deep sleet
rn I am reading a mathematics book that is pinned resources and working on forex ai project for fun to learn more about neaural networks
actually its based on GenAI
we were successful to build the thing
but 3 months down the line working on genAI i understood
I could go from A to D or E F maybe without ML and stuff
but the best way is to go learn ml then deep learning then some nlp into it
then go learn gen ai
a chat system
just like gpt
but for personalized marketing data
ohh
Yeah I can barely imagine xdd
There was a course on the basics of sci kit learn and ml
gimme a sec
This course is a practical and hands-on introduction to Machine Learning with Python and Scikit-Learn for beginners with basic knowledge of Python and statistics.
It is designed and taught by Aakash N S, CEO and co-founder of Jovian. Check out their YouTube channel here: https://youtube.com/@jovianhq
We'll start with the basics of machine lear...
why youtube tutorial
kill the boy, be the man
as if that's a bad thing
not sure
but as a backend developer maybe
Scikit learn has really good documentation
@past meteor how about this resource
So just read that compared watching a video of someone that just read the docs
whres micro avg
Yeah I started reading it when you told me that , but the guy doesn't only explain how to use it , he give alot of tips from experience
and applies it with projects
Well, you can certainly do what you want to do
but videos give the illusion of learning
There are way more effective ways, for instance doing specific kaggle competitions yourself individually
and then reading top performing solutions
how could one do it with 0 knowledge
Reading + doing are way more "active learning" compared to watching videos, it's very passive and lets you zone out
And once you finished the 10h video you're like "okay I learnt x, y and z" when it's not true at all π
practice, practice, practice
A book that teaches it from scratch
Makes sense, Will do that!
okay xD
this does't have accuracy, the other one does
also the code is like, rather different, not sure why you're expecting the same output
Well I classified it with the same classification method = LR
Vectorizer is the onlt difference
And as far as I know for the confusion matrix param it does not influence
Reading a text like this gives you a lot of the "finesse" you need for data science. Arguably it has a prerequisite (standard university statistics) but I think you can wing it if you pay close attention
Also this is from scikit docs
Gallery examples: Recognizing hand-written digits Faces recognition example using eigenfaces and SVMs Pipeline ANOVA SVM Custom refit strategy of a grid search with cross-validation Restricted Bolt...
Micro average (averaging the total true positives, false negatives and false positives) is only shown for multi-label or multi-class with a subset of classes, because it corresponds to accuracy otherwise and would be the same for all metrics. See also precision_recall_fscore_support for more details on averages.
what about the second edition?
Anybody who has worked/working on LDA and topic modelling ?
Hi, does this channel also cover the less advanced topic of Data Analysis (Streamlit, Pandas, etc.)? I didn't see any in the comments above and it's pretty huge part of Python.
people talk about pandas in this channel. streamlit is more about web development than it is about data science, even if making dashboards is a popular streamlit use case.
is it a smart idea to try to build transformer in mid-low spec personal pc?
i have 8gb ram total, i7, RTX2000 something
or i should just accept my spec limitation and give up on making transformer?
yea, you can definitely do a lot with Streamlit!
Has there ever been talk about doing a 'Data Jam', similar to 'Code Jam'?
Do you mean something like a Kaggle competition?
yes, just Python focused
Don't ask a question to ask question.
Always ask your question with the intent that someone who knows it will answer you without having to pry for additional detail / ask for full context before they can be able to answer your question.
I looked at Kaggle again and remembered why I didn't try those - mostly ML and Data Science focused, I'm looking for Data Analysis focused.
I haven't heard anything about a "Data Jam" like that, would love that kind of thing myself
We haven't had anything like that. It's unlikely that we will, as the code jam already requires a lot of labor.
yes, it would take a critical mass of people to make it happen. I'm scouring the web for something like this, can't seem to find a whole lot yet.
The answer is, it depends. It depends on what you wanna do. Do you wanna train a model with transformer or finetune, or?
If your GPU (RTX2000 has a VRAM >= 12GB), then I think you're good to go; so long as what you wanna do isn't beyond your GPU card.
I usually recommend using RTX 3060 which has 12GB VRAM or the RTX A4000 which has 16GB VRAM.
Anything beyond what these cards can handle (e.g. task that requires A6000, RTX 4000 series, A100s) is gonna be an overkill for you to attempt that on your pc (instead, rent a GPU online)
When it comes to RAM The usual rule of thumb is 2x your VRAM, though I think 16GB - 32GB of RAM is probably okay.
yes, I want to train transformer and fine tune. it's have 8GB VRAM, what are the main problem if I'm lacking ram? can't I do some work around?
Can anyone share a data source for automobile 'registrations'? It's more definitive than 'auto sales' (like for looking into the details of Tesla's sold, which they don't provide).
anyone worked on a classification task where we gotta classify datasets as 0 or 1 instead of a row in the dataset
What would be the difference between a data sample and a data set for you?
The dataset contains an irregular number of data samples (and all data samples have the same shape over all datasets)?
@nova matrix
Oh, someone mentioned a data challenge that was similar to advent of code... what was it?
hmm... interesting on first glance, I'll have to look deeper, thanks!
Do neural networks also discern patterns? From what I can tell there are just a bunch of computations that are reliant on each other and changing the variables of all of the compounded calculations to the least error is what training does. Is that inherently finding patterns?
you can think of all of machine learning as "discerning patterns". what you said about interreliant computations is how neural networks do it.
Ic. So is it similar to the patterns that we can visualize which CNNs produced post training?
"the patterns that we can visualize which CNNs produced"
idk what you're referring to here.
I saw a visualized process of a CNN, and noticed how the resulting data passed into the final layer, if visualized, represents the patterns that those base images can contain. Idk if Iβm conveying that properly but whatever.
can you show an example?
Yeah. IIRC they are fed into a nn afterwards which then comes out with the final prediction
a fully connected layer I think
Training and fine-tuning transformers, especially large models like BERT, GPT, can be resource-intensive.
If you attempt it, your pc might start heating up real bad (this could fry your RAM), your pc will also start lagging in the process, you could run into out of memory error while training. Another annoying part is that, it could take forever to finish training.
A walk around might be, using Mixed precision, reducing batch size, or using PEFT techniques like LoRA.
Or better still, just use Colab, Kaggle's free tier GPU, or rent from companies like AWS, SaturnCloud, or Vast etc.
I see... thx β€οΈ
might I suggest paperspace, they're pretty cool
tryin to make a text detector from random text, and have to generate a 128GiB text file
why do i do this to myself (;_;)
Show your code?
!code
It's easier for everyone when you give code as text.
import random
import string
k = 131072 # Size of the file you want to generate in MiB. Warning: going past 1024 can cause issues.
# Define characters to choose from
characters = string.ascii_lowercase
# Define the file path
file_path = 'E:/file.txt'
# Generate random text
while k > 0:
random_text = ''.join(random.choices(characters, k=1048576))
k = k - 1
print(str(k) + "MiB left to generate")
with open(file_path, 'a') as file:
file.write(random_text)
print(f"Random text has been generated and saved to {file_path}.")
^ here
i've got around a 100 datasets
300 x 64
and each one has a classification assigned to it 0 or 1
@mild dirge
Hey, I'm having some difficulty figuring out how to optimize this:
def setZdepth(self, depth):
self.depth[2]=depth
self.arrZ= cp.arange(self.shape.N())//(self.shape.Nx*self.shape.Ny)%self.shape.Nz==depth
def viewZ(self, data):
return data[self.arrZ]
This viewZ function takes almost all the time of my program, presumably because this slicing operation is really slow... There has to be a better way!!
(This provides a 2D slice of a 3D block of data)
Also I'm using cProfiler, so I'm guessing maybe it's misattributing the time to the wrong function
finally generated my insanely large text file, now time to do some science with it
β¬οΈ
I figured people did import cupy as np
so it's 134gb. what is it?
and how large is your ram?
random lowercase characters a through z
how is that interesting?
i'm trying to see if i can find words in pure randomnes
it's not for a summer course or anything, i'm just bored and have nothing to do
cool, though you probably can do it without creating a file, by using a buffer
if you were curious about it
it's pretty funny π
also, i got a 4TB external SSD for my birthday and needed something to use it for
nice
I'm not sure I'd consider this data science, but if it interests you and motivates you to practice programming, I guess that's cool
i already wrote all the scripts for it
i did it for 1GB and it took 40hrs
so for 128gb
I bet you can write a faster script
40 hrs * 128 = way too long lol
For example, you could make a Trie datastructure
and instead of loading the file into memory, you can control the file pointer manually
eh im not too good at programming
everyone starts somewhere π If you were bored, those are a few things you could try to make it faster
i stole permanently borrowed most of the code from stackoverflow
this is my "i'm bored" project
Oooh that's cool
im tryin to understand what all of that means
Thanks π It's a Lattice Boltzman Method fluid simulation
The blue view is the speed view, the orange view is the density view
I have a tool I can play with and switch views on the fly, and it records the session to .mp4 so I can post it on discord
nice!
cool :D
if its not data science, then what is it lol
file IO and string manipulation.
depends: they can do data science on it, but probably it won't be especially interesting
like... how many 3 letter words appear? 4 letter words? etc
yeah that's what im tryin to do
it looks for how many of each word in a 370K word list appears in the random textfile
then outputs that number to a textfile
at the end it gives me a bunch of data
128GB data in + 4.1MB data in -> ~10.5MB data out.
Woooah
I will make a prediction ahead of time:
||I suspect that some 3 letter words will appear more often than other 3 letter words. I'm thinking because of this problem Edit: oops that link isn't to the right problem ||
You can make it available, but I suspect not many people will want to use it. But I like putting things on github even for my personal use, because it also helps if I decide I want to go back to a previous version
i kinda want to bc running this code on my machine would take over HALF A YEAR
and i want to distribute it among more machines
so im gonna split up the work into 740 chunks, then anyone can do them and send the results back to me
So, uploading it to github just means other people can see the code, it doesn't mean they would run it for you π More likely, you might find someone who is interested in help you make it run faster
i dont mean random people per se, more like my friends and/or family who have better machines
It's also good to learn and practice git and GitHub
It's a very useful skill
one of my friends has a ryzen 9 7000 something
GPU won't make this faster, because the bottleneck will be the disk
Gpus won't help you here
reading the data from the drive will be slower than the processing time
the ryzen is a cpu, not a gpu lol
not with a fast enough disk
I don't think there's a disk in existence that's faster than a CPU
CPU is "fast" but only can have a limited amount of data in it at a time.
The disk is slower, but can have a lot of data
so the time would all be the data transfer back and forth between CPU and disk
I would recommend:
- Post your code to github
- Fix your code so that 1 gigabyte takes more like... probably 10 minutes instead of 40 hours.
If it takes 40 hours, I can tell you your code is inefficient
use more words
Most important skill when asking for help is explaining your problem
what does "makke the plot bigger so the subplots dont overlap" mean
don't forget that that means cycling through a 1GB text file 370,104 times (number of words in wordlist) and is equivalent to processing 370 TB of information
Try commenting out that line (put a # at the start of the line) and see what happens if you don't do it
ok ill try
You don't need to do that, though.
Consider the beginning of the file is applejdogejb...
Ok we can see apple is there right?
When the algorithm starts, you only need to check the words starting with a you don't need to check the words starting with z
This cuts the time by 26x~
how do i do this in code tho
So, here's what I think your algorithm you wrote it (without looking at your code, but you can correct me if I'm guessing wrong)
there is no obsevable difference
what does subplot
mean
n = 0 # do not change, should be at 0
a = 0 # do not change, should be at 0
k = 0 # which word of the list to start from. 0 means start from first word. 500 would start from 501st word.
t = 499 # how many words you want to process, minus 1. Useful if you have a large dataset that is >8GiB.
with open("E:/file.txt", "r") as file:
file_as_string = file.read().replace("\n", "")
# opening the file in read mode
my_file = open("R:/pythonProject/wordlist.txt", "r")
# reading the file
data = my_file.read()
# replacing end splitting the text
# when newline ('\n') is seen.
data_into_list = data.split("\n")
my_file.close()
results = open("R:/pythonProject/results-" + str(k) + "-" + str(k + t), "w")
while n <= t:
print("Searching file for the word: " + data_into_list[n + k] + " - #" + str(n+1))
a = file_as_string.count(data_into_list[n + k])
results.write(data_into_list[n + k] + " - " + str(a) + " appearances" + "\n")
n = n + 1
here
here's my code
- Read the 1 gigabyte file into memory
- For each word in the dictionary
- Scan the entire 1 gigabyte for that word.
You can think of it differently:
- read it into memory
- For each position in the file
- Check if there are any words in the dictionary that exactly match the current position of the file
by switching 2 and 3, you unlock the ability to narrow down the search space: you know the first letter is 'a' so you can skip all the other letters. If no 'a' word is first, you can move to the second letter of the file and try 'p' words etc
assuming the input was applejdog...
so after you finish with 'a' you only have
pplejdog...
so you check the 'p' words.
what are we doing here again?
Not datascience, just algorithms really
you came in about 30-45 minutes late.
yeah so it'd be nice if someone could catch me up, seems interesting
read the past couple hundred messages or so, then you'll be caught up
#data-science-and-ml message <- this message is where it started
https://pynative.com/python-file-seek/
You can manually control the position of the file using seek(), this lets you avoid reading the whole file at once. You can for example, read the first ~20 characters, check if any words start with that letter, and if not, seek() to the next position
They made a 134 gigabyte file of random a-z characters. They want to search the random letters for words, but their code is very slow (and obviously they don't have 134 gigabytes of ram)
I think I see
if you're up for a challenge, may I introduce to you the AC automaton
In computer science, the AhoβCorasick algorithm is a string-searching algorithm invented by Alfred V. Aho and Margaret J. Corasick in 1975. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all strings simultaneously. The complexity of the algorithm ...
yup. also my pronouns are he/him
which matches a string (your 134gb file in this case) against a list of words (your word list)
Ah, yeah that makes sense. Seems more complicated than their current coding level, but that's a better algorithm than I was suggesting
im not too good at programming, i don't know how to implement something like that lol
My algorithm is just like, step 1 of optimization
don't worry about it too much then
right now it will take 3,000-5,500 hrs of computation time to search through the file for all words
as for memory, you can specify how much to read in file.read(num_of_characters)
But the thinking is based on the same idea:
If you know the file starts with a you only need to check a words.
Take that to the next level
If it starts with ap you only need to check words starting with ap
let's stick to one letter for now.
so do i split up the wordlist into 26 lists each corresponding to one letter?
sure, and put it in a dictionary for easy access
Kinda like if you think about a phone book right.
Say I type "555-124..." What number comes up next in my autofill? Only the numbers that start wit h 555-124-.
https://www.geeksforgeeks.org/implement-a-phone-directory/
A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.
yes
(And you might imagine on step 2, you can split the a list by each second letter)
but that's later
that uses a Trie though which isn't too simple
Trie is the general solution, but the first level is simple enough
just 1 level can be done by a novice, and cuts the search time by 26x
so i have to only look for words starting with one letter instead of all words?
so it's "for each letter in the file, search for all words starting with that letter, go to the next letter in the file, go back to step 1"
Say you read the whole file in:
file = open('myfile.txt', 'r')
then you keep track of where you are in the file
pointer = 0
Then you check the letter
letter = file[pointer]
Now you have a dictionary of all the words, sorted by their letters
dictionary['a'] = {...}
So you can now run through all the words that start with that letter
for word in dictionary[letter]:
and now you just want to know if the letters starting at pointer match that word
if word == file[pointer:pointer+len(word)]:
when you finish, you can increase pointer
pointer+=1
and then you're ready to check all the words starting on the second letter of the file
well that only checks whether the word is present, yes or no
i want how MANY times it appears
keep another dictionary that's {word: count}, start the counts at 0, add 1 when you see a match
If the word appears starting at pointer then file[pointer:pointer+len(word)] will match that word. You can then record that in some record
yeah like purplys said
if word not in FoundDict:
FoundDict[word]=1
else:
FoundDict[word]+=1
With this, your 40 hour run should be more like 1-2 hours, I think
also how high can the variable go in python
integers are only limited by your ram basically
cuz with this pointer will have to go to around 130,000,000,000
pretty much forever, python doesn't use a specific nuber of bits
floats can go to
>>> import sys
>>> sys.float_info
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)
>>> sys.float_info.max
1.7976931348623157e+308
>>>
that amount
oh big number
it stores all numbers basically as the literal strings like "130,000,000,000" so it can go basically to more numbers than the atoms in the universe
ok good
that's if you read the entire file at once
you can just keep on doing file.read(10000) or smthn, to read 10000 characters at a time
well the problem is the .read() trying to put 134gb worth of file stuff into your ram
Yup, you don't have 134 gigabytes of RAM π
To do any calculation, it first has to read from the disk into RAM, then send from the RAM to the cpu
so yeah, you'll want to only read part of the file at a time
how much RAM do you have?
it'd work something like
file_contents = file.read(10000)
while file_contents != '': # it will be '' once file has nothing more to read
... # do stuff with the 10000 characters
file_contents = file.read(10000)
32gb ddr4, 2666mhz
100k is not even 1 megabyte. You can go bigger if you want
1 gigabyte is 1 billion letters
what about 1073741824, that should be good
every letter is 1 byte
it reads in 1 GiB at a time
you don't have to think too hard
literally just try it, see how much % ram it takes up, go bigger / lower accordingly (if you overshoot it'll just MemoryError anyways)
tbh I don't think it'll impact performance too much, most of the exec time's gonna go to the matching anyway
it honestly annoys me how many people i meet don't know the difference between GB and GiB
I do coding for work, and the only thing that annoys me is whenever I see anyone, ever, try to write a regular expression. Because everyone I've met is terrible at it lol
I just look it up if I need it
alright i did dictionary[a], and it just returned the letter "a".
You need to add all the words to the dictionary, starting with their starting letter
i thought it would return all words starting with a
yeah my dictionary contains 370K+ words, all in alphabetical order
for word in originalDict:
newDictionary[word[0]].append(word)
maybe something like this
what's your dictionary like
we're thinking something like
d = {
'a': ['apple', 'abduct', 'abort', ...], # all words starting with a
'b': ['bad', 'bard', ...] # all words starting with b
... # etc etc
}
yeah it won't know you're trying to do a prefix matching. Dictionary does an exact match only. Trie does a prefix match, but it's usually not provided by default.
no my dictionary is just a giant list imported from a text file
the first little bit of my dictionary .txt file looks like this
Try this:
newDictionary={}
for letter in "abcdefghijklmnopqrstuvwxyz":
newDictionary[letter]=[]
now each letter will have it's own list
when you read in the file, add each word to the correct list in the NewDictionary
and how do i do that
Give it a try, let me know if you get stuck
alright so i didn't use your method, but i did find a way to split up the wordlist.txt file into 26 text files each labeled by their starting letter in the alphabet
(e.g. dict_a.txt, dict_b.txt, etc...)
@mild grotto im not sure if that will work or not
that'll work
If you ask for help in #algos-and-data-structs you'll find others that can likely help (since this is getting off topic from AI stuff)
wtf howd you get that
sys has that?
you can look for type info/ wtf
current dataset tree structure :-
βββ Cloudy
βββ Rain
βββ Shine
βββ Sunrise
so each dir (e.g cloudy ) has nearly 300 images approx..
and wanna train this all images on CNN
so should I train sepearately ( which I should I think ) , like first train for cloudy ,
also consider that , like in each dir there are only images, no labels nothing!, that's why I though to train seperately...
ok? so you do have "labels" then, cause images in the Cloudy folder are cloudy, then that could be their label
Yeah !
then... just train the CNN as usual? I don't see the problem
For each classes??
Or I can train on all 4 classes
wdym for each class?
it's not like NNs only work for binary classification so you need to merge them with ovo or something
hello i wanted to ask where can i learn about transformers
same ^
Guys is there any well known models for music embedding?
I want to create a web app to organize my music collection.
Hey everyone, I need a reference for Bi directional LSTM. Does anyone have the original paper for it?
hey so I have been using google colab and been running out of their free GPU run time lately and was wondering if there was a way to use the free $300 worth of google enterprise credits to pay for more computational units and GPU's to run with google colab
oh, from https://cloud.google.com/free/docs/free-cloud-features#free-trial
You can't add GPUs to your VM instances.
https://youtu.be/R67XuYc9NQ4?si=Oz-ThlRLalwzA5cB
is this a good project? i am currently making this one
In this video kaggle grandmaster Rob Mulla takes you through an economic data analysis project with python pandas. We walk through the process of pulling down the data for different economic indicators, cleaning and joining the data. Using the Fred api you can pull up to date data and compare, analyze and explore.
Copy and edit the notebook fro...
I use different accounts! ( 5 )
has anyone used positron ide? i couldnt find the changelogs in github, is this normal?
I ended up just purchasing 100 computing units because I can't be asked to be switching from account to account icl but that hilarious
Can u mark me TP FP TN FN
You can do it per class, but not over all classes I guess
Your lines are also not matching the squares
I made this one recently ^^
If you want to minimize, for example, a model's incorrect classification of class BE to any other class, then you would try to minimize False Positives, right?
If you missclasify a BE as something else (True=BE, prediction=CP f.e) then that would be a False Negative with respect to BE.
Because you did not catch the BE
And you look for this row to minimize?
The diagonals are all correct classifications
The rest is missclassification so that row shows all the missclassifications for samples that were actually BE
But TP/FP/FN/TN only makes sense for binary classification
Sorry to butt in, what's BE?
Some random class
Birch tree (Berk in dutch)
Oh thanks
So if you want to talk about TP/FP/FN/TN you can think of the problem as BE or not BE
Ohh yeah, makes sense
And then you have those measures for BE
But you'd do that for each class
So each class has their own TP/TN/FN/FP
Do anyone know a good place to get simple dataset? I made a linear regression model, which seems to work, but I want to try it on a larger dataset. Is Kaggle a good place to find them? Thanks!
kaggle is a good spot, https://datasetsearch.research.google.com/ google has an entire search engine for datasets as well
OH thanks
Nice, google have a search engine for everything
Huh life expectancy data, that's intresting
May I ask
Some data set have string as data, for example, for car data, there's gas, diseal etc.
Would it be sutible to change thoose tag into unique integer for linear regression?
For example:
oil as 1
gas as 2
Or do I want other algrithom?
Thanks!
it can work but generally we like to encode data in different ways, like one-hot encoding. This is because the model has no way to know that 1 and 2 are discrete values, it will see 2 as being "twice as impactful" as 1, or something along those lines. So instead we might use something like [1, 0] and [0, 1]
Oh, so for example```
oil: [0,1]
gas: [1,0]
something else: [1,1]
Would that work?
sure, but there are lots of other ways of encoding data like this https://www.bigdataelearning.com/blog/7-data-encoding-techniques I don't have a great resource for this but this one at least goes over a few other techniques you could consider
Thanks, I am completely new, I'll have a look into it
merging all 4 directories ( cloudy, sunny...... ) into one dir??
and then again I have to label this data then?
Hi. Is there a way to make Pandas read headers and subheaders from a CSV correctly?
For example tabular data like:
CategoryA CategoryB CategoryC
X Y X Y X Y
1 (data)...................................................
2 ...
3 ...
I want to be able to read a CSV exported data of this kind, in such a way that the it is known that X and Y are subcategories of the main categories CategoryN in example, I want to be able to do:
df['CategoryA']['X']
I tried doing this with pandas, but got main columns in the MultiIndex labeled as unnamed
How exactly is it formatted? (commas, spaces, something else)
commas
just this?```
A,B
X,Y,X,Y
1,2,3,4
5,6,7,8
or```
A,A,B,B
X,Y,X,Y
1,2,3,4
5,6,7,8
this is just a tabullar example of it, but normally it would be exported to CSV, and look something like:
,CategoryA,,CategoryB,,CategoryC,,
,X,Y,X,Y,X,X,Y
(...data)
from what I could see when exported to CSV looks like this at least
hmm, for a,a,b,b it works like ```py
import io
import pandas as pd
file = io.StringIO(
"""A,A,B,B
X,Y,X,Y
1,2,3,4
5,6,7,8"""
)
df = pd.read_csv(file, header=[0, 1])
print(df)
not sure for a,,b,
notice, that you have A, and B twice in the main category
yes, to indicate it is a.x a.y, rather than just ?.y
yes, the problem I have, is that the CSV i'll get might look more like the a,,b, version
yeah you might have to just parse it yourself
I'm not sure what spreadsheet program exports how, but the ones I've used so far export the aforementioned tabular data with multiheaders to CSV in this way:
a,,b,
which kind of makes sense, if you think of it since the main CategoryA, CategoryB take multiple cells
just read the first n header rows, construct the multi-index, then pass it to read_csv
!d pandas.MultiIndex.from_tuples
classmethod MultiIndex.from_tuples(tuples, sortorder=None, names=None)```
Convert list of tuples to MultiIndex.
When it comes to data science and machine learning in general, what would be the distinction between using SQL over Pandas or other libraries, I mean I'm not sure what would be the roles of each, because technically you can do everything from manipulation, collection and so on with either of them. What be the role of each in a machine learning project? From my understanding SQL is used for collection and storage and obviously used to define the schema, insert the data and so on. Also if you would like to extend the data base you would use sql to insert new rows, but when would you actually load the dataset into pandas and start cleaning and preprocessing or would that be done using sql? How would this work?
SQL and pandas are both for tabular data (rows and columns). But SQL databases exist on the hard drive, and dataframes only exist in memory while a python program is running.
@echo mesa ^
I'm sorry but I don't really understand what you mean, or how this answers my question
do you know the difference between hard drive (also known as disk) and RAM (also known as memory)?
Pandas is like a bread knife, and relational database is like a chainsaw. You would probably not use a chainsaw to cut bread, and not use a knife to cut down a tree.
SQL is just an interface language used by many relational databases (the standard).
Also Pandas is a Python specific thing that is useful in Python as a way to manipulate tabular data in general.
sql is a language your sql server can parse and process. pandas is a high level api povided in python. both are not mutually exclusive (see for example https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html)
direct sql is very well suited for some kind of job, while pandas for others. usually it would depend a bit on the role you want to endorse. if your goal is to manipulate "raw data and tables", sql look like good. if your role is to query the data in order to perform a ds job, maybe pandas will suite better. but it's hard to really categorize everything here
I fear that people are answering the question from too many angles and making it more confusing for OP
The TLDR is that you want a database when you are getting serious / storing a lot / need to do queries fast. Cleaning and preprocessing will probably be done elsewhere. SQL is for querying, not running a complex preprocessing step.
It's mostly convenient / fine for anything not huge.
And you don't need to learn SQL.
Pola.rs is like something in between.
I do not use it for performance reasons too, i'm impatient with these things. I don't like something taking a day when it can take 3 hours.
Epoch [1/10], Loss: 0.4800
Validation Loss: 0.2547, Accuracy: 88.44%
Epoch [2/10], Loss: 0.3913
Validation Loss: 0.2687, Accuracy: 89.78%
Epoch [3/10], Loss: 0.2884
Validation Loss: 0.2423, Accuracy: 88.00%
Epoch [4/10], Loss: 0.2128
Validation Loss: 0.1990, Accuracy: 92.44%
Epoch [5/10], Loss: 0.1172
Validation Loss: 0.1681, Accuracy: 92.89%
Epoch [6/10], Loss: 0.0621
Validation Loss: 0.2270, Accuracy: 93.33%
Epoch [7/10], Loss: 0.0730
Validation Loss: 0.2494, Accuracy: 92.44%
Epoch [8/10], Loss: 0.0330
Validation Loss: 0.1599, Accuracy: 93.78%
Epoch [9/10], Loss: 0.0247
Validation Loss: 1.6280, Accuracy: 93.78%
Epoch [10/10], Loss: 0.0449
Validation Loss: 0.3310, Accuracy: 90.22%
is this good?
Probably the game development experience, it's all about iteration speed there, so waiting on something to process for really long is pain. (And also kind of one of the selling points of using Python in the first place, I don't want to compile for an hour)
or need more accuracy
is this where I can ask questions about regex in python?
Akshually sql is a standard so itβs up for the creator to decide on the implementation in the backend π€
I know that. But this isn't about showing what I know. It's about giving the asker the information that would be most helpful for them.
Dw I was joking ik what u mean
Use polars it sm quicker
Uh, in terms of my performance rank tiers... pandas is pretty low, in modern tools.
Polars is really where it's at... or pyarrow if my problems are simple enough.
(well, fine, you know I'll say duckdb
)
just wanted to give a quick appreciation to stelercus, etrotta, zeal, billybobby and other people that helped me not too long ago the suggestions were very well received at work
i just feel like it would be wrong to not give explicit thanks so again just wanted to say thank you for the suggestions π
lol kinda funny polars is being discussed again
i'm a bit late but was lazy loading buggy for your project or just in general? sorry if i'm interrupting btw
This touches on my main complaint of the dataframe libraries: having to learn yet another syntax. Maybe it was you, maybe it was polars, but it's still yet-another-data-api.
I'm just starting to learn Pandas and here I learn it's not great π΅βπ«
Do you think the skills transfer well across Pandas, polars and similar tools? You're basically working with and manipulating tabular data from what I understand
Pandas is required knowledge. We're just complaining about advanced-level problems after you learn Pandas.
Oh ok, thanks. I'll definitely keep learning Pandas
The problem is, the skills don't transfer as nicely as you'd hope. Polars is a very different API. SQL too. This is the crux of my complaint.
yea I wouldn't worry too much I learnt polars 3weeks ago and everyone on my team(5 members) picked it up easily had some issues like people trying to 1..1 pandas but besides that simple learning
it helps somewhat in transferring skills for some things, like Spark was ez to pick up after i was somewhat ok w pandas
Thank you everyone for your answers!
I try to use pyarrow a lot more for loading issues. The whole point of arrow tables is zero copying to pandas/polars/duckdb/etc.
wait wait why the f correlation values are not 1 for same columns?
Hey guys, does anyone have familiarity with PyTorch backpropagation? I have this code and have no idea which tensor the gradients are stored in that are created by loss_value.backward()?
https://paste.pythondiscord.com/XVLQ https://discord.com/channels/267624335836053506/1262534150766858311
@lapis sequoia if you ask for help in more than one place, please link to the thread
np
Hello. I am trying to create a vqa type model with possibly video and audio inputs. I was wondering if anyone could give me some advice regarding this? Because say I have a streaming speech-to-text algorithm. I wouldn't have the entire input at once, so how would I say, perform positional embedding, or something like self-attention, when I don't have the entire input yet.
I'm quite new to this transformer-type architechture, so I hope somebody with more experience might be able to point me in relatively the right direction. Thanks for any advice anybody might be able to give.
-# Note: Though, I wouldn't consider myself a complete newbie to AI/ML, I'm not any "pro" either, so please don't be too harsh is what I'm saying is nonsensical or unfeasible!
@unborn hemlock my advice is to not use anaconda or any of its variations. I've been doing DS/AI/ML for five years and have never used or needed it.
I mean how would you manage multiple pythons?
There is no reason to be using a different system than the rest of the python community, unless you for some reason want to paint yourself into a corner where you can't use the majority of guides about managing environments
you can have different virtual environments.
that ability comes with python.
without conda
You mean venv ?
yes
I thought it's the same. they are all just env managers and conda has more feature
am i right ?
what features does conda have that you think you need?
Maybe the built-in package that come from its repo i used venv before and it was very hard to manage builtin package
the built-in package?
i mean like how it can handle its dependencies installing nonpython package.
which non-python packages do you need?
like Numpy
you don't need conda to install numpy.
I know ... but conda is easier to use
how does conda make it easier to install numpy?
with regular venvs, you just do pip install numpy
When you use pip, some dependencies might be missing. Conda handle that easily.
I have never encountered this.
Maybe it's just me, but I have seen a lot of developers on GitHub use it too, and I feel like I have to have it too in order to follow their guideline :\
those people have been gaslit into thinking that they need conda
Usually there are requirements.txt for a lot of projects so u can just install those in regular venv
Itβs not that diff
Yeah, like I said I used to only use venv but many of them recommend Conda. I did research on the advantages, and yes I'm not sure if I read it right. just trying it
I mean the only real advantage is that it comes w some preloaded packages
that is what i was trying to say.. it has preloaded packages.. I said 'builtin package' that confused @serene scaffold
Hi @serene scaffold , are you able to help me with something PySpark related please?
Be sure to never "ask to ask". Always ask your actual question in your first message.
Be sure also to not post screenshots of text. Copy and paste actual text into the chat.
is this ok or too long? didnt want to take up too much space in chat
It's better to take up more space in the chat if it's content that people can use. If someone turns up who can help you answer this, they will probably need to google the error message.
hi
anyone has expertise in pinceone
docsearch = pec.from_texts([t.page_content for t in text_chunks], embeddings, index_name="test")
AttributeError Traceback (most recent call last)
Cell In[62], line 1
----> 1 docsearch = pec.from_texts([t.page_content for t in text_chunks], embeddings, index_name="test")
File ~\anaconda3\envs\vectordb\Lib\site-packages\pinecone\control\pinecone.py:590, in Pinecone.from_texts(*args, **kwargs)
588 @staticmethod
589 def from_texts(*args, **kwargs):
--> 590 raise AttributeError(_build_langchain_attribute_error_message("from_texts"))
AttributeError: from_texts is not a top-level attribute of the Pinecone class provided by pinecone's official python package developed at https://github.com/pinecone-io/pinecone-python-client. You may have a name collision with an export from another dependency in your project that wraps Pinecone functionality and exports a similarly named class. Please refer to the following knowledge base article for more information: https://docs.pinecone.io/troubleshooting/pinecone-attribute-errors-with-langchain
Selection deleted
help me with this code
Hi guys,
I'm very new to python and can't get this forked project top work properly as I keep running into this error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices
I am running torch 2.0.1 due to some compatibility issues with torchvision and torchaudio.
I have been running it on a remote SSH on RunPod with the following hardware:
- 12 x RTX A4000
- 128 vCPU 250 GB RAM
Obviously I'm using RunPod to speed up the model-training process but I can't seem to get python to take advantage of the extra GPU-processing power.
CODE:
import json
import torch
import torch.nn as nn
from config import eval_interval, learn_rate, max_iters
from src.model import GPTLanguageModel
from src.utils import current_time, estimate_loss, get_batch
def model_training(update: bool) -> None:
"""
Trains or updates a GPTLanguageModel using pre-loaded data.
This function either initializes a new model or loads an existing model based
on the `update` parameter. It then trains the model using the AdamW optimizer
on the training and validation data sets. Finally the trained model is saved.
:param update: Boolean flag to indicate whether to update an existing model.
"""
# LOAD DATA -----------------------------------------------------------------
train_data = torch.load("assets/output/train.pt")
valid_data = torch.load("assets/output/valid.pt")
with open("assets/output/vocab.txt", "r", encoding="utf-8") as f:
vocab = json.loads(f.read())
# INITIALIZE / LOAD MODEL ---------------------------------------------------
if update:
try:
model = torch.load("assets/models/model.pt")
print("Loaded existing model to continue training.")
except FileNotFoundError:
print("No existing model found. Initializing a new model.")
model = GPTLanguageModel(vocab_size=len(vocab))
else:
print("Initializing a new model.")
model = GPTLanguageModel(vocab_size=len(vocab))
# Utilize all available GPUs if available
if torch.cuda.device_count() > 1:
print(f"Using {torch.cuda.device_count()} GPUs.")
model = nn.DataParallel(model)
...
CODE CONT...
# Move model to CUDA devices
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# initialize optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learn_rate)
# number of model parameters
n_params = sum(p.numel() for p in model.parameters())
print(f"Parameters to be optimized: {n_params}\n", )
# MODEL TRAINING ------------------------------------------------------------
for i in range(max_iters):
# evaluate the loss on train and valid sets every 'eval_interval' steps
if i % eval_interval == 0 or i == max_iters - 1:
train_loss = estimate_loss(model, train_data).to(device)
valid_loss = estimate_loss(model, valid_data).to(device)
time = current_time()
print(f"{time} | step {i}: train loss {train_loss:.4f}, valid loss {valid_loss:.4f}")
# sample batch of data
x_batch, y_batch = get_batch(train_data).to(device)
# evaluate the loss
logits, loss = model(x_batch, y_batch).to(device)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
torch.save(model, "assets/models/model.pt")
print("Model saved")
TERMINAL:
root@db6e42fc7512:~/lad-gpt# python run.py train
Initializing a new model.
Using 12 GPUs.
Parameters to be optimized: 7041970
Traceback (most recent call last):
File "/root/lad-gpt/run.py", line 20, in <module>
main()
File "/root/lad-gpt/run.py", line 15, in main
train.model_training(args.update)
File "/root/lad-gpt/src/train.py", line 65, in model_training
train_loss = estimate_loss(model, train_data).to(device)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/lad-gpt/src/utils.py", line 23, in estimate_loss
logits, loss = model(X, Y)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
TERMINAL CONT...
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 644, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/lad-gpt/src/model.py", line 151, in forward
pos_emb = self.pos_embedding(torch.arange(T)) # (T, C)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
root@db6e42fc7512:~/lad-gpt#
Hey guys, I've been training a coco-model on image detection lately in google colab. I was wondering if anyone had a link oe two that would explain a way in which I can somehow download this model I've trained so that I can import it to a new file and just plug in an image to be detected rather than re-run the enitre model on colab for it to be used everytime I restart my PC.
Popular opinion: the organizations currently developing AI are evil and should be stopped
Does anyone have a good resource to learn about transformers?
nah i'm jk.
π
We have put together the complete Transformer model, and now we are ready to train it for neural machine translation. We shall use a training dataset for this purpose, which contains short English and German sentence pairs. We will also revisit the role of masking in computing the accuracy and loss metrics during the training [β¦]
ML Mastery is goated.
Tysm!
you know the math behind it all?
behind transformers ? nope
hmmm. i'll let someone else weigh in too.
any place to learn about it?
yea man. 3blue1brown on YT? professor leonard as well for calc up to multivar calc. stats too.
Oh ok
yea man. check out our resources pinned too. you got this.
Tysm man!
Yeah , I started in the math book
any time, homie!!
finished the linear algebra section
keep active in this server. you will learn so much.
try gilbert strang for lin alg?
are you a CS major?
No man , I am a highschool student
oh wow. hella ambitious.
anyways, make sure your math fundamentals are up to par.
they're pardon the pun, integral.
Will do boss!
Repost from help channel:
I have a rather complicated problem
I am trying to set up this repo here
https://github.com/mala-lab/InCTRL
and got to the last step of testing the visa dataset
but when I try to run it I get a permission denied error even though I have full admin rights
My assumption is that it has something to do with CUDA and my GPU
So my question is
How do I give the code access to my GPU
Please show the "permission denied error"
you have device = torch.device("cuda" if torch.cuda.is_available() else "cpu"). Chances are that torch.cuda.is_available() is false, meaning that you're setting the device as cpu. and then the logs say Using 12 GPUs. but that probably actually means that you're using 12 CPU cores, and the logging statement just assumes that torch.cuda.is_available() would have always been true.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
there's also this
so make sure that anywhere you have device = , it's locked in to being only cuda:0.
its just the path to the checkpoint
as you can see, the function returns true
how should that be possible, when the project has >>50 modules
I was not talking to you in that message.
ah mb
Any recommendations on how to learn python to lean into pandas?
W3school or DataCamp
worth the sub? Or just get a book instead
@loud plank use the kaggle pandas tutorial
don't use w3schools no matter what
lol
Kaggle is very good
I mean I need to start with a basic understanding of Python.
I learned my lesson trying to skip to pandas
every w3schools article is at least slightly incorrect. and sometimes just blatantly wrong. and there are so many resources that are actually good that there's no reason to settle for w3schools.
I wasnβt sure if a book or data camp was good
!resources
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
Iβve been looking at data camp for a long time now
If you're willing to make a financial commitment, I think it's also a good place to learn
Have you tried it yourself? If not what did you use?
Yeah when I was in my former company every staff was subscribed to DataCamp. And I like it. Although I remember some of my colleagues not having same experience. Some preffered DataQuest.
Iβll take a look at it. Thanks.
These days I learn new stuff mostly on YouTube or from colleagues at office
Yo guys, do you guys know any way to get access to the twitter api for free?
I know there are ways around it, but no idea how
if the official API requires that you pay, then you have to pay.
I tried a few modules like tweepy but they all want api creds with costs 5k a mo
It's a ridiclous price tho π
But fair enough
then you'll have to come up with a different project.
Is there a way to guarantee an RL model converges on the best possible score assuming is has the information needed and the score is possible at the cost of performance/time?
I don't believe it's possible to guarantee that any neural model will converge on the best possible set of weights.
Why
the best possible weights might be in some very small, obscure valley somewhere that's very far away from where you randomly initialize, and which is in a different direction from the one your training data pulls the model.
How to get around this?
you don't.
for most neural architectures, you can never be certain that the model you have is the best possible model
You can if it reaches the maximum calculated score
I also donβt understand how itβs possible for an agent to converge when there is a minimum random exploration chance. If the maximum score of an environment required 100 moves and the chance of random exploration is 1% it will only be a success about 37% of the time which is not converging (after infinite amounts of training). In this example the complexity of the environment (100) is very small and the minimum chance of random exploration is very low (1%). This example is generously in favor of convergence yet it doesnβt converge. So how do big complex models converge?
And in the case where there is not a minimum exploration chance the model is highly unlikely to find the optimal score before it no longer randomly explores
I wanna fetch a following list of a set list of X users
And run it as a py script, is that possible with those sources?
if you want data specifically from Twitter no, these are alternative platforms that follow mostly the same format but completely separately from twitter itself
Oh yeah that's fine, as long as I can do this
Hey guys
Hello and welcome to our wonderful data science and AI chat

There's a lot of interesting writing about this in Sutton & Barto's book.
You can't guarantee convergence to the optimal policy if your algorithm is off policy, uses a function approximator (e.g., a neural network) and uses bootstrapping (TD learning)
We know from reinforcement learning theory that temporal difference learning can fail in certain cases. Sutton and Barto (2018) identify a deadly triad of function approximation, bootstrapping, and off-policy learning. When these three properties are combined, learning can diverge with the value estimates becoming unbounded. However, several alg...
If your problem's state action space is very simple you can sidestep the problem by using a tabular method instead of a function approximator
I got a question about linear regression and LSTM output's. If I have a predicted set the same size as the testing labels, with the predicted dataset as the coefficient and the testing labels as the dependent variable, is there anything useful I can extract out of using a linear regression model in that manor?
can anyone explain this? the classifier is KNN. the vectors are video-image embeddings: " For the classifier
training, we select 1000 query from the training data of VQAv2, for each query we run the GRiT model to extarct ground
truth clips. Then we label the each concatenated query-chunk
embedding vector as 1 if the chunk contains clips from ground
truth, other wise give a 0. Then we train KNN classifier on
this. After the KNN is trained, we test it on 1000 queries from
the validation samples from VQA-v2 dataset to report results."
KNN is trained to do what here?
@serene scaffold
idk what VQAv2 or GRiT are. or what a "query from the training data" is.