#data-science-and-ml

1 messages Β· Page 134 of 1

vagrant root
#

text

past meteor
#

If the 2000 data points are a representative sample, then it can work

#

There's studies using 5-10 data points out there

vagrant root
#

[38400] per sample

past meteor
#

The size is a concern but not everything

vagrant root
#

ok but in context

#

the data is 50 text columns

#

each with a bert encoding of 768

#

so [2000,50,768]

#

[2000,38400]

past meteor
#

Do you know what internal and external validity is?

vagrant root
vagrant root
#

yea pretty much

past meteor
vagrant root
#

yea it works great on validation set at only 200 epoch

past meteor
#

At the end of the day, this part isn't an exact science and you have to just make solid arguments

vagrant root
#

3 sets

#

1 is train 1500
test 500

validation after training 20

past meteor
#

The one with 20 is a concern

#

It's tiny

vagrant root
#

yeah but it is outside of training data

past meteor
#

Internal validity refers to the degree of confidence that the causal relationship being tested is trustworthy and not influenced by other factors or variables. External validity refers to the extent to which results from a study can be applied (generalized) to other situations, groups, or events.

#

This is a general concept in science / stats

#

You can apply it here

#

Can you generalize the results on 20 data points to other data?

past meteor
#

Then you should basically just look at your data

#

And say if you can or can't

vagrant root
#

i have a 2000 sample dataset
i test train on it

then when the model has learnt, i validate it externally on 20 samples as it would be if it were a product

#

the external 20 samples are not present in the 2000 sample dataset and cant be learnt or overfit

past meteor
vagrant root
past meteor
#

So the regular training cycle

#

You mean that 500 data points are used as a validation set for early stopping or so?

vagrant root
#

ok i guess im not clear

past meteor
#

Yeah, you're using a bit of non-standard lingo which makes it a bit hard to understand

#

But we'll get there

vagrant root
#

in this train is 1500
val is 500
test is 20
all samples are exclusive to their set

past meteor
#

Okay this is clear now

past meteor
#

Do you believe the result holds for other situations?

vagrant root
#

the data is self generated mostly and it is difficult to extract

past meteor
#

The larger the dataset the more confidence in terms of external validity

vagrant root
past meteor
#

Then you should motivate why

vagrant root
#

its still less

#

how should i split the data

#

tain/val/test

past meteor
#

That's all, you should motivate why you believe the results are valid

vagrant root
past meteor
#

Maybe the 20 data points are really representative for the population? Unlikely but possible πŸ™‚

vagrant root
#

i dont want the test set to be larger than it is. am i right for thinking that?

past meteor
#

Actually look at those 20 data points

vagrant root
#

the 20 datapoint are very diverse in terms of the model

past meteor
#

Just look at them, qualitatively

#

And ask yourself if they're representable for your entire population

#

That gives you a basis to reason about external validity

vagrant root
#

hmm, ok ill do that later today

#

thanks πŸ™‚

lapis sequoia
#

if some one wants data science 50 tb drive dm me

#

it contains a whole lot of cool stuff

wooden sail
ember pawn
#

yuh

lapis sequoia
#

this is not any kind of shady stuff

#

it is genuinely helpful

past meteor
#

Ok I'll bite, what's in the 50 tb drive?

wooden sail
#

inb4 "a single picture of your mom"

vagrant root
wild loom
#

hey so I finished training and testing my model but it still needs alot of work I think. the AP ( Average Precision ) metric score it returns is 30, which seems quite bad but when I test it against random images it works very well and not in an overfitting sense as the outlining for the image predictions of facial areas isn't super rigged but rather a little abstract

#

mostly for @vagrant root and @hearty depot

wild loom
#

or anyone if anyone can help with this πŸ˜‚

vagrant root
#

Precision is 30 for which data?

#

test or val?

narrow tiger
#

yoo this is soo cool

deep sleet
#

when working with LSTMs does it treat every sequence independently so if there's a pattern that is happening over several consequences it won't be able to captures it?

serene scaffold
#

if there's a pattern that exists consistently across training instances, the model is supposed to learn that. but the order in which the model sees each training instance shouldn't make a difference.

severe hare
deep sleet
serene scaffold
deep sleet
#

oh ok

severe hare
#

LSTMs is really only used for Timeseries, so you'll be fine

deep sleet
#

What did you use for timeseries

deep sleet
#

Rn I am facing a data leakage issue so can't even evaluate it properly xd

severe hare
deep sleet
#

?

#

isn't that the tradingview programming lang?

severe hare
#

All the popular libraries in Python aren't that great compared to just Numpy/T-Flow

deep sleet
#

Noted but I don't get how you used pinescript instead of LSTMs

severe hare
#

Oh sorry um, there is kindof a lot of Time Series Analysis to do on time series data before you get to LSTMs,

#

While LSTMs are powerful, they can be complex and computationally expensive. Here are some simpler time series algorithms that might be suitable for your project:

  1. Simple Moving Average (SMA): Calculate the average value of the last n observations to forecast the next value.

Example: forecast = (sum(last_n_values) / n)

  1. Exponential Smoothing (ES): A combination of a simple moving average and a smoothing factor to reduce the impact of noise.

Example: forecast = alpha * (last_value) + (1 - alpha) * forecast

  1. Autoregressive (AR): Model the current value as a linear combination of past values.

Example: forecast = a * last_value + b * last_last_value + ...

  1. Autoregressive Integrated Moving Average (ARIMA): A combination of AR and ES, which can handle non-stationarity.

Example: forecast = a * last_value + b * last_last_value + c * error

  1. Seasonal Decomposition: Break down the time series into trend, seasonality, and residuals using techniques like STL decomposition or seasonal decomposition.

Example: forecast = trend + seasonality + residuals

  1. ** Prophet**: A simple and interpretable algorithm that models time series as a piecewise linear function with seasonal trends.

Example: forecast = piecewise_linear_function(trend) + seasonality

These algorithms are relatively easy to implement and can provide good results for simple time series forecasting tasks. However, keep in mind that they may not perform as well as LSTMs on more complex or non-linear time series data.

Remember to evaluate your model's performance using metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), or Mean Absolute Percentage Error (MAPE) to determine its effectiveness.

#

^ except this is kindof wrong because you need a fully functioning LSTM before you can feed that (the LSTM) to an ARIMA model.

#

The ARIMA model with the added auto-correlation test is: what? Who knows?

#

Anyone..?

deep sleet
#

oh

deep sleet
left tartan
left tartan
severe hare
#

The origin of the algorithms

#

George Udny Yule, CBE, FRS (18 February 1871 – 26 June 1951), usually known as Udny Yule, was a British statistician, particularly known for the Yule distribution and proposing the preferential attachment model for random graphs.

past meteor
deep sleet
#

sorry wasn't familiar with the term , had to google it

past meteor
#

It can find patterns that occur across different series yes. The way you should view any kind of recurrent neural network is that you have a latent variable that is a "summary" of all that happened in previous timesteps

#

This is also the case for multivariate series

deep sleet
#

So what difference does the size of the sequence make?

amber sequoia
#

hi. Is there a way to make Pandas treat absent row header values in CSV file like they belong to the previous row header value instead of making it a new NaN header?

what I mean is basically:

data looks like this in Excel:

parameter1 parameter2  2010 2011 2012
A          B           foo  foo  foo
           C           foo  foo  foo
           D           foo  foo  foo
M          N           bar  bar  bar
           O           bar  bar  bar

but after exporting to CSV and importing to Pandas it looks like:

            2010  2011  2012
A      B    foo   foo   foo
NaN    C    foo   foo   foo
       D    foo   foo   foo
M      N    bar   bar   bar
NaN    O    bar   bar   bar
past meteor
autumn heron
#

Hello guys, sorry to interrupt but is it better to start learning matplotlib/pandas/numpy/scipy along with linear algebra/calculus? or only math first?

past meteor
# deep sleet the length

If the sequence is very long you run the risk of the latent variable not being able to "remember" what happened in the beginning, hence why LSTMs are used over vanilla RNNs. At least that's some of the intuition.

serene scaffold
past meteor
#

(Aside from vanishing gradients)

deep sleet
#

oh

#

What is vanishing gradients?

autumn heron
#

on Coursera

serene scaffold
autumn heron
#

Hm

past meteor
autumn heron
#

I see, thank you

past meteor
#

People selling the courses summarize that and sell it to you

autumn heron
#

Also like, when to start andrew ng course?

serene scaffold
#

pandas is probably the best documented library in all of python. (and it better be, because it would be incomprehensible otherwise.)

autumn heron
#

What are the pre requisites

vagrant root
past meteor
autumn heron
#

So numpy would give me a better understanding?

narrow tiger
#
    documents=["This is document1", "This is document2"], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well```
what does this comment here mean (it is from chromadb docs)
and what does tokenization mean in llm contexxt
past meteor
serene scaffold
deep sleet
past meteor
serene scaffold
#

!e

import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print('This is going to do element-wise addition without a for loop.')
print(arr1 + arr2)
arctic wedgeBOT
autumn heron
#

Also does andrew ng cover the math required?

vagrant root
autumn heron
#

I'm currently watching gilbert strang but like its super long

#

I feel like alot of it isn't actulaly necessary?

past meteor
serene scaffold
autumn heron
severe hare
#

multiplies 2 arrays

deep sleet
serene scaffold
#

!e

import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print('This is going to do element-wise multiplication without a for loop.')
print(arr1 * arr2)
arctic wedgeBOT
autumn heron
amber sequoia
# vagrant root What do you want to do?

basically I have tabular data that has column headers, but also has row headers. The row headers have 2 levels, you can think of them of as main_group, and sub_group:

                     X     Y      Z
main_group sub_group 
A          foo
           bar
           foobar
B          asdf
           qwert
           uiop

When I export this tabular data to CSV and import it back to Pandas instead of the above structure I get additional NaN headers in the main_group seems like pandas' read_csv treats empty CSV values in this column as a NaN, even though it is specified as col_index
I need basically to retrieve from pandas an original structure, meaning that it know that, bar, and foobar also belong to the A main_group, and not to some NaN main group, which pandas seem to produce when reading csv

serene scaffold
autumn heron
#

there is other way to do dot(?) product

serene scaffold
#

and there's np.dot(a, b)

autumn heron
#

im familar with the latter

#

how do you use @

serene scaffold
#

same as *

autumn heron
#

arr1@arr2?

serene scaffold
#

yes

autumn heron
#

it shouldn't work right? (3x1 3x1)

serene scaffold
#

but the two arrays have to be valid for a matmul
so the shapes have to be (a, b), (b, c)

#

(a can equal c)

autumn heron
#

yes but what is that output

#

32, why does it show no error

narrow tiger
autumn heron
serene scaffold
#

it might have treated the arrays as shapes (1, 3) and (3, 1)

#

which would reduce to an array of shape (1, 1), which is effectively a scalar.

autumn heron
#

without informing us

#

how can it transpose without any notification

#

i feel like this would cause huge problems somehow

narrow tiger
#

np does alot of things very differently lel
you won't even be able to spot where the error is coming from

autumn heron
#

hm

narrow tiger
#

same for pandas

autumn heron
#

is there a verification method

#

like to check whether you can multiply 2 matrices

narrow tiger
#

write tests? manually

autumn heron
#

so we have to define our own function to do that

#

can we get the order of a matrix using np

narrow tiger
autumn heron
#

I really wanna watch this but I don't know if its for beginners

autumn heron
narrow tiger
autumn heron
#

im not sure

narrow tiger
#

i think it is for begginers

autumn heron
#

i watched the 'essence of calculus' and it was really good

narrow tiger
#

that really helped me atleast and i am very much begginer

autumn heron
#

hm

#

I have not started multivariable calc at all 😨

#

3b1b has videos on it on khanacademy but idk how much of it is necessary

vagrant root
unkempt apex
severe hare
#

OpenGL?

unkempt apex
#

pygame!

severe hare
#

mm

autumn heron
#

is that pong ai

unkempt apex
#

yeah

vagrant root
autumn heron
#

awesome

unkempt apex
#

yeah!

autumn heron
#

how long did it train

unkempt apex
vagrant root
#

based

unkempt apex
autumn heron
#

hm

vagrant root
unkempt apex
#

I can train it on higher!

#

because I was just finding correct hyperparameters

vagrant root
#

does it completely crumble?

unkempt apex
#

crumble?

#

I just trained it on 220k and tested for 2 minutes

#

so don't know need to test more

vagrant root
#

oh

unkempt apex
#

current speed is 4 pixels!

#

and dim are 800x400

severe hare
#

Could add a calculation for possible deferred velocity after bounce so it slows down and speeds up randomly.

#

-or not randomly

#

decision tree, binary struct, whatever

unkempt apex
#

yeah but this, the model will train , just need more episodes I think!

severe hare
#

Looks good man.

unkempt apex
#

yeah thanks!

amber sequoia
deep sleet
unkempt apex
#

Dont go directly to RL , it literally took me 4 weeks to fully understand

deep sleet
#

Oh okay xdd

unkempt apex
#

I know what u are doing that's why I told you this

#

Just try some algo lime you aredoing then move to this!

severe hare
#

Reinforcement Learning and Deep Learning are their own entire industries, or will be very soon. Lot of applications and typically the work of an organized department; not one person. So well done. RL is very useful thing.

unkempt apex
#

First appreciation by you , ohh God !!! Thanks !

rich moth
#

I fixed all the errors just to run into this during the eval stage. smh..```Epoch 1/3: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 351/351 [6:02:06<00:00, 61.90s/batch, Batch Loss=3.48e-5]
Evaluation: 0%| | 0/88 [00:00<?, ?it/s]Input to VideoEncoder: batch_size=16, num_frames=16, channels=3, height=128, width=128
After view reshape: torch.Size([16, 48, 128, 128])
After conv2d_layers: torch.Size([16, 512, 128, 128])
After view reshape before fc: torch.Size([16, 8388608])
Input to VideoDecoder: torch.Size([16, 512])
After fc layer: torch.Size([16, 131072])
After view reshape: torch.Size([16, 512, 16, 16])
After conv_reduce: torch.Size([16, 512, 16, 16])
After conv2d_transpose_layers: torch.Size([16, 48, 128, 128])
Channels: 3, Expected size: 12582912, Actual size: 12582912
Final output shape: torch.Size([16, 16, 3, 128, 128])
Evaluation: 0%| | 0/88 [00:00<?, ?it/s]
Error in epoch 1: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Epoch 1/3: 0%| | 0/351 [00:00<?, ?batch/s]
Error in epoch 2: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Epoch 1/3: 0%| | 0/351 [00:00<?, ?batch/s]
Error in epoch 3: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.```

glad harness
#

Hey can anyone help me with tensorflow and keras error in my project ?

ionic valley
#

basically,

  • lasso only avoids multicollinearity if a large number of the attributes are already linearly independent
  • I know nothing about lasso as a "maximum a posteriori estimator" following a "laplace distribution," but the point is that lasso does not eliminate the need for VIF analysis
  • adding features in excess is still bad, even for lasso, because lasso still performs poorly assuming low kruskal rank

correct?

wooden sail
#

yes, except i didn't get what you meant by "attributes" in the first point

glad harness
#

help me plz

harsh sun
#

Are CNNs just normal neural networks besides how the initial data is prepped for input? Cause it just seems like CNNs are about how the initial image is separated and condensed and downscaled for higher performance with regard to the NN. Is that correct?

wooden sail
#

the main difference is not that, it's that the convolution operation contains extra information. nowadays we call this "model-based machine learning"

harsh sun
wooden sail
#

a convolution has fewer parameters than a regular matrix multiplication, so it's easier to train. it also has the nice property of "spatial invariance", which is often what you want when processing images. these two things together give the cnn its power

#

what do you mean by "condensed version of aspects of the image"

harsh sun
#

you have the input image which is condensed into the convolution

wooden sail
#

tell me in words, i still don't know what you mean

harsh sun
#

and then condenced even more into the pooling

harsh sun
# wooden sail tell me in words, i still don't know what you mean

you have the original image. then the image itself is condensed when you apply the filter because it does the dot product between the values specified in the dimension of the filter. so that itself is inherently smaller. then that data that is produced is pooled. which is condensed even more because it takes the max value out of a part of that convolution (I think thats the term), which condenses the data even more. so then you have these pools which are now very condensed relative to the original CNN. then it gets fed into the NN for classification.

#

thats what I mean by condensed

wooden sail
#

nothing about convolutions inherently yields smaller dimensions

#

in fact, the standard definition of convolution yields a larger array than both the original image and the filter

#

pooling is also a separate operation and you can build CNNs without it, but you can roughly interpret it as keeping the most "salient" results in that they're the ones with largest magnitude

harsh sun
wooden sail
#

that's not how convolutions work

harsh sun
#

oh

wooden sail
#

convolutions are equivalent to matrix multiplication with a toeplitz matrix

#

you can vectorize your image, turn the convolution kernel into a huge toeplitz matrix, and multiply the two. you get an output the same size as the original image

#

the matrix, however, has few unique entries and is spatially invariant

harsh sun
#

oh

tidal bough
#

in image processing (e.g. in CNNs) convolutions are often done with a stride larger than 1, in which case it does make the result smaller

harsh sun
#

alright ty

wooden sail
#

the pooling part and setting up your convolutions to reduce dimensions does have the effect of projecting onto a lower dimensional vector space, maybe that's what you meant by "condensing"

#

"bottlenecking" the network so that the input is represented by a small vector. you can do that without CNNs though so i would treat that as a separate concept

lapis sequoia
#

Hi,
Hope u are doing well,
I am working on time series forecasting using multiple models (CNN-LSTM-Attention, CNN-LSTM,GRU-attnetion, Nbeats, ARIMA,Prophet). The three first algorithms produces good results compared to the two last ones, but when trying to plot curves, i noticed that the model is just shifting the last point time of the input and consider it as output. which means that models didn't learn in reality. Please any solution to this problem ?

mild dirge
#

Instead of feeding the true value as input, feed the previous output of the model, and see what the plot looks like @lapis sequoia

lapis sequoia
mild dirge
#
model_predictions = []
x = ...  # the value at t=0
while ...:
  y = model(x)
  model_predictions.append(y)
  x = y
ionic valley
#
# Reciprocate the sub count 
dislikes['uploader_sub_count_recip'] = 1 / (dislikes['uploader_sub_count'] + 1)

np.isinf(dislikes['uploader_sub_count_recip']).sum() #64000

May be a dumb question, but why am I getting infinite values when applying f(x) = 1/(x+1) to my column? Uploader subscriber counts are integer values >= 0.

#

nvm, I just found out that 64000 observations somehow had a subscriber count of -1

#

and of course f(-1) = 1 / (-1 + 1) = 1 / 0 so it checks out

harsh sun
#

so, the filters are produced from the training to take apart features that are then sent to the rest of the nn which dont use filters, but instead standard neural networks to process those high level features?

deep sleet
#

I think I found the issue with how the results were too good

#

it wasn't really data leakage but I was showing it the test data at once so it wasn't really predicting but marking up the patterns

#

somehow I have to make it view it one candle at a time and see if there's a viable trade or not then take it

alpine nest
#

what is the best way to get into data science as a high school student? I am just stuck following tutorials but they don't really help much. Any tips?

drifting mango
alpine nest
#

yeah i guess that's what you gotta do

left tartan
#

'Don't really help much': how so? Tell us more?

alpine nest
#

well i have a decent understanding of python and different data structures etc. I also have watched seminars, and read papers on neural networks, even got a copy of the nnfs book for free from github. But i don't know where i can take it from there. Obviously im not looking for a job at the moment but just wondering who, what and how? Thanks.

#

Yeah

unborn sapphire
#

how to install torchviz in conda ?

left tartan
left tartan
left tartan
alpine nest
#

oh ok

#

i'll have a look

hollow sentinel
#

write a project charter

#

learn the soft skills/PM side of it too

#

at least that's what i'd do

alpine nest
hollow sentinel
#

ah, perfect.

alpine nest
#

yeah

hollow sentinel
#

nah i was just making sure.

alpine nest
#

mhm

hollow sentinel
#

but good stuff dude!

alpine nest
#

thanks

lucid tide
#

whats the best activation method for a transformer based model ReLU, SwiLU, GeLU or GeGLU

alpine nest
hollow sentinel
alpine nest
#

oh yeah

#

its neural networks from scratch

#

here is a link to the book of it

#

free pdf

cedar tusk
#

anyone here tried positron ide?

#

i wanted to ask if R implementation is as good as r studio?

#

with the column name autocompletes and such

toxic mortar
#

Anybody using DataSpell? Why is my jupyter output so wieeerd

cedar tusk
#

yep its the viridis

toxic mortar
#

U mean this?

cedar tusk
#

can u delete "cmap='viridis'"

#

and try that way?

toxic mortar
#

this is the same file opened in vs

cedar tusk
#

yep that is the correct viridis palette

#

but let us try, delete the argument from 5th line in cell

toxic mortar
#

Either it is IDE specific setting or jupyter

cedar tusk
#

i think the issue is the ide converts the image to the negative color values

#

look at the background, its now black since the color is converted

#

there should be an option to disable this

#

try this

toxic mortar
#

Thanks man really helped me πŸ˜„

cedar tusk
#

np, i really dont like intellij

#

vscode for the win

#

or rstudio

toxic mortar
#

I wanted to test the professional jetbrains products since I've received them free as student

#

Mixed feeling tbh

ember pawn
#

hello
has anyone done the andrew ng CNN course i wanted to ask some things

#

i am getting this error i have no idea what it is ???

serene grail
#

Do you understand how assert statements work? I haven't done the course so I'm not sure if he teaches them or not
https://realpython.com/python-assert-statement/

In this tutorial, you'll learn how to use Python's assert statement to document, debug, and test code in development. You'll learn how assertions might be disabled in production code, so you shouldn't use them to validate data. You'll also learn about a few common pitfalls of assertions in Python.

serene scaffold
#

@ember pawn that error message means that your code does not pass the tests. The error message gives you a hint for how you can change the code.

cedar tusk
#

i honestly feel like tensorflow is too unintuitional, i like pytorch more (alot)

ember pawn
#

🀑
something is wrong with this
i submitted and i got 100/100 and every other fucntion works idk what is thsi error

cedar tusk
#

somthing changed between the versions of the packages and now it aint working

#

shows the course is outdated

#

a little

ember pawn
#

idk
it works honestly lost my mind with it ahahha i will do the next assingment

toxic mortar
#

In my project, I developed a pretty good RandomForestClassifier model that's giving me great results. I have a dataset with 20k labeled records, and I also have around 200k more unlabeled ones. Should I use my current model to classify the rest 200k unlabeled records to create some baseline labels, which would help me get more labeled data to build an even better model. Or I should stop here? What are ur experience w it? Thanks πŸ˜„

nova matrix
#

Hi everyone
I was planning to do a classification task where an entire dataset ( has many measurements ) has one label (positive negative) and i have many of these datasets around a 100.
Any ideas on how to work through this or if anyone has experience with such a dataset

deep sleet
#

Any good resources on deep learning ? I been looking through random stuff online and looking for a more structured approach

nova matrix
hallow sphinx
#

What order should I study for ML (Which order is most efficient)?

Linear algebra
Calculus
Probability & Statistics

lapis sequoia
hallow sphinx
lapis sequoia
#

Depends, I just heard it from a youtube video.

hallow sphinx
#

mhmm right

unkempt apex
#

but for basics like vectors, derivatives, partial and all

unkempt apex
hallow sphinx
#

I am asking, did you make your own models, or did you used APIs?

unkempt apex
river cape
#

HI guys

#

I was working on the mnist classification data

#

then I saw this line

#

model.predict(X_test[12].reshape(1,28,28)).argmax(axis=1)

#

Why do we need to reshape the X_test ? Isnt it already in the format of (1,28,28)?

mild dirge
#

So X_test will be of shape (nr_samples, 28, 28) or (nr_samples, 784) I assume @river cape

#

if you do X_test[i], you will get (28, 28) or (784,) but a model will always require shape (batch_size, 28, 28)

#

So you need to make it a batch, with a size of 1, and this you do with reshaping

#

You could think of it like this: The model wants a list of samples, but you only give it a single sample, so reshaping makes it a single element list.

river cape
mild dirge
#

Jup, a list with 1 grayscale image that is 28x28

#

But then a tensor ofcourse (not a Python list)

#

The list is just an analogy

hallow sphinx
unkempt apex
#

do both , as per usecases -_-

haughty cradle
#

does lstm model have catastrophic forgetting issue?

#

also what exactly is changed on the NN for continual learning

small wedge
uneven locust
#

Hey mates, we are a team building an AI learning platform:
https://cone.ai
Need insights and reviews for it. Can you please check and provide me with your feedback or suggest something innovative you want in any learning platform...

worldly dawn
#

at least, not that blatant

uneven locust
bright scroll
#

hey guys! i wanted to create a telegram bot to which i could send photo and it would recognise from photos of db. but im facing troubles with converting photos. pls is there anyone who could share some repos??

stuck flax
#

Hello, do you see any problem with this sorting algorithm?

def grow_buble():
    global test_list, loop
    for index, item in enumerate(test_list):
        try:
            test_list[index + loop]
            test_list[index + 1]
            pass
        except IndexError:
            break
        if test_list[index] > test_list[index + loop]:
            test_list[index], test_list[index + loop] = test_list[index + loop], test_list[index]
        if test_list[index] > test_list[index + 1]:
            test_list[index], test_list[index + 1] = test_list[index + 1], test_list[index]

peak ridge
#

all this ML is so confusing

deep sleet
loud violet
#

hi guys , does any one here has experience with sdmx api ?

cedar tusk
#

the speed is just luxury

cedar tusk
#

what is it

#

need more context to see

peak ridge
deep sleet
#

Yeah

peak ridge
deep sleet
#

about 3 weeks ago

#

not super consistent tho

peak ridge
peak ridge
deep sleet
peak ridge
deep sleet
deep sleet
peak ridge
#

im learning it all

#

for a big reason
a cause

a mission

#

a project (already been developed from 5 months by few ppl)

#

what's the goto way u r following @deep sleet

deep sleet
#

rn I am reading a mathematics book that is pinned resources and working on forex ai project for fun to learn more about neaural networks

deep sleet
#

may I know what is it?

peak ridge
#

actually its based on GenAI

#

we were successful to build the thing

#

but 3 months down the line working on genAI i understood

I could go from A to D or E F maybe without ML and stuff

#

but the best way is to go learn ml then deep learning then some nlp into it
then go learn gen ai

peak ridge
deep sleet
#

ohh

peak ridge
#

yes
complex shit

#

but cool

deep sleet
#

Yeah I can barely imagine xdd

peak ridge
#

so

#

ml is to start

#

how do u do ml @deep sleet

deep sleet
#

There was a course on the basics of sci kit learn and ml

#

gimme a sec

#

This course is a practical and hands-on introduction to Machine Learning with Python and Scikit-Learn for beginners with basic knowledge of Python and statistics.

It is designed and taught by Aakash N S, CEO and co-founder of Jovian. Check out their YouTube channel here: https://youtube.com/@jovianhq

We'll start with the basics of machine lear...

β–Ά Play video
peak ridge
#

why youtube tutorial

#

kill the boy, be the man

spring field
peak ridge
past meteor
peak ridge
toxic mortar
#

Why do I get full report here

#

and here not?

past meteor
#

So just read that compared watching a video of someone that just read the docs

toxic mortar
#

whres micro avg

peak ridge
#

im just trying to learn
and m unable to lear

#

xD

deep sleet
#

and applies it with projects

past meteor
#

Well, you can certainly do what you want to do

#

but videos give the illusion of learning

#

There are way more effective ways, for instance doing specific kaggle competitions yourself individually

#

and then reading top performing solutions

peak ridge
past meteor
#

Reading + doing are way more "active learning" compared to watching videos, it's very passive and lets you zone out

#

And once you finished the 10h video you're like "okay I learnt x, y and z" when it's not true at all πŸ˜…

spring field
#

practice, practice, practice

past meteor
deep sleet
#

Makes sense, Will do that!

peak ridge
#

@deep sleet

#

carry me along

deep sleet
#

okay xD

peak ridge
#

yes

#

that's how we play

past meteor
peak ridge
#

done

#

next?

past meteor
#

It's not done

#

read it first and then ask me

spring field
toxic mortar
#

Vectorizer is the onlt difference

#

And as far as I know for the confusion matrix param it does not influence

past meteor
toxic mortar
#

Also this is from scikit docs

spring field
toxic mortar
#

Aaaaaaa yeye

#

makes sense. thank u very much

spring field
past meteor
#

Ah, I linked the wrong one

#

2nd ed is a better choice

wintry grail
#

Anybody who has worked/working on LDA and topic modelling ?

strange cradle
#

Hi, does this channel also cover the less advanced topic of Data Analysis (Streamlit, Pandas, etc.)? I didn't see any in the comments above and it's pretty huge part of Python.

serene scaffold
haughty cradle
#

is it a smart idea to try to build transformer in mid-low spec personal pc?

#

i have 8gb ram total, i7, RTX2000 something

#

or i should just accept my spec limitation and give up on making transformer?

strange cradle
#

Has there ever been talk about doing a 'Data Jam', similar to 'Code Jam'?

serene grail
strange cradle
odd meteor
strange cradle
serene grail
#

I haven't heard anything about a "Data Jam" like that, would love that kind of thing myself

serene scaffold
strange cradle
odd meteor
# haughty cradle or i should just accept my spec limitation and give up on making transformer?

The answer is, it depends. It depends on what you wanna do. Do you wanna train a model with transformer or finetune, or?

If your GPU (RTX2000 has a VRAM >= 12GB), then I think you're good to go; so long as what you wanna do isn't beyond your GPU card.

I usually recommend using RTX 3060 which has 12GB VRAM or the RTX A4000 which has 16GB VRAM.

Anything beyond what these cards can handle (e.g. task that requires A6000, RTX 4000 series, A100s) is gonna be an overkill for you to attempt that on your pc (instead, rent a GPU online)

When it comes to RAM The usual rule of thumb is 2x your VRAM, though I think 16GB - 32GB of RAM is probably okay.

haughty cradle
strange cradle
#

Can anyone share a data source for automobile 'registrations'? It's more definitive than 'auto sales' (like for looking into the details of Tesla's sold, which they don't provide).

nova matrix
#

anyone worked on a classification task where we gotta classify datasets as 0 or 1 instead of a row in the dataset

mild dirge
#

The dataset contains an irregular number of data samples (and all data samples have the same shape over all datasets)?

#

@nova matrix

left tartan
strange cradle
harsh sun
#

Do neural networks also discern patterns? From what I can tell there are just a bunch of computations that are reliant on each other and changing the variables of all of the compounded calculations to the least error is what training does. Is that inherently finding patterns?

serene scaffold
harsh sun
serene scaffold
harsh sun
serene scaffold
harsh sun
#

Highest layer there

serene scaffold
# harsh sun

looks like those represent the outputs of the convolutional layers

harsh sun
#

a fully connected layer I think

odd meteor
#

A walk around might be, using Mixed precision, reducing batch size, or using PEFT techniques like LoRA.

Or better still, just use Colab, Kaggle's free tier GPU, or rent from companies like AWS, SaturnCloud, or Vast etc.

haughty cradle
spring field
fiery stump
#

tryin to make a text detector from random text, and have to generate a 128GiB text file

#

why do i do this to myself (;_;)

fiery stump
#

the file is now 26GB in size

#

i still have over 100,000MiB to generate

storm valve
#

Show your code?

fiery stump
serene scaffold
arctic wedgeBOT
#
Formatting code on Discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

serene scaffold
#

It's easier for everyone when you give code as text.

fiery stump
#
import random
import string
k = 131072  # Size of the file you want to generate in MiB. Warning: going past 1024 can cause issues.

# Define characters to choose from
characters = string.ascii_lowercase
# Define the file path
file_path = 'E:/file.txt'
# Generate random text
while k > 0:
    random_text = ''.join(random.choices(characters, k=1048576))
    k = k - 1
    print(str(k) + "MiB left to generate")
    with open(file_path, 'a') as file:
        file.write(random_text)

print(f"Random text has been generated and saved to {file_path}.")
#

^ here

nova matrix
mild grotto
#

Hey, I'm having some difficulty figuring out how to optimize this:

def setZdepth(self, depth):
    self.depth[2]=depth
    self.arrZ= cp.arange(self.shape.N())//(self.shape.Nx*self.shape.Ny)%self.shape.Nz==depth

  def viewZ(self, data):
    return data[self.arrZ]

This viewZ function takes almost all the time of my program, presumably because this slicing operation is really slow... There has to be a better way!!

#

(This provides a 2D slice of a 3D block of data)

#

Also I'm using cProfiler, so I'm guessing maybe it's misattributing the time to the wrong function

serene scaffold
#

@mild grotto this is with cupy, or what?

#

What is cp?

mild grotto
#

cupy

#

it's the standard abreiviation for cupy

fiery stump
#

finally generated my insanely large text file, now time to do some science with it
⬇️

serene scaffold
serene scaffold
#

and how large is your ram?

fiery stump
#

random lowercase characters a through z

serene scaffold
#

how is that interesting?

fiery stump
#

i'm trying to see if i can find words in pure randomnes

#

it's not for a summer course or anything, i'm just bored and have nothing to do

mild grotto
#

if you were curious about it

fiery stump
#

eh true

#

but just the sight of a 134GB text file is so cool for some reason

mild grotto
#

it's pretty funny πŸ™‚

fiery stump
#

also, i got a 4TB external SSD for my birthday and needed something to use it for

mild grotto
#

nice

serene scaffold
#

I'm not sure I'd consider this data science, but if it interests you and motivates you to practice programming, I guess that's cool

fiery stump
#

i already wrote all the scripts for it

#

i did it for 1GB and it took 40hrs

#

so for 128gb

mild grotto
fiery stump
#

40 hrs * 128 = way too long lol

mild grotto
#

For example, you could make a Trie datastructure

#

and instead of loading the file into memory, you can control the file pointer manually

fiery stump
#

eh im not too good at programming

mild grotto
#

everyone starts somewhere πŸ™‚ If you were bored, those are a few things you could try to make it faster

fiery stump
#

i stole permanently borrowed most of the code from stackoverflow

mild grotto
serene grail
fiery stump
#

im tryin to understand what all of that means

mild grotto
#

Thanks πŸ™‚ It's a Lattice Boltzman Method fluid simulation

fiery stump
#

i just see blue with ripples

#

why did it switch to gray at the end

mild grotto
#

The blue view is the speed view, the orange view is the density view

#

I have a tool I can play with and switch views on the fly, and it records the session to .mp4 so I can post it on discord

fiery stump
#

nice!

mild grotto
#

but yeah it mostly does ripples

#

velocity view showing vortexes

fiery stump
#

cool :D

fiery stump
serene scaffold
mild grotto
#

depends: they can do data science on it, but probably it won't be especially interesting

#

like... how many 3 letter words appear? 4 letter words? etc

fiery stump
#

yeah that's what im tryin to do

#

it looks for how many of each word in a 370K word list appears in the random textfile

#

then outputs that number to a textfile

#

at the end it gives me a bunch of data

#

128GB data in + 4.1MB data in -> ~10.5MB data out.

serene grail
mild grotto
#

I will make a prediction ahead of time:
||I suspect that some 3 letter words will appear more often than other 3 letter words. I'm thinking because of this problem Edit: oops that link isn't to the right problem ||

fiery stump
#

also

#

should i make my code available on github

#

or is it too bad

mild grotto
fiery stump
#

i kinda want to bc running this code on my machine would take over HALF A YEAR

#

and i want to distribute it among more machines

#

so im gonna split up the work into 740 chunks, then anyone can do them and send the results back to me

mild grotto
#

So, uploading it to github just means other people can see the code, it doesn't mean they would run it for you πŸ™‚ More likely, you might find someone who is interested in help you make it run faster

fiery stump
#

i dont mean random people per se, more like my friends and/or family who have better machines

serene grail
#

It's also good to learn and practice git and GitHub
It's a very useful skill

fiery stump
#

one of my friends has a ryzen 9 7000 something

mild grotto
#

GPU won't make this faster, because the bottleneck will be the disk

serene scaffold
#

Gpus won't help you here

mild grotto
#

reading the data from the drive will be slower than the processing time

fiery stump
#

the ryzen is a cpu, not a gpu lol

fiery stump
serene scaffold
#

I don't think there's a disk in existence that's faster than a CPU

mild grotto
#

CPU is "fast" but only can have a limited amount of data in it at a time.

#

The disk is slower, but can have a lot of data

#

so the time would all be the data transfer back and forth between CPU and disk

reef spade
#

i dont understand

#

why

#

to do this

mild grotto
# fiery stump not with a fast enough disk

I would recommend:

  1. Post your code to github
  2. Fix your code so that 1 gigabyte takes more like... probably 10 minutes instead of 40 hours.

If it takes 40 hours, I can tell you your code is inefficient

mild grotto
#

Most important skill when asking for help is explaining your problem

reef spade
#

what does "makke the plot bigger so the subplots dont overlap" mean

fiery stump
mild grotto
mild grotto
#

This cuts the time by 26x~

fiery stump
#

how do i do this in code tho

mild grotto
#

So, here's what I think your algorithm you wrote it (without looking at your code, but you can correct me if I'm guessing wrong)

reef spade
#

what does subplot

#

mean

fiery stump
#
n = 0  # do not change, should be at 0
a = 0  # do not change, should be at 0
k = 0  # which word of the list to start from. 0 means start from first word. 500 would start from 501st word.
t = 499  # how many words you want to process, minus 1. Useful if you have a large dataset that is >8GiB.

with open("E:/file.txt", "r") as file:
    file_as_string = file.read().replace("\n", "")

# opening the file in read mode
my_file = open("R:/pythonProject/wordlist.txt", "r")


# reading the file
data = my_file.read()

# replacing end splitting the text
# when newline ('\n') is seen.
data_into_list = data.split("\n")
my_file.close()

results = open("R:/pythonProject/results-" + str(k) + "-" + str(k + t), "w")
while n <= t:
    print("Searching file for the word: " + data_into_list[n + k] + " - #" + str(n+1))
    a = file_as_string.count(data_into_list[n + k])
    results.write(data_into_list[n + k] + " - " + str(a) + " appearances" + "\n")
    n = n + 1
#

here

#

here's my code

mild grotto
#
  1. Read the 1 gigabyte file into memory
  2. For each word in the dictionary
  3. Scan the entire 1 gigabyte for that word.

You can think of it differently:

  1. read it into memory
  2. For each position in the file
  3. Check if there are any words in the dictionary that exactly match the current position of the file
#

by switching 2 and 3, you unlock the ability to narrow down the search space: you know the first letter is 'a' so you can skip all the other letters. If no 'a' word is first, you can move to the second letter of the file and try 'p' words etc

#

assuming the input was applejdog...

#

so after you finish with 'a' you only have

#

pplejdog...

#

so you check the 'p' words.

jaunty helm
#

what are we doing here again?

mild grotto
fiery stump
jaunty helm
fiery stump
#

read the past couple hundred messages or so, then you'll be caught up

mild grotto
# fiery stump ```py n = 0 # do not change, should be at 0 a = 0 # do not change, should be a...

https://pynative.com/python-file-seek/
You can manually control the position of the file using seek(), this lets you avoid reading the whole file at once. You can for example, read the first ~20 characters, check if any words start with that letter, and if not, seek() to the next position

PYnative

Learn to use the seek() method to move the file handle/pointer ahead or backward from the current position, beginning or end of the file

mild grotto
jaunty helm
#

I think I see
if you're up for a challenge, may I introduce to you the AC automaton

In computer science, the Aho–Corasick algorithm is a string-searching algorithm invented by Alfred V. Aho and Margaret J. Corasick in 1975. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all strings simultaneously. The complexity of the algorithm ...

fiery stump
jaunty helm
#

which matches a string (your 134gb file in this case) against a list of words (your word list)

mild grotto
#

Ah, yeah that makes sense. Seems more complicated than their current coding level, but that's a better algorithm than I was suggesting

fiery stump
#

im not too good at programming, i don't know how to implement something like that lol

mild grotto
#

My algorithm is just like, step 1 of optimization

jaunty helm
#

don't worry about it too much then

fiery stump
#

right now it will take 3,000-5,500 hrs of computation time to search through the file for all words

jaunty helm
#

as for memory, you can specify how much to read in file.read(num_of_characters)

mild grotto
#

But the thinking is based on the same idea:
If you know the file starts with a you only need to check a words.
Take that to the next level
If it starts with ap you only need to check words starting with ap

fiery stump
#

let's stick to one letter for now.

#

so do i split up the wordlist into 26 lists each corresponding to one letter?

jaunty helm
mild grotto
mild grotto
#

(And you might imagine on step 2, you can split the a list by each second letter)

#

but that's later

jaunty helm
mild grotto
#

just 1 level can be done by a novice, and cuts the search time by 26x

fiery stump
#

so i have to only look for words starting with one letter instead of all words?

#

so it's "for each letter in the file, search for all words starting with that letter, go to the next letter in the file, go back to step 1"

mild grotto
#

Say you read the whole file in:
file = open('myfile.txt', 'r')
then you keep track of where you are in the file
pointer = 0
Then you check the letter
letter = file[pointer]
Now you have a dictionary of all the words, sorted by their letters
dictionary['a'] = {...}
So you can now run through all the words that start with that letter
for word in dictionary[letter]:
and now you just want to know if the letters starting at pointer match that word
if word == file[pointer:pointer+len(word)]:

#

when you finish, you can increase pointer
pointer+=1

#

and then you're ready to check all the words starting on the second letter of the file

fiery stump
#

well that only checks whether the word is present, yes or no

#

i want how MANY times it appears

jaunty helm
mild grotto
#

If the word appears starting at pointer then file[pointer:pointer+len(word)] will match that word. You can then record that in some record

#

yeah like purplys said

#
if word not in FoundDict:
  FoundDict[word]=1
else:
  FoundDict[word]+=1
#

With this, your 40 hour run should be more like 1-2 hours, I think

fiery stump
#

also how high can the variable go in python

jaunty helm
fiery stump
#

cuz with this pointer will have to go to around 130,000,000,000

mild grotto
jaunty helm
#

floats can go to

>>> import sys
>>> sys.float_info
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)
>>> sys.float_info.max
1.7976931348623157e+308
>>>
#

that amount

fiery stump
#

oh big number

mild grotto
#

it stores all numbers basically as the literal strings like "130,000,000,000" so it can go basically to more numbers than the atoms in the universe

fiery stump
#

ok good

jaunty helm
fiery stump
#

uh oh...

#

it didn't even try

#

it just gave up after less than a second

jaunty helm
mild grotto
#

To do any calculation, it first has to read from the disk into RAM, then send from the RAM to the cpu

#

so yeah, you'll want to only read part of the file at a time

fiery stump
#

how big of a part

#

100,000? 1,000,000?

mild grotto
#

how much RAM do you have?

jaunty helm
fiery stump
#

32gb ddr4, 2666mhz

mild grotto
#

1 gigabyte is 1 billion letters

fiery stump
#

what about 1073741824, that should be good

mild grotto
#

every letter is 1 byte

fiery stump
#

it reads in 1 GiB at a time

jaunty helm
# fiery stump 100,000? 1,000,000?

you don't have to think too hard
literally just try it, see how much % ram it takes up, go bigger / lower accordingly (if you overshoot it'll just MemoryError anyways)
tbh I don't think it'll impact performance too much, most of the exec time's gonna go to the matching anyway

fiery stump
#

it honestly annoys me how many people i meet don't know the difference between GB and GiB

mild grotto
#

I do coding for work, and the only thing that annoys me is whenever I see anyone, ever, try to write a regular expression. Because everyone I've met is terrible at it lol

serene grail
#

I just look it up if I need it

fiery stump
#

alright i did dictionary[a], and it just returned the letter "a".

mild grotto
fiery stump
#

i thought it would return all words starting with a

#

yeah my dictionary contains 370K+ words, all in alphabetical order

mild grotto
#
for word in originalDict:
  newDictionary[word[0]].append(word)

maybe something like this

jaunty helm
mild grotto
fiery stump
#

no my dictionary is just a giant list imported from a text file

#

the first little bit of my dictionary .txt file looks like this

mild grotto
#

Try this:

newDictionary={}
for letter in "abcdefghijklmnopqrstuvwxyz":
  newDictionary[letter]=[]
#

now each letter will have it's own list

#

when you read in the file, add each word to the correct list in the NewDictionary

fiery stump
#

and how do i do that

mild grotto
#

Give it a try, let me know if you get stuck

fiery stump
#

alright so i didn't use your method, but i did find a way to split up the wordlist.txt file into 26 text files each labeled by their starting letter in the alphabet

#

(e.g. dict_a.txt, dict_b.txt, etc...)

#

@mild grotto im not sure if that will work or not

mild grotto
#

that'll work

#

If you ask for help in #algos-and-data-structs you'll find others that can likely help (since this is getting off topic from AI stuff)

flint grail
#

sys has that?

#

you can look for type info/ wtf

unkempt apex
#

current dataset tree structure :-

β”œβ”€β”€ Cloudy
β”œβ”€β”€ Rain
β”œβ”€β”€ Shine
└── Sunrise

so each dir (e.g cloudy ) has nearly 300 images approx..
and wanna train this all images on CNN

so should I train sepearately ( which I should I think ) , like first train for cloudy ,
also consider that , like in each dir there are only images, no labels nothing!, that's why I though to train seperately...

jaunty helm
jaunty helm
unkempt apex
#

For each classes??
Or I can train on all 4 classes

jaunty helm
ember pawn
#

hello i wanted to ask where can i learn about transformers

haughty cradle
#

same ^

sweet harness
#

Guys is there any well known models for music embedding?

#

I want to create a web app to organize my music collection.

drifting tide
#

Hey everyone, I need a reference for Bi directional LSTM. Does anyone have the original paper for it?

wild loom
#

hey so I have been using google colab and been running out of their free GPU run time lately and was wondering if there was a way to use the free $300 worth of google enterprise credits to pay for more computational units and GPU's to run with google colab

wild loom
#

NOOOOO i needed hat

#

thank you though

orchid forge
#

https://youtu.be/R67XuYc9NQ4?si=Oz-ThlRLalwzA5cB

is this a good project? i am currently making this one

In this video kaggle grandmaster Rob Mulla takes you through an economic data analysis project with python pandas. We walk through the process of pulling down the data for different economic indicators, cleaning and joining the data. Using the Fred api you can pull up to date data and compare, analyze and explore.

Copy and edit the notebook fro...

β–Ά Play video
unkempt apex
cedar tusk
#

has anyone used positron ide? i couldnt find the changelogs in github, is this normal?

wild loom
toxic mortar
#

Can u mark me TP FP TN FN

mild dirge
#

Your lines are also not matching the squares

#

I made this one recently ^^

toxic mortar
mild dirge
#

If you missclasify a BE as something else (True=BE, prediction=CP f.e) then that would be a False Negative with respect to BE.

#

Because you did not catch the BE

toxic mortar
#

And you look for this row to minimize?

mild dirge
#

The diagonals are all correct classifications

#

The rest is missclassification so that row shows all the missclassifications for samples that were actually BE

#

But TP/FP/FN/TN only makes sense for binary classification

serene grail
#

Sorry to butt in, what's BE?

toxic mortar
#

Some random class

mild dirge
#

Birch tree (Berk in dutch)

serene grail
#

Oh thanks

mild dirge
#

So if you want to talk about TP/FP/FN/TN you can think of the problem as BE or not BE

toxic mortar
mild dirge
#

And then you have those measures for BE

#

But you'd do that for each class

#

So each class has their own TP/TN/FN/FP

toxic mortar
#

🐐

#

Thanks man. Got it

ocean pawn
#

Do anyone know a good place to get simple dataset? I made a linear regression model, which seems to work, but I want to try it on a larger dataset. Is Kaggle a good place to find them? Thanks!

small wedge
ocean pawn
#

OH thanks

ocean pawn
#

Huh life expectancy data, that's intresting

#

May I ask

#

Some data set have string as data, for example, for car data, there's gas, diseal etc.

#

Would it be sutible to change thoose tag into unique integer for linear regression?

#

For example:
oil as 1

#

gas as 2

#

Or do I want other algrithom?

#

Thanks!

small wedge
ocean pawn
#

Would that work?

small wedge
ocean pawn
unkempt apex
#

and then again I have to label this data then?

amber sequoia
#

Hi. Is there a way to make Pandas read headers and subheaders from a CSV correctly?

For example tabular data like:

   CategoryA              CategoryB             CategoryC
   X Y                    X  Y                  X Y
1  (data)...................................................
2  ...
3  ...

I want to be able to read a CSV exported data of this kind, in such a way that the it is known that X and Y are subcategories of the main categories CategoryN in example, I want to be able to do:

df['CategoryA']['X']

I tried doing this with pandas, but got main columns in the MultiIndex labeled as unnamed

agile cobalt
#

How exactly is it formatted? (commas, spaces, something else)

amber sequoia
#

commas

agile cobalt
#

just this?```
A,B
X,Y,X,Y
1,2,3,4
5,6,7,8

or```
A,A,B,B
X,Y,X,Y
1,2,3,4
5,6,7,8
amber sequoia
#

this is just a tabullar example of it, but normally it would be exported to CSV, and look something like:

,CategoryA,,CategoryB,,CategoryC,,
,X,Y,X,Y,X,X,Y
(...data)
#

from what I could see when exported to CSV looks like this at least

agile cobalt
#

hmm, for a,a,b,b it works like ```py
import io
import pandas as pd
file = io.StringIO(
"""A,A,B,B
X,Y,X,Y
1,2,3,4
5,6,7,8"""
)

df = pd.read_csv(file, header=[0, 1])
print(df)

#

not sure for a,,b,

amber sequoia
#

notice, that you have A, and B twice in the main category

agile cobalt
#

yes, to indicate it is a.x a.y, rather than just ?.y

amber sequoia
#

yes, the problem I have, is that the CSV i'll get might look more like the a,,b, version

agile cobalt
#

yeah you might have to just parse it yourself

amber sequoia
#

I'm not sure what spreadsheet program exports how, but the ones I've used so far export the aforementioned tabular data with multiheaders to CSV in this way:

a,,b,

which kind of makes sense, if you think of it since the main CategoryA, CategoryB take multiple cells

agile cobalt
#

just read the first n header rows, construct the multi-index, then pass it to read_csv

#

!d pandas.MultiIndex.from_tuples

arctic wedgeBOT
echo mesa
#

When it comes to data science and machine learning in general, what would be the distinction between using SQL over Pandas or other libraries, I mean I'm not sure what would be the roles of each, because technically you can do everything from manipulation, collection and so on with either of them. What be the role of each in a machine learning project? From my understanding SQL is used for collection and storage and obviously used to define the schema, insert the data and so on. Also if you would like to extend the data base you would use sql to insert new rows, but when would you actually load the dataset into pandas and start cleaning and preprocessing or would that be done using sql? How would this work?

serene scaffold
#

SQL and pandas are both for tabular data (rows and columns). But SQL databases exist on the hard drive, and dataframes only exist in memory while a python program is running.

#

@echo mesa ^

echo mesa
serene scaffold
iron basalt
#

SQL is just an interface language used by many relational databases (the standard).

#

Also Pandas is a Python specific thing that is useful in Python as a way to manipulate tabular data in general.

velvet mountain
# echo mesa When it comes to data science and machine learning in general, what would be the...

sql is a language your sql server can parse and process. pandas is a high level api povided in python. both are not mutually exclusive (see for example https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html)

direct sql is very well suited for some kind of job, while pandas for others. usually it would depend a bit on the role you want to endorse. if your goal is to manipulate "raw data and tables", sql look like good. if your role is to query the data in order to perform a ds job, maybe pandas will suite better. but it's hard to really categorize everything here

serene scaffold
#

I fear that people are answering the question from too many angles and making it more confusing for OP

iron basalt
#

It's mostly convenient / fine for anything not huge.

#

And you don't need to learn SQL.

#

Pola.rs is like something in between.

#

I do not use it for performance reasons too, i'm impatient with these things. I don't like something taking a day when it can take 3 hours.

unkempt apex
#
Epoch [1/10], Loss: 0.4800
Validation Loss: 0.2547, Accuracy: 88.44%
Epoch [2/10], Loss: 0.3913
Validation Loss: 0.2687, Accuracy: 89.78%
Epoch [3/10], Loss: 0.2884
Validation Loss: 0.2423, Accuracy: 88.00%
Epoch [4/10], Loss: 0.2128
Validation Loss: 0.1990, Accuracy: 92.44%
Epoch [5/10], Loss: 0.1172
Validation Loss: 0.1681, Accuracy: 92.89%
Epoch [6/10], Loss: 0.0621
Validation Loss: 0.2270, Accuracy: 93.33%
Epoch [7/10], Loss: 0.0730
Validation Loss: 0.2494, Accuracy: 92.44%
Epoch [8/10], Loss: 0.0330
Validation Loss: 0.1599, Accuracy: 93.78%
Epoch [9/10], Loss: 0.0247
Validation Loss: 1.6280, Accuracy: 93.78%
Epoch [10/10], Loss: 0.0449
Validation Loss: 0.3310, Accuracy: 90.22%
#

is this good?

iron basalt
#

Probably the game development experience, it's all about iteration speed there, so waiting on something to process for really long is pain. (And also kind of one of the selling points of using Python in the first place, I don't want to compile for an hour)

unkempt apex
#

or need more accuracy

wide dagger
#

is this where I can ask questions about regex in python?

hearty depot
serene scaffold
hearty depot
#

Use polars it sm quicker

left tartan
#

Uh, in terms of my performance rank tiers... pandas is pretty low, in modern tools.

#

Polars is really where it's at... or pyarrow if my problems are simple enough.

#

(well, fine, you know I'll say duckdb ducky_dave )

worldly wagon
#

just wanted to give a quick appreciation to stelercus, etrotta, zeal, billybobby and other people that helped me not too long ago the suggestions were very well received at work

i just feel like it would be wrong to not give explicit thanks so again just wanted to say thank you for the suggestions πŸ™

#

lol kinda funny polars is being discussed again

#

i'm a bit late but was lazy loading buggy for your project or just in general? sorry if i'm interrupting btw

left tartan
#

This touches on my main complaint of the dataframe libraries: having to learn yet another syntax. Maybe it was you, maybe it was polars, but it's still yet-another-data-api.

serene grail
left tartan
serene grail
#

Oh ok, thanks. I'll definitely keep learning Pandas

left tartan
#

The problem is, the skills don't transfer as nicely as you'd hope. Polars is a very different API. SQL too. This is the crux of my complaint.

worldly wagon
hearty depot
serene grail
#

Thank you everyone for your answers!

left tartan
#

I try to use pyarrow a lot more for loading issues. The whole point of arrow tables is zero copying to pandas/polars/duckdb/etc.

cedar tusk
lapis sequoia
serene scaffold
#

@lapis sequoia if you ask for help in more than one place, please link to the thread

gleaming osprey
#

Hello. I am trying to create a vqa type model with possibly video and audio inputs. I was wondering if anyone could give me some advice regarding this? Because say I have a streaming speech-to-text algorithm. I wouldn't have the entire input at once, so how would I say, perform positional embedding, or something like self-attention, when I don't have the entire input yet.
I'm quite new to this transformer-type architechture, so I hope somebody with more experience might be able to point me in relatively the right direction. Thanks for any advice anybody might be able to give.
-# Note: Though, I wouldn't consider myself a complete newbie to AI/ML, I'm not any "pro" either, so please don't be too harsh is what I'm saying is nonsensical or unfeasible!

frosty fulcrum
#

does anyone know what Probes & Affinities mean in context of ML?

#

this thing...

serene scaffold
#

@unborn hemlock my advice is to not use anaconda or any of its variations. I've been doing DS/AI/ML for five years and have never used or needed it.

unborn hemlock
serene scaffold
#

There is no reason to be using a different system than the rest of the python community, unless you for some reason want to paint yourself into a corner where you can't use the majority of guides about managing environments

serene scaffold
#

that ability comes with python.

#

without conda

unborn hemlock
serene scaffold
#

yes

unborn hemlock
#

I thought it's the same. they are all just env managers and conda has more feature

#

am i right ?

serene scaffold
#

what features does conda have that you think you need?

unborn hemlock
#

Maybe the built-in package that come from its repo i used venv before and it was very hard to manage builtin package

serene scaffold
#

the built-in package?

unborn hemlock
#

i mean like how it can handle its dependencies installing nonpython package.

serene scaffold
#

which non-python packages do you need?

unborn hemlock
#

like Numpy

serene scaffold
#

you don't need conda to install numpy.

unborn hemlock
#

I know ... but conda is easier to use

serene scaffold
#

how does conda make it easier to install numpy?
with regular venvs, you just do pip install numpy

unborn hemlock
#

When you use pip, some dependencies might be missing. Conda handle that easily.

serene scaffold
#

I have never encountered this.

unborn hemlock
#

Maybe it's just me, but I have seen a lot of developers on GitHub use it too, and I feel like I have to have it too in order to follow their guideline :\

serene scaffold
#

those people have been gaslit into thinking that they need conda

hearty depot
#

It’s not that diff

unborn hemlock
hearty depot
unborn hemlock
rugged tide
#

Hi @serene scaffold , are you able to help me with something PySpark related please?

serene scaffold
#

Be sure also to not post screenshots of text. Copy and paste actual text into the chat.

rugged tide
serene scaffold
tough lantern
#

hi

#

anyone has expertise in pinceone

#

docsearch = pec.from_texts([t.page_content for t in text_chunks], embeddings, index_name="test")

AttributeError Traceback (most recent call last)
Cell In[62], line 1
----> 1 docsearch = pec.from_texts([t.page_content for t in text_chunks], embeddings, index_name="test")

File ~\anaconda3\envs\vectordb\Lib\site-packages\pinecone\control\pinecone.py:590, in Pinecone.from_texts(*args, **kwargs)
588 @staticmethod
589 def from_texts(*args, **kwargs):
--> 590 raise AttributeError(_build_langchain_attribute_error_message("from_texts"))

AttributeError: from_texts is not a top-level attribute of the Pinecone class provided by pinecone's official python package developed at https://github.com/pinecone-io/pinecone-python-client. You may have a name collision with an export from another dependency in your project that wraps Pinecone functionality and exports a similarly named class. Please refer to the following knowledge base article for more information: https://docs.pinecone.io/troubleshooting/pinecone-attribute-errors-with-langchain
Selection deleted

#

help me with this code

autumn comet
#

Hi guys,

I'm very new to python and can't get this forked project top work properly as I keep running into this error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices

I am running torch 2.0.1 due to some compatibility issues with torchvision and torchaudio.

I have been running it on a remote SSH on RunPod with the following hardware:

  • 12 x RTX A4000
  • 128 vCPU 250 GB RAM

Obviously I'm using RunPod to speed up the model-training process but I can't seem to get python to take advantage of the extra GPU-processing power.

autumn comet
# autumn comet Hi guys, I'm very new to python and can't get this forked project top work prop...

CODE:

import json

import torch
import torch.nn as nn

from config import eval_interval, learn_rate, max_iters
from src.model import GPTLanguageModel
from src.utils import current_time, estimate_loss, get_batch


def model_training(update: bool) -> None:
    """
    Trains or updates a GPTLanguageModel using pre-loaded data.

    This function either initializes a new model or loads an existing model based
    on the `update` parameter. It then trains the model using the AdamW optimizer
    on the training and validation data sets. Finally the trained model is saved.

    :param update: Boolean flag to indicate whether to update an existing model.
    """
    # LOAD DATA -----------------------------------------------------------------

    train_data = torch.load("assets/output/train.pt")
    valid_data = torch.load("assets/output/valid.pt")
        
    with open("assets/output/vocab.txt", "r", encoding="utf-8") as f:
        vocab = json.loads(f.read())

    # INITIALIZE / LOAD MODEL ---------------------------------------------------

    if update:
        try:
            model = torch.load("assets/models/model.pt")
            print("Loaded existing model to continue training.")
        except FileNotFoundError:
            print("No existing model found. Initializing a new model.")
            model = GPTLanguageModel(vocab_size=len(vocab))
        
    else:
        print("Initializing a new model.")
        model = GPTLanguageModel(vocab_size=len(vocab))

    # Utilize all available GPUs if available
    if torch.cuda.device_count() > 1:
        print(f"Using {torch.cuda.device_count()} GPUs.")
        model = nn.DataParallel(model)

...

autumn comet
# autumn comet Hi guys, I'm very new to python and can't get this forked project top work prop...

CODE CONT...

    # Move model to CUDA devices
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    
    # initialize optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=learn_rate)

    # number of model parameters
    n_params = sum(p.numel() for p in model.parameters())
    print(f"Parameters to be optimized: {n_params}\n", )

    # MODEL TRAINING ------------------------------------------------------------

    for i in range(max_iters):

        # evaluate the loss on train and valid sets every 'eval_interval' steps
        if i % eval_interval == 0 or i == max_iters - 1:
            train_loss = estimate_loss(model, train_data).to(device)
            valid_loss = estimate_loss(model, valid_data).to(device)

            time = current_time()
            print(f"{time} | step {i}: train loss {train_loss:.4f}, valid loss {valid_loss:.4f}")

        # sample batch of data
        x_batch, y_batch = get_batch(train_data).to(device)

        # evaluate the loss
        logits, loss = model(x_batch, y_batch).to(device)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

    torch.save(model, "assets/models/model.pt")
    print("Model saved")
autumn comet
# autumn comet Hi guys, I'm very new to python and can't get this forked project top work prop...

TERMINAL:

root@db6e42fc7512:~/lad-gpt# python run.py train
Initializing a new model.
Using 12 GPUs.
Parameters to be optimized: 7041970

Traceback (most recent call last):
  File "/root/lad-gpt/run.py", line 20, in <module>
    main()
  File "/root/lad-gpt/run.py", line 15, in main
    train.model_training(args.update)
  File "/root/lad-gpt/src/train.py", line 65, in model_training
    train_loss = estimate_loss(model, train_data).to(device)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/lad-gpt/src/utils.py", line 23, in estimate_loss
    logits, loss = model(X, Y)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])

autumn comet
# autumn comet Hi guys, I'm very new to python and can't get this forked project top work prop...

TERMINAL CONT...

File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 644, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/lad-gpt/src/model.py", line 151, in forward
    pos_emb = self.pos_embedding(torch.arange(T))           # (T, C)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

root@db6e42fc7512:~/lad-gpt# 
wild loom
#

Hey guys, I've been training a coco-model on image detection lately in google colab. I was wondering if anyone had a link oe two that would explain a way in which I can somehow download this model I've trained so that I can import it to a new file and just plug in an image to be detected rather than re-run the enitre model on colab for it to be used everytime I restart my PC.

half bison
#

Popular opinion: the organizations currently developing AI are evil and should be stopped

deep sleet
#

Does anyone have a good resource to learn about transformers?

deep sleet
#

😭

hollow sentinel
# deep sleet 😭

We have put together the complete Transformer model, and now we are ready to train it for neural machine translation. We shall use a training dataset for this purpose, which contains short English and German sentence pairs. We will also revisit the role of masking in computing the accuracy and loss metrics during the training […]

#

ML Mastery is goated.

deep sleet
#

Tysm!

hollow sentinel
deep sleet
hollow sentinel
deep sleet
hollow sentinel
hollow sentinel
deep sleet
#

Tysm man!

deep sleet
hollow sentinel
deep sleet
#

finished the linear algebra section

hollow sentinel
#

keep active in this server. you will learn so much.

hollow sentinel
#

are you a CS major?

deep sleet
#

No man , I am a highschool student

hollow sentinel
#

anyways, make sure your math fundamentals are up to par.

#

they're pardon the pun, integral.

deep sleet
#

Will do boss!

round fjord
#

Repost from help channel:

I have a rather complicated problem

I am trying to set up this repo here
https://github.com/mala-lab/InCTRL

and got to the last step of testing the visa dataset
but when I try to run it I get a permission denied error even though I have full admin rights
My assumption is that it has something to do with CUDA and my GPU

GitHub

Official implementation of CVPR'24 paper 'Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts'. - GitHub - mala-lab/InCTRL: O...

#

So my question is
How do I give the code access to my GPU

serene scaffold
serene scaffold
# autumn comet **CODE CONT...** ``` # Move model to CUDA devices device = torch.device(...

you have device = torch.device("cuda" if torch.cuda.is_available() else "cpu"). Chances are that torch.cuda.is_available() is false, meaning that you're setting the device as cpu. and then the logs say Using 12 GPUs. but that probably actually means that you're using 12 CPU cores, and the logging statement just assumes that torch.cuda.is_available() would have always been true.

#

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
there's also this

#

so make sure that anywhere you have device = , it's locked in to being only cuda:0.

round fjord
round fjord
round fjord
serene scaffold
round fjord
#

ah mb

loud plank
#

Any recommendations on how to learn python to lean into pandas?

remote hull
loud plank
remote hull
#

W3school is free

#

And some resources on DataCamp too

serene scaffold
#

@loud plank use the kaggle pandas tutorial
don't use w3schools no matter what

loud plank
#

lol

remote hull
#

Kaggle is very good

loud plank
serene scaffold
#

every w3schools article is at least slightly incorrect. and sometimes just blatantly wrong. and there are so many resources that are actually good that there's no reason to settle for w3schools.

loud plank
#

I wasn’t sure if a book or data camp was good

serene scaffold
#

!resources

arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

loud plank
#

I’ve been looking at data camp for a long time now

odd meteor
loud plank
odd meteor
loud plank
odd meteor
#

These days I learn new stuff mostly on YouTube or from colleagues at office

warped rapids
#

Yo guys, do you guys know any way to get access to the twitter api for free?

#

I know there are ways around it, but no idea how

serene scaffold
warped rapids
#

I tried a few modules like tweepy but they all want api creds with costs 5k a mo

warped rapids
#

But fair enough

serene scaffold
#

then you'll have to come up with a different project.

violet gull
#

Is there a way to guarantee an RL model converges on the best possible score assuming is has the information needed and the score is possible at the cost of performance/time?

serene scaffold
serene scaffold
# violet gull Why

the best possible weights might be in some very small, obscure valley somewhere that's very far away from where you randomly initialize, and which is in a different direction from the one your training data pulls the model.

serene scaffold
#

you don't.

#

for most neural architectures, you can never be certain that the model you have is the best possible model

violet gull
#

You can if it reaches the maximum calculated score

violet gull
# serene scaffold for most neural architectures, you can never be certain that the model you have ...

I also don’t understand how it’s possible for an agent to converge when there is a minimum random exploration chance. If the maximum score of an environment required 100 moves and the chance of random exploration is 1% it will only be a success about 37% of the time which is not converging (after infinite amounts of training). In this example the complexity of the environment (100) is very small and the minimum chance of random exploration is very low (1%). This example is generously in favor of convergence yet it doesn’t converge. So how do big complex models converge?

#

And in the case where there is not a minimum exploration chance the model is highly unlikely to find the optimal score before it no longer randomly explores

agile cobalt
warped rapids
#

And run it as a py script, is that possible with those sources?

agile cobalt
#

if you want data specifically from Twitter no, these are alternative platforms that follow mostly the same format but completely separately from twitter itself

warped rapids
unique spoke
#

Hey guys

serene scaffold
tawdry girder
past meteor
#
#

If your problem's state action space is very simple you can sidestep the problem by using a tabular method instead of a function approximator

fervent shore
#

I got a question about linear regression and LSTM output's. If I have a predicted set the same size as the testing labels, with the predicted dataset as the coefficient and the testing labels as the dependent variable, is there anything useful I can extract out of using a linear regression model in that manor?

verbal venture
#

can anyone explain this? the classifier is KNN. the vectors are video-image embeddings: " For the classifier
training, we select 1000 query from the training data of VQAv2, for each query we run the GRiT model to extarct ground
truth clips. Then we label the each concatenated query-chunk
embedding vector as 1 if the chunk contains clips from ground
truth, other wise give a 0. Then we train KNN classifier on
this. After the KNN is trained, we test it on 1000 queries from
the validation samples from VQA-v2 dataset to report results."

#

KNN is trained to do what here?

verbal venture
#

@serene scaffold

serene scaffold