#data-science-and-ml | Python | Page 404

wooden sail May 19, 2022, 6:29 AM

#

log on with your personal account and use the free version

spare briar May 19, 2022, 6:31 AM

#

-you can accumulate batches and still apply momentum

#

momentum changes the optimizer step

viscid flume May 19, 2022, 6:32 AM

#

Can I upload a folder or something to colab?

wooden sail May 19, 2022, 6:32 AM

#

i know, but you can rewrite momentum as an accumulated gradient with large momentum parameters and resetting the momentum every few steps. you can make them match in degenerate cases

#

yes, you can upload files to it. the easiest way is through google drive, but you can also just upload stuff directly

#

more generally, what i mean is that these two things are just a specific choice of descent step size and momentum

spare briar May 19, 2022, 6:34 AM

#

this just isnt true

viscid flume May 19, 2022, 6:34 AM

#

How do I do that then?

spare briar May 19, 2022, 6:34 AM

#

you would need to reset momentum on every call to step

wooden sail May 19, 2022, 6:35 AM

#

that's a valid choice of momentum

spare briar May 19, 2022, 6:37 AM

#

viscid flume How do I do that then?

if step % acc_steps == 0:

wooden sail May 19, 2022, 6:37 AM

#

what i'm getting at is that you can write a single mathematical expression that does either of these things with a different parameter schedule. of course you get different results. they're flavors of the same thing though, a stochastic gradient descent schedule

viscid flume May 19, 2022, 6:37 AM

#

ok

#

in this?

spare briar May 19, 2022, 6:39 AM

#

wooden sail what i'm getting at is that you can write a single mathematical expression that ...

right i agree this is true when you reset momentum in your schedule

wooden sail May 19, 2022, 6:39 AM

#

mhm

spare briar May 19, 2022, 6:40 AM

#

this argument can get pretty pathological though

#

i could argue that optimization schedule + loss is one object

#

and we study the equivalence classes of trajectories

#

when we trade off schedule and loss

viscid flume May 19, 2022, 6:41 AM

#

viscid flume in this?

I think I found it

spare briar May 19, 2022, 6:41 AM

#

not that this is wrong

#

some interesting results from thinking this way like https://arxiv.org/abs/1910.07454

arXiv.org

An Exponential Learning Rate Schedule for Deep Learning

Intriguing empirical evidence exists that deep learning can work well with
exoticschedules for varying the learning rate. This paper suggests that the
phenomenon may be due to Batch Normalization...

viscid flume May 19, 2022, 6:43 AM

#

spare briar if step % acc_steps == 0:

change this:if epoch_current <= cfg.SOLVER.GENERATOR.INIT_EPOCH: # FP _t.tic() real_images_color = real_images_color.to(device,memory_format=torch.channels_last) generated = model_generator(real_images_color) loss_init = init_loss(model_backbone, real_images_color, generated) INIT_FP_time = _t.toc() loss_dict = {"Init_loss": loss_init} # BP _t.tic() if (iteration+1) % 4 == 0 or (iteration+1) == len(data_loader): optimizer_generator.zero_grad(set_to_none=True) loss_init.backward() optimizer_generator.step() scheduler_generator.step() scheduler_discriminator.step() INIT_BP_time = _t.toc() meters.update(INIT_FP_time=INIT_FP_time, INIT_BP_time=INIT_BP_time)
to?

spare briar May 19, 2022, 6:44 AM

#

looks like its already happening

#

if (iteration+1) % 4 == 0

viscid flume May 19, 2022, 6:44 AM

#

Ok

spare briar May 19, 2022, 6:45 AM

#

although the second part after or is sort of funny

#

since they dont scale properly their learning rate for that step

spare briar May 19, 2022, 6:47 AM

#

viscid flume Ok

(you could make it % 6 and decrease your batch size but make sure to also scale the learning rate to compensate)

viscid flume May 19, 2022, 6:48 AM

#

spare briar (you could make it % 6 and decrease your batch size but make sure to also scale ...

Okay, scale by how much?

spare briar May 19, 2022, 6:48 AM

#

well think about it

#

if you accumulate 6 batches instead of 4

#

you have 6/4 more gradient

viscid flume May 19, 2022, 6:49 AM

#

Actually, I added that myself

#

It wasn't in the code before, and I read an article and did that, wasn't sure how it works back then

viscid flume May 19, 2022, 6:51 AM

#

spare briar you have 6/4 more gradient

So if gradient*n, learning rate/n?

#

Ok

viscid flume May 19, 2022, 7:23 AM

#

2022-05-19 15:22:50,438 AnimeGan.trainer INFO: eta: 1:22:41 epoch: 1 | 421/6700 batch_time: 0.0062 (0.0049) data_time: 0.0001 (0.0001) Init_loss: 292.7678 (286.7284) lr(G|D): 0.000050|0.000010 max mem: 1857

#

wut

#

it's working!

dusty valve May 19, 2022, 9:21 AM

#

how do i install pytorch without CUDA?

#

i don't have a nvidia gpu btw

ebon ember May 19, 2022, 9:25 AM

#

Hey
Anyone here familiar with xarray?
'''
def calc_max_std(gb):
max_cent = gb.mean(dim='dt').argmax()
idx_left = max_cent - 30
idx_right = max_cent + 30
return gb[:,idx_left:idx_right].max(dim='x').mean()

da.groupby('dt').map(calc_max_std)
'''
It gives me an 'TypeError: 'DataArray' object cannot be interpreted as an integer'
If I do
'''
def calc_max_std(gb):
max_cent = gb.mean(dim='dt').argmax()
idx_left = max_cent - 30
idx_right = max_cent + 30
return gb[:,idx_left:idx_right].max(dim='x').mean()
'''
it works, but that's besides the point as I need to slice the array differently in every group. Anyone any idea?
thanks a lot in advance

dusty valve May 19, 2022, 9:31 AM

#

if anyone knows how to fix this, ping me AssertionError: Torch not compiled with CUDA enabled i don't have an nvidia gpu btw

brazen spire May 19, 2022, 9:42 AM

#

Does tensorflow 2.9 support cuda 11.6?

barren wedge May 19, 2022, 10:00 AM

#

Can we ignore words with only one token in dataframe/series?
how to do it?

ebon ember May 19, 2022, 10:04 AM

#

ebon ember Hey Anyone here familiar with xarray? ''' def calc_max_std(gb): max_cent = g...

Found the problem 🙂
had to replace 'max_cent = gb.mean(dim='dt').argmax()' with 'max_cent = gb.mean(dim='dt').argmax().data'

willow jasper May 19, 2022, 10:50 AM

#

can anyone tell me how can i add a new row in this datasets
but the main point is that i have to add a column but the values for that column should be similar to score but in place of 5 i have to put 1 and in place of 2 i have to place 0

#

can anyone help me out

lone yacht May 19, 2022, 11:10 AM

#

Hi everyone, I have a few questions for those with experience in training neural networks in PyTorch. I've recently been trying to train and evaluate a CNN to classify images from the CIFAR10 dataset. The condition is it has to be three hidden layers only, and I've set it up as per the code attached.

#

I have a training and validation set, and I've been evaluating the model with them. I'm using SGD with a learning rate of 0.001 and a momentum of 0.9, with cross entropy loss as the criterion. I was hoping to get a graph that looks like an exponential graph, as per most experiments I've seen, but instead it's come out like this, where the validation accuracy is all over the place.

#

Is there anything I could change to get something more consistent and with higher accuracy?

#

This is what I was hoping it would look more like

mild dirge May 19, 2022, 11:33 AM

#

Well the scale of your graph is quite misleading, the accuracy goes from 48 to 60 there

#

So the training accuracy and validation accuracy aren't as far off as it seems in that graph

#

@lone yacht

#

As of now it seems that you are slightly over-fitting as your validation accuracy is lower than your training accuracy. But equally important is that your training accuracy doesn't seem to increase further.

lone yacht May 19, 2022, 11:38 AM

#

Thanks for the response @mild dirge. Do you know what I could try to get a better accuracy? I also note that the first epoch starts at around 55% which I'm curious about, because I would have expected it to be more like 10% starting off with random weights, given that there are 10 classes, do you know why that might be?

mild dirge May 19, 2022, 11:38 AM

#

It is likely the accuracy after the first epoch

#

It is quite normal that it can jump that high after the first epoch*

lone yacht May 19, 2022, 11:40 AM

#

Ahh, that makes sense

mild dirge May 19, 2022, 11:40 AM

#

And the fact that the accuracy is not going up can be because of many things

#

complex dataset, too small network etc.

#

It's also not that common to only use 1 fc layer I think

chilly sapphire May 19, 2022, 11:42 AM

#

hello everyone

#

just wondering if anyone can help me with my data structure analysis assignment

mild dirge May 19, 2022, 11:42 AM

#

In general 3 hidden layers probably won't give too good of a performance

#

you could probably pull a sneaky and use a convolution layer with a stride above 1

#

that way you reduce the spatial dimensions and get to use another convolutional layer

#

(instead of the pool)

#

@lone yacht

lone yacht May 19, 2022, 11:45 AM

#

Thanks @mild dirge , I'll try swapping out the pooling layer with another convolutional layer

mild dirge May 19, 2022, 11:45 AM

#

With a stride above 1* that's the important bit here

frigid elk May 19, 2022, 11:48 AM

#

in pyspark.ml.classification.logisticregressionmodel i can call 'evaluate' to get access to the aucroc and precision recall summary. ... what's the equivalent for randomforestclassifier?

willow jasper May 19, 2022, 12:44 PM

#

willow jasper can anyone tell me how can i add a new row in this datasets but the main point i...

@spring valley

willow jasper May 19, 2022, 1:03 PM

#

willow jasper can anyone tell me how can i add a new row in this datasets but the main point i...

@compact barncan u help me out

compact rose May 19, 2022, 1:17 PM

#

Hello guys, i have a doubt with pyspark. I printed a schema of dataframe and I get this ( check image). However, i want to split the array do you see there to columns, how do i do that?

frigid elk May 19, 2022, 1:21 PM

#

is this what you're looking for?

df.select('session_id', 'session_events.*').show()

compact rose May 19, 2022, 1:26 PM

#

I think is something like that. I just saw that i didn't explain me well. My idea is to split what is inside of "element" to columns. In the image, that is the values that are inside of the row. And i want to slice it to get a column like this |DatetimeOpenApp| OpenApp| DatetimeViewHome| ViewHome|

serene scaffold May 19, 2022, 1:26 PM

#

@willow jasper please do not ping random people (including staff members) asking for help. This is a warning.

frigid elk May 19, 2022, 1:39 PM

#

hmm... i'm in the middle of something at work, but off the top of my head i'm thinking something along the route of exploding the session_events and pivoting the results back into named columns. ...

from pyspark.sql import functions as F
df.select('session_id', F.explode(F.col('session_events'))).select('id', F.col('datetime'), F.col('event'))

there may be some syntax error in that, didn't validate, but hope that sets you on the right path.

compact rose May 19, 2022, 1:40 PM

#

I will try and thanks!

flat ridge May 19, 2022, 2:34 PM

#

hey guys, is there an "inverse prediction" function in sklearn??

#

like what X i need to put for returning an Y

mild dirge May 19, 2022, 2:46 PM

#

There can be multiple inputs that give the same prediction @flat ridge

compact rose May 19, 2022, 2:58 PM

#

Million thanks to dre, u saved my life. But now i need your help again guys xd So, now i was finishing in working on my data and i want to group the data in pyspark. I want to have only one session ID. Example : The first session id repeats in 4 rows. I want that value appears only one and that the other values stay in the same session. If you watch in the columns, you will see that i use a Dummies for that and i want to be like a checklist. Where we can see where did the client trespass(If they went to OpenApp it is 1 , if they went to viewhome it is a 2 )

inland zephyr May 19, 2022, 4:23 PM

#

anyway I have open question in here and i need your suggestion
let said I have several observation of evaluation (said one hour, thirty minute, twenty minute and last a minute). Let said in each observation give its performance (accuracy). I have hypothesis that even the observation minute is decrease, but the performance deviation should be small to take conclusion that my model is stable for each observation. Is it okay to put standard deviation calculation for this case? The model is same for each observation

#

i wrongly ask the question in the wrong room lmao

brave sand May 19, 2022, 4:52 PM

#

does anyone here have any experience with multi agent reinforcement learning?

serene scaffold May 19, 2022, 4:53 PM

#

brave sand does anyone here have any experience with multi agent reinforcement learning?

don't ask to ask

steady basalt May 19, 2022, 5:36 PM

#

dusty valve how do i install pytorch without CUDA?

use metal on mac

#

its how i run errting on the gpu

#

else, co lab is best for smallish projects

cinder matrix May 19, 2022, 5:56 PM

#

guys anyone have a cool machine learning project idea something

#

maybe an interesting way to collect some dataset or a task thats not so comon idk

pearl heart May 19, 2022, 6:10 PM

#

yes

#

I have

soft kite May 19, 2022, 6:17 PM

#

would this be the appropriate channel to ask about using pandas to visualize my data?

steady basalt May 19, 2022, 6:38 PM

#

can anhyone help me augmenting my images for cnn?

#

if anyone here has xp

#

desperate!

wooden sail May 19, 2022, 6:53 PM

#

if you're using keras or pytorch, they should have built-ins for this

loud cove May 19, 2022, 6:58 PM

#

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 0)

I'm using sklearn and not sure why not getting the same results each run even though I've set the random state to 0

you can see random forest and log reg changes scores.

data are articles which have been vectorized and lemmatized .

thin palm May 19, 2022, 7:05 PM

#

Any recommendations in finding and understanding outliers in your Pandas dataset?

wooden sail May 19, 2022, 7:05 PM

#

basic descriptive statistics are a good place to start

#

split up the data into quartiles and see if you can learn something

thin palm May 19, 2022, 7:07 PM

#

wooden sail split up the data into quartiles and see if you can learn something

Yup, was thinking of box plots of each features to see where things are. .describe() gives good details as well

steady basalt May 19, 2022, 7:17 PM

#

wooden sail if you're using keras or pytorch, they should have built-ins for this

Oh they do

#

I’m just choking on syntax

#

Do u know it by heart?

wooden sail May 19, 2022, 7:21 PM

#

sadly not by heart, but i can tell you what the parameters mean if you show me an example of the syntax

steady basalt May 19, 2022, 7:25 PM

#

so when you use flow

#

and if you say, rotate at angles 10 degrees

#

it will add 36 images to ur training set

#

?

#

@wooden sail

#

well of course you save it as a new variable set

#

exxample

#

image_generator = ImageDataGenerator(rescale=1/255, validation_split=0.2)

train_dataset = image_generator.flow_from_directory(batch_size=32,
directory='full_dataset',
shuffle=True,
target_size=(280, 280),
subset="training",
class_mode='categorical')

validation_dataset = image_generator.flow_from_directory(batch_size=32,
directory='full_dataset',
shuffle=True,
target_size=(280, 280),
subset="validation",
class_mode='categorical')

#

bad example

#

https://www.tensorflow.org/tutorials/images/data_augmentation

TensorFlow

Data augmentation | TensorFlow Core

#

                                   rotation_range=20,
                                   width_shift_range=0.2,
                                   height_shift_range=0.2,
                                   horizontal_flip=True
                                   validation_split=0.2)  # val 20%

val_datagen = ImageDataGenerator(rescale=1./255, validation_split=0.2)


train_data = train_datagen.flow_from_directory(train_path, 
                                               target_size=(224, 224), 
                                               color_mode='rgb',
                                               batch_size=BS, 
                                               class_mode='categorical',
                                               shuffle=True,
                                               subset = 'training') 

val_data = val_datagen.flow_from_directory(train_path, 
                                           target_size=(224, 224), 
                                           color_mode='rgb',
                                           batch_size=BS, 
                                           class_mode='categorical',
                                           shuffle=False,
                                           subset = 'validation')```

#

@wooden sail

#

looking at this code here

#

how do you tell how many images train_datagen will create

#

its potentially thousands right?

#

per sample

#

my thought is that this amount of transformation will make many images from one image to go thru epochs

wooden sail May 19, 2022, 7:32 PM

#

no, it won't generate new ones per se. it'll take the same ones you fed and make a modified copy in memory on the fly, then delete it. it will use as many images as you told it to when you specified the number of epochs and total number of images

steady basalt May 19, 2022, 7:32 PM

#

oh

#

waiiiiiit

#

so its like

wooden sail May 19, 2022, 7:32 PM

#

the transformations are applied randomly

steady basalt May 19, 2022, 7:32 PM

#

epoch 1 will use your first version

#

and epoch 2 will use a rotated one?

#

epoch 3 will use a scaled one?

wooden sail May 19, 2022, 7:33 PM

#

randomly

steady basalt May 19, 2022, 7:33 PM

#

it says its a range... so uh where does it specify how many inside that range are made

wooden sail May 19, 2022, 7:33 PM

#

they will very likely never see the original

#

the range is a percentage. you tell it any transformation with a parameter ranging from 0% to 20% shift is valid, for example

steady basalt May 19, 2022, 7:34 PM

#

(rescale=1./255, < also what is this?

wooden sail May 19, 2022, 7:34 PM

#

multiply by 1/255

steady basalt May 19, 2022, 7:34 PM

#

so number of versions of one iage = number of epochs

#

?

wooden sail May 19, 2022, 7:35 PM

#

as far as i recall, yeah

steady basalt May 19, 2022, 7:35 PM

#

riiight ok

#

im trying to get the code in here and work but struggling

#

Actually

#

I tried adding to a sequential model earlier but i think that was just a singular transform so it was busted

#

shud i expect a nice validation loss rreduction?

wooden sail May 19, 2022, 7:36 PM

#

with augmentation, yes

#

the validation loss should be a lot closer to the training one, at least

steady basalt May 19, 2022, 7:36 PM

#

its curerntly 98% accurate before agumentation almsot

wooden sail May 19, 2022, 7:36 PM

#

that's meaningless, all overfitting

steady basalt May 19, 2022, 7:36 PM

#

oh nvm

#

its 73

#

reckon it will break 80?

#

for X-rays, is there any rule so should I use any specific set of transformations and ranges that are optimal? or its it trial and erro

wooden sail May 19, 2022, 7:38 PM

#

it would depend on the type of registration error you expect to find

#

i'd expect this type of shift and shear and scaling up and down along one or both axes are common. flipping LR and UD, not really

steady basalt May 19, 2022, 7:38 PM

#

rotate
?

wooden sail May 19, 2022, 7:39 PM

#

small rotations, yeah. very small though.

steady basalt May 19, 2022, 7:39 PM

#

do you know how to account for artifacts?

#

that are biasing my accuracy

#

e.g literal text that label my class

wooden sail May 19, 2022, 7:39 PM

#

artifacts in what, the images? you'd expect them to average out

#

or what are you calling artifacts

steady basalt May 19, 2022, 7:39 PM

#

if you have pictures of cars and pictures of ducks and the cars all have on the image 'car' written on it

#

somewhere

#

small

#

its gona pick up on that

#

and thats kinda not what you want a model to do when you roll it out

wooden sail May 19, 2022, 7:40 PM

#

couldn't really say. you'd kinda wanna remove bad data from the data set

steady basalt May 19, 2022, 7:41 PM

#

Supposedly there’s a workaround for this we’re gona learn but I cannot guess what it’s going to be

wooden sail May 19, 2022, 7:41 PM

#

i can't think of one off the top of my head. admittedly this isn't the type of ML i look at

steady basalt May 19, 2022, 7:42 PM

#

What do u look at

wooden sail May 19, 2022, 7:42 PM

#

so-called "deep unfoldings" are what has my attention atm

steady basalt May 19, 2022, 7:42 PM

#

Industry or academia?

wooden sail May 19, 2022, 7:42 PM

#

academia

steady basalt May 19, 2022, 7:43 PM

#

Which country?

wooden sail May 19, 2022, 7:43 PM

#

germany

steady basalt May 19, 2022, 7:43 PM

#

Nice

#

Post doc?

wooden sail May 19, 2022, 7:43 PM

#

doc

steady basalt May 19, 2022, 7:43 PM

#

I was considering if a PhD is worth it next but I’m not so sure

#

I’d rather work for a company than a research uni

wooden sail May 19, 2022, 7:44 PM

#

if all you wanna do is research, a company is more than ok tbh. you can consider a phd if you think it gives you better job options, you like academia and/or teaching, or something like that

steady basalt May 19, 2022, 7:45 PM

#

I’m not so much into teaching this because I don’t have a math background

#

Are opportunities good in Germany for English speakers with masters?

wooden sail May 19, 2022, 7:45 PM

#

in industry, i'm not sure yet. in academia yes, doctoral programs are often in english and lots of foreigners apply here. masters too, for that matter

steady basalt May 19, 2022, 7:45 PM

#

PhD definitely improved job options because I notice half of the companies say PhD preferred rn which is annoying

#

But 3 years of hell….

thin palm May 19, 2022, 7:50 PM

#

if my skewness of a certain column came out to a normal distribution, why does my box plot show outliers?

Screen_Shot_2022-05-19_at_1.49.36_PM.png

#

But then I plot a histogram and you can kinda see that bins go above

Screen_Shot_2022-05-19_at_1.52.23_PM.png

#

any idea on what to do with outliers with salaries?

steady basalt May 19, 2022, 7:55 PM

#

@wooden sail man im getting so many errors apparently my colab directory doesnt exist

steady basalt May 19, 2022, 7:55 PM

#

thin palm any idea on what to do with outliers with salaries?

if they are too useful to drop theres a scaler that will account for that

thin palm May 19, 2022, 7:56 PM

#

steady basalt if they are too useful to drop theres a scaler that will account for that

what would you do? I personally dont think it makes sense to drop them but ofc scaling would make the most sense

wooden sail May 19, 2022, 7:58 PM

#

do you have it in the correct directory? 😛 you can see your file structure, check around

steady basalt May 19, 2022, 7:58 PM

#

Yeah dude

#

I will literally livestream this to u

#

im stuck augmenting for an hour

wooden sail May 19, 2022, 7:58 PM

#

there's a tmp folder where some stuff lands when downloaded, but other things land in a different folder whose name i can't recall

thin palm May 19, 2022, 7:58 PM

#

steady basalt if they are too useful to drop theres a scaler that will account for that

is histplot a good idea to get an idea of outliers?

steady basalt May 19, 2022, 7:59 PM

#

boxplot

#

@wooden sail wana try get my augmentation working?

wooden sail May 19, 2022, 7:59 PM

#

i'll pass 😛

steady basalt May 19, 2022, 8:00 PM

#

Anyone?

thin palm May 19, 2022, 8:02 PM

#

steady basalt boxplot

gotcha, we can't have a salary of 0 so maybe it makes sense to drop just that one value

steady basalt May 19, 2022, 8:02 PM

#

haha

#

intern

thin palm May 19, 2022, 8:10 PM

#

what's yalls thoughts on this?

Screen_Shot_2022-05-19_at_2.10.31_PM.png

steady basalt May 19, 2022, 8:23 PM

#

thats alot of outleirs

#

keep them

#

its expected irl theres always a bunch earning more

thin palm May 19, 2022, 8:24 PM

#

steady basalt its expected irl theres always a bunch earning more

yeah I'd keep the outliers above the right wisker, but the last one on the left needs to go, because salary can't be $0

steady basalt May 19, 2022, 8:25 PM

#

is it justone data point

thin palm May 19, 2022, 8:25 PM

#

steady basalt is it justone data point

right but there's 5 columns with $0 salaries... imagine a house with negative square foot.. you'd be like that doesn't make sense

steady basalt May 19, 2022, 8:26 PM

#

5 columns?

#

5 companies?

thin palm May 19, 2022, 8:26 PM

#

apologize, 5 rows

steady basalt May 19, 2022, 8:26 PM

#

are they all the same company

thin palm May 19, 2022, 8:27 PM

#

steady basalt are they all the same company

doesn't say company name, just company ID and they're all different

steady basalt May 19, 2022, 8:32 PM

#

god it takesme like 10 minutes to train a tiny neural network

#

how fast do u think a cloud service cud make this

#

80ms per step on m1 pro

steady basalt May 19, 2022, 9:18 PM

#

anyone know why flow_from_directory says error20 directory doesnt exist

#

its a npy file

#

how does this function need data

steady basalt May 19, 2022, 9:50 PM

#

ah damn u need specific folder structure

#

@wooden sail any idea how to create class files is it just numpy saving

loud cove May 19, 2022, 10:11 PM

#

I have a question, so I have created a model and it gave me accuracy of 85% with train test split, when I used it on the full thing, it gives me 98%.
is this right?
What I did at the end is fit + predict on the full dataset

mild dirge May 19, 2022, 10:13 PM

#

on the full thing?

loud cove May 19, 2022, 10:13 PM

#

yea

mild dirge May 19, 2022, 10:13 PM

#

well that would be the training accuracy

#

testing on the data you use for training will give very high accuracy indeed

#

So that accuracy does not tell you anything about how good the model would perform on new data

loud cove May 19, 2022, 10:14 PM

#

yeah makes sense

loud cove May 19, 2022, 10:15 PM

#

mild dirge well that would be the training accuracy

- Your Notebook with the code you used in cleaning and analyzing the data + your model + evaluation.
- JSON file for the clusters output (to each category).
- PDF file for the evaluation report that will be delivered to the leader (Make sure to mention all your steps in visualizing your results).

it is just that these are the requirement of the project, but I'm confused about the second one.

mild dirge May 19, 2022, 10:16 PM

#

What about that is saying to train/test on the full data?

#

Or is this another problem?

loud cove May 19, 2022, 10:17 PM

#

mild dirge What about that is saying to train/test on the full data?

nothing, but I'm confused on how would i output anything if they only gave me one dataset around, 2500 articles to classify

#

unless I manually split them it doesn't make sense

mild dirge May 19, 2022, 10:17 PM

#

the second point is not about classifying, it's about clustering

loud cove May 19, 2022, 10:18 PM

#

yeah there is no clustering, it is just 3 columns, article, article name and category (three types)

#

or am i missing something

mild dirge May 19, 2022, 10:19 PM

#

loud cove unless I manually split them it doesn't make sense

Alright, let's take it one problem at a time, what do you mean with this?

#

This is still about the classification right?

loud cove May 19, 2022, 10:19 PM

#

for example, make the training the first 2000 and test last 500

#

yes it is a classification project

mild dirge May 19, 2022, 10:19 PM

#

Yes, that is a often used method

#

Sometimes the prof provides an already split data, but if it is not split yet, you can split it yourself

loud cove May 19, 2022, 10:20 PM

#

mild dirge Yes, that is a often used method

yeah but there is train test split thing and it does it for me it feels weird.

mild dirge May 19, 2022, 10:20 PM

#

You do need to put a bit of thought into how you split it

#

Like you want to test all the classes, so you could f.e. try to get an equal distribution of each class in the testing set

loud cove May 19, 2022, 10:20 PM

#

well, it is articles and it doesn't seem to be orders, so i was thinking shuffeling then bottom 500 and move on

mild dirge May 19, 2022, 10:21 PM

#

Well if you plan on looking at the accuracy, you want the classes to be equally distributed for the test set

#

Otherwise you'd normally use different measures like f1 score, macro accuracy, confusion matrix etc.

loud cove May 19, 2022, 10:21 PM

#

I was mainly looking at f score, but the project info says the mostly care about accuracy for this one.

mild dirge May 19, 2022, 10:22 PM

#

Did you look at the class distribution?

loud cove May 19, 2022, 10:22 PM

#

yes

mild dirge May 19, 2022, 10:22 PM

#

and?

loud cove May 19, 2022, 10:22 PM

#

#

mild dirge May 19, 2022, 10:22 PM

#

alright, it's not balanced

#

how did you approach that?

loud cove May 19, 2022, 10:23 PM

#

wel

#

well, in score I just used weighted

#

but I honestly don't know much and just getting the project flipped lol

mild dirge May 19, 2022, 10:23 PM

#

weighted?

loud cove May 19, 2022, 10:23 PM

#

for the scores

mild dirge May 19, 2022, 10:24 PM

#

what function is that?

loud cove May 19, 2022, 10:24 PM

#

what I found weird is that engineering (least count) have highest f score, my thoery is due to not much overlap in words

#

def try_model(model_name):
    mdl=''
    if model_name == 'Logistic Regression':
        mdl = LogisticRegression(max_iter=500) #increasing iterations since default 100 wasn't enough
    elif model_name == 'Random Forest':
        mdl = RandomForestClassifier()
    elif model_name == 'Multinomial Naive Bayes':
        mdl = MultinomialNB()
    elif model_name == 'Support Vector Classifer':
        mdl = SVC()
    elif model_name == 'Decision Tree Classifier':
        mdl = DecisionTreeClassifier()
    elif model_name == 'K Nearest Neighbour':
        mdl = KNeighborsClassifier()
    elif model_name == 'Gaussian Naive Bayes':
        mdl = GaussianNB()
   


    oneVsRest = OneVsRestClassifier(mdl)
    oneVsRest.fit(x_train, y_train)
    y_pred = oneVsRest.predict(x_test)
    
    
    # Performance metrics
    accuracy = round(accuracy_score(y_test, y_pred) * 100, 4)
    # Get the weighted precision, recall, f1 scores
    precision, recall, f1score, support = score(y_test, y_pred, average='weighted')

    print(f'Test Accuracy Score of Basic {model_name}: % {accuracy}')
    print(f'Precision : {precision}')
    print(f'Recall    : {recall}')
    print(f'F1-score   : {f1score}')

    # Add performance parameters to list
    perform_list.append(dict([
        ('Model', model_name),
        ('Test Accuracy', round(accuracy, 4)),
        ('Precision', round(precision, 4)),
        ('Recall', round(recall, 4)),
        ('F1', round(f1score, 4))
         ]))

#

it is just a function to try few different models and see the result of each

#

from sklearn.metrics import precision_recall_fscore_support as score

mild dirge May 19, 2022, 10:26 PM

#

Okay, so a weighted accuracy would be a solution I suppose

#

Where there any hyper-parameters to tune for your model?

loud cove May 19, 2022, 10:26 PM

#

yeahi'm used weighted as you can see
precision, recall, f1score, support = score(y_test, y_pred, average='weighted')

mild dirge May 19, 2022, 10:26 PM

#

did you do a grid-search?

loud cove May 19, 2022, 10:27 PM

#

I kept defaults other than the estimators, but decided against it.

mild dirge May 19, 2022, 10:27 PM

#

Did you train/test the model and decided to change stuff to get a higher accuracy afterwards?

arctic wedgeBOT May 19, 2022, 10:27 PM

#

Hey @loud cove!

It looks like you tried to attach file type(s) that we do not allow (.ipynb). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.

Feel free to ask in #community-meta if you think this is a mistake.

loud cove May 19, 2022, 10:28 PM

#

mild dirge Did you train/test the model and decided to change stuff to get a higher accurac...

no because it took more time to get less than 1% extra

mild dirge May 19, 2022, 10:28 PM

#

alright, that's good

#

So if you didn't change your model and choose all the hyper parameters beforehand, then just splitting the data into training and testing is fine

loud cove May 19, 2022, 10:29 PM

#

you can see the thing here, just change csv to ipynb since bot didn't let me upload

📎 articles_classifier.csv

mild dirge May 19, 2022, 10:29 PM

#

If you wanted to find good parameters for your model, you could have used a validation set to check the model accuracy

#

You understand why you split it into training and testing right?

loud cove May 19, 2022, 10:30 PM

#

I don't really care about the accuracy, I think 80 something is good enough.
yes to prevent overfitting

mild dirge May 19, 2022, 10:30 PM

#

So why does it seem "weird" to you to split the data like that?

loud cove May 19, 2022, 10:31 PM

#

no what I find weird is that they requested clusters, I'm assuming my predictions for each of the three categories, which is weird

#

I'd only be able to get that if I split them manually

mild dirge May 19, 2022, 10:31 PM

#

Is there more explanation on clusters?

loud cove May 19, 2022, 10:31 PM

#

nope

mild dirge May 19, 2022, 10:31 PM

#

JSON file for the clusters output (to each category) This is literally everything in the assignment mentioning clusters

#

?

loud cove May 19, 2022, 10:32 PM

#

Feel free to choose your classification algorithm and all the pre-processing needed on the data.

The team shares with you this JSON file (Note: "JSON" text is clickable) for a group of categorized articles as you will divide those articles into 3 groups: training data, validating data, and testing data.

To measure the accuracy of each algorithm, at this level you will measure the accuracy by the percent of matching only.

What we expected:
A GitHub repository includes:
- Your Notebook with the code you used in cleaning and analyzing the data + your model + evaluation.
- JSON file for the clusters output (to each category).
- PDF file for the evaluation report that will be delivered to the leader (Make sure to mention all your steps in visualizing your results).

Avoid plagiarism at all costs. All submissions will undergo plagiarism checks before being graded.

loud cove May 19, 2022, 10:32 PM

#

mild dirge `JSON file for the clusters output (to each category)` This is literally everyth...

yes

mild dirge May 19, 2022, 10:32 PM

#

It seems that they wanted you to split into training/validation/test

#

you did not make a validation dataset?

steady basalt May 19, 2022, 10:32 PM

#

why bother when u can cross validate training data on itself

loud cove May 19, 2022, 10:33 PM

#

nah I just did validation with grid search

mild dirge May 19, 2022, 10:33 PM

#

steady basalt why bother when u can cross validate training data on itself

It's more time efficient, cross validation requires multiple training cycles

loud cove May 19, 2022, 10:33 PM

#

I'm hoping they won't bother with that part lol

mild dirge May 19, 2022, 10:33 PM

#

but grid search with what set?

loud cove May 19, 2022, 10:34 PM

#

I don't really know much and just took a github from an a bit similar project and made it similar to my needs.

loud cove May 19, 2022, 10:34 PM

#

mild dirge but grid search with what set?

training

steady basalt May 19, 2022, 10:35 PM

#

i miss stats tasks

mild dirge May 19, 2022, 10:35 PM

#

alright, so you used cross validation, that's fine

steady basalt May 19, 2022, 10:35 PM

#

image_generator = imagedatagenerator(rescale=1/255, validation_split=0.2,rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True, ) 
y_train = keras.utils.to_categorical(y_train, 3)
model.compile(optimizer = opt , loss = 'categorical_crossentropy' , metrics=['categorical_accuracy'])
y_train = keras.utils.np_utils.to_categorical(y_train, 3)
y_test = keras.utils.np_utils.to_categorical(y_test, 3)
datagen = imagedatagenerator(rescale=1/255, featurewise_center=True, featurewise_std_normalization=True, rotation_range=20, 
                             width_shift_range=0.2,height_shift_range=0.2,
                             horizontal_flip=True,validation_split=0.2)
# compute quantities required for featurewise normalization
# (std, mean, and principal components if ZCA whitening is applied)
datagen.fit(x_train)
# fits the model on batches with real-time data augmentation:
model.fit_generator(datagen.flow(x_train, y_train, batch_size=32),
          steps_per_epoch=len(x_train) / 32, epochs=100)```

mild dirge May 19, 2022, 10:35 PM

#

It also uses validation sets, but it splits your training data into multiple folds

steady basalt May 19, 2022, 10:35 PM

#

brainmon its gona work this time

loud cove May 19, 2022, 10:35 PM

#

yeah but didn't do it in one set into like 60 10 30 training validation test

mild dirge May 19, 2022, 10:35 PM

#

Concerning the second point of your requirements, I do not know what is meant

#

I would ask for clarification if I were you

#

I can only guess at some stuff that could be wanted from you

loud cove May 19, 2022, 10:36 PM

#

mild dirge It also uses validation sets, but it splits your training data into multiple fol...

It also uses validation sets, but it splits your training data into multiple folds
What is it?

mild dirge May 19, 2022, 10:36 PM

#

the cross validation grid search function

loud cove May 19, 2022, 10:37 PM

#

mild dirge I would ask for clarification if I were you

It is just one of those MOOC things, I honestly don't care much and just want to get it over with, they don't have much info but was just checking

loud cove May 19, 2022, 10:37 PM

#

mild dirge the cross validation grid search function

yeah that's what i found when igoogled

steady basalt May 19, 2022, 10:37 PM

#

@mild dirge i need help with this augmentation give 5 mins to train first

loud cove May 19, 2022, 10:37 PM

#

it was between that and random and i just went with this one.

#

honestly it is stupid, why ask me for pdf (I know in real life you'd need to summarize outside of your code, but just annoying).

mild dirge May 19, 2022, 10:38 PM

#

That's just convention

#

For pretty much every one of my courses I have to write reports and send as pdfs too

loud cove May 19, 2022, 10:39 PM

#

Yeah I'm jus being lazy, I don't really care much ML career wise but just took it since it is free

steady basalt May 19, 2022, 10:39 PM

#

fit generator broke my kernel

loud cove May 19, 2022, 10:39 PM

#

just gluing together few things here and there and it works

steady basalt May 19, 2022, 10:41 PM

#

92ms/step lemon_clown

mild dirge May 19, 2022, 10:41 PM

#

I mean you can get pretty far gluing stuff together with ML, but when you don't fully understand what you are doing you might hit a brick wall at some point

steady basalt May 19, 2022, 10:41 PM

#

I have 42

mild dirge May 19, 2022, 10:41 PM

#

steady basalt 92ms/step <:lemon_clown:896150992557572207>

Not that bad right?

steady basalt May 19, 2022, 10:41 PM

#

images

#

jesus

mild dirge May 19, 2022, 10:41 PM

#

per image?

steady basalt May 19, 2022, 10:42 PM

#

let me check

mild dirge May 19, 2022, 10:42 PM

#

or batch

steady basalt May 19, 2022, 10:42 PM

#

ok its

#

yeah

loud cove May 19, 2022, 10:42 PM

#

mild dirge I mean you can get pretty far gluing stuff together with ML, but when you don't ...

yeah I think domain knowledge and a little stats is enough for most cases.

steady basalt May 19, 2022, 10:42 PM

#

1500 img

#

so thats 42 uhhh steps

#

per image

#

thats batch size?

mild dirge May 19, 2022, 10:44 PM

#

ehhhh

steady basalt May 19, 2022, 10:44 PM

#

didnt set anything

#

defaults 32...

mild dirge May 19, 2022, 10:44 PM

#

if you didnd't set a batch size it is probably some default value in pytorch

loud cove May 19, 2022, 10:44 PM

#

https://github.com/MAmr21/EGYFWD/blob/KO---Articles-Classification/articles classifier V1.ipynb

alright what you think?

GitHub

EGYFWD/articles classifier V1.ipynb at KO---Articles-Classification...

EgyFWD Course. Contribute to MAmr21/EGYFWD development by creating an account on GitHub.

mild dirge May 19, 2022, 10:44 PM

#

or keras*

loud cove May 19, 2022, 10:45 PM

#

wait nvm wrong file

steady basalt May 19, 2022, 10:45 PM

#

32 is meant to be default

mild dirge May 19, 2022, 10:45 PM

#

Looks interesting @loud cove

steady basalt May 19, 2022, 10:45 PM

#

anytihng over 100ms/step would be so horrible for big data

#

i wonder if a tesla cud reduce to like 20

mild dirge May 19, 2022, 10:46 PM

#

steady basalt i wonder if a tesla cud reduce to like 20

I don't think cars work very well as image classification models

steady basalt May 19, 2022, 10:46 PM

#

xd

#

nvidia tesla

loud cove May 19, 2022, 10:46 PM

#

mild dirge Looks interesting <@965084284274765824>

this one https://github.com/MAmr21/EGYFWD/blob/main/KO/Article Classification/articles classifier.ipynb

loud cove May 19, 2022, 10:47 PM

#

mild dirge

oh it does that on your computer too? I thought mine just glitched

steady basalt May 19, 2022, 10:48 PM

#

cruising on 13.5GB ram and 98% GPU usage

#

just from python3

#

shudda got 32gbn

#

wow 14gb ram used

mild dirge May 19, 2022, 10:49 PM

#

steady basalt shudda got 32gbn

yeah same

steady basalt May 19, 2022, 10:49 PM

#

what causes it

mild dirge May 19, 2022, 10:49 PM

#

what causes what?

steady basalt May 19, 2022, 10:49 PM

#

just the calculations?

mild dirge May 19, 2022, 10:49 PM

#

loud cove this one https://github.com/MAmr21/EGYFWD/blob/main/KO/Article%20Classification/...

I'm looking through it btw

mild dirge May 19, 2022, 10:50 PM

#

steady basalt just the calculations?

Yeah the forward and backwards pass mainly

steady basalt May 19, 2022, 10:50 PM

#

backprop is the main ram hog>

loud cove May 19, 2022, 10:50 PM

#

I don't think 32 would make much difference

steady basalt May 19, 2022, 10:50 PM

#

im getting close to my 16gb

#

2 left

loud cove May 19, 2022, 10:50 PM

#

that's more than enough

steady basalt May 19, 2022, 10:51 PM

#

its a tiny project haha

#

wow the laptops so hot now

loud cove May 19, 2022, 10:51 PM

#

windows/os keeps a backup memory so you apps don't lag

steady basalt May 19, 2022, 10:51 PM

#

I thought these new macs stayed cool

loud cove May 19, 2022, 10:51 PM

#

steady basalt I thought these new macs stayed cool

I don't think any machine will stay cool when you have 98% gpu utilization

loud cove May 19, 2022, 10:51 PM

#

mild dirge I'm looking through it btw

thanks, I copied most of it from this guy https://github.com/deepak0437/natural_language_processing/blob/main/News_Article/BBC_News_Classification.ipynb

GitHub

natural_language_processing/BBC_News_Classification.ipynb at main ·...

Contribute to deepak0437/natural_language_processing development by creating an account on GitHub.

steady basalt May 19, 2022, 10:52 PM

#

the air coming out the sides is cool

#

the keyboards hoy

mild dirge May 19, 2022, 10:52 PM

#

Did you copy a lot?

steady basalt May 19, 2022, 10:52 PM

#

hot

mild dirge May 19, 2022, 10:52 PM

#

Could be an idea to reference it if you think you copied maybe a bit too much

loud cove May 19, 2022, 10:52 PM

#

mild dirge Did you copy a lot?

I downloaded his then changed it to fit mine

steady basalt May 19, 2022, 10:52 PM

#

underside of the laptops BURNING hot

#

shud i cancel this

#

cnn

#

yikes

loud cove May 19, 2022, 10:52 PM

#

mild dirge Could be an idea to reference it if you think you copied maybe a bit too much

I would if it was an academia thing, but most of it is just basic stuff

mild dirge May 19, 2022, 10:53 PM

#

high gpu usage is good though

#

you want the gpu to work hard

loud cove May 19, 2022, 10:54 PM

#

loud cove I would if it was an academia thing, but most of it is just basic stuff

but they might scream palgarism so I'll just let it go
most of these projects follow same stuff anyways

mild dirge May 19, 2022, 10:54 PM

#

So one thing that you did is choosing the model that gives the highest test accuracy @loud cove

loud cove May 19, 2022, 10:55 PM

#

mild dirge So one thing that you did is choosing the model that gives the highest test accu...

that's because that's what they wanted

mild dirge May 19, 2022, 10:55 PM

#

Which means you implicitly use the test data to make choices

loud cove May 19, 2022, 10:55 PM

#

mild dirge May 19, 2022, 10:55 PM

#

the testing data is only for testing your final model

#

Nothing about choosing your model/training/parameter-search etc. should include anything from your test data

loud cove May 19, 2022, 10:56 PM

#

the closest thing to it is the log reg, but I think random forrest would make more sense

loud cove May 19, 2022, 10:56 PM

#

mild dirge the testing data is only for testing your **final** model

yeah makes sense

thin palm May 19, 2022, 10:57 PM

#

anyone have a good understanding of a matrix correlation??
1.) Before Plotting and calculating a matrix correlation do I need to OHE some feature columns first?
2.) Right now my dataset only has 3 numeric values, the rest are of types objects. The 3rd numeric value being the target variable. So if I build my matrix it's going to be small, and how can I determine which feature is most important, high correlation, etc. with really only 2 values?
3.) If I decide to OHE my object types to become numerical values will this be valuable for my correlation matrix?

loud cove May 19, 2022, 10:57 PM

#

but I'd have to split manually right?

mild dirge May 19, 2022, 10:57 PM

#

It might be that your test data has some weird pattern in it and that random forest might be able to learn that very luckily

#

Which would make the final accuracy that you find not really fair indicator on new data

loud cove May 19, 2022, 10:57 PM

#

mild dirge It might be that your test data has some weird pattern in it and that random for...

it changes between log and it for the best if i disable the rng

steady basalt May 19, 2022, 10:58 PM

#

Hot app is tellingme its only 63C

#

yet it burns my hand

loud cove May 19, 2022, 10:58 PM

#

#

different runs

mild dirge May 19, 2022, 10:58 PM

#

Doesn't really matter which one shows up best, the point is that the test data shouldn't be used for this

#

validation data should have been used

loud cove May 19, 2022, 10:58 PM

#

but random forests seems most reasonable with free text

mild dirge May 19, 2022, 10:59 PM

#

And I would maybe place this plot at the data analysis part

#

Not the results

loud cove May 19, 2022, 10:59 PM

#

mild dirge validation data should have been used

in a real world scenario I'd statistically seperate a bit of the data for testing

steady basalt May 19, 2022, 11:00 PM

#

loud cove

nice thats plenty of results

loud cove May 19, 2022, 11:00 PM

#

mild dirge And I would maybe place this plot at the data analysis part

yeah It used to be at start with word count, but moved it

#

to keep all visuals in one place

#

the eda have the counts anyways

mild dirge May 19, 2022, 11:00 PM

#

Don't know how deep this project goes, but you could go more in-depth on this

loud cove May 19, 2022, 11:00 PM

#

steady basalt nice thats plenty of results

it is the same thing, i just turned of the rng

mild dirge May 19, 2022, 11:01 PM

#

Checking if there is a lot of overlap between words, more than the other two categories f.e.

loud cove May 19, 2022, 11:01 PM

#

mild dirge Don't know how deep this project goes, but you could go more in-depth on this

that's actually the reason i moved it to bottom

steady basalt May 19, 2022, 11:01 PM

#

also u shud be getting AUROC its better than accuracy

loud cove May 19, 2022, 11:01 PM

#

I could find the correlation, but it is a MOOC project with interest in ML so I didn't really dive deep

mild dirge May 19, 2022, 11:02 PM

#

The word clouds at the bottom are pretty cool, but you don't say anything about them

loud cove May 19, 2022, 11:02 PM

#

steady basalt also u shud be getting AUROC its better than accuracy

thanks, I'll google that

loud cove May 19, 2022, 11:03 PM

#

mild dirge The word clouds at the bottom are pretty cool, but you don't say anything about ...

yeah i just kept it to show the "overlap" because everything is related, a distribution chart would been better, but people eat that shit, it isn't like it matters.

mild dirge May 19, 2022, 11:03 PM

#

Think all-in-all it looks pretty good, just be careful with using the test data for making decisions about the model. You could have a super interesting model and project with good results, but when you use the test data for training/choosing stuff, it could invalidate any results you get.

loud cove May 19, 2022, 11:04 PM

#

mild dirge Think all-in-all it looks pretty good, just be careful with using the test data ...

how would you split them manually with respect to distribution? I don't think this is needed given that it is free text, but with numbers it might

steady basalt May 19, 2022, 11:04 PM

#

ValueError: Shapes (None, None, None, None, None) and (None, 3) are incompatible

#

can someone whos good with TF

#

help me

#

b4 i sleep

mild dirge May 19, 2022, 11:05 PM

#

you'd want the same distribution of categories in the test set as the dataset

steady basalt May 19, 2022, 11:05 PM

#

     18           steps_per_epoch=len(x_train) / 32, epochs=100)```

#

dont even get how this will work without validation data added

#

added it and still get value error

loud cove May 19, 2022, 11:07 PM

#

mild dirge you'd want the same distribution of categories in the test set as the dataset

I'm guessing there is probably a lib for that, or a function.
I'd probably do something like get 30% of the data, filter for each distribution and get their weight.

mild dirge May 19, 2022, 11:08 PM

#

Think it's called stratified

loud cove May 19, 2022, 11:08 PM

#

im sorry if this is basic, i tried googling but no luck, what do you do with validation dataset exactly? I understand what its use is, but how is it different from cross validation functions?

mild dirge May 19, 2022, 11:08 PM

#

train_test_split has an argument for it

loud cove May 19, 2022, 11:08 PM

#

mild dirge Think it's called stratified

it is there by default though

loud cove May 19, 2022, 11:08 PM

#

mild dirge train_test_split has an argument for it

yea exactly

#

oh nvm the shuffle is the one on by default

mild dirge May 19, 2022, 11:09 PM

#

loud cove im sorry if this is basic, i tried googling but no luck, what do you do with val...

Validation dataset is for checking the accuracy of your model while in the process of making decisions and training etc.

steady basalt May 19, 2022, 11:09 PM

#

@loud cove where u from G

mild dirge May 19, 2022, 11:09 PM

#

And cross validation splits training data into train/validation multiple times

#

Each time having a different slice of the data be validation and the rest train

loud cove May 19, 2022, 11:10 PM

#

mild dirge Validation dataset is for checking the accuracy of your model while in the proce...

Yeah i understand, but if i do cross validation doesn't this mean it isn't needed?

loud cove May 19, 2022, 11:10 PM

#

steady basalt <@965084284274765824> where u from G

Egypt

mild dirge May 19, 2022, 11:10 PM

#

loud cove Yeah i understand, but if i do cross validation doesn't this mean it isn't neede...

correct

#

they accomplish the same thing

#

cross validation is probably better as it uses all of your training data for training and for testing at some point

loud cove May 19, 2022, 11:11 PM

#

yeah that's what i was confused about, that they wanted a standalone validation

mild dirge May 19, 2022, 11:11 PM

#

But that also means you need to train/validate multiple times

steady basalt May 19, 2022, 11:11 PM

#

Total params: 608,771
Trainable params: 607,939

#

ducky_angel

mild dirge May 19, 2022, 11:11 PM

#

And when it takes a few hours to train that might be something you are willing to give up on in order to get quicker results

loud cove May 19, 2022, 11:11 PM

#

yeah like i did for that 1% accuracy 😛

steady basalt May 19, 2022, 11:12 PM

#

ValueError: Shapes (None, None, None, None, None, None, None, None, None, None) and (None, 3) are incompatible

#

save me

mild dirge May 19, 2022, 11:12 PM

#

steady basalt Total params: 608,771 Trainable params: 607,939

I had 5 million for my previous model 😛

loud cove May 19, 2022, 11:12 PM

#

local?

mild dirge May 19, 2022, 11:12 PM

#

yeah

steady basalt May 19, 2022, 11:12 PM

#

#

why ERROR

#

AAAAAAA

mild dirge May 19, 2022, 11:12 PM

#

steady basalt save me

And I don't know how to help sorry

steady basalt May 19, 2022, 11:13 PM

#

whys my shape liek that

mild dirge May 19, 2022, 11:13 PM

#

interesting tensor shape though

loud cove May 19, 2022, 11:13 PM

#

mild dirge yeah

what for?

mild dirge May 19, 2022, 11:13 PM

#

that must be wrong

mild dirge May 19, 2022, 11:13 PM

#

loud cove what for?

CNN, classifying 400 different bird species

steady basalt May 19, 2022, 11:13 PM

#

my y train is encoded

loud cove May 19, 2022, 11:14 PM

#

cnn the channel?

steady basalt May 19, 2022, 11:14 PM

#

dude i just printed and

#

printed shap eagian

#

and it chagned

mild dirge May 19, 2022, 11:14 PM

#

loud cove cnn the channel?

convolutional neural network

steady basalt May 19, 2022, 11:14 PM

#

every time i press .shape

#

it adds a 3??

#

WAT

loud cove May 19, 2022, 11:15 PM

#

ah another thing.
https://www.mikulskibartosz.name/how-to-set-the-global-random_state-in-scikit-learn/
np.random.seed(31415) this doesn't actually globally set seed status or whatever right?

Bartosz Mikulski

How to set the global random_state in Scikit Learn

What to do if you keep forgetting to set the random_state?

#

might been an old thing i think

mild dirge May 19, 2022, 11:16 PM

#

no clue srr

steady basalt May 19, 2022, 11:16 PM

#

YES

#

YES

#

YES

#

ITS FIXED

#

AAAAAAAAAA

#

6 HOURS LATER

loud cove May 19, 2022, 11:16 PM

#

on its own?

steady basalt May 19, 2022, 11:16 PM

#

no i just reset my variables lol

mild dirge May 19, 2022, 11:17 PM

#

I'm going to sleep rn, I wish you both best of luck with your projects, gn!

steady basalt May 19, 2022, 11:17 PM

#

gnnn

loud cove May 19, 2022, 11:18 PM

#

me too, thanks for the help.
have a good night guys!

molten bluff May 20, 2022, 12:16 AM

#

hi all, it's been a while since I touched stats or probability. Can someone tell me what sampling each row in a dataset uniformly at random with probability p means?

serene scaffold May 20, 2022, 12:34 AM

#

molten bluff hi all, it's been a while since I touched stats or probability. Can someone tell...

is there any possibility that it just means "randomly pick 30% of the rows" (where p is .3)?

wild pagoda May 20, 2022, 2:24 AM

#

hey everyone, when i'm running this:

        data_frame["lap"] = data_frame["lap"].astype(int)

I got this error because there is null data, how do i ignore null data?

#

i tried to add ignore, but it doesn't change dtype float to int

clever owl May 20, 2022, 3:31 AM

#

if i have a list [1,2,3,4] is 1 the head or the tail?

wild pagoda May 20, 2022, 3:33 AM

#

clever owl if i have a list [1,2,3,4] is 1 the head or the tail?

head

clever owl May 20, 2022, 3:33 AM

#

and is that the beginning or the end of the list

#

the beginning right?

wild pagoda May 20, 2022, 3:34 AM

#

beginning

clever owl May 20, 2022, 3:34 AM

#

cool haha ty

wild pagoda May 20, 2022, 3:34 AM

#

for example you can use list pop

wild pagoda May 20, 2022, 4:18 AM

#

how do i convert "" to null boolean in pandas?

rose agate May 20, 2022, 4:30 AM

#

wild pagoda how do i convert "" to null boolean in pandas?

you can do the whole dataframe with something like df.replace('', np.NaN, inplace=True)

#

or one col like df.loc[df['col'] == ''] = np.NaN

wild pagoda May 20, 2022, 4:31 AM

#

rose agate you can do the whole dataframe with something like `df.replace('', np.NaN, inpla...

if i convert to np.nan, when i'm using pf.to_sql it automaticly convert to true

#

not null

rose agate May 20, 2022, 4:31 AM

#

wild pagoda if i convert to np.nan, when i'm using pf.to_sql it automaticly convert to true

try using None instead of np.NaN then perhaps

wild pagoda May 20, 2022, 4:33 AM

#

rose agate try using `None` instead of `np.NaN` then perhaps

nope too, it automaticly convert to false

rose agate May 20, 2022, 4:37 AM

#

wild pagoda nope too, it automaticly convert to false

You'll probably need to read the documentation for that method then. This thread also seems to have a solutions that may help you: https://stackoverflow.com/questions/23353732/python-pandas-write-to-sql-with-nan-values

Stack Overflow

Python Pandas write to sql with NaN values

I'm trying to read a few hundred tables from ascii and then write them to mySQL. It seems easy to do with Pandas but I hit an error that doesn't make sense to me:

I have a data frame of 8 columns.

flat fern May 20, 2022, 4:49 AM

#

wild pagoda nope too, it automaticly convert to false

that's tomori😋

wild pagoda May 20, 2022, 4:49 AM

#

flat fern that's tomori😋

yes!

wild pagoda May 20, 2022, 4:50 AM

#

rose agate You'll probably need to read the documentation for that method then. This thread...

maybe i will using sql, since i still don't know how

thin palm May 20, 2022, 5:05 AM

#

Any ideas of creating correlations and heatmaps of categorical variables?

coral nimbus May 20, 2022, 5:30 AM

#

Anyone tried web scraping with python before

#

I tried web scraping with the HTML element, but encountered problems

#

I'm trying to scrape a live updating number(my code executes every 1 min to get the updated number), but my code doesnt work for some reason

odd meteor May 20, 2022, 7:47 AM

#

coral nimbus I tried web scraping with the HTML element, but encountered problems

What's the error message you got? Are you using Playwright or Selenium or Bs4 or?

coral nimbus May 20, 2022, 7:49 AM

#

Bs4

#

Lemme run it again

coral nimbus May 20, 2022, 8:08 AM

#

"TypeError: nonetype object is not subscriptable"

copper jetty May 20, 2022, 8:50 AM

#

Good day... Am new here

copper jetty May 20, 2022, 8:51 AM

#

odd meteor What's the error message you got? Are you using Playwright or Selenium or Bs4 o...

Please can you help me with web scraping with playwright+scrapy?

#

I did all set up but am getting error whenever I tried to use it to scrape a Javascript heavy site

thorn halo May 20, 2022, 9:33 AM

#

There’s a game called 8 ball pool(a billiards game developed by miniclip you all must be knowing about it)
I am developing a program which can predict the trajectories and also predict where the balls will go after the collision of billiard balls many apps like this occurs and they charge a heavy price on this. They simulate the collision same as the game does, I am not sure how they got the parameters ( mostly hit and trial) but its perfect.

My question is that which physics engine should I use for simulating collision of balls ?
I have fair knowledge in c++ and python.
The link below shows what I am trying to do :

https://stackoverflow.com/questions/69155760/how-to-get-direction-after-cueball-collide-with-any-ball-in-unity3d-like-in-8-b?noredirect=1&lq=1

Stack Overflow

How to get Direction after cueBall collide with any ball in unity3d...

Hi!
i am making something like 8 ball pool in 3d with unity 3d C#.
A is que ball and I know Dir1. I want to calculate Dir2.I am using Raycast to i can get point of contact.

odd meteor May 20, 2022, 9:33 AM

#

copper jetty Please can you help me with web scraping with playwright+scrapy?

I don't use scrapy tho. What's the error message you're getting? Are you able to see what you're doing? Add headless=False inside your chromium (if that's the browser you're using with Playwright)

odd meteor May 20, 2022, 9:35 AM

#

copper jetty I did all set up but am getting error whenever I tried to use it to scrape a Jav...

So, it's not bringing out any error message right? It's just not working as you want it to?

odd meteor May 20, 2022, 9:38 AM

#

coral nimbus "TypeError: nonetype object is not subscriptable"

Can you share your code?

copper jetty May 20, 2022, 9:41 AM

#

odd meteor I don't use scrapy tho. What's the error message you're getting? Are you able to...

I just started using playwright.... I normally use scrappy +splash for scraping Javascript sites but that combo did not work when I tried using it for heavy Javascript website

#

So why trying to look for solution to my problem, I came across playwright integration with Scrapy

#

I tried it but nothing nothing

#

So which tool(s) will you advice me to use it comes to web scraping ?

#

Especially Dynamic websites

copper jetty May 20, 2022, 9:45 AM

#

odd meteor Can you share your code?

Yeah but am not with my laptop

coral nimbus May 20, 2022, 10:09 AM

#

@odd meteor here you go

#

import bs4
import requests
from bs4 import BeautifulSoup

#may need to change variable according to previous components
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'}
url='https://finance.yahoo.com/quote/VAXX?p=VAXX&.tsrc=fin-srch'
r= requests.get(url)


web_content=bs4.BeautifulSoup(r.text, 'xml')  
elements=web_content.find("div", {'class': "D(ib) Mend(20px)"})


print(r.text)

#

essentially what i'm trying to do is

#

parse the current stock price(VAXX) from this url

#

I picked the correct div(correct me if i'm wrong) but for some reason it keeps showing 4.37 despite the current price being something else

#

so I changed it to print(r.text) and did some digging, apparently the parsed value is hard set to 4.37

#

i'm thinking of scraping another website instead, looks like yfinance really doesnt like to be scraped

main gorge May 20, 2022, 10:41 AM

#

Did anyone know how we open security port open from Airflow web server to GI EMR, camel Ec2 and S3 buckets using python

odd meteor May 20, 2022, 10:43 AM

#

coral nimbus ```py import bs4 import requests from bs4 import BeautifulSoup #may need to cha...

First, confirm what you're trying to scrap from yfinance isn't disallowed. You can do that by checking the robot.txt file of the website.

https://finance.yahoo.com/robots.txt

If what you're trying to scrap isn't prohibited, and you're sure you are selecting the right attributes and tag where the information you wanna scrap is, then perhaps try using another parser. You could try html.parser or lxml

I'm not on pc at the moment, so unfortunately I can't inspect the HTML of the url you just sent at this time.

However, try this

import requests
from bs4 import BeautifulSoup
url = "https://finance.yahoo.com/quote/VAXX?p=VAXX&.tsrc=fin-srch"
page = requests.get(url)
soup = BeautifulSoup(page, 'lxml')
prices = soup.find('You most likely need to tweak this area to include the parent tag before the div tag ')

result = [price.text for price in prices]
result

coral nimbus May 20, 2022, 10:44 AM

#

wait what do you mean by including the parent tag before the div tag?

odd meteor May 20, 2022, 10:45 AM

#

coral nimbus wait what do you mean by including the parent tag before the div tag?

Try that, yes. Use CSS selector to link them. Then you might wanna use the id attribute instead of the class attribute (that's if the div tag has and id attribute)

#

Unfortunately, I can’t do much since I'm on my mobile phone. But it should work if you're picking the right tag and attribute where the price data you're trying to scrap is.

coral nimbus May 20, 2022, 10:47 AM

#

thanks a lot

#

all hail @odd meteor

#

<fin-streamer class="Fw(b) Fz(36px) Mb(-4px) D(ib)" data-symbol="VAXX" data-test="qsp-price" data-field="regularMarketPrice" data-trend="none" data-pricehint="4" value="4.09" active="">4.0900</fin-streamer>

#

it looks something like this

#

no id tag, nothing

odd meteor May 20, 2022, 10:57 AM

#

coral nimbus <fin-streamer class="Fw(b) Fz(36px) Mb(-4px) D(ib)" data-symbol="VAXX" data-test...

With CSS selector, you can easily select more than one attribute of a tag if you can't easily tell the unique attribute.

Something like this

soup.find('div', class_ = "write the class here', data-test = 'qsp-price', value='4. 09')

coral nimbus May 20, 2022, 11:03 AM

#

ohhhhhhhhh

odd meteor May 20, 2022, 11:07 AM

#

coral nimbus <fin-streamer class="Fw(b) Fz(36px) Mb(-4px) D(ib)" data-symbol="VAXX" data-test...

Also, from the above HTML, it doesn't have a div tag. It has fin-streamer tag. So ensure you're calling the appropriate tag as well

mint palm May 20, 2022, 11:33 AM

#

how to track very slight motions ??
this above image shows how a point was fixed then vector was obtained by comparing frames

#

i have the frames

#

but how do i apply this vector approach

#

has anyone done something similar before

coral nimbus May 20, 2022, 12:22 PM

#

cool

#

is this some face recognition stuff

past lion May 20, 2022, 12:29 PM

#

Im currently doing a project for uni that ive chosen to analyze the accuracy of simulated predictions using a range of models for the stock market. Im wondering if arima would even be the right option to go down? Ive so far simulated with GBM, was looking into ARIMA & GARCH as well? Anyone have an opinion on what would be most suitable.

rose agate May 20, 2022, 12:44 PM

#

past lion Im currently doing a project for uni that ive chosen to analyze the accuracy of...

From what I've heard, the stock market is notoriously hard to predict, I'm not sure if there's any actual pattern for any time series methods to learn in general, I'd expect most of them to give similar, mediocre results. ARIMA's probably mediocre as well, you could try something with machine learning like a transformer or LSTM, but again, probably mediocre

coral nimbus May 20, 2022, 2:13 PM

#

import requests
from bs4 import BeautifulSoup
!pip install google
try:
    from googlesearch import search
except ImportError:
    print("No module named 'google' found")
 
# to search
query = "investing.com aapl"
 
for j in search(query, tld="co.in", num=1, stop=1, pause=1):
    print(j)
url=j
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
prices = soup.find('div', {'class': 'instrument-price_instrument-price__3uw25 flex items-end flex-wrap font-bold'}).find_all('span')[0].text
                                         
print(prices)

#

@odd meteor I was able to solve the problem by google searching using python then doing it the traditional way

#

as I intend the query part to be a concat of investing.com + 'TICKER', with the TICKER being parsed from another excel sheet I programmed for optimal stock picking

#

now I need to make a for loop to execute this code per minute and to make an array to store this information, any ideas?

hollow flare May 20, 2022, 2:17 PM

#

Hi

#

One Question for all 😅

"Skills" or "Degree"

coral nimbus May 20, 2022, 2:18 PM

#

skills

#

hard skills

wooden sail May 20, 2022, 2:36 PM

#

depends what you wanna do/where you apply. some jobs will also require you have some cardboard

thorn halo May 20, 2022, 2:55 PM

#

There’s a game called 8 ball pool(a billiards game developed by miniclip you all must be knowing about it)
I am developing a program which can predict the trajectories and also predict where the balls will go after the collision of billiard balls many apps like this occurs and they charge a heavy price on this. They simulate the collision same as the game does, I am not sure how they got the parameters ( mostly hit and trial) but its perfect.

My question is that which physics engine should I use for simulating collision of balls ?
I have fair knowledge in c++ and python.
The link below shows what I am trying to do :

https://stackoverflow.com/questions/69155760/how-to-get-direction-after-cueball-collide-with-any-ball-in-unity3d-like-in-8-b?noredirect=1&lq=1

Stack Overflow

How to get Direction after cueBall collide with any ball in unity3d...

Hi!
i am making something like 8 ball pool in 3d with unity 3d C#.
A is que ball and I know Dir1. I want to calculate Dir2.I am using Raycast to i can get point of contact.

coral nimbus May 20, 2022, 3:09 PM

#

thorn halo There’s a game called 8 ball pool(a billiards game developed by miniclip you all...

no wonder i see people pulling off trickshots that easily

shut phoenix May 20, 2022, 3:40 PM

#

Difference between classifying and regression?

thin palm May 20, 2022, 3:41 PM

#

shut phoenix Difference between classifying and regression?

calssification is 0-1 think you have a disease or you dont... regression is estimation for multiple values think prices of real estate, cars, etc

#

Any body have an idea on how to make a matrix correlation with mixed categories such as objects and ints?

shut phoenix May 20, 2022, 3:41 PM

#

thin palm calssification is 0-1 think you have a disease or you dont... regression is esti...

Ah alr

thorn halo May 20, 2022, 3:46 PM

#

coral nimbus no wonder i see people pulling off trickshots that easily

Yea its called cheto on ios and aim assist on android

vivid jasper May 20, 2022, 4:33 PM

#

Howdy -- trying to bulk change data types in pandas for every column after the 15th to float from object -- is there an easy or preferred way to do this? panda newbie

barren wedge May 20, 2022, 4:34 PM

#

Hey guys... I'm quite new to AI. Like, I know some theory, but I haven't done anything practical. All the projects related to AI seem boring, like I don't want to make AI to predict salaries and stuff. Anyone got any interesting ideas?

mint palm May 20, 2022, 4:41 PM

#

barren wedge Hey guys... I'm quite new to AI. Like, I know some theory, but I haven't done an...

delve into Computer Vision

barren wedge May 20, 2022, 5:04 PM

#

Good Idea... Thanks 👍

vagrant pilot May 20, 2022, 5:08 PM

#

i wanna learn nural networks an more ai stuff how can i start?

robust jungle May 20, 2022, 5:28 PM

#

vagrant pilot i wanna learn nural networks an more ai stuff how can i start?

how comfortable are you with python

steady basalt May 20, 2022, 5:48 PM

#

robust jungle how comfortable are you with python

😆

worthy phoenix May 20, 2022, 5:57 PM

#

is maths necessary to use tensorflow or any other machine learning frameworks in that sense?

serene scaffold May 20, 2022, 5:59 PM

#

worthy phoenix is maths necessary to use tensorflow or any other machine learning frameworks in...

they do the math part for you, but you need to understand how neural networks work in order to do anything, so yes.

karmic flicker May 20, 2022, 5:59 PM

#

Have you guys ever seen two different matplotlib plots overlap eachother even though they are on different plots

worthy phoenix May 20, 2022, 6:00 PM

#

serene scaffold they do the math part for you, but you need to understand how neural networks wo...

i mean i kinda know how neural network works, (kinda), so the maths part like linear regression and others are already done in there for u?

serene scaffold May 20, 2022, 6:03 PM

#

worthy phoenix i mean i kinda know how neural network works, (kinda), so the maths part like li...

in the same way that Python will do 2 + 2 for you, yes. but that doesn't eliminate the need to understand algebra. it just means that the literal calculation is done for you. if you understand linear regression, for example, well enough that you could do it by hand if you had to (or learn how to with minimal time investment), then you'll be fine.

worthy phoenix May 20, 2022, 6:04 PM

#

seems ok to me , thanks

odd meteor May 20, 2022, 6:21 PM

#

worthy phoenix seems ok to me , thanks

Welcome to the gang 💪🏿💪🏿. Now, you can easily convince others that ML isn't so hard as some people perceive it to be.

Thanks to Mr. Sterlerlock 🙌🙌

spring marsh May 20, 2022, 7:09 PM

#

Hey everyone so can someone tell me from where can I learn opencv for machine learning ? I already know machine learning looking for a small course to get me started with computer vision.\

lapis sequoia May 20, 2022, 7:14 PM

#

spring marsh Hey everyone so can someone tell me from where can I learn opencv for machine le...

how about this one ? https://www.youtube.com/watch?v=qCR2Weh64h4&t=2s

spring marsh May 20, 2022, 7:20 PM

#

lapis sequoia how about this one ? https://www.youtube.com/watch?v=qCR2Weh64h4&t=2s

no bro I don't want a tutorial for just opencv

tacit basin May 20, 2022, 7:42 PM

#

spring marsh no bro I don't want a tutorial for just opencv

Pyimage search is a great resource. https://pyimagesearch.com/

PyImageSearch

Chris Hufnagel

PyImageSearch - You can master Computer Vision, Deep Learning, and ...

Helping developers, students, and researchers master Computer Vision, Deep Learning, and OpenCV.

limber kelp May 20, 2022, 8:47 PM

#

Guys, is it possible to download the 'wrangled' excel file after it has been wrangled using python?

I want to use it in Tableau as Tableau doesn't have good wrangling options.

I want to wrangle it in python and then use it in tableau for visualization.

tacit basin May 20, 2022, 9:17 PM

#

limber kelp Guys, is it possible to download the 'wrangled' excel file after it has been wra...

Maybe this would work for you
Read excel into pandas, save to excel, open in tableu

brave sand May 20, 2022, 10:40 PM

#

Do you guys train your networks on a home gpu?

plush glacier May 20, 2022, 10:42 PM

#

brave sand Do you guys train your networks on a home gpu?

i currently do train them on my own pc but I've used google colab before

brave sand May 20, 2022, 10:46 PM

#

plush glacier i currently do train them on my own pc but I've used google colab before

what gpu do you have?

plush glacier May 20, 2022, 10:46 PM

#

i have a 1660super

#

so cloud gpu is a lot faster but i would have to upload my data there and i don't have the best upload speed (it isn't slow at all but for things like 30gb it will take 2 hours to upload the data)

#

@brave sand i just recommend for smaller things to try it out on your own device if possible (and it doesn't take way to long for you) and later maybe go to cloud if you think you will need it

brave sand May 20, 2022, 11:05 PM

#

plush glacier <@765319974469238814> i just recommend for smaller things to try it out on your ...

yeah, I have a 3070, but the vram is limiting

#

internship is about reinforcement learning

mild dirge May 20, 2022, 11:12 PM

#

What kinda network are you running that your vram is limiting on a 3070? @brave sand

brave sand May 20, 2022, 11:15 PM

#

mild dirge What kinda network are you running that your vram is limiting on a 3070? <@76531...

oh, I haven’t run any models that vram is a limit but I’m just wondering if my 3070 will ever limit me in the future

mild dirge May 20, 2022, 11:16 PM

#

Depends on the task (and also the network)

brave sand May 20, 2022, 11:23 PM

#

mild dirge Depends on the task (and also the network)

is Multi Reinforcement Learning demanding on vram?

iron basalt May 20, 2022, 11:53 PM

#

brave sand is Multi Reinforcement Learning demanding on vram?

Very.

#

I'm assuming deep learning is used*

#

(backprop)

brave sand May 20, 2022, 11:56 PM

#

iron basalt Very.

I’ll have to see. I’m running my model on Monday

iron basalt May 21, 2022, 12:00 AM

#

brave sand I’ll have to see. I’m running my model on Monday

Yeah, it depends on the task, if it's remotely complicated it gets expensive and memory heavy fast.

#

Depending on what you are doing you may have one agent per GPU...

#

Multi-agent methods get amazing results, but their main downside is that you have to run multiple agents.

#

So either you have really simple agents, or a lot of GPUs.

brave sand May 21, 2022, 12:14 AM

#

iron basalt Depending on what you are doing you may have one agent per GPU...

since the GPU crash, is it worth investing in cheaper gpus?

iron basalt May 21, 2022, 12:19 AM

#

brave sand since the GPU crash, is it worth investing in cheaper gpus?

No idea. Future is pretty uncertain right now regarding chip access.

#

If you are willing to not rely on pytorch, etc, and can make your own system like it, or use non-deep learning methods, then you can buy the much cheaper AMD GPUs.

#

But the tradeoff is a lot of work and learning.

brave sand May 21, 2022, 12:21 AM

#

AMD Gpus? I thought they were a pain to deal with for ml

iron basalt May 21, 2022, 12:22 AM

#

They are because the big libraries for deep learning don't support them.

#

No CUDA.

#

And also because AMD used to not really care about ML.

#

Nvidia has more or less a monopoly on deep learning at this time.

#

(Unless you are willing to take the hard route that most can't be bothered with (someone has to make the pytorch support for AMD GPUs))

iron basalt May 21, 2022, 12:31 AM

#

mint palm has anyone done something similar before

https://en.wikipedia.org/wiki/Optical_flow

Optical flow

Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. Optical flow can also be defined as the distribution of apparent velocities of movement of brightness pattern in an image. The concept of optical flow was introduced by the ...

#

https://www.youtube.com/watch?v=gZS1AiQhqtg

YouTube

William Forfang

Optic Flow Algorithm

Demonstrating the optic flow algorithm via "Polynomial Expansion".
A. Original footage
B. Motion represented by color
C. Motion represented by arrows

Original footage owned/uploaded by the slow-mo guys. See their channel here:

https://www.youtube.com/user/theslowmoguys

▶ Play video

#

Roboticists love this.

thin palm May 21, 2022, 12:41 AM

#

Is this still considered a normal distribution?

Screen_Shot_2022-05-20_at_6.41.04_PM.png

#

Reason I ask is because I'm getting ready to OHE a categorical column and I need to define 3 different types, 1 for entry level, 2 for mid level, and 3 for senior. I was hoping to use my distribution graph to get an idea of how many n types we'd need but i'm not sure.

tidal bough May 21, 2022, 1:32 AM

#

Really doesn't look like one to me. I'm not sure what's up with the oscillation, but the distribution is otherwise pretty much uniform (which is IMO a very unexpected result for "years of experience" in basically any dataset. Maybe it was intentionally normalized that way?..)

thin palm May 21, 2022, 1:39 AM

#

tidal bough Really doesn't look like one to me. I'm not sure what's up with the oscillation,...

Yeah could've been normalized for the sake of this assignment, so in other words safe to say it is? I checked each values of the z score and they all fall between -3 and 3

#

So if I want to create a One Hot Encoder based off years experiences teirs (Entry-level, Intermediate, Mid-level, Senior or executive-level) 4 total. How would I do this? Right now my column is categorical

desert oar May 21, 2022, 1:53 AM

#

it sounds like you don't need it to be gaussian

#

you are talking about binning your data

#

there are several ways to do this, all of them essentially arbitrary

#

do you need to use these 4 categories for the assignment?

#

that density plot is really weird

thin palm May 21, 2022, 1:54 AM

#

desert oar do you _need_ to use these 4 categories for the assignment?

so no I dont, but here's the issue. there's 100,000 lines and this specific column 'companyId' represents 63 unique values. If I were to OHE this it's gonna add 63 additional columns

desert oar May 21, 2022, 1:54 AM

#

this looks like it's more or less uniformly distributed based on that

thin palm May 21, 2022, 1:55 AM

#

so my other thought was it's an important column I don't want to drop it

desert oar May 21, 2022, 1:55 AM

#

thin palm so no I dont, but here's the issue. there's 100,000 lines and this specific colu...

company id doesn't necessarily make sense as a model input anyway - your model won't be able to make predictions for companies outside the training set

#

you might however want to use attributes of the company as input features

#

total number of employees, etc.

#

i don't see any reason to discretize a feature like "years of experience" unless you need to or can think of a specific good reason to

thin palm May 21, 2022, 1:56 AM

#

desert oar company id doesn't necessarily make sense as a model input anyway - your model w...

so let me show you the column names first:
jobId companyId jobType degree major industry yearsExperience milesFromMetropolis salary

it's limited here

#

well the reason why is because the companyId probably represents the company name, and obviously Apple would pay more than other companies.

thin palm May 21, 2022, 1:57 AM

#

desert oar i don't see any reason to discretize a feature like "years of experience" unless...

so that's why I wanted to discretize companyId based off of yearsExpereince to split them into some tiers

#

But I'm having a hard time one hot encoding this based off a condition, not sure what to do?

desert oar May 21, 2022, 1:57 AM

#

thin palm so that's why I wanted to discretize companyId based off of yearsExpereince to s...

i dont know what you mean by that

thin palm May 21, 2022, 1:57 AM

#

There's 63 unique values

desert oar May 21, 2022, 1:57 AM

#

i think you are overthinking whatever this is that you're doing

thin palm May 21, 2022, 1:58 AM

#

desert oar i dont know what you mean by that

true but am I making sense first of all haha

desert oar May 21, 2022, 1:58 AM

#

no, sorry

#

you have 63 unique companies in the data, what does that have to do with binning years of experience?

thin palm May 21, 2022, 1:58 AM

#

desert oar no, sorry

Okay so I have 63 unique values for a categorical feature

desert oar May 21, 2022, 1:59 AM

#

i wouldnt consider company id a feature though

thin palm May 21, 2022, 2:00 AM

#

desert oar i wouldnt consider company id a feature though

so basically you'd drop this

thin palm May 21, 2022, 2:01 AM

#

desert oar i wouldnt consider company id a feature though

Only reason why I considered it because i know the companyId is an alias for the company name (assumption)

desert oar May 21, 2022, 2:01 AM

#

i would use it to incorporate attributes about the company into your model

#

but if you use company id in the model you can't make predictions about other companies

thin palm May 21, 2022, 2:01 AM

#

desert oar i would use it to incorporate attributes about the company into your model

the goal is to make a prediction on salaries sorry I didn't mention that

#

the goal is to predict salaries

#

using the companyID can help identify which companies pay a bit more for salary... I just dont know how to OHE since it has so many features

desert oar May 21, 2022, 2:27 AM

#

thin palm using the companyID can help identify which companies pay a bit more for salary....

its totally fine to one-hot encode 63 categories. however i still question if that's what you actually want to do

#

what if you need to predict salary for an employee at a company that isn't in the trianing set?

fallen inlet May 21, 2022, 2:47 AM

#

Is there any good resource for python data science?

thin palm May 21, 2022, 2:59 AM

#

desert oar its totally fine to one-hot encode 63 categories. however i still question if th...

very good question.. no idea. Whats your take?

desert oar May 21, 2022, 3:08 AM

#

thin palm very good question.. no idea. Whats your take?

decide what you want/need first

thin palm May 21, 2022, 3:09 AM

#

desert oar decide what you want/need first

But this is where I’m struggling, I’m torn apart of either keeping it or not

thin palm May 21, 2022, 5:48 AM

#

Do I need to transform features into normal distributions?

thin palm May 21, 2022, 6:37 AM

#

I don't think so actually, just the machine learning models that require scaling.

steady basalt May 21, 2022, 6:58 AM

#

I don’t think they’ll care

#

Just encode company

pearl radish May 21, 2022, 7:01 AM

#

Hey everyone, I have a "tf.contrib" error when testing the object detection API.
so I ran the upgrade scripts on my project directory and there was no issues or error.

I then tested the object detection API installation and I still got the error about "tf.contrib".
I then ran the upgrade scripts on the main directory where the error seems to come from and it was successful. But when I tested the installation, I still got the same error "tf.contrib".
Is Anything else I can do?

#

wooden sail May 21, 2022, 7:13 AM

#

it seems the issue is the tensorflow version. you'd need a version of tensorflow starting with 1.x.x

#

you can read about it here https://www.tensorflow.org/guide/migrate/upgrade

#

try installing an older TF version in an environment

mint palm May 21, 2022, 7:24 AM

#

iron basalt https://en.wikipedia.org/wiki/Optical_flow

thank you very much, i didnt knew what to search 4

dim crypt May 21, 2022, 8:29 AM

#

fallen inlet Is there any good resource for python data science?

it looks like most of the pins in this channel are more geared towards ML/AI, if it's other DS subjects that you're after you might find something at https://www.pythondiscord.com/resources/?topics=data-science even if I unfortunately think it was a little bit thin for that as well

pearl radish May 21, 2022, 8:33 AM

#

@wooden sail please the upgrade scripts kinda confuses me.
To be clear, should I uninstall the TF v2.8.0 and install TFv 1.15 before running the upgrade scripts?
And on what directory should I run the upgrade scripts on?

wooden sail May 21, 2022, 8:57 AM

#

pearl radish <@467435887236612106> please the upgrade scripts kinda confuses me. To be clear...

honestly i would just make a virtual environment with TF v1.x.x instead

tacit basin May 21, 2022, 9:20 AM

#

Nice resource for image classification model selection from Timm models https://www.kaggle.com/code/jhoward/which-image-models-are-best
Compares speed and accuracy for a lot of models available from Timm library

Which image models are best?

Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources

pearl radish May 21, 2022, 9:37 AM

#

@wooden sail thanks

loud cove May 21, 2022, 9:45 AM

#

mild dirge Nothing about choosing your model/training/parameter-search etc. should include ...

I'm curious since you seem to do this for living, what do you do after training a model? how do you use it? I know you do fit predict and all that, but what is the typical workflow for saving it?
I read there is pickling (and another source were talking about a better way than pickling), I'm just curious how it happens in a typical professional environment.

mild dirge May 21, 2022, 9:48 AM

#

loud cove I'm curious since you seem to do this for living, what do you do after training ...

Wouldn't really know. I'm just doing a masters in AI atm, haven't done anything professionally...

mild dirge May 21, 2022, 9:50 AM

#

loud cove I'm curious since you seem to do this for living, what do you do after training ...

It's called machine learning model deployment if you wanted to look it up btw

wooden sail May 21, 2022, 9:51 AM

#

normally one stores the model and its parameters in a way that it can be used easily, e.g. by containerizing it or simply keeping it around for yourself. the idea is that the network is simply an architecture, and the parameters you trained are specific to the data you used for training. if the training went well, you can now use the parameters only for inference without having to worry too much about the inference being correct

#

the network itself makes a prediction. the training step tuned the params so that the predictions become accurate. so once it's trained, save the parameters and trust the inference. as an example, once the params are stored, you can trust your network to detect faces in images

#

but yeah, look into deployment as pccamel suggests. containerization is just one approach

mild dirge May 21, 2022, 9:54 AM

#

The only time we had to save a model, we indeed just saved the parameters and loaded it into the program that controlled a robot, and then the model was just called every iteration to check if the camera had spotted any objects

#

Most of the times we just make a model and use some test data to see how well it would work for new data

wooden sail May 21, 2022, 9:55 AM

#

that sounds about right. since the params are a thing of their own independent of implementation, you can just as easily, say, do the training with tensorflow, then make a standalone implementation of a forward pass of the network in c++ that reads the resulting parameters and infers blazingly fast

#

so you really have a lot of flexibility in deployment

loud cove May 21, 2022, 9:56 AM

#

mild dirge Wouldn't really know. I'm just doing a masters in AI atm, haven't done anything ...

interesting, what's your undergrad?

loud cove May 21, 2022, 9:57 AM

#

wooden sail but yeah, look into deployment as pccamel suggests. containerization is just one...

thanks, I'm just curious, I have no plan on doing any ML anyways.

mild dirge May 21, 2022, 9:57 AM

#

loud cove interesting, what's your undergrad?

AI as well haha

loud cove May 21, 2022, 9:57 AM

#

the heck is AI lol

mild dirge May 21, 2022, 9:57 AM

#

is undergrad the same as bacchelor?

loud cove May 21, 2022, 9:57 AM

#

yeah

mild dirge May 21, 2022, 9:58 AM

#

yeah AI then

#

You know what AI is right?

loud cove May 21, 2022, 9:58 AM

#

what's that?

mild dirge May 21, 2022, 9:58 AM

#

artificial intelligence

loud cove May 21, 2022, 9:58 AM

#

i know ai but i don't know bsc in ai

#

it is probably just a new trendy thing unis do now

mild dirge May 21, 2022, 9:58 AM

#

Well it teaches stuff about AI

loud cove May 21, 2022, 9:58 AM

#

is it mostly stats? is it mostly cs?\

mild dirge May 21, 2022, 9:58 AM

#

yes

#

that

wooden sail May 21, 2022, 9:59 AM

#

AI is a buzzword anyways. you can boil it down to a few math competencies and optimization for targetted applications

loud cove May 21, 2022, 9:59 AM

#

yeah is it closer to stats or CS though?

mild dirge May 21, 2022, 9:59 AM

#

and psychology, biology a bit

#

jack of all trades

#

Lots of elective courses, I chose mostly cs courses

loud cove May 21, 2022, 9:59 AM

#

wooden sail AI is a buzzword anyways. you can boil it down to a few math competencies and op...

amen

wooden sail May 21, 2022, 9:59 AM

#

AI falls both under stats and CS separately. also overlaps with so-called "signal processing"

loud cove May 21, 2022, 9:59 AM

#

I know but im asking about his degree

#

what faculty provides it?

mild dirge May 21, 2022, 10:00 AM

#

We could choose what we were interested in, I chose mostly cs courses, and robotics

#

science and engineering

loud cove May 21, 2022, 10:00 AM

#

most unis are just capitalizing on the buzzword

loud cove May 21, 2022, 10:00 AM

#

mild dirge science and engineering

ah

mild dirge May 21, 2022, 10:00 AM

#

It exists for 20 ish years I believe on my uni

loud cove May 21, 2022, 10:00 AM

#

yeah those existed for a while, but it gained a lot of traction in the past few years.

mild dirge May 21, 2022, 10:01 AM

#

But yeah AI just means its a bit more broad than just machine learning

loud cove May 21, 2022, 10:01 AM

#

yea im familiar

#

but it is mostly just buzzwords

#

once you have an optimization model or program then that isn't AI, that's just math and algorithms.

mild dirge May 21, 2022, 10:02 AM

#

hmm

loud cove May 21, 2022, 10:02 AM

#

but AI is a cooler way to say it.

mild dirge May 21, 2022, 10:02 AM

#

AI is just a name for a field of research with many different sub-fields

loud cove May 21, 2022, 10:02 AM

#

yea

mild dirge May 21, 2022, 10:02 AM

#

It's become a buzz-word lately

#

But it already exists for a long time

loud cove May 21, 2022, 10:03 AM

#

yea, I know someone from here in Egypt working in AI since the 90s

#

so I'd imagine it was more common in the west.

compact rose May 21, 2022, 10:10 AM

#

Hello guys, hope you guys are having a nice day! I need help in pyspark and i'm knowing how to solve this. I want to merge duplicated rows but keep the values that were in them. In the first image, you can see the dataframe i'm working on it. I'm using this block of code , but isn't working it

loud cove May 21, 2022, 10:15 AM

#

compact rose Hello guys, hope you guys are having a nice day! I need help in pyspark and i'm...

why are you using alias?

#

wouldn't the max return one result only anyways?

#

oh nvm im dumb

loud cove May 21, 2022, 10:24 AM

#

compact rose Hello guys, hope you guys are having a nice day! I need help in pyspark and i'm...

I never used spark, but have you tried adding \ after each comma?

compact rose May 21, 2022, 10:26 AM

#

I didn't, i am gonne try, but i think it wont work! w8 a sec

#

didn't work :/

loud cove May 21, 2022, 10:28 AM

#

https://sparkbyexamples.com/pyspark/pyspark-groupby-explained-with-example/
looking similar to this to me

#

have you tried doing max for all the columns and then renaming them after? might be easier and cleaner.

compact rose May 21, 2022, 10:32 AM

#

yeah, i was basing in that!

#

Just tried and nothing xD

loud cove May 21, 2022, 10:36 AM

#

compact rose Just tried and nothing xD

you're getting the max for everything

#

i think you can max everything instead of writing all this and just rename after

#

try agg(max(*)) and see what it gets you.

compact rose May 21, 2022, 10:42 AM

#

I just made it work. I just trade the max for a function of sql. (f.max and it worked)!

#

thanks anyway !

civic stone May 21, 2022, 11:17 AM

#

i want to use D3.js v4 Force Directed Graph with Labels like the below link
https://bl.ocks.org/heybignick/3faf257bbbbc7743bb72310d03b86ee8

Does anybody used this code, it's not working with me

D3.js v4 Force Directed Graph with Labels

heybignick’s Block 3faf257bbbbc7743bb72310d03b86ee8

celest vine May 21, 2022, 11:44 AM

#

Hi

#

Does anyone know how to speed up pandas iteration?

#

I have a column which contains image URLs and I am downloading all the images using requests.
Right now it downloads around 12000 to 13000 images in an hour.
Is there a way to speed this up?

#

I have 50 Mbps. It is good enough, right?

tidal bough May 21, 2022, 11:57 AM

#

celest vine I have a column which contains image URLs and I am downloading all the images us...

and I am downloading all the images using requests
Using one thread, or multiple via threading?

loud cove May 21, 2022, 12:13 PM

#

celest vine I have 50 Mbps. It is good enough, right?

what's the source? are you sure you aren't getting limited?

#

and no it depends on the size of each image and the total. of them.
many smaller files will take longer than less small files.

celest vine May 21, 2022, 12:16 PM

#

tidal bough > and I am downloading all the images using requests Using one thread, or multip...

One thread I guess. I have no idea about threading

loud cove May 21, 2022, 12:17 PM

#

celest vine One thread I guess. I have no idea about threading

https://adrien.barbaresi.eu/blog/how-to-download-parallel-politeness-rules-python.html

Bits of Language: corpus linguistics, NLP and text analytics

How to download web pages in parallel and follow politeness rules i...

Optimizing downloads is crucial to gather data from a series of websites. However, one should respect “politeness” rules. Here is a simple way keep an eye on all these constraints as once.

tidal bough May 21, 2022, 12:18 PM

#

celest vine One thread I guess. I have no idea about threading

Waiting on multiple requests at the same time should boost the speed massively, then. This can be done either by threading requests, or, somewhat more nicely, by using an asynchronous requests library like aiohttp

celest vine May 21, 2022, 12:18 PM

#

loud cove and no it depends on the size of each image and the total. of them. many smaller...

Image size is pretty small like 2 KB. But there are over 100k images.

loud cove May 21, 2022, 12:19 PM

#

celest vine Image size is pretty small like 2 KB. But there are over 100k images.

yeah that will just take long

#

all these images will download in less than one second

celest vine May 21, 2022, 12:20 PM

#

loud cove all these images will download in less than one second

Currently, using only requests it's downloading 12k-13k images per hour

loud cove May 21, 2022, 12:21 PM

#

you can do multithreading and it will increase them, but be careful not to get black listed

#

what's the source of these files?

#

and how many files are there? they might have a better option to do this.

celest vine May 21, 2022, 12:22 PM

#

loud cove what's the source of these files?

The image URLs are of Twitter profile pics.

celest vine May 21, 2022, 12:22 PM

#

loud cove and how many files are there? they might have a better option to do this.

Around 150k

loud cove May 21, 2022, 12:23 PM

#

celest vine The image URLs are of Twitter profile pics.

try to see if the api will help

celest vine May 21, 2022, 12:23 PM

#

loud cove try to see if the api will help

Twitter API?

loud cove May 21, 2022, 12:23 PM

#

yea

celest vine May 21, 2022, 12:23 PM

#

loud cove yea

Nope, it's got loads of rate limits.

loud cove May 21, 2022, 12:24 PM

#

yeah your best bet is multi processing, but you'd probably be ip banned for spamming requests

loud cove May 21, 2022, 12:25 PM

#

loud cove https://adrien.barbaresi.eu/blog/how-to-download-parallel-politeness-rules-pytho...

you have both of these https://opensourceoptions.com/blog/use-python-to-download-multiple-files-or-urls-in-parallel/

go wild.

arctic wedgeBOT May 21, 2022, 12:34 PM

#

aiohttp v3.8.1

Async http client/server framework (asyncio)

supple scroll May 21, 2022, 12:36 PM

#

is there any reason why activation functions are just the normal sigmoid/relu/etc, and not something like this?

mild dirge May 21, 2022, 12:39 PM

#

supple scroll is there any reason why activation functions are just the normal sigmoid/relu/et...

Well one reason is that people prefer their gpus below 200 °C

tidal bough May 21, 2022, 12:43 PM

#

this looks like a combination of 4 parts, each of them in the form a (x-b)^k, so 3 parameters per part. Calculating gradients with regards to each parameter is going to be annoying.

#

I'd guess that people studied complex activation functions like that and they weren't any better than the ones normally used while being harder to compute. No sources for that claim, though.

supple scroll May 21, 2022, 12:51 PM

#

what about neurons that take in two inputs and give one output? you could do something weird like have a functioning xor gate with only 3 neurons
something like this:

mild dirge May 21, 2022, 1:02 PM

#

Not sure about that specific example, but a pretty major reason why it is as it is now is because we can easily apply matrix multiplication

#

Which is heavily optimized

serene scaffold May 21, 2022, 1:44 PM

#

mild dirge Well one reason is that people prefer their gpus below 200 °C

that activation function doesn't really look all that complicated, I don't think?

mild dirge May 21, 2022, 1:44 PM

#

serene scaffold that activation function doesn't really look all that complicated, I don't think...

Well compared to something like max(0, x) it would be, but yeah on smaller models maybe it would be okay

#

not really sure how much of the computation time is dependent on the activation function normally

wooden sail May 21, 2022, 1:53 PM

#

that depends on how difficult it is to differentiate it, i.e. do the backwards passes

#

stuff like the derivative being bounded, and being bounded by a small constant at that, provides convergence guarantees and affects how large the learning rate can be while remaining stable

#

there's also a paper discussing the optimality of relu and similar piecewise spline-like functions, but i can't for the life of me remember the title. it must have been in ICASSP 2021 or 2020. but at any rate, any deep enough network should fall in the scope of the universal approximation theorem, so may as well go with well behaved functions

serene scaffold May 21, 2022, 2:06 PM

#

wooden sail there's also a paper discussing the optimality of relu and similar piecewise spl...

please let me know if you remember the paper lemon_hyperpleased

wooden sail May 21, 2022, 2:06 PM

#

i've been looking for like 10 minutes haha, i'll keep trying

#

oof i can't find it, sorry. assume i'm misremembering, or look around yourself... lemon_sweat

rich fiber May 21, 2022, 3:45 PM

#

Hello, I want to pursue AI and started my journey by learning the Fundamentals of Python, now I'm confused on what should I do next as I tried learning OpenCV but had too much trouble due to the background knowledge required.
Would love to hear your suggestions or journey in AI.

wooden sail May 21, 2022, 4:05 PM

#

i would say that if you seriously intend on working in/with AI in the long term, you need to learn those things. a minimum background in linear algebra, stats, and multivar calc is necessary if you want to understand what you're doing.

#

if you only plan on using APIs and networks other people have designed and possibly trained, you don't need that. but that also limits your options

#

you can alternatively start with classical signal processing/image processing, which is the same stuff i mentioned above but directly looking at its applications

rich fiber May 21, 2022, 4:36 PM

#

wooden sail you can alternatively start with classical signal processing/image processing, w...

Thanks for your help :)

iron basalt May 21, 2022, 6:07 PM

#

supple scroll is there any reason why activation functions are just the normal sigmoid/relu/et...

Many different activation functions are used, depends what you are doing.

steady basalt May 21, 2022, 6:09 PM

#

mild dirge Well one reason is that people prefer their gpus below 200 °C

LOL

steady basalt May 21, 2022, 6:10 PM

#

wooden sail i would say that if you seriously intend on working in/with AI in the long term,...

Im shit at maths btw

#

i just failed a quiz on guassian substitution for matrices

#

doesnt stop me >:D

wooden sail May 21, 2022, 6:12 PM

#

knowing how to do gaussian elimination by hand is not a big deal. knowing that elementary row and column operations are rank- and solution-preserving, on the other hand, is important

steady basalt May 21, 2022, 6:12 PM

#

I think that as long as u know the omega basics of the math behind ml, only stats matters anymore

#

I mean, I’m sure everyone knows how to multiply matrices bro

wooden sail May 21, 2022, 6:13 PM

#

sadly you can't cleanly separate the two in multivariate statistics, since you'll be looking at a LOT of covariance and correlation matrices, their rank, and their related spaces

steady basalt May 21, 2022, 6:13 PM

#

I think that beyond what’s needed for backpropagation calculus gets less important

#

And more so general stats

wooden sail May 21, 2022, 6:14 PM

#

if you just wanna use stuff and not make new theoretical results, yes, for sure

#

and honestly that is the case for the vast majority of people

steady basalt May 21, 2022, 6:14 PM

#

I do not think 99% of data scientists are trying to invent the wheel

#

Yes

#

Myself included

#

If I wanted to do that I’d go for a PhD in ml

wooden sail May 21, 2022, 6:14 PM

#

i might be a little out of touch with the real world 😛

steady basalt May 21, 2022, 6:15 PM

#

Even the phds I know have not done so but rather applied it for research

#

The models we have already are fine

wooden sail May 21, 2022, 6:15 PM

#

they're "fine" in that they work. they're "not fine" in that there aren't all that many results providing performance and convergence guarantees

steady basalt May 21, 2022, 6:16 PM

#

Theoretical ml math is like for the actual gods and math nerds. We need them and I appreciate their work but it’s definitely only a tiny fraction of engineers

wooden sail May 21, 2022, 6:16 PM

#

so you go off on a limb trusting your training and validation

#

small percentage of engineers and (applied) mathematicians, yeah.

steady basalt May 21, 2022, 6:16 PM

#

I can’t imagine how hard it was to code tensorflow from scratch

#

Or come up with certain models

#

It’s way behind my own iq limit

wooden sail May 21, 2022, 6:17 PM

#

the core models are old, i would argue the basics of that is not all that difficult. the stuff is decades old, we just lacked the computational power

steady basalt May 21, 2022, 6:17 PM

#

It’s for the most creative bunch, I just learn and use their creations

#

#

Read?

thin palm May 21, 2022, 7:36 PM

#

steady basalt Just encode company

so encode the companyId? There's 63 features, but I think I may just do this to save myself the time

thin palm May 21, 2022, 9:36 PM

#

Would anybody ordinal encode degree types? Such as High School, Bachelors, Masters, and PHD? A colleague of mine mentioned to OHE instead but I'm unsure

thin palm May 21, 2022, 10:13 PM

#

perfect, was going to do this anyway. Thanks mate

#

Does "None" for degree obtained mean missing values or just mean no degree was obtained? Originally I thought it meant no degree was obtained another colleuage saying it's null values

steady basalt May 21, 2022, 10:26 PM

#

thin palm Would anybody ordinal encode degree types? Such as High School, Bachelors, Maste...

Yes

thin palm May 21, 2022, 10:27 PM

#

steady basalt Yes

How about categorical features that say "None" for degree awarded, does this mean missing data or just none obtained? But now looking at major awarded some of the values have "None" even though a degree was awaraced

steady basalt May 21, 2022, 10:28 PM

#

None is a category

#

Maybe not in the case of grades

thin palm May 21, 2022, 10:29 PM

#

steady basalt Maybe not in the case of grades

Maybe I could do a boolean indexing that finds all degrees awarded and make sure that majors are being awarded

#

because you can't graduate with no Major

main fox May 22, 2022, 1:14 AM

#

thin palm Would anybody ordinal encode degree types? Such as High School, Bachelors, Maste...

I'd make dummy variables for each. I don't know how your data is structured, but it's possible for someone to have a PhD and a Masters, and someone else to have a PhD with no Masters for example.

thin palm May 22, 2022, 1:16 AM

#

main fox I'd make dummy variables for each. I don't know how your data is structured, but...

Good looks on this, but it's not structured that way. How about occupation 'CFO, CEO, Janitor, Manager' would you ordinal encode this or One Hot encode instead?

main fox May 22, 2022, 1:18 AM

#

Personally, I tend to not try and ordinal encode. It imposes an assumption of equal distance between points that might not be true.

thin palm May 22, 2022, 1:18 AM

#

main fox Personally, I tend to not try and ordinal encode. It imposes an assumption of eq...

okay cool, thanks for the help man.

main fox May 22, 2022, 1:19 AM

#

You're welcome.

void granite May 22, 2022, 1:30 AM

#

thin palm How about categorical features that say "None" for degree awarded, does this mea...

this a general problem in dealing with data, the solution is generally to ask the people who recorded/created the data, or look at any documentation associated with it

thin palm May 22, 2022, 1:31 AM

#

void granite this a general problem in dealing with data, the solution is generally to ask th...

it's for a take home assignment on a job interview.. I think they just would like to see how I tackle certain problems

void granite May 22, 2022, 1:31 AM

#

oh ok

#

🙂

thin palm May 22, 2022, 1:31 AM

#

void granite oh ok

thank you though!

void granite May 22, 2022, 1:31 AM

#

I hope the job interview goes well!

thin palm May 22, 2022, 1:32 AM

#

void granite I hope the job interview goes well!

thanks so much, I really need it haha

thin palm May 22, 2022, 1:50 AM

#

void granite I hope the job interview goes well!

any idea on making a correlation matrix for some with 92 features?

void granite May 22, 2022, 1:51 AM

#

df.corr() 😉

#

assuming a pandas dataframe

thin palm May 22, 2022, 1:52 AM

#

void granite df.corr() 😉

haha yeah that's one way, the image comes out to blurry cause there's so many features, and yes

#

may just do a table instead of viz

void granite May 22, 2022, 1:53 AM

#

it's a lot of features

thin palm May 22, 2022, 1:53 AM

#

void granite it's a lot of features

yeah, will have to just use a table

void granite May 22, 2022, 1:53 AM

#

maybe there's a subset you can look at

#

or you could use some type of dimensionality reduction

#

like MDS or PCA

thin palm May 22, 2022, 1:54 AM

#

Yeah, but I'm not sure on that. I might just make 2 models... one dropping the big features and one with it

#

is it okay to havea correlation of .384 based off of degree and salary? salary is what we're predicting

main fox May 22, 2022, 1:58 AM

#

thin palm any idea on making a correlation matrix for some with 92 features?

plt.figure(figsize=(), dpi=())
sns.heatmap(df.corr())

thin palm May 22, 2022, 1:58 AM

#

main fox plt.figure(figsize=(), dpi=()) sns.heatmap(df.corr())

it's just too big to fit on a figsize

#


corr_df.columns = ['feature_1','feature_2', 'correlation'] # rename columns

corr_df.sort_values(by="correlation",ascending=False, inplace=True) # sort by correlation

corr_df = corr_df[corr_df['feature_1'] != corr_df['feature_2']] # Remove self correlation

# corr_df[(corr_df['correlation'] >= 0.5) | (corr_df['correlation'] <= -0.5)]
corr_df```

main fox May 22, 2022, 1:59 AM

#

Then yeah drop correlations close to zero I suppose.

thin palm May 22, 2022, 1:59 AM

#

main fox Then yeah drop correlations close to zero I suppose.

But that's the highest is the .384

#

not sure if that's a good thing or bad

main fox May 22, 2022, 1:59 AM

#

It's relative to the data

thin palm May 22, 2022, 2:00 AM

#

main fox It's relative to the data

ok ok

thin palm May 22, 2022, 2:16 AM

#

main fox It's relative to the data

So then how does the matrix tell us what features we need to drop?? The ones that have high correlation right?

#

If I have 93 features and 62 of them are fairly independent, does that mean we just keep them?

void granite May 22, 2022, 2:19 AM

#

thin palm is it okay to havea correlation of .384 based off of degree and salary? salary i...

I wouldn't say this is a strong relationship, but given that salary is going to be influenced by a ton of things, I am not surprised

#

is that r or r^2 ?

#

I guess it's R if you are saying it's correlation

thin palm May 22, 2022, 2:21 AM

#

void granite I wouldn't say this is a strong relationship, but given that salary is going to ...

check this out

Screen_Shot_2022-05-21_at_8.21.08_PM.png

void granite May 22, 2022, 2:22 AM

#

hm hm

#

are there a bunch of features bc there's a bunch of careers in there?

thin palm May 22, 2022, 2:22 AM

#

void granite hm hm

again its for job interview so as long as I'm on the right track...

void granite May 22, 2022, 2:23 AM

#

yeah

thin palm May 22, 2022, 2:23 AM

#

void granite are there a bunch of features bc there's a bunch of careers in there?

because there's 63 unique companyID's

void granite May 22, 2022, 2:23 AM

#

years experience + degree is probably pretty good then 🙂

thin palm May 22, 2022, 2:23 AM

#

I thought about dropping this to see if it made a difference

void granite May 22, 2022, 2:23 AM

#

I bet years experience + degree + field is even better

#

but maybe just the two are enough

thin palm May 22, 2022, 2:24 AM

#

void granite I bet years experience + degree + field is even better

so based off what I showed you, do you think I can probably drop the comapnyId?

void granite May 22, 2022, 2:24 AM

#

sorry typos

thin palm May 22, 2022, 2:24 AM

#

and that dang companyID has forced me to make 63 extra columns... but the reason I kept it is because company12 (could be apple for example) is known to pay better than company55(which could be HP)

#

does that make sense @void granite ?

void granite May 22, 2022, 2:26 AM

#

yeah just pick a few of the higher ones for the model

thin palm May 22, 2022, 2:26 AM

#

void granite yeah just pick a few of the higher ones for the model

Meaning that I create my features with only a few of the selected columns?

void granite May 22, 2022, 2:28 AM

#

I mean, it's -a- way to do it, if you just want something quick and dirty. there's a lot of advanced techniques you can use if you want to be formal about winnowing down predictors, but uh... it's been over five years since I was doing any sort of advanced statistical work 😄

thin palm May 22, 2022, 2:28 AM

#

void granite I mean, it's -a- way to do it, if you just want something quick and dirty. there...

ahh gotcha gotcha, i'll have to play around a bit more! Thank you though

void granite May 22, 2022, 2:32 AM

#

if you were being fancy about it, you would pick some sort of criteria for evaluating models, make a bunch of models, and then rate them by that criteria... maybe you would also have some parsimony there (preferring models with a lesser number of variables)

#

there's like entire fields of study about this problem heh

#

and there is, imho, a sense in which it's an art in addition to a science

#

actually, thinking about it

#

you want companyID to be a factor, like a categorial data type

#

"company" should just be one column, in other words

#

http://pandas-docs.github.io/pandas-docs-travis/user_guide/categorical.html

fallow herald May 22, 2022, 6:00 AM

#

good morning

#

I am new to pandas, is there a sort of 'mask' functionality?

#

#

Lets say I have a data frame as in the image above

#

would it be possible to have a mask in a Serie and run it over the dataframe, returning me all rows matching the mask?

#

what would be the best approach when I have such a task?

hybrid mica May 22, 2022, 6:09 AM

#

why is the favicon.ico for colab in a different colour, for each of these notebooks?

tacit basin May 22, 2022, 8:50 AM

#

fallow herald would it be possible to have a mask in a Serie and run it over the dataframe, re...

What's your desired output DF?

tacit basin May 22, 2022, 8:51 AM

#

hybrid mica why is the favicon.ico for colab in a different colour, for each of these notebo...

Not sure. Maybe one is connected to kernel and the other one not?

wooden forge May 22, 2022, 9:23 AM

#

Hey guys, so I have this plot, and I'd like to make a Gaussian fit. But I don't really know how, so would anyone know by any chance how to do it properly?

#

(the pit in p=0 is totally normal btw that's exactly what I was trying to get)

tidal bough May 22, 2022, 9:27 AM

#

wooden forge Hey guys, so I have this plot, and I'd like to make a Gaussian fit. But I don't ...

Simple way: calculate the mean and std of the data, and draw a gaussian with these parameters.

#

In theory you could instead directly minimize the mean squared error, or perhaps maximize the likelihood, via something like scipy.optimize. But I think the result will be very close, and the latter approach is more complicated and computationally expensive.

wooden forge May 22, 2022, 9:28 AM

#

mmh

#

the mean would be sigma and the std mu ?

tidal bough May 22, 2022, 9:29 AM

#

the opposite

wooden forge May 22, 2022, 9:29 AM

#

hoo

#

let me check whats the std lol

#

I'm not used to english terms

tidal bough May 22, 2022, 9:29 AM

#

!docs numpy.std can be used, say

arctic wedgeBOT May 22, 2022, 9:29 AM

#

numpy.std


numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<no value>, *, where=<no value>)#```
Compute the standard deviation along the specified axis.

Returns the standard deviation, a measure of the spread of a distribution,
of the array elements. The standard deviation is computed for the
flattened array by default, otherwise over the specified axis.

wooden forge May 22, 2022, 9:30 AM

#

haaaaaaa

#

Okay I know what that is

#

thanks !

wooden forge May 22, 2022, 9:37 AM

#

tidal bough Simple way: calculate the mean and std of the data, and draw a gaussian with the...

I might be doing something wrong

wooden sail May 22, 2022, 9:56 AM

#

if you have the data from which you produced that PDF, you could find the mean and std from there. if you only have the plot you showed, you have to find the parameters of a gaussian curve directly

wooden forge May 22, 2022, 9:56 AM

#

wooden sail if you have the data from which you produced that PDF, you could find the mean a...

I have a np array, I just use np.mean and np.std on it to get the parameters

wooden sail May 22, 2022, 9:57 AM

#

an np array of what, though?

wooden forge May 22, 2022, 9:57 AM

#

but I just decided to look directly on the graph to get the value and then get sigma from that

tidal bough May 22, 2022, 9:57 AM

#

a numpy array of what? the original data you used to calculate the histogram?

wooden forge May 22, 2022, 9:57 AM

#

wooden sail an np array of what, though?

ho it's the square module of a state function psi

wooden sail May 22, 2022, 9:57 AM

#

if you have only the samples of the plot you showed, you'll have to do a least squares fit or something similar, and finding the mean and variance won't work

tidal bough May 22, 2022, 9:57 AM

#

wooden forge but I just decided to look directly on the graph to get the value and then get s...

from the plot sigma should be 50-75 by my eye

wooden forge May 22, 2022, 9:57 AM

#

yeah I find 66

tidal bough May 22, 2022, 9:58 AM

#

hmm, how did you plot the gaussian?

wooden forge May 22, 2022, 9:58 AM

#

I just look at the central value, calculate sigma from that and then plot a gaussian

#

#

worked fine lol

#

idk why the method you gave me didn't

wooden sail May 22, 2022, 9:59 AM

#

that doesn't look like all that good a fit

tidal bough May 22, 2022, 9:59 AM

#

the sigma is clearly slightly off (lower than it should be)

wooden forge May 22, 2022, 9:59 AM

#

Yeah I know

somber prism May 22, 2022, 9:59 AM

#

guys i loaded a dataset using tf.data.Dataset.from tensor slices and my map function is like this ```
def preprocessing(x, img_path):
print(dir(x))
name1 = str(x[0].numpy())
name2 = str(x[3].numpy())
num1 = str(x[1].numpy())
num2 = str(x[2].numpy())
target = float(x[4])

    img_name1 = f'{name1}_{add_zeros(num1)}.jpg'
    img_name2 = f'{name2}_{add_zeros(num2)}.jpg'
    
    img1 = plt.imread(os.path.join(img_path, name1, img_name1))
    img2 = plt.imread(os.path.join(img_path, name2, img_name2))
    
    return tf.convert_to_tensor([img1, img2, target])

wooden forge May 22, 2022, 9:59 AM

#

also I changed the central value to get the maximum divided by 2
that's totally normal

wooden sail May 22, 2022, 10:00 AM

#

surmised as much

tidal bough May 22, 2022, 10:00 AM

#

you might in fact want to go the scipy.optimize way, since you only have the PDF

wooden sail May 22, 2022, 10:00 AM

#

still, you'll wanna do a fit

wooden forge May 22, 2022, 10:01 AM

#

PDF ?

wooden sail May 22, 2022, 10:01 AM

#

since you have the parametric model, you can find the gradient and hessian analytically (or use automatic differentiation) to make it converge quickly

#

PDF = probability density function

tidal bough May 22, 2022, 10:01 AM

#

wooden forge PDF ?

Density plot

wooden forge May 22, 2022, 10:01 AM

#

I also have the state function

#

as a numpy array

tidal bough May 22, 2022, 10:01 AM

#

wooden sail since you have the parametric model, you can find the gradient and hessian analy...

ooh, fancy ways

wooden forge May 22, 2022, 10:02 AM

#

basically I have my system in an initial psi state as a ket, then I apply my operator, yada yada, and get a final state

#

and I'm ploting the density of probability

#

which I have as an array

wooden sail May 22, 2022, 10:03 AM

#

so you do have the data. you can just do a maximum likelihood estimate of the mean and variance then, both of which have closed form if the distribution is normal (and is assumed to have AWGN noise)

wooden forge May 22, 2022, 10:03 AM

#

AWGN ?

wooden sail May 22, 2022, 10:03 AM

#

additive white gaussian noise

tidal bough May 22, 2022, 10:04 AM

#

here's an example of gaussian fitting:

from scipy import optimize as opt
import scipy.stats as stats
import numpy as np
# X,Y are the points of your PDF

def loss_fun(params):
    mu, sigma = params
    return ((stats.norm.pdf(X, mu, sigma) - Y)**2).mean()


initial_guess = np.array([0, 60])  # initial guess, just needs to be close-enough to the true values
res = opt.minimize(loss_fun, x0=initial_guess)

wooden sail May 22, 2022, 10:04 AM

#

just do what reptile suggests, it seems we're using different nomenclature anyway so it'd take too long to discuss it

wooden forge May 22, 2022, 10:05 AM

#

lmao

#

that's fine Edd ^^

#

is Y the data I have

#

just to be sure

tidal bough May 22, 2022, 10:07 AM

#

Sure, each Y[i] being the value of |\Psi|^2 at the corresponding p=X[i]

wooden forge May 22, 2022, 10:07 AM

#

ha nice!

tidal bough May 22, 2022, 10:08 AM

#

Then you do mu, sigma = res.x and plot a gaussian with these.

wooden forge May 22, 2022, 10:09 AM

#

oki ^^

#

thx ! will try !

tidal bough May 22, 2022, 10:09 AM

#

note my edit just now; it should be return ((stats.norm.pdf(X, mu, sigma) - Y)**2).mean()

wooden forge May 22, 2022, 10:10 AM

#

ho okay

wooden sail May 22, 2022, 10:19 AM

#

did you get it to work? i'm curious how close that solution will come, since vanilla least squares isn't optimal here. seems to be that the noise covariance is not just a scaled identity matrix, so the cost function would ideally have an inverse of the covariance matrix somewhere in there

wooden forge May 22, 2022, 10:21 AM

#

Not done yet

#

I grabbed some snacks that's why

#

wooden forge May 22, 2022, 10:28 AM

#

wooden sail did you get it to work? i'm curious how close that solution will come, since van...

there you go

wooden sail May 22, 2022, 10:36 AM

#

guess i surmised as much. once thing you can try is to split up the signal into chunks of a handful of samples and compute the variance of each