#data-science-and-ml

1 messages · Page 404 of 1

wooden sail
#

log on with your personal account and use the free version

spare briar
#

-you can accumulate batches and still apply momentum

#

momentum changes the optimizer step

viscid flume
#

Can I upload a folder or something to colab?

wooden sail
#

i know, but you can rewrite momentum as an accumulated gradient with large momentum parameters and resetting the momentum every few steps. you can make them match in degenerate cases

#

yes, you can upload files to it. the easiest way is through google drive, but you can also just upload stuff directly

#

more generally, what i mean is that these two things are just a specific choice of descent step size and momentum

spare briar
#

this just isnt true

viscid flume
#

How do I do that then?

spare briar
#

you would need to reset momentum on every call to step

wooden sail
#

that's a valid choice of momentum

spare briar
wooden sail
#

what i'm getting at is that you can write a single mathematical expression that does either of these things with a different parameter schedule. of course you get different results. they're flavors of the same thing though, a stochastic gradient descent schedule

viscid flume
#

ok

#

in this?

spare briar
wooden sail
#

mhm

spare briar
#

this argument can get pretty pathological though

#

i could argue that optimization schedule + loss is one object

#

and we study the equivalence classes of trajectories

#

when we trade off schedule and loss

viscid flume
spare briar
#

not that this is wrong

viscid flume
# spare briar if step % acc_steps == 0:

change this:if epoch_current <= cfg.SOLVER.GENERATOR.INIT_EPOCH: # FP _t.tic() real_images_color = real_images_color.to(device,memory_format=torch.channels_last) generated = model_generator(real_images_color) loss_init = init_loss(model_backbone, real_images_color, generated) INIT_FP_time = _t.toc() loss_dict = {"Init_loss": loss_init} # BP _t.tic() if (iteration+1) % 4 == 0 or (iteration+1) == len(data_loader): optimizer_generator.zero_grad(set_to_none=True) loss_init.backward() optimizer_generator.step() scheduler_generator.step() scheduler_discriminator.step() INIT_BP_time = _t.toc() meters.update(INIT_FP_time=INIT_FP_time, INIT_BP_time=INIT_BP_time)
to?

spare briar
#

looks like its already happening

#

if (iteration+1) % 4 == 0

viscid flume
#

Ok

spare briar
#

although the second part after or is sort of funny

#

since they dont scale properly their learning rate for that step

spare briar
# viscid flume Ok

(you could make it % 6 and decrease your batch size but make sure to also scale the learning rate to compensate)

spare briar
#

well think about it

#

if you accumulate 6 batches instead of 4

#

you have 6/4 more gradient

viscid flume
#

Actually, I added that myself

#

It wasn't in the code before, and I read an article and did that, wasn't sure how it works back then

viscid flume
#

Ok

viscid flume
#

2022-05-19 15:22:50,438 AnimeGan.trainer INFO: eta: 1:22:41 epoch: 1 | 421/6700 batch_time: 0.0062 (0.0049) data_time: 0.0001 (0.0001) Init_loss: 292.7678 (286.7284) lr(G|D): 0.000050|0.000010 max mem: 1857

#

wut

#

it's working!

dusty valve
#

how do i install pytorch without CUDA?

#

i don't have a nvidia gpu btw

ebon ember
#

Hey
Anyone here familiar with xarray?
'''
def calc_max_std(gb):
max_cent = gb.mean(dim='dt').argmax()
idx_left = max_cent - 30
idx_right = max_cent + 30
return gb[:,idx_left:idx_right].max(dim='x').mean()

da.groupby('dt').map(calc_max_std)
'''
It gives me an 'TypeError: 'DataArray' object cannot be interpreted as an integer'
If I do
'''
def calc_max_std(gb):
max_cent = gb.mean(dim='dt').argmax()
idx_left = max_cent - 30
idx_right = max_cent + 30
return gb[:,idx_left:idx_right].max(dim='x').mean()
'''
it works, but that's besides the point as I need to slice the array differently in every group. Anyone any idea?
thanks a lot in advance

dusty valve
#

if anyone knows how to fix this, ping me AssertionError: Torch not compiled with CUDA enabled i don't have an nvidia gpu btw

brazen spire
#

Does tensorflow 2.9 support cuda 11.6?

barren wedge
#

Can we ignore words with only one token in dataframe/series?
how to do it?

ebon ember
willow jasper
#

can anyone tell me how can i add a new row in this datasets
but the main point is that i have to add a column but the values for that column should be similar to score but in place of 5 i have to put 1 and in place of 2 i have to place 0

#

can anyone help me out

lone yacht
#

Hi everyone, I have a few questions for those with experience in training neural networks in PyTorch. I've recently been trying to train and evaluate a CNN to classify images from the CIFAR10 dataset. The condition is it has to be three hidden layers only, and I've set it up as per the code attached.

#

I have a training and validation set, and I've been evaluating the model with them. I'm using SGD with a learning rate of 0.001 and a momentum of 0.9, with cross entropy loss as the criterion. I was hoping to get a graph that looks like an exponential graph, as per most experiments I've seen, but instead it's come out like this, where the validation accuracy is all over the place.

#

Is there anything I could change to get something more consistent and with higher accuracy?

#

This is what I was hoping it would look more like

mild dirge
#

Well the scale of your graph is quite misleading, the accuracy goes from 48 to 60 there

#

So the training accuracy and validation accuracy aren't as far off as it seems in that graph

#

@lone yacht

#

As of now it seems that you are slightly over-fitting as your validation accuracy is lower than your training accuracy. But equally important is that your training accuracy doesn't seem to increase further.

lone yacht
#

Thanks for the response @mild dirge. Do you know what I could try to get a better accuracy? I also note that the first epoch starts at around 55% which I'm curious about, because I would have expected it to be more like 10% starting off with random weights, given that there are 10 classes, do you know why that might be?

mild dirge
#

It is likely the accuracy after the first epoch

#

It is quite normal that it can jump that high after the first epoch*

lone yacht
#

Ahh, that makes sense

mild dirge
#

And the fact that the accuracy is not going up can be because of many things

#

complex dataset, too small network etc.

#

It's also not that common to only use 1 fc layer I think

chilly sapphire
#

hello everyone

#

just wondering if anyone can help me with my data structure analysis assignment

mild dirge
#

In general 3 hidden layers probably won't give too good of a performance

#

you could probably pull a sneaky and use a convolution layer with a stride above 1

#

that way you reduce the spatial dimensions and get to use another convolutional layer

#

(instead of the pool)

#

@lone yacht

lone yacht
#

Thanks @mild dirge , I'll try swapping out the pooling layer with another convolutional layer

mild dirge
#

With a stride above 1* that's the important bit here

frigid elk
#

in pyspark.ml.classification.logisticregressionmodel i can call 'evaluate' to get access to the aucroc and precision recall summary. ... what's the equivalent for randomforestclassifier?

willow jasper
compact rose
#

Hello guys, i have a doubt with pyspark. I printed a schema of dataframe and I get this ( check image). However, i want to split the array do you see there to columns, how do i do that?

frigid elk
#

is this what you're looking for?

df.select('session_id', 'session_events.*').show()
compact rose
#

I think is something like that. I just saw that i didn't explain me well. My idea is to split what is inside of "element" to columns. In the image, that is the values that are inside of the row. And i want to slice it to get a column like this |DatetimeOpenApp| OpenApp| DatetimeViewHome| ViewHome|

serene scaffold
#

@willow jasper please do not ping random people (including staff members) asking for help. This is a warning.

frigid elk
#

hmm... i'm in the middle of something at work, but off the top of my head i'm thinking something along the route of exploding the session_events and pivoting the results back into named columns. ...

from pyspark.sql import functions as F
df.select('session_id', F.explode(F.col('session_events'))).select('id', F.col('datetime'), F.col('event'))

there may be some syntax error in that, didn't validate, but hope that sets you on the right path.

compact rose
#

I will try and thanks!

flat ridge
#

hey guys, is there an "inverse prediction" function in sklearn??

#

like what X i need to put for returning an Y

mild dirge
#

There can be multiple inputs that give the same prediction @flat ridge

compact rose
#

Million thanks to dre, u saved my life. But now i need your help again guys xd So, now i was finishing in working on my data and i want to group the data in pyspark. I want to have only one session ID. Example : The first session id repeats in 4 rows. I want that value appears only one and that the other values stay in the same session. If you watch in the columns, you will see that i use a Dummies for that and i want to be like a checklist. Where we can see where did the client trespass(If they went to OpenApp it is 1 , if they went to viewhome it is a 2 )

inland zephyr
#

anyway I have open question in here and i need your suggestion
let said I have several observation of evaluation (said one hour, thirty minute, twenty minute and last a minute). Let said in each observation give its performance (accuracy). I have hypothesis that even the observation minute is decrease, but the performance deviation should be small to take conclusion that my model is stable for each observation. Is it okay to put standard deviation calculation for this case? The model is same for each observation

#

i wrongly ask the question in the wrong room lmao

brave sand
#

does anyone here have any experience with multi agent reinforcement learning?

steady basalt
#

its how i run errting on the gpu

#

else, co lab is best for smallish projects

cinder matrix
#

guys anyone have a cool machine learning project idea something

#

maybe an interesting way to collect some dataset or a task thats not so comon idk

pearl heart
#

yes

#

I have

soft kite
#

would this be the appropriate channel to ask about using pandas to visualize my data?

steady basalt
#

can anhyone help me augmenting my images for cnn?

#

if anyone here has xp

#

desperate!

wooden sail
#

if you're using keras or pytorch, they should have built-ins for this

loud cove
#

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 0)

I'm using sklearn and not sure why not getting the same results each run even though I've set the random state to 0

you can see random forest and log reg changes scores.

data are articles which have been vectorized and lemmatized .

thin palm
#

Any recommendations in finding and understanding outliers in your Pandas dataset?

wooden sail
#

basic descriptive statistics are a good place to start

#

split up the data into quartiles and see if you can learn something

thin palm
steady basalt
#

I’m just choking on syntax

#

Do u know it by heart?

wooden sail
#

sadly not by heart, but i can tell you what the parameters mean if you show me an example of the syntax

steady basalt
#

so when you use flow

#

and if you say, rotate at angles 10 degrees

#

it will add 36 images to ur training set

#

?

#

@wooden sail

#

well of course you save it as a new variable set

#

exxample

#

image_generator = ImageDataGenerator(rescale=1/255, validation_split=0.2)

train_dataset = image_generator.flow_from_directory(batch_size=32,
directory='full_dataset',
shuffle=True,
target_size=(280, 280),
subset="training",
class_mode='categorical')

validation_dataset = image_generator.flow_from_directory(batch_size=32,
directory='full_dataset',
shuffle=True,
target_size=(280, 280),
subset="validation",
class_mode='categorical')

#

bad example

#
                                   rotation_range=20,
                                   width_shift_range=0.2,
                                   height_shift_range=0.2,
                                   horizontal_flip=True
                                   validation_split=0.2)  # val 20%

val_datagen = ImageDataGenerator(rescale=1./255, validation_split=0.2)


train_data = train_datagen.flow_from_directory(train_path, 
                                               target_size=(224, 224), 
                                               color_mode='rgb',
                                               batch_size=BS, 
                                               class_mode='categorical',
                                               shuffle=True,
                                               subset = 'training') 

val_data = val_datagen.flow_from_directory(train_path, 
                                           target_size=(224, 224), 
                                           color_mode='rgb',
                                           batch_size=BS, 
                                           class_mode='categorical',
                                           shuffle=False,
                                           subset = 'validation')```
#

@wooden sail

#

looking at this code here

#

how do you tell how many images train_datagen will create

#

its potentially thousands right?

#

per sample

#

my thought is that this amount of transformation will make many images from one image to go thru epochs

wooden sail
#

no, it won't generate new ones per se. it'll take the same ones you fed and make a modified copy in memory on the fly, then delete it. it will use as many images as you told it to when you specified the number of epochs and total number of images

steady basalt
#

oh

#

waiiiiiit

#

so its like

wooden sail
#

the transformations are applied randomly

steady basalt
#

epoch 1 will use your first version

#

and epoch 2 will use a rotated one?

#

epoch 3 will use a scaled one?

wooden sail
#

randomly

steady basalt
#

it says its a range... so uh where does it specify how many inside that range are made

wooden sail
#

they will very likely never see the original

#

the range is a percentage. you tell it any transformation with a parameter ranging from 0% to 20% shift is valid, for example

steady basalt
#

(rescale=1./255, < also what is this?

wooden sail
#

multiply by 1/255

steady basalt
#

so number of versions of one iage = number of epochs

#

?

wooden sail
#

as far as i recall, yeah

steady basalt
#

riiight ok

#

im trying to get the code in here and work but struggling

#

Actually

#

I tried adding to a sequential model earlier but i think that was just a singular transform so it was busted

#

shud i expect a nice validation loss rreduction?

wooden sail
#

with augmentation, yes

#

the validation loss should be a lot closer to the training one, at least

steady basalt
#

its curerntly 98% accurate before agumentation almsot

wooden sail
#

that's meaningless, all overfitting

steady basalt
#

oh nvm

#

its 73

#

reckon it will break 80?

#

for X-rays, is there any rule so should I use any specific set of transformations and ranges that are optimal? or its it trial and erro

wooden sail
#

it would depend on the type of registration error you expect to find

#

i'd expect this type of shift and shear and scaling up and down along one or both axes are common. flipping LR and UD, not really

steady basalt
#

rotate
?

wooden sail
#

small rotations, yeah. very small though.

steady basalt
#

do you know how to account for artifacts?

#

that are biasing my accuracy

#

e.g literal text that label my class

wooden sail
#

artifacts in what, the images? you'd expect them to average out

#

or what are you calling artifacts

steady basalt
#

if you have pictures of cars and pictures of ducks and the cars all have on the image 'car' written on it

#

somewhere

#

small

#

its gona pick up on that

#

and thats kinda not what you want a model to do when you roll it out

wooden sail
#

couldn't really say. you'd kinda wanna remove bad data from the data set

steady basalt
#

Supposedly there’s a workaround for this we’re gona learn but I cannot guess what it’s going to be

wooden sail
#

i can't think of one off the top of my head. admittedly this isn't the type of ML i look at

steady basalt
#

What do u look at

wooden sail
#

so-called "deep unfoldings" are what has my attention atm

steady basalt
#

Industry or academia?

wooden sail
#

academia

steady basalt
#

Which country?

wooden sail
#

germany

steady basalt
#

Nice

#

Post doc?

wooden sail
#

doc

steady basalt
#

I was considering if a PhD is worth it next but I’m not so sure

#

I’d rather work for a company than a research uni

wooden sail
#

if all you wanna do is research, a company is more than ok tbh. you can consider a phd if you think it gives you better job options, you like academia and/or teaching, or something like that

steady basalt
#

I’m not so much into teaching this because I don’t have a math background

#

Are opportunities good in Germany for English speakers with masters?

wooden sail
#

in industry, i'm not sure yet. in academia yes, doctoral programs are often in english and lots of foreigners apply here. masters too, for that matter

steady basalt
#

PhD definitely improved job options because I notice half of the companies say PhD preferred rn which is annoying

#

But 3 years of hell….

thin palm
#

if my skewness of a certain column came out to a normal distribution, why does my box plot show outliers?

#

But then I plot a histogram and you can kinda see that bins go above

#

any idea on what to do with outliers with salaries?

steady basalt
#

@wooden sail man im getting so many errors apparently my colab directory doesnt exist

steady basalt
thin palm
wooden sail
#

do you have it in the correct directory? 😛 you can see your file structure, check around

steady basalt
#

Yeah dude

#

I will literally livestream this to u

#

im stuck augmenting for an hour

wooden sail
#

there's a tmp folder where some stuff lands when downloaded, but other things land in a different folder whose name i can't recall

thin palm
steady basalt
#

boxplot

#

@wooden sail wana try get my augmentation working?

wooden sail
#

i'll pass 😛

steady basalt
#

Anyone?

thin palm
steady basalt
#

haha

#

intern

thin palm
#

what's yalls thoughts on this?

steady basalt
#

thats alot of outleirs

#

keep them

#

its expected irl theres always a bunch earning more

thin palm
steady basalt
#

is it justone data point

thin palm
steady basalt
#

5 columns?

#

5 companies?

thin palm
#

apologize, 5 rows

steady basalt
#

are they all the same company

thin palm
steady basalt
#

god it takesme like 10 minutes to train a tiny neural network

#

how fast do u think a cloud service cud make this

#

80ms per step on m1 pro

steady basalt
#

anyone know why flow_from_directory says error20 directory doesnt exist

#

its a npy file

#

how does this function need data

steady basalt
#

ah damn u need specific folder structure

#

@wooden sail any idea how to create class files is it just numpy saving

loud cove
#

I have a question, so I have created a model and it gave me accuracy of 85% with train test split, when I used it on the full thing, it gives me 98%.
is this right?
What I did at the end is fit + predict on the full dataset

mild dirge
#

on the full thing?

loud cove
#

yea

mild dirge
#

well that would be the training accuracy

#

testing on the data you use for training will give very high accuracy indeed

#

So that accuracy does not tell you anything about how good the model would perform on new data

loud cove
#

yeah makes sense

loud cove
# mild dirge well that would be the training accuracy
- Your Notebook with the code you used in cleaning and analyzing the data + your model + evaluation.
- JSON file for the clusters output (to each category).
- PDF file for the evaluation report that will be delivered to the leader (Make sure to mention all your steps in visualizing your results).

it is just that these are the requirement of the project, but I'm confused about the second one.

mild dirge
#

What about that is saying to train/test on the full data?

#

Or is this another problem?

loud cove
#

unless I manually split them it doesn't make sense

mild dirge
#

the second point is not about classifying, it's about clustering

loud cove
#

yeah there is no clustering, it is just 3 columns, article, article name and category (three types)

#

or am i missing something

mild dirge
#

This is still about the classification right?

loud cove
#

for example, make the training the first 2000 and test last 500

#

yes it is a classification project

mild dirge
#

Yes, that is a often used method

#

Sometimes the prof provides an already split data, but if it is not split yet, you can split it yourself

loud cove
mild dirge
#

You do need to put a bit of thought into how you split it

#

Like you want to test all the classes, so you could f.e. try to get an equal distribution of each class in the testing set

loud cove
#

well, it is articles and it doesn't seem to be orders, so i was thinking shuffeling then bottom 500 and move on

mild dirge
#

Well if you plan on looking at the accuracy, you want the classes to be equally distributed for the test set

#

Otherwise you'd normally use different measures like f1 score, macro accuracy, confusion matrix etc.

loud cove
#

I was mainly looking at f score, but the project info says the mostly care about accuracy for this one.

mild dirge
#

Did you look at the class distribution?

loud cove
#

yes

mild dirge
#

and?

loud cove
mild dirge
#

alright, it's not balanced

#

how did you approach that?

loud cove
#

wel

#

well, in score I just used weighted

#

but I honestly don't know much and just getting the project flipped lol

mild dirge
#

weighted?

loud cove
#

for the scores

mild dirge
#

what function is that?

loud cove
#

what I found weird is that engineering (least count) have highest f score, my thoery is due to not much overlap in words

#
def try_model(model_name):
    mdl=''
    if model_name == 'Logistic Regression':
        mdl = LogisticRegression(max_iter=500) #increasing iterations since default 100 wasn't enough
    elif model_name == 'Random Forest':
        mdl = RandomForestClassifier()
    elif model_name == 'Multinomial Naive Bayes':
        mdl = MultinomialNB()
    elif model_name == 'Support Vector Classifer':
        mdl = SVC()
    elif model_name == 'Decision Tree Classifier':
        mdl = DecisionTreeClassifier()
    elif model_name == 'K Nearest Neighbour':
        mdl = KNeighborsClassifier()
    elif model_name == 'Gaussian Naive Bayes':
        mdl = GaussianNB()
   


    oneVsRest = OneVsRestClassifier(mdl)
    oneVsRest.fit(x_train, y_train)
    y_pred = oneVsRest.predict(x_test)
    
    
    # Performance metrics
    accuracy = round(accuracy_score(y_test, y_pred) * 100, 4)
    # Get the weighted precision, recall, f1 scores
    precision, recall, f1score, support = score(y_test, y_pred, average='weighted')

    print(f'Test Accuracy Score of Basic {model_name}: % {accuracy}')
    print(f'Precision : {precision}')
    print(f'Recall    : {recall}')
    print(f'F1-score   : {f1score}')

    # Add performance parameters to list
    perform_list.append(dict([
        ('Model', model_name),
        ('Test Accuracy', round(accuracy, 4)),
        ('Precision', round(precision, 4)),
        ('Recall', round(recall, 4)),
        ('F1', round(f1score, 4))
         ]))
#

it is just a function to try few different models and see the result of each

#

from sklearn.metrics import precision_recall_fscore_support as score

mild dirge
#

Okay, so a weighted accuracy would be a solution I suppose

#

Where there any hyper-parameters to tune for your model?

loud cove
#

yeahi'm used weighted as you can see
precision, recall, f1score, support = score(y_test, y_pred, average='weighted')

mild dirge
#

did you do a grid-search?

loud cove
#

I kept defaults other than the estimators, but decided against it.

mild dirge
#

Did you train/test the model and decided to change stuff to get a higher accuracy afterwards?

arctic wedgeBOT
#

Hey @loud cove!

It looks like you tried to attach file type(s) that we do not allow (.ipynb). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.

Feel free to ask in #community-meta if you think this is a mistake.

loud cove
mild dirge
#

alright, that's good

#

So if you didn't change your model and choose all the hyper parameters beforehand, then just splitting the data into training and testing is fine

loud cove
mild dirge
#

If you wanted to find good parameters for your model, you could have used a validation set to check the model accuracy

#

You understand why you split it into training and testing right?

loud cove
#

I don't really care about the accuracy, I think 80 something is good enough.
yes to prevent overfitting

mild dirge
#

So why does it seem "weird" to you to split the data like that?

loud cove
#

no what I find weird is that they requested clusters, I'm assuming my predictions for each of the three categories, which is weird

#

I'd only be able to get that if I split them manually

mild dirge
#

Is there more explanation on clusters?

loud cove
#

nope

mild dirge
#

JSON file for the clusters output (to each category) This is literally everything in the assignment mentioning clusters

#

?

loud cove
#
Feel free to choose your classification algorithm and all the pre-processing needed on the data.

The team shares with you this JSON file (Note: "JSON" text is clickable) for a group of categorized articles as you will divide those articles into 3 groups: training data, validating data, and testing data.

To measure the accuracy of each algorithm, at this level you will measure the accuracy by the percent of matching only.

What we expected:
A GitHub repository includes:
- Your Notebook with the code you used in cleaning and analyzing the data + your model + evaluation.
- JSON file for the clusters output (to each category).
- PDF file for the evaluation report that will be delivered to the leader (Make sure to mention all your steps in visualizing your results).

Avoid plagiarism at all costs. All submissions will undergo plagiarism checks before being graded.
mild dirge
#

It seems that they wanted you to split into training/validation/test

#

you did not make a validation dataset?

steady basalt
#

why bother when u can cross validate training data on itself

loud cove
#

nah I just did validation with grid search

mild dirge
loud cove
#

I'm hoping they won't bother with that part lol

mild dirge
#

but grid search with what set?

loud cove
#

I don't really know much and just took a github from an a bit similar project and made it similar to my needs.

loud cove
steady basalt
#

i miss stats tasks

mild dirge
#

alright, so you used cross validation, that's fine

steady basalt
#
image_generator = imagedatagenerator(rescale=1/255, validation_split=0.2,rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True, ) 
y_train = keras.utils.to_categorical(y_train, 3)
model.compile(optimizer = opt , loss = 'categorical_crossentropy' , metrics=['categorical_accuracy'])
y_train = keras.utils.np_utils.to_categorical(y_train, 3)
y_test = keras.utils.np_utils.to_categorical(y_test, 3)
datagen = imagedatagenerator(rescale=1/255, featurewise_center=True, featurewise_std_normalization=True, rotation_range=20, 
                             width_shift_range=0.2,height_shift_range=0.2,
                             horizontal_flip=True,validation_split=0.2)
# compute quantities required for featurewise normalization
# (std, mean, and principal components if ZCA whitening is applied)
datagen.fit(x_train)
# fits the model on batches with real-time data augmentation:
model.fit_generator(datagen.flow(x_train, y_train, batch_size=32),
          steps_per_epoch=len(x_train) / 32, epochs=100)```
mild dirge
#

It also uses validation sets, but it splits your training data into multiple folds

steady basalt
#

brainmon its gona work this time

loud cove
#

yeah but didn't do it in one set into like 60 10 30 training validation test

mild dirge
#

Concerning the second point of your requirements, I do not know what is meant

#

I would ask for clarification if I were you

#

I can only guess at some stuff that could be wanted from you

loud cove
mild dirge
#

the cross validation grid search function

loud cove
loud cove
steady basalt
#

@mild dirge i need help with this augmentation give 5 mins to train first

loud cove
#

it was between that and random and i just went with this one.

#

honestly it is stupid, why ask me for pdf (I know in real life you'd need to summarize outside of your code, but just annoying).

mild dirge
#

That's just convention

#

For pretty much every one of my courses I have to write reports and send as pdfs too

loud cove
#

Yeah I'm jus being lazy, I don't really care much ML career wise but just took it since it is free

steady basalt
#

fit generator broke my kernel

loud cove
#

just gluing together few things here and there and it works

steady basalt
#

92ms/step lemon_clown

mild dirge
#

I mean you can get pretty far gluing stuff together with ML, but when you don't fully understand what you are doing you might hit a brick wall at some point

steady basalt
#

I have 42

mild dirge
steady basalt
#

images

#

jesus

mild dirge
#

per image?

steady basalt
#

let me check

mild dirge
#

or batch

steady basalt
#

ok its

#

yeah

loud cove
steady basalt
#

1500 img

#

so thats 42 uhhh steps

#

per image

#

thats batch size?

mild dirge
#

ehhhh

steady basalt
#

didnt set anything

#

defaults 32...

mild dirge
#

if you didnd't set a batch size it is probably some default value in pytorch

loud cove
mild dirge
#

or keras*

loud cove
#

wait nvm wrong file

steady basalt
#

32 is meant to be default

steady basalt
#

anytihng over 100ms/step would be so horrible for big data

#

i wonder if a tesla cud reduce to like 20

mild dirge
steady basalt
#

xd

#

nvidia tesla

loud cove
# mild dirge

oh it does that on your computer too? I thought mine just glitched

steady basalt
#

cruising on 13.5GB ram and 98% GPU usage

#

just from python3

#

shudda got 32gbn

#

wow 14gb ram used

mild dirge
steady basalt
#

what causes it

mild dirge
#

what causes what?

steady basalt
#

just the calculations?

mild dirge
steady basalt
#

backprop is the main ram hog>

loud cove
#

I don't think 32 would make much difference

steady basalt
#

im getting close to my 16gb

#

2 left

loud cove
#

that's more than enough

steady basalt
#

its a tiny project haha

#

wow the laptops so hot now

loud cove
#

windows/os keeps a backup memory so you apps don't lag

steady basalt
#

I thought these new macs stayed cool

loud cove
loud cove
steady basalt
#

the air coming out the sides is cool

#

the keyboards hoy

mild dirge
#

Did you copy a lot?

steady basalt
#

hot

mild dirge
#

Could be an idea to reference it if you think you copied maybe a bit too much

loud cove
steady basalt
#

underside of the laptops BURNING hot

#

shud i cancel this

#

cnn

#

yikes

loud cove
mild dirge
#

high gpu usage is good though

#

you want the gpu to work hard

loud cove
mild dirge
#

So one thing that you did is choosing the model that gives the highest test accuracy @loud cove

loud cove
mild dirge
#

Which means you implicitly use the test data to make choices

loud cove
mild dirge
#

the testing data is only for testing your final model

#

Nothing about choosing your model/training/parameter-search etc. should include anything from your test data

loud cove
#

the closest thing to it is the log reg, but I think random forrest would make more sense

loud cove
thin palm
#

anyone have a good understanding of a matrix correlation??
1.) Before Plotting and calculating a matrix correlation do I need to OHE some feature columns first?
2.) Right now my dataset only has 3 numeric values, the rest are of types objects. The 3rd numeric value being the target variable. So if I build my matrix it's going to be small, and how can I determine which feature is most important, high correlation, etc. with really only 2 values?
3.) If I decide to OHE my object types to become numerical values will this be valuable for my correlation matrix?

loud cove
#

but I'd have to split manually right?

mild dirge
#

It might be that your test data has some weird pattern in it and that random forest might be able to learn that very luckily

#

Which would make the final accuracy that you find not really fair indicator on new data

loud cove
steady basalt
#

Hot app is tellingme its only 63C

#

yet it burns my hand

loud cove
#

different runs

mild dirge
#

Doesn't really matter which one shows up best, the point is that the test data shouldn't be used for this

#

validation data should have been used

loud cove
#

but random forests seems most reasonable with free text

mild dirge
#

And I would maybe place this plot at the data analysis part

#

Not the results

loud cove
steady basalt
loud cove
#

to keep all visuals in one place

#

the eda have the counts anyways

mild dirge
#

Don't know how deep this project goes, but you could go more in-depth on this

loud cove
mild dirge
#

Checking if there is a lot of overlap between words, more than the other two categories f.e.

loud cove
steady basalt
#

also u shud be getting AUROC its better than accuracy

loud cove
#

I could find the correlation, but it is a MOOC project with interest in ML so I didn't really dive deep

mild dirge
#

The word clouds at the bottom are pretty cool, but you don't say anything about them

loud cove
loud cove
mild dirge
#

Think all-in-all it looks pretty good, just be careful with using the test data for making decisions about the model. You could have a super interesting model and project with good results, but when you use the test data for training/choosing stuff, it could invalidate any results you get.

loud cove
steady basalt
#

ValueError: Shapes (None, None, None, None, None) and (None, 3) are incompatible

#

can someone whos good with TF

#

help me

#

b4 i sleep

mild dirge
#

you'd want the same distribution of categories in the test set as the dataset

steady basalt
#
     18           steps_per_epoch=len(x_train) / 32, epochs=100)```
#

dont even get how this will work without validation data added

#

added it and still get value error

loud cove
mild dirge
#

Think it's called stratified

loud cove
#

im sorry if this is basic, i tried googling but no luck, what do you do with validation dataset exactly? I understand what its use is, but how is it different from cross validation functions?

mild dirge
#

train_test_split has an argument for it

loud cove
loud cove
#

oh nvm the shuffle is the one on by default

mild dirge
steady basalt
#

@loud cove where u from G

mild dirge
#

And cross validation splits training data into train/validation multiple times

#

Each time having a different slice of the data be validation and the rest train

loud cove
loud cove
mild dirge
#

they accomplish the same thing

#

cross validation is probably better as it uses all of your training data for training and for testing at some point

loud cove
#

yeah that's what i was confused about, that they wanted a standalone validation

mild dirge
#

But that also means you need to train/validate multiple times

steady basalt
#

Total params: 608,771
Trainable params: 607,939

mild dirge
#

And when it takes a few hours to train that might be something you are willing to give up on in order to get quicker results

loud cove
#

yeah like i did for that 1% accuracy 😛

steady basalt
#

ValueError: Shapes (None, None, None, None, None, None, None, None, None, None) and (None, 3) are incompatible

#

save me

mild dirge
loud cove
#

local?

mild dirge
#

yeah

steady basalt
#

why ERROR

#

AAAAAAA

mild dirge
steady basalt
#

whys my shape liek that

mild dirge
#

interesting tensor shape though

loud cove
mild dirge
#

that must be wrong

mild dirge
steady basalt
#

my y train is encoded

loud cove
#

cnn the channel?

steady basalt
#

dude i just printed and

#

printed shap eagian

#

and it chagned

mild dirge
steady basalt
#

every time i press .shape

#

it adds a 3??

#

WAT

loud cove
#

might been an old thing i think

mild dirge
#

no clue srr

steady basalt
#

YES

#

YES

#

YES

#

ITS FIXED

#

AAAAAAAAAA

#

6 HOURS LATER

loud cove
#

on its own?

steady basalt
#

no i just reset my variables lol

mild dirge
#

I'm going to sleep rn, I wish you both best of luck with your projects, gn!

steady basalt
#

gnnn

loud cove
#

me too, thanks for the help.
have a good night guys!

molten bluff
#

hi all, it's been a while since I touched stats or probability. Can someone tell me what sampling each row in a dataset uniformly at random with probability p means?

serene scaffold
wild pagoda
#

hey everyone, when i'm running this:

        data_frame["lap"] = data_frame["lap"].astype(int)

I got this error because there is null data, how do i ignore null data?

#

i tried to add ignore, but it doesn't change dtype float to int

clever owl
#

if i have a list [1,2,3,4] is 1 the head or the tail?

clever owl
#

and is that the beginning or the end of the list

#

the beginning right?

wild pagoda
#

beginning

clever owl
#

cool haha ty

wild pagoda
#

for example you can use list pop

wild pagoda
#

how do i convert "" to null boolean in pandas?

rose agate
#

or one col like df.loc[df['col'] == ''] = np.NaN

wild pagoda
#

not null

rose agate
wild pagoda
rose agate
# wild pagoda nope too, it automaticly convert to false

You'll probably need to read the documentation for that method then. This thread also seems to have a solutions that may help you: https://stackoverflow.com/questions/23353732/python-pandas-write-to-sql-with-nan-values

flat fern
wild pagoda
wild pagoda
thin palm
#

Any ideas of creating correlations and heatmaps of categorical variables?

coral nimbus
#

Anyone tried web scraping with python before

#

I tried web scraping with the HTML element, but encountered problems

#

I'm trying to scrape a live updating number(my code executes every 1 min to get the updated number), but my code doesnt work for some reason

odd meteor
coral nimbus
#

Bs4

#

Lemme run it again

coral nimbus
#

"TypeError: nonetype object is not subscriptable"

copper jetty
#

Good day... Am new here

copper jetty
#

I did all set up but am getting error whenever I tried to use it to scrape a Javascript heavy site

thorn halo
#

There’s a game called 8 ball pool(a billiards game developed by miniclip you all must be knowing about it)
I am developing a program which can predict the trajectories and also predict where the balls will go after the collision of billiard balls many apps like this occurs and they charge a heavy price on this. They simulate the collision same as the game does, I am not sure how they got the parameters ( mostly hit and trial) but its perfect.

My question is that which physics engine should I use for simulating collision of balls ?
I have fair knowledge in c++ and python.
The link below shows what I am trying to do :

https://stackoverflow.com/questions/69155760/how-to-get-direction-after-cueball-collide-with-any-ball-in-unity3d-like-in-8-b?noredirect=1&lq=1

odd meteor
odd meteor
odd meteor
copper jetty
#

So why trying to look for solution to my problem, I came across playwright integration with Scrapy

#

I tried it but nothing nothing

#

So which tool(s) will you advice me to use it comes to web scraping ?

#

Especially Dynamic websites

copper jetty
coral nimbus
#

@odd meteor here you go

#
import bs4
import requests
from bs4 import BeautifulSoup

#may need to change variable according to previous components
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'}
url='https://finance.yahoo.com/quote/VAXX?p=VAXX&.tsrc=fin-srch'
r= requests.get(url)


web_content=bs4.BeautifulSoup(r.text, 'xml')  
elements=web_content.find("div", {'class': "D(ib) Mend(20px)"})


print(r.text)
#

essentially what i'm trying to do is

#

parse the current stock price(VAXX) from this url

#

I picked the correct div(correct me if i'm wrong) but for some reason it keeps showing 4.37 despite the current price being something else

#

so I changed it to print(r.text) and did some digging, apparently the parsed value is hard set to 4.37

#

i'm thinking of scraping another website instead, looks like yfinance really doesnt like to be scraped

main gorge
#

Did anyone know how we open security port open from Airflow web server to GI EMR, camel Ec2 and S3 buckets using python

odd meteor
# coral nimbus ```py import bs4 import requests from bs4 import BeautifulSoup #may need to cha...

First, confirm what you're trying to scrap from yfinance isn't disallowed. You can do that by checking the robot.txt file of the website.

https://finance.yahoo.com/robots.txt

If what you're trying to scrap isn't prohibited, and you're sure you are selecting the right attributes and tag where the information you wanna scrap is, then perhaps try using another parser. You could try html.parser or lxml

I'm not on pc at the moment, so unfortunately I can't inspect the HTML of the url you just sent at this time.

However, try this

import requests
from bs4 import BeautifulSoup
url = "https://finance.yahoo.com/quote/VAXX?p=VAXX&.tsrc=fin-srch"
page = requests.get(url)
soup = BeautifulSoup(page, 'lxml')
prices = soup.find('You most likely need to tweak this area to include the parent tag before the div tag ')

result = [price.text for price in prices]
result
coral nimbus
#

wait what do you mean by including the parent tag before the div tag?

odd meteor
#

Unfortunately, I can’t do much since I'm on my mobile phone. But it should work if you're picking the right tag and attribute where the price data you're trying to scrap is.

coral nimbus
#

thanks a lot

#

all hail @odd meteor

#

<fin-streamer class="Fw(b) Fz(36px) Mb(-4px) D(ib)" data-symbol="VAXX" data-test="qsp-price" data-field="regularMarketPrice" data-trend="none" data-pricehint="4" value="4.09" active="">4.0900</fin-streamer>

#

it looks something like this

#

no id tag, nothing

odd meteor
coral nimbus
#

ohhhhhhhhh

odd meteor
mint palm
#

how to track very slight motions ??
this above image shows how a point was fixed then vector was obtained by comparing frames

#

i have the frames

#

but how do i apply this vector approach

#

has anyone done something similar before

coral nimbus
#

cool

#

is this some face recognition stuff

past lion
#

Im currently doing a project for uni that ive chosen to analyze the accuracy of simulated predictions using a range of models for the stock market. Im wondering if arima would even be the right option to go down? Ive so far simulated with GBM, was looking into ARIMA & GARCH as well? Anyone have an opinion on what would be most suitable.

rose agate
coral nimbus
#
import requests
from bs4 import BeautifulSoup
!pip install google
try:
    from googlesearch import search
except ImportError:
    print("No module named 'google' found")
 
# to search
query = "investing.com aapl"
 
for j in search(query, tld="co.in", num=1, stop=1, pause=1):
    print(j)
url=j
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
prices = soup.find('div', {'class': 'instrument-price_instrument-price__3uw25 flex items-end flex-wrap font-bold'}).find_all('span')[0].text
                                         
print(prices)
#

@odd meteor I was able to solve the problem by google searching using python then doing it the traditional way

#

as I intend the query part to be a concat of investing.com + 'TICKER', with the TICKER being parsed from another excel sheet I programmed for optimal stock picking

#

now I need to make a for loop to execute this code per minute and to make an array to store this information, any ideas?

hollow flare
#

Hi

#

One Question for all 😅

"Skills" or "Degree"

coral nimbus
#

skills

#

hard skills

wooden sail
#

depends what you wanna do/where you apply. some jobs will also require you have some cardboard

thorn halo
#

There’s a game called 8 ball pool(a billiards game developed by miniclip you all must be knowing about it)
I am developing a program which can predict the trajectories and also predict where the balls will go after the collision of billiard balls many apps like this occurs and they charge a heavy price on this. They simulate the collision same as the game does, I am not sure how they got the parameters ( mostly hit and trial) but its perfect.

My question is that which physics engine should I use for simulating collision of balls ?
I have fair knowledge in c++ and python.
The link below shows what I am trying to do :

https://stackoverflow.com/questions/69155760/how-to-get-direction-after-cueball-collide-with-any-ball-in-unity3d-like-in-8-b?noredirect=1&lq=1

coral nimbus
shut phoenix
#

Difference between classifying and regression?

thin palm
#

Any body have an idea on how to make a matrix correlation with mixed categories such as objects and ints?

thorn halo
vivid jasper
#

Howdy -- trying to bulk change data types in pandas for every column after the 15th to float from object -- is there an easy or preferred way to do this? panda newbie

barren wedge
#

Hey guys... I'm quite new to AI. Like, I know some theory, but I haven't done anything practical. All the projects related to AI seem boring, like I don't want to make AI to predict salaries and stuff. Anyone got any interesting ideas?

barren wedge
#

Good Idea... Thanks 👍

vagrant pilot
#

i wanna learn nural networks an more ai stuff how can i start?

robust jungle
steady basalt
worthy phoenix
#

is maths necessary to use tensorflow or any other machine learning frameworks in that sense?

serene scaffold
karmic flicker
#

Have you guys ever seen two different matplotlib plots overlap eachother even though they are on different plots

worthy phoenix
serene scaffold
worthy phoenix
#

seems ok to me , thanks

odd meteor
# worthy phoenix seems ok to me , thanks

Welcome to the gang 💪🏿💪🏿. Now, you can easily convince others that ML isn't so hard as some people perceive it to be.

Thanks to Mr. Sterlerlock 🙌🙌

spring marsh
#

Hey everyone so can someone tell me from where can I learn opencv for machine learning ? I already know machine learning looking for a small course to get me started with computer vision.\

spring marsh
tacit basin
limber kelp
#

Guys, is it possible to download the 'wrangled' excel file after it has been wrangled using python?

I want to use it in Tableau as Tableau doesn't have good wrangling options.

I want to wrangle it in python and then use it in tableau for visualization.

tacit basin
brave sand
#

Do you guys train your networks on a home gpu?

plush glacier
plush glacier
#

i have a 1660super

#

so cloud gpu is a lot faster but i would have to upload my data there and i don't have the best upload speed (it isn't slow at all but for things like 30gb it will take 2 hours to upload the data)

#

@brave sand i just recommend for smaller things to try it out on your own device if possible (and it doesn't take way to long for you) and later maybe go to cloud if you think you will need it

brave sand
#

internship is about reinforcement learning

mild dirge
#

What kinda network are you running that your vram is limiting on a 3070? @brave sand

brave sand
mild dirge
#

Depends on the task (and also the network)

brave sand
iron basalt
#

I'm assuming deep learning is used*

#

(backprop)

brave sand
iron basalt
#

Depending on what you are doing you may have one agent per GPU...

#

Multi-agent methods get amazing results, but their main downside is that you have to run multiple agents.

#

So either you have really simple agents, or a lot of GPUs.

brave sand
iron basalt
#

If you are willing to not rely on pytorch, etc, and can make your own system like it, or use non-deep learning methods, then you can buy the much cheaper AMD GPUs.

#

But the tradeoff is a lot of work and learning.

brave sand
#

AMD Gpus? I thought they were a pain to deal with for ml

iron basalt
#

They are because the big libraries for deep learning don't support them.

#

No CUDA.

#

And also because AMD used to not really care about ML.

#

Nvidia has more or less a monopoly on deep learning at this time.

#

(Unless you are willing to take the hard route that most can't be bothered with (someone has to make the pytorch support for AMD GPUs))

iron basalt
#

Roboticists love this.

thin palm
#

Is this still considered a normal distribution?

#

Reason I ask is because I'm getting ready to OHE a categorical column and I need to define 3 different types, 1 for entry level, 2 for mid level, and 3 for senior. I was hoping to use my distribution graph to get an idea of how many n types we'd need but i'm not sure.

tidal bough
#

Really doesn't look like one to me. I'm not sure what's up with the oscillation, but the distribution is otherwise pretty much uniform (which is IMO a very unexpected result for "years of experience" in basically any dataset. Maybe it was intentionally normalized that way?..)

thin palm
#

So if I want to create a One Hot Encoder based off years experiences teirs (Entry-level, Intermediate, Mid-level, Senior or executive-level) 4 total. How would I do this? Right now my column is categorical

desert oar
#

it sounds like you don't need it to be gaussian

#

you are talking about binning your data

#

there are several ways to do this, all of them essentially arbitrary

#

do you need to use these 4 categories for the assignment?

#

that density plot is really weird

thin palm
desert oar
#

this looks like it's more or less uniformly distributed based on that

thin palm
#

so my other thought was it's an important column I don't want to drop it

desert oar
#

you might however want to use attributes of the company as input features

#

total number of employees, etc.

#

i don't see any reason to discretize a feature like "years of experience" unless you need to or can think of a specific good reason to

thin palm
#

well the reason why is because the companyId probably represents the company name, and obviously Apple would pay more than other companies.

thin palm
#

But I'm having a hard time one hot encoding this based off a condition, not sure what to do?

desert oar
thin palm
#

There's 63 unique values

desert oar
#

i think you are overthinking whatever this is that you're doing

thin palm
desert oar
#

no, sorry

#

you have 63 unique companies in the data, what does that have to do with binning years of experience?

thin palm
desert oar
#

i wouldnt consider company id a feature though

thin palm
thin palm
desert oar
#

i would use it to incorporate attributes about the company into your model

#

but if you use company id in the model you can't make predictions about other companies

thin palm
#

the goal is to predict salaries

#

using the companyID can help identify which companies pay a bit more for salary... I just dont know how to OHE since it has so many features

desert oar
#

what if you need to predict salary for an employee at a company that isn't in the trianing set?

fallen inlet
#

Is there any good resource for python data science?

thin palm
desert oar
thin palm
thin palm
#

Do I need to transform features into normal distributions?

thin palm
#

I don't think so actually, just the machine learning models that require scaling.

steady basalt
#

I don’t think they’ll care

#

Just encode company

pearl radish
#

Hey everyone, I have a "tf.contrib" error when testing the object detection API.
so I ran the upgrade scripts on my project directory and there was no issues or error.

I then tested the object detection API installation and I still got the error about "tf.contrib".
I then ran the upgrade scripts on the main directory where the error seems to come from and it was successful. But when I tested the installation, I still got the same error "tf.contrib".
Is Anything else I can do?

wooden sail
#

it seems the issue is the tensorflow version. you'd need a version of tensorflow starting with 1.x.x

#

try installing an older TF version in an environment

mint palm
dim crypt
pearl radish
#

@wooden sail please the upgrade scripts kinda confuses me.
To be clear, should I uninstall the TF v2.8.0 and install TFv 1.15 before running the upgrade scripts?
And on what directory should I run the upgrade scripts on?

wooden sail
tacit basin
pearl radish
#

@wooden sail thanks

loud cove
mild dirge
mild dirge
wooden sail
#

normally one stores the model and its parameters in a way that it can be used easily, e.g. by containerizing it or simply keeping it around for yourself. the idea is that the network is simply an architecture, and the parameters you trained are specific to the data you used for training. if the training went well, you can now use the parameters only for inference without having to worry too much about the inference being correct

#

the network itself makes a prediction. the training step tuned the params so that the predictions become accurate. so once it's trained, save the parameters and trust the inference. as an example, once the params are stored, you can trust your network to detect faces in images

#

but yeah, look into deployment as pccamel suggests. containerization is just one approach

mild dirge
#

The only time we had to save a model, we indeed just saved the parameters and loaded it into the program that controlled a robot, and then the model was just called every iteration to check if the camera had spotted any objects

#

Most of the times we just make a model and use some test data to see how well it would work for new data

wooden sail
#

that sounds about right. since the params are a thing of their own independent of implementation, you can just as easily, say, do the training with tensorflow, then make a standalone implementation of a forward pass of the network in c++ that reads the resulting parameters and infers blazingly fast

#

so you really have a lot of flexibility in deployment

loud cove
loud cove
mild dirge
loud cove
#

the heck is AI lol

mild dirge
#

is undergrad the same as bacchelor?

loud cove
#

yeah

mild dirge
#

yeah AI then

#

You know what AI is right?

loud cove
#

what's that?

mild dirge
#

artificial intelligence

loud cove
#

i know ai but i don't know bsc in ai

#

it is probably just a new trendy thing unis do now

mild dirge
#

Well it teaches stuff about AI

loud cove
#

is it mostly stats? is it mostly cs?\

mild dirge
#

yes

#

that

wooden sail
#

AI is a buzzword anyways. you can boil it down to a few math competencies and optimization for targetted applications

loud cove
#

yeah is it closer to stats or CS though?

mild dirge
#

and psychology, biology a bit

#

jack of all trades

#

Lots of elective courses, I chose mostly cs courses

wooden sail
#

AI falls both under stats and CS separately. also overlaps with so-called "signal processing"

loud cove
#

I know but im asking about his degree

#

what faculty provides it?

mild dirge
#

We could choose what we were interested in, I chose mostly cs courses, and robotics

#

science and engineering

loud cove
#

most unis are just capitalizing on the buzzword

loud cove
mild dirge
#

It exists for 20 ish years I believe on my uni

loud cove
#

yeah those existed for a while, but it gained a lot of traction in the past few years.

mild dirge
#

But yeah AI just means its a bit more broad than just machine learning

loud cove
#

yea im familiar

#

but it is mostly just buzzwords

#

once you have an optimization model or program then that isn't AI, that's just math and algorithms.

mild dirge
#

hmm

loud cove
#

but AI is a cooler way to say it.

mild dirge
#

AI is just a name for a field of research with many different sub-fields

loud cove
#

yea

mild dirge
#

It's become a buzz-word lately

#

But it already exists for a long time

loud cove
#

yea, I know someone from here in Egypt working in AI since the 90s

#

so I'd imagine it was more common in the west.

compact rose
#

Hello guys, hope you guys are having a nice day! I need help in pyspark and i'm knowing how to solve this. I want to merge duplicated rows but keep the values that were in them. In the first image, you can see the dataframe i'm working on it. I'm using this block of code , but isn't working it

loud cove
#

wouldn't the max return one result only anyways?

#

oh nvm im dumb

loud cove
compact rose
#

I didn't, i am gonne try, but i think it wont work! w8 a sec

#

didn't work :/

loud cove
#

have you tried doing max for all the columns and then renaming them after? might be easier and cleaner.

compact rose
#

yeah, i was basing in that!

#

Just tried and nothing xD

loud cove
#

i think you can max everything instead of writing all this and just rename after

#

try agg(max(*)) and see what it gets you.

compact rose
#

I just made it work. I just trade the max for a function of sql. (f.max and it worked)!

#

thanks anyway !

civic stone
celest vine
#

Hi

#

Does anyone know how to speed up pandas iteration?

#

I have a column which contains image URLs and I am downloading all the images using requests.
Right now it downloads around 12000 to 13000 images in an hour.
Is there a way to speed this up?

#

I have 50 Mbps. It is good enough, right?

tidal bough
loud cove
#

and no it depends on the size of each image and the total. of them.
many smaller files will take longer than less small files.

celest vine
loud cove
tidal bough
celest vine
loud cove
#

all these images will download in less than one second

celest vine
loud cove
#

you can do multithreading and it will increase them, but be careful not to get black listed

#

what's the source of these files?

#

and how many files are there? they might have a better option to do this.

celest vine
loud cove
celest vine
loud cove
#

yea

celest vine
loud cove
#

yeah your best bet is multi processing, but you'd probably be ip banned for spamming requests

arctic wedgeBOT
supple scroll
#

is there any reason why activation functions are just the normal sigmoid/relu/etc, and not something like this?

mild dirge
tidal bough
#

this looks like a combination of 4 parts, each of them in the form a (x-b)^k, so 3 parameters per part. Calculating gradients with regards to each parameter is going to be annoying.

#

I'd guess that people studied complex activation functions like that and they weren't any better than the ones normally used while being harder to compute. No sources for that claim, though.

supple scroll
#

what about neurons that take in two inputs and give one output? you could do something weird like have a functioning xor gate with only 3 neurons
something like this:

mild dirge
#

Not sure about that specific example, but a pretty major reason why it is as it is now is because we can easily apply matrix multiplication

#

Which is heavily optimized

serene scaffold
mild dirge
#

not really sure how much of the computation time is dependent on the activation function normally

wooden sail
#

that depends on how difficult it is to differentiate it, i.e. do the backwards passes

#

stuff like the derivative being bounded, and being bounded by a small constant at that, provides convergence guarantees and affects how large the learning rate can be while remaining stable

#

there's also a paper discussing the optimality of relu and similar piecewise spline-like functions, but i can't for the life of me remember the title. it must have been in ICASSP 2021 or 2020. but at any rate, any deep enough network should fall in the scope of the universal approximation theorem, so may as well go with well behaved functions

serene scaffold
wooden sail
#

i've been looking for like 10 minutes haha, i'll keep trying

#

oof i can't find it, sorry. assume i'm misremembering, or look around yourself...lemon_sweat

rich fiber
#

Hello, I want to pursue AI and started my journey by learning the Fundamentals of Python, now I'm confused on what should I do next as I tried learning OpenCV but had too much trouble due to the background knowledge required.
Would love to hear your suggestions or journey in AI.

wooden sail
#

i would say that if you seriously intend on working in/with AI in the long term, you need to learn those things. a minimum background in linear algebra, stats, and multivar calc is necessary if you want to understand what you're doing.

#

if you only plan on using APIs and networks other people have designed and possibly trained, you don't need that. but that also limits your options

#

you can alternatively start with classical signal processing/image processing, which is the same stuff i mentioned above but directly looking at its applications

iron basalt
steady basalt
#

i just failed a quiz on guassian substitution for matrices

#

doesnt stop me >:D

wooden sail
#

knowing how to do gaussian elimination by hand is not a big deal. knowing that elementary row and column operations are rank- and solution-preserving, on the other hand, is important

steady basalt
#

I think that as long as u know the omega basics of the math behind ml, only stats matters anymore

#

I mean, I’m sure everyone knows how to multiply matrices bro

wooden sail
#

sadly you can't cleanly separate the two in multivariate statistics, since you'll be looking at a LOT of covariance and correlation matrices, their rank, and their related spaces

steady basalt
#

I think that beyond what’s needed for backpropagation calculus gets less important

#

And more so general stats

wooden sail
#

if you just wanna use stuff and not make new theoretical results, yes, for sure

#

and honestly that is the case for the vast majority of people

steady basalt
#

I do not think 99% of data scientists are trying to invent the wheel

#

Yes

#

Myself included

#

If I wanted to do that I’d go for a PhD in ml

wooden sail
#

i might be a little out of touch with the real world 😛

steady basalt
#

Even the phds I know have not done so but rather applied it for research

#

The models we have already are fine

wooden sail
#

they're "fine" in that they work. they're "not fine" in that there aren't all that many results providing performance and convergence guarantees

steady basalt
#

Theoretical ml math is like for the actual gods and math nerds. We need them and I appreciate their work but it’s definitely only a tiny fraction of engineers

wooden sail
#

so you go off on a limb trusting your training and validation

#

small percentage of engineers and (applied) mathematicians, yeah.

steady basalt
#

I can’t imagine how hard it was to code tensorflow from scratch

#

Or come up with certain models

#

It’s way behind my own iq limit

wooden sail
#

the core models are old, i would argue the basics of that is not all that difficult. the stuff is decades old, we just lacked the computational power

steady basalt
#

It’s for the most creative bunch, I just learn and use their creations

#

Read?

thin palm
thin palm
#

Would anybody ordinal encode degree types? Such as High School, Bachelors, Masters, and PHD? A colleague of mine mentioned to OHE instead but I'm unsure

thin palm
#

perfect, was going to do this anyway. Thanks mate

#

Does "None" for degree obtained mean missing values or just mean no degree was obtained? Originally I thought it meant no degree was obtained another colleuage saying it's null values

thin palm
# steady basalt Yes

How about categorical features that say "None" for degree awarded, does this mean missing data or just none obtained? But now looking at major awarded some of the values have "None" even though a degree was awaraced

steady basalt
#

None is a category

#

Maybe not in the case of grades

thin palm
#

because you can't graduate with no Major

main fox
thin palm
main fox
#

Personally, I tend to not try and ordinal encode. It imposes an assumption of equal distance between points that might not be true.

thin palm
main fox
#

You're welcome.

void granite
thin palm
void granite
#

oh ok

#

🙂

thin palm
void granite
#

I hope the job interview goes well!

thin palm
thin palm
void granite
#

df.corr() 😉

#

assuming a pandas dataframe

thin palm
#

may just do a table instead of viz

void granite
#

it's a lot of features

thin palm
void granite
#

maybe there's a subset you can look at

#

or you could use some type of dimensionality reduction

#

like MDS or PCA

thin palm
#

Yeah, but I'm not sure on that. I might just make 2 models... one dropping the big features and one with it

#

is it okay to havea correlation of .384 based off of degree and salary? salary is what we're predicting

main fox
thin palm
#

corr_df.columns = ['feature_1','feature_2', 'correlation'] # rename columns

corr_df.sort_values(by="correlation",ascending=False, inplace=True) # sort by correlation

corr_df = corr_df[corr_df['feature_1'] != corr_df['feature_2']] # Remove self correlation

# corr_df[(corr_df['correlation'] >= 0.5) | (corr_df['correlation'] <= -0.5)]
corr_df```
main fox
#

Then yeah drop correlations close to zero I suppose.

thin palm
#

not sure if that's a good thing or bad

main fox
#

It's relative to the data

thin palm
thin palm
#

If I have 93 features and 62 of them are fairly independent, does that mean we just keep them?

void granite
#

is that r or r^2 ?

#

I guess it's R if you are saying it's correlation

void granite
#

hm hm

#

are there a bunch of features bc there's a bunch of careers in there?

thin palm
void granite
#

yeah

thin palm
void granite
#

years experience + degree is probably pretty good then 🙂

thin palm
#

I thought about dropping this to see if it made a difference

void granite
#

I bet years experience + degree + field is even better

#

but maybe just the two are enough

thin palm
void granite
#

sorry typos

thin palm
#

and that dang companyID has forced me to make 63 extra columns... but the reason I kept it is because company12 (could be apple for example) is known to pay better than company55(which could be HP)

#

does that make sense @void granite ?

void granite
#

yeah just pick a few of the higher ones for the model

thin palm
void granite
#

I mean, it's -a- way to do it, if you just want something quick and dirty. there's a lot of advanced techniques you can use if you want to be formal about winnowing down predictors, but uh... it's been over five years since I was doing any sort of advanced statistical work 😄

thin palm
void granite
#

if you were being fancy about it, you would pick some sort of criteria for evaluating models, make a bunch of models, and then rate them by that criteria... maybe you would also have some parsimony there (preferring models with a lesser number of variables)

#

there's like entire fields of study about this problem heh

#

and there is, imho, a sense in which it's an art in addition to a science

#

actually, thinking about it

#

you want companyID to be a factor, like a categorial data type

#

"company" should just be one column, in other words

fallow herald
#

good morning

#

I am new to pandas, is there a sort of 'mask' functionality?

#

Lets say I have a data frame as in the image above

#

would it be possible to have a mask in a Serie and run it over the dataframe, returning me all rows matching the mask?

#

what would be the best approach when I have such a task?

hybrid mica
#

why is the favicon.ico for colab in a different colour, for each of these notebooks?

tacit basin
wooden forge
#

Hey guys, so I have this plot, and I'd like to make a Gaussian fit. But I don't really know how, so would anyone know by any chance how to do it properly?

#

(the pit in p=0 is totally normal btw that's exactly what I was trying to get)

tidal bough
#

In theory you could instead directly minimize the mean squared error, or perhaps maximize the likelihood, via something like scipy.optimize. But I think the result will be very close, and the latter approach is more complicated and computationally expensive.

wooden forge
#

mmh

#

the mean would be sigma and the std mu ?

tidal bough
#

the opposite

wooden forge
#

hoo

#

let me check whats the std lol

#

I'm not used to english terms

tidal bough
#

!docs numpy.std can be used, say

arctic wedgeBOT
#

numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<no value>, *, where=<no value>)#```
Compute the standard deviation along the specified axis.

Returns the standard deviation, a measure of the spread of a distribution,
of the array elements. The standard deviation is computed for the
flattened array by default, otherwise over the specified axis.
wooden forge
#

haaaaaaa

#

Okay I know what that is

#

thanks !

wooden forge
wooden sail
#

if you have the data from which you produced that PDF, you could find the mean and std from there. if you only have the plot you showed, you have to find the parameters of a gaussian curve directly

wooden forge
wooden sail
#

an np array of what, though?

wooden forge
#

but I just decided to look directly on the graph to get the value and then get sigma from that

tidal bough
#

a numpy array of what? the original data you used to calculate the histogram?

wooden forge
wooden sail
#

if you have only the samples of the plot you showed, you'll have to do a least squares fit or something similar, and finding the mean and variance won't work

tidal bough
wooden forge
#

yeah I find 66

tidal bough
#

hmm, how did you plot the gaussian?

wooden forge
#

I just look at the central value, calculate sigma from that and then plot a gaussian

#

worked fine lol

#

idk why the method you gave me didn't

wooden sail
#

that doesn't look like all that good a fit

tidal bough
#

the sigma is clearly slightly off (lower than it should be)

wooden forge
#

Yeah I know

somber prism
#

guys i loaded a dataset using tf.data.Dataset.from tensor slices and my map function is like this ```
def preprocessing(x, img_path):
print(dir(x))
name1 = str(x[0].numpy())
name2 = str(x[3].numpy())
num1 = str(x[1].numpy())
num2 = str(x[2].numpy())
target = float(x[4])

    img_name1 = f'{name1}_{add_zeros(num1)}.jpg'
    img_name2 = f'{name2}_{add_zeros(num2)}.jpg'
    
    img1 = plt.imread(os.path.join(img_path, name1, img_name1))
    img2 = plt.imread(os.path.join(img_path, name2, img_name2))
    
    return tf.convert_to_tensor([img1, img2, target])
wooden forge
#

also I changed the central value to get the maximum divided by 2
that's totally normal

wooden sail
#

surmised as much

tidal bough
#

you might in fact want to go the scipy.optimize way, since you only have the PDF

wooden sail
#

still, you'll wanna do a fit

wooden forge
#

PDF ?

wooden sail
#

since you have the parametric model, you can find the gradient and hessian analytically (or use automatic differentiation) to make it converge quickly

#

PDF = probability density function

tidal bough
wooden forge
#

I also have the state function

#

as a numpy array

wooden forge
#

basically I have my system in an initial psi state as a ket, then I apply my operator, yada yada, and get a final state

#

and I'm ploting the density of probability

#

which I have as an array

wooden sail
#

so you do have the data. you can just do a maximum likelihood estimate of the mean and variance then, both of which have closed form if the distribution is normal (and is assumed to have AWGN noise)

wooden forge
#

AWGN ?

wooden sail
#

additive white gaussian noise

tidal bough
#

here's an example of gaussian fitting:

from scipy import optimize as opt
import scipy.stats as stats
import numpy as np
# X,Y are the points of your PDF

def loss_fun(params):
    mu, sigma = params
    return ((stats.norm.pdf(X, mu, sigma) - Y)**2).mean()


initial_guess = np.array([0, 60])  # initial guess, just needs to be close-enough to the true values
res = opt.minimize(loss_fun, x0=initial_guess)
wooden sail
#

just do what reptile suggests, it seems we're using different nomenclature anyway so it'd take too long to discuss it

wooden forge
#

lmao

#

that's fine Edd ^^

#

is Y the data I have

#

just to be sure

tidal bough
#

Sure, each Y[i] being the value of |\Psi|^2 at the corresponding p=X[i]

wooden forge
#

ha nice!

tidal bough
#

Then you do mu, sigma = res.x and plot a gaussian with these.

wooden forge
#

oki ^^

#

thx ! will try !

tidal bough
#

note my edit just now; it should be return ((stats.norm.pdf(X, mu, sigma) - Y)**2).mean()

wooden forge
#

ho okay

wooden sail
#

did you get it to work? i'm curious how close that solution will come, since vanilla least squares isn't optimal here. seems to be that the noise covariance is not just a scaled identity matrix, so the cost function would ideally have an inverse of the covariance matrix somewhere in there

wooden forge
#

Not done yet

#

I grabbed some snacks that's why

wooden sail
#

guess i surmised as much. once thing you can try is to split up the signal into chunks of a handful of samples and compute the variance of each