#data-science-and-ml

1 messages · Page 5 of 1

steady basalt
#

there are 1800

#

of 50k test size

desert oar
#

this is exactly why we do stratified sampling. if the rare class is 5% of cases, you want to make sure that it's 5% in both the train and test sets if possible. then you can oversample in test later.

steady basalt
#

that isnt a issue

steady basalt
#

i see that as cheating as test data shud be totally unseen

desert oar
#

it's not cheating. it's a valid technique. the test data is meant to be a simulation of unseen data, but your data

steady basalt
#

how do i fix this?!

desert oar
#

1800 / 50_000 = 3.6%

#

what's the % in the training set before oversampling?

steady basalt
#

in the overall dataset its 18k/500+k

#

so um

desert oar
#

ok so about the same

steady basalt
#

15k/45?

#

youd expect so with random split

desert oar
#

yeah with that much data it should be ok. so alright, you've ruled out a pathological case like having literally 5 instances of the rare class in the test set

steady basalt
#

1800 is more th an enough

desert oar
#

next question: feature distributions. are there any rare feature values that show up in one split but not the other?

steady basalt
#

is there a quick and ez way to test that

desert oar
#

not in general. but conceptually it's "groupby and compute the distribution"

#

or just compute the distribution in both sets and compare to the baseline

steady basalt
#

X_train and X_test plots?

#

let me go and look

desert oar
#

how many features do you have anyway? what kind of model is this?

steady basalt
#

ok there was about 10 features but after dummifiying its 50

#

a handful of continouis

#

random forest for now, but will change that later once fixed

#

to xgb maybe

desert oar
#

they're all categorical?

steady basalt
#

so, histograms for cont data and countplots for the onehot encoded data?

#

ei have 5 numerical

#

althoi about 70~% of one of them were imputed w median

desert oar
#

generally it's not necessary to dummify categorical variables in tree-based models. if you do that, you end up skewing your model to over-fitting on high-cardinality features. sometimes it's actively harmful.

steady basalt
#

Oh.........

#

let me change that and see if the problem is fixed

desert oar
steady basalt
#

well spotted

#

i did it on all data

desert oar
#

well that's an obvious source of data leakage

steady basalt
#

so its fillna of both test and train with train median

#

ah man, i imputed modes too way earlier when i ifrst loaded the data

desert oar
#

yeah that's an easy mistake to make. assuming you're using scikit-learn, you'll want to add it as a step in your sklearn pipeline and not at the beginning when loading data

steady basalt
#

actually, im not so sure all of it was a problem

#

alot of things had to be imputed manually

#

such as missing variable to a certain category

desert oar
#

yeah i don't want to declare prematurely that fixing the data leakage will solve your problem. but it's one thing you'll want to eliminate anyway.

steady basalt
#

maybe your advice on the balance of the features wil work il lcheck it out

#

@desert oar

#

first comparison

#

its not too different

#

omfg i have some outlier

#

no nvm

#

outlier shiudn tmatter

#

looking better with some other params i chose, but sitll not good enough for class1 f1

#

now, if i grid searched for f1, and it thought the best params are when class0 f1 is 0.9 and class 1 f1 is 0.2, how do i instead try to go for both being somewher ein the middle, say 0.7

#

@desert oar

#

class-specific f1 for class1 would fix it?

desert oar
steady basalt
#

so just 'f1' metric no variant

desert oar
#

right

steady basalt
#

but it keeps thinking best params is when class0 f1 is really high

#

and that means clas1 will suck

desert oar
#

isn't "class1 f1" what you want anyway?

steady basalt
#

yeah

desert oar
#

assuming the rare class is class 1

#

so ignore class 0 f1, it's not the thing you care about

steady basalt
#

well, it kinda matters

#

but still, class1 f1 sucks

#

idk why

#

did u see the histograms above

desert oar
#

a false positive for one is a false negative for the other

steady basalt
#

?

#

its best to get a decent f1 for both

#

?

desert oar
#

i am suggesting that you focus on the f1 score for the class that you care about. and that you should convince yourself in the binary classification case that there's no point in looking at both

#

i'm not sure what the histograms are telling me

steady basalt
#

so if recall and precision for the negative class is rly low, im saying my models going to tell people theyre getting x disease when actually they aint?

#

thats bad

#

with predicting diseases its probably quite important to consider both classes

#

i dont want amodel that cannot properly predict people who dont have the disease

desert oar
steady basalt
#

Yes

#

thats a further challenge, but for now, maybe i shud optimise to make class1 f1 highest rather than just say scorer=f1

desert oar
#

you can also look at "balanced accuracy" which is the average of sensitivity and specificity

desert oar
steady basalt
#

the issue with that is that the stupid grid search will think '0.9' precision and '0.1' recall is best bcs the f1s highest, when i cud be getting something better like 0.65 and 0.65

#

in this case recall>precision

desert oar
#

fair enough. maybe you can penalize extreme values somehow

steady basalt
#

i cud optimise for recall, and have tried, but still got the same bad result

desert oar
#

try balanced accuracy maybe

steady basalt
#

i think this has to be a data problem, i shud step away from grid searching and just use default models until something looks alright

desert oar
#

that's also a good decision

#

start with a baseline model and then optimize after you know it works

steady basalt
#

so how should i go about fixing this weird issue

#

i showed you the distributions arent too different

#

a bit for age, sure but shudnt ruin the model

desert oar
#

right, so you've ruled out another issue

#

now it seems that you need to turn to the imbalance problem itself

steady basalt
#

well i used smote so its 5050 for train

desert oar
#

what are your performance metrics within the train set?

steady basalt
#

i turned 18k/450k into 450/450k

desert oar
#

it's possible also that the smote results aren't good

steady basalt
#

i dont think undersampling will help me her eeither as i need the data

desert oar
#

i've heard "mixed reviews" about oversampling and i don't personally use it

steady basalt
#

smote shud work, i mean its just balancing the class while adding nothing new

#

for model purposese

#

i dont think thats the problem

#

my halving grid search seems to be showing test f1 of 0.7 so far maximuim

#

but im telling u when i print the actual results of the test set its gona suck for either class

#

espeically recall/precision being imbalanced

#

if i dont fix this soon im gona have to call it quits and submit bad results, which doenst look good for a thesis

#

sure, its not about performance but it helps when the reader sees decent numbers that would be helpful in deployment

desert oar
#

unfortunately this is just how machine learning goes

#

i do need to get back to my own work for now, but you'll have to keep trying things and coming up with reasons why the model might be overfitting to one class or another

steady basalt
#

is there a metric thats balanced f1? so its not maximising by saying 0.99 precision and 0.1 recall and instead tries to find a maximum where both are highest and balanced?

desert oar
#

you could e.g. take the average of both f1 scores

steady basalt
#

does sklearn have it?

#

i need to make it?

desert oar
#

i'm not sure if that's equivalent to plain f1 though... might need to write it out

steady basalt
#

yeah no i meant for class 1 mostly

desert oar
#

im not sure, it would be useful though

steady basalt
#

i meant i prefer 0.7 recall and 0.6 precision over 0.9 and 0.3

desert oar
#

actually it's a harmonic mean, it should discourage extreme values anyway

steady basalt
#

optimising f1 again hasnt done well for clas 1

#

its maximiing class 0

#

0.07... terrible

#

0.55 recall...

desert oar
#

you don't know that it's maximizing class 0 precision. you only know that it's coming out high

steady basalt
#

thats def not trying to optimise class 1 f1

#

scorer=f1 must be doing average f1

#

f2_score = make_scorer(fbeta_score, beta=2, pos_label=1) shud i do this

desert oar
#

stop and read the docs. f1 score without any further qualifications is precision and recall of the "1" class. and that is how it's almost always used. resist the temptation to guess

steady basalt
#

that just cant be whats going on here

desert oar
#

it has to be, unless there's a bug in your code

steady basalt
#
                             random_state=42,verbose=10,n_jobs=-1,cv=3,scoring='f1').fit(X_train, y_train)
clf=RandomForestClassifier(**search.best_params_)```
#

y_preds = clf.predict(X_test)
print(accuracy_score(y_test, y_preds))

print(classification_report(y_test, y_preds))
metrics.plot_roc_curve(clf, X_test, y_test) ```
#

maybe i shud try .95 .05 split ....

#

xd

#

900 rare class is still not too bad

desert oar
#

i mean, it's possible that it's using the wrong f1 score i guess. try being explicit, scoring=make_scorer(f1_score, average='binary')

steady basalt
#

f1 score doenst work

#

i think new sklearn is f1?

desert oar
#

from sklearn.metrics import f1_score

#

that isn't a string. it's the actual function

steady basalt
#

i was using grid search's scorer f1 function

desert oar
#

right, it lets you pass a string for convenience

#

somewhere in the docs (i forget where) it says which string corresponds to which function

steady basalt
#

'f1' is probably the same as yours

#

anyway, tried splitting down to 95% 5% which is 870 class1's

#

some more training data for it

#

i suppose it wud be bad luck if its like, given the 800 samples of the 18k that are the hardes tto predict

#

: )

#

guess thats why i need to do cv, ill do it on the train set

amber thorn
#

Can anyone here help me with a code I've been working on...I am having difficulty calling a function into another function. lmk if anyone's interested, i'll give detailed explanation in PM.

#

help is appreciated!!

serene scaffold
amber thorn
#

oh okay...thanks!

worldly kiln
#

hey, if possible i would like some guidance regarding starting my journey in data science….If anyone can help me where i should start from, what steps to take on first and where do i learn it from? thank you.

serene scaffold
worldly kiln
serene scaffold
worldly kiln
serene scaffold
#

though it might be a CS degree, and you'd have to look into ways to make it data science-focused. there usually aren't data science degrees.

delicate wasp
#

Whats AI ?

lapis sequoia
#

could be an if statement

serene scaffold
iron basalt
serene scaffold
lapis sequoia
serene scaffold
iron basalt
#

I guess you could say AI is more about the goal, what you want the end result to be, and not so much about what exact methods being used / what it is right now (to some limit, it has to do enough for most to consider it AI, so some lower bound of things).

#

And that goal is often mimicking some part of human intelligence.

#

Or not even human intelligence.

#

Like maybe how bees map out their territory.

#

Though the very general idea of AI is often dropped to something more specific to be on the same page and be productive.

iron basalt
vernal crescent
#

Hello, does anyone here have any knowledge about the Speech Recognition module? I have a question shipit

bold timber
#

Hi, I have a problem for imbalanced data. How to determine the equation for class_weight with classes 0 or 1?

lapis sequoia
#

When a ML model "learns," how do I know what exactly is it doing to learn? What mathematical functions does it use?

agile cobalt
#

look up a tutorial on gradient descent to have a basic idea or take on a course like Andew Ng's on Coursera

#

if you just want to "what it does", it's not that complex except the math behind the derivatives, but if you want to know "why does it works", specially for some of the most complex tasks... I'm not sure if whenever anyone in the world can answer that adequately

probably yes and I'm just exaggerating, but still

lapis sequoia
#

I have another question, so based on my understanding a neural network basically creates a mathematical predictive function that maps x to y. How can I see the function it has created? If I can't see it, can I know the rough shape of it?

agile cobalt
#

I recommend taking a look at simpler algorithms before looking into neural networks
(e.g., linear regression and decision trees)

but the "function" in neural networks is essentially just multiplying all the previous layer's nodes by the connection weights, then adding it all together (for each node in the next layer)

#

oh, I almost forgot - then yeeting it into the activation function

lapis sequoia
#

that it would run future predictions on

agile cobalt
#

I'm not sure if the libraries ever compile it into a single function or just stores all the weights then runs it layer by layer

#

my guess is that it has to store and run layer by layer because of the activation functions though
(otherwise, the final function would get monstrous really, really fast)

#

but yeah, like I suggested before: If you want to understand them properly, do take a course on it.
I don't understand them all that well either, as you may have already guessed

lapis sequoia
#

I've decided to read a book called Neural Networks from Scratch, where we build a NN from raw Python

iron basalt
lapis sequoia
agile cobalt
lapis sequoia
agile cobalt
iron basalt
lapis sequoia
lapis sequoia
agile cobalt
#

the model is not one single function.
It's a bunch of weights.

you can look up "pytorch visualise model weights" or more generally "pytorch model visualization"

iron basalt
#

You can edit its source code though.

#

But i'm not sure what kind of modifications you are trying to do. If it's just altering some weights, yeah, that is what it already does.

#

Neural networks can be seen as just big functions that have fine-tuned parameters via some algorithm (all can if you abstract hard enough).

#

(The details of that function are what matter though, and it's why it's considered a neural network and not just any big function)

frigid creek
#

hi, so there are IoU, confidence threshold, precision, recall, AP and im just still confused about the part where mAP is based on IoU plus theres also part where some say precision-recall curve is weighted by threshold, in this graph is IoU and confidence threshold the same thing? thanks

unique flame
#

Anyone know of a paper for image classification in which the researchers tried different training/validation ratio? So a split of 60/40,70/30,80/20 etc.

modest onyx
#

my rnn is actually convering pog

rapid cedar
#

coders of reddit discord, where did you learn ai?
also, what modules do you recommend to learn b4 starting ai learning

steady basalt
#

Uni

rapid cedar
modest onyx
#

now there's a new and improved version you can check out (it's free)

wooden sail
modest onyx
#

an NN is just a massive composite and piecewise function

#

There are many ways to visualize what each layer is doing though

unique flame
jaunty creek
#

hello, I'm trying to make a few google accounts and google asks me for a phone number
to get around this, i've tried MAC spoofing, changing IP's, multiple browsers, various timeframes, various passwords and username configurations and clearing the browser cache and the cache on my local cache on my machine.
how does google still know im the same user trying to make new accounts?

#

is this some sort of ai

#

detecting it

neat crescent
#

Unknown, but this sounds a little sketchy. It's not difficult to make more than one Google account, so whatever you're doing is out of the norm. As such we won't help as per our rule 5.

#

!rule 5

arctic wedgeBOT
#

5. Do not provide or request help on projects that may break laws, breach terms of services, or are malicious or inappropriate.

barren wedge
#

Is there any document ai outside google cloud?

jaunty creek
steady basalt
#

I’ve got like 4 gmails

modest onyx
#

I've got like 12

#

half of which are for each institute/organization I'm in

wicked vessel
#

hi 👋I was playing around with tensorflow and was wondering if there was a better way for the following:

I have a dataset with a feature (tcp port) that can range from 0-65500 where only the exact value matters. Eg if the dataset feature has port 123 then it only has a meaning if the input variable is exactly 123, and not 124, 122, etc. From what ive seen you normally want a categorical/integer encoding for something like this but i have a feeling there should be a better way when dealing with 65k different possible values

#

Right now my model seems to heavily depend on that one feature as a big indicator, giving outputs that make no sense when looking at the other features

tropic matrix
#

is there a way to implement a custom loss function using sklearn mathematical operations?

i use a MinMaxScaler on my y output (regression), and i'm wondering if there's a way to inverse_transform the rmse in order to find a "true" rmse and judge how accurate it is

#

nvm figured it out

#

i had to first enable run_eagerly=True in model.compile, then converted my y_true and y_pred to np arrays using .numpy(), then i just inputted them into the function

steady basalt
#

does anyone know how to fillna conditionally so that where ycol = 1 it fills with medians with that condition rather than column median

#

as one of my features would be better imputed this way for better predictive power

#

seems like a tricky pandas query

wooden sail
#

do it in two lines, perhaps. df['somecol'].loc[ some_condition and some_other_condition ] = some_val, and a second line for where the condition doesn't hold

#

otherwise, pandas should have an equivalent to numpy's where, which does exactly this

steady basalt
#

so i can say impute median of x column using a median thats calculated only on a condition of anothjer col

#

for example, my target class is binary, and i want to impute medians based on that rather than blanket impute a column

#

so the median xcol is say 130 where y=1 but 125 where y=0

#

so a missing value will check first, does ycol equal 1 or 0, then accordingly impute the median of all columns where that condition meets

#

the code you wrote works for finding median values, how do you put that into thefillna command

wooden sail
#

this would be without fillna, but rather making one of the conditions into isna

steady basalt
#

with fillna youd literally writ ethe exact same code

#

before and after fillna

#

seemd to have worked

#

df['somecol'].loc[ some_condition and some_other_condition ]

#

.fillna(df['somecol'].loc[ some_condition and some_other_condition ] .median())

wooden sail
#

that's basically the same thing, and also requires two lines

#

you only need one condition in that case

steady basalt
#

seemed to h ave imputed the wrong number

wooden sail
#

or you could check out the pandas where that i shared above. any of these 3 work

steady basalt
#

not sure why it didnt work

#

it seemed to have imputed all

#

not just where y=1

wooden sail
#

show the code

steady basalt
#

combined_train['bp'].loc[combined_train['stroke'] == 1].fillna(combined_train['bp'].loc[combined_train['stroke'] == 1].median(),inplace=True)

#

NaN 1

#

still have these

wooden sail
#

you're replacing the entries if column 'bp' corresponding to where the rows of the column 'stroke' are true, with the median of those same entries of bp

steady basalt
#

But it’s fillna so it shud do so where it’s missing based on where it isn’t

wooden sail
#

no

#

if you tell it to replace nans with the median of some nans, it will happily give you nans again

steady basalt
#

Ummmm

untold bloom
#

pandas' median and many other methods exclude NaN in the result

#

code is perhaps not working due to inplace=True

steady basalt
#

I’m thinking it’s easier to fillna with a float that I’ve calculated myself to be the median

untold bloom
#

at that point, you're perhaps modifying a copy

steady basalt
#

Nah it works otherwise on normal use

untold bloom
#

inplace=True is to be avoided 999 out of 1000 occasions :p

#

it's useful in very rare cases; so perhaps try re-assigning

wooden sail
#

what does combined_train['bp'].loc[combined_train['stroke'] == 1].median() yield?

steady basalt
#

The correct number

#

It’s higher than when == 0

wooden sail
#

then try what nahita says

steady basalt
#

What if I just fillna of that condition with the number

#

Shud work

untold bloom
#
sub_values = combined_train.loc[combined_train["stroke"] == 1, "bp"]
combined_train.loc[combined_train["stroke"] == 1, "bp"] = sub_values.fillna(sub_values.median())
#

problem is inplace=True...

wooden sail
#

the other two methods i mentioned earlier also get you there without using fillna and are pretty similar to this, so try those out too, if you like

steady basalt
#

A value is trying to be set on a copy of a slice from a DataFrame

#

his strategy gave this

untold bloom
#

that means combined_train was defined from another dataframe, possibly as a subset or something

#

like combined_train = other_df[...] idk

steady basalt
#

it iws

#

yes

untold bloom
#

you need to chain .copy() at the end

steady basalt
#

when creating it?

untold bloom
#

combined_train = other_df[...].copy()

steady basalt
#

well it was created from X_train and y_train which is also a subset of another df

#

and on and on and on before that

#

problem?

#

combined_train = pd.concat([X_train, y_train], axis=1).copy()

#

? works ?

wooden sail
#

will you work with the original df or only these new ones? you could use copies if you don't need to modify the original

steady basalt
#

this shudnt be an issue

#

qol sacrified for what functionality?

untold bloom
#

pd.concat would give a copy anyway; that warning is perhaps due to some other code that you didn't share

steady basalt
#

.copy didnt work

untold bloom
#

yes, that was expected :p

#

because pd.concat gives you a new thing anyway

steady basalt
#

nah its from x_train and y_train from previous df, whicih in itself is likely from another df

#

its a long notebook im not going back and going thru everyhing

untold bloom
steady basalt
#

instead of sub values i just said the condition you made subvalues from =

#

same thing

#

combined_train['bp'].loc[combined_train['stroke'] == 1].fillna(134.5,inplace=True)

#

this SHUD work wtf

untold bloom
#

that should never work really

steady basalt
#

the 1s didnt get filled

untold bloom
#

you have chained access...

steady basalt
#

combined_train['bp'].loc[combined_train['stroke'] == 1]=combined_train['bp'].loc[combined_train['stroke'] == 1].fillna(134.5) this worked

untold bloom
#

combined_train['bp'].loc[combined_train['stroke'] == 1] is unpreferred

#

combined_train.loc[combined_train['stroke'] == 1, "bp"] is preferred

steady basalt
#

so the weird thing is why this worked and the earlier didnt work

untold bloom
steady basalt
#

well its done now and working on what i typed there

untold bloom
#

ok, sorry for the clutter

steady basalt
#

so now i have different medians filled in my train data to base imputing my tests on that are conditionalyl different

#

so the model easier predicts

#

maybe improve score

steady basalt
#

are you allowed to fillna of the test data conditionally or does it have to just be based off of a single column train value

#

it wud be cheating right? to impute test values where the test y is a certain value

#

is the best way to do this to impute based off of the enitre column median from train?

#

but isnt that leaking data in a way

#

its that you have seen and imputed on knowledge of test data

#

if you say 'impute test value with x value where testy=1

#

x value coming from training data

#

if you know that test1 is 1 or 0, isnt that cheating

#

testy*

wooden sail
#

hmm yeah, you mean to modify the values of the input based on the output? you don't wanna do that

#

it'll make your test results better than they would otherwise be, not representative of real data

steady basalt
#

so the entire load of work i just did, is just to make better training set, but for the test set id need to impute nissing values based on a isngle training value

wooden sail
#

you can incorporate this behavior into the model with some kind of recursion or adding a variable that keeps track of the previous state. that would allow you to accumulate your predictions

steady basalt
#

wdym?

#

looking at the first prediction, seeing its y value and adjusting accordingly ? cant do that if theres any na in the first place

wooden sail
#

let's take a step back. you're calling x the input and y what you're trying to predict, yeah?

steady basalt
#

yes

wooden sail
#

ok. and you want to modify values of x based on what y is

steady basalt
#

well i want to boost my f1 score its rly low

#

           0       0.97      0.75      0.84     48436
           1       0.04      0.30      0.08      1806

    accuracy                           0.73     50242
   macro avg       0.50      0.53      0.46     50242
weighted avg       0.93      0.73      0.82     50242```
#

for class1 its still bad

wooden sail
#

sure, but what you're doing right now is replacing NaNs in x, yeah?

steady basalt
#

i just did that with your guys help to make it better than just blanket imputing based off of the entire column but instead based off of the value of train_y

wooden sail
#

just yes or no lol

steady basalt
#

yes

wooden sail
#

ok. well, in practice the values of y will not be available. but is there any reason to believe the current value of y depends on the previous values of y?

steady basalt
#

no

wooden sail
#

how about the current x on the previous values of x?

steady basalt
#

no, its meant to be random

wooden sail
#

then this approach cannot be done in practice

steady basalt
#

well that isnt what i was doing

#

was was giving the model the assumption that any missing data from test is going to be different depending on its x or y in the first place

#

by imputing differnet medians before training and evaluating

#

based on training sets x and y

wooden sail
#

you're still trying to modify x based on y

steady basalt
#

if the average bp of someone with a strok eis higher than someone without a stroke in the trainnig set, it makse sense to impute conditionally so that those with a stroke have higher bp value in the training data, and this isnt cheating because.. its training data

#

then the model will 'see' someone with that higher blood pressure and may tend towards putting it in stroke=1 category, which may improve accuracy

#

make sense?

wooden sail
#

yes, that's fine

#

what will you replace the values with when you don't have y

steady basalt
#

yea that is what all the fuss was about

#

oh, I have y for everything

#

I only used data where y existed

wooden sail
#

yes but when you go out and actually use the network, y is not available

steady basalt
#

y is either 'the source said this person was on the record for stroke' or wasnt

wooden sail
#

or are you not trying to predict y from x?

steady basalt
#

I am

wooden sail
#

these are two inputs?

steady basalt
#

wdym y is not available?

#

in practise it isnt

wooden sail
#

if you want to predict y from x in real data, you are given x and y is unknown

steady basalt
#

yes

wooden sail
#

so what do you plan on doing with the nans then

steady basalt
#

its based off of the training set

#

so this new unseen data will use the training sets values

#

to impute nans

#

same as how my test set has also done

#

which isnt conditional that is just based on entire column value

wooden sail
#

so the training nans in x are computed from the training y, and then the text x nans are computed from the training x values?

steady basalt
#

weird, my grid search cv is now saying 0.95 scores for these tests.... that is odd

wooden sail
#

all right. i have to say it's a bit weird when you also said the values of x are unrelated to each other

steady basalt
#

idk what u meant by that. but i had to do it this way to prevent leakage while also improving the model

wooden sail
#

but you're using something like a population average, so that's okish, not great

steady basalt
#

what else can i do?

wooden sail
#

if you're computing just an average value, may as well look at statistics people have gathered in larger populations so that you get a better estimate of the median or mean or whatever you're using

steady basalt
#

this is odd, how my grid search scores are 0.97 now, what is happening? do you think its now seeing x values and being like 'well ill just guess y being this' and getting them all right due to the distribution of the x values

#

thats strange, shudnt score that well using this

#

its surpoassed 0.99 train and test in grid search cv on my trainig set

#

!!!!

#

how the hell has this happened

#

oh i know how

#

hmmm. i oversampled so actually still shudnt hav ehappened

#

somehow must still be guessing only one class to a massively high f1 and just disregarding the other class

#

optimising on f1

#

@wooden sail do u know why my f1 for class0 is 0.9 and class1 is only 0.1

wooden sail
#

what's your cost function

steady basalt
#

RF critereon/

#

entropy

#

im using random forest and xgb

#

[CV 4/10; 6/32] END bootstrap=True, criterion=entropy, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100, n_jobs=-1;, score=(train=0.997, test=0.993) total time= 1.8s this is on the training set btw

#

the training sets oversampled to balance

#

i did SMOTE after one-hot encoding also

wooden sail
#

if one of the categories is not very common, you could get it wrong always and still get a good performance by this metric

steady basalt
#

yep

wooden sail
#

didn't someone suggest using max f1 or something like that yesterday?

steady basalt
#

they said just f1

wooden sail
#

you could also make your own cost function where you average the results from the two categories, for instance

steady basalt
#

also, random forest isnt supposed to take onehot encoded columsn right?

#

if thats the case, it wudnt be possible to use smote

#

as it i beleive requires you to encode

wooden sail
#

it can, but it's probably not the most efficient at dealing with the sparsity involved

steady basalt
#

whats better model?

wooden sail
#

you could use smote and cast back to the original categories

wooden sail
#

how did you do the encoding? just do that backwards

wooden sail
steady basalt
#

ok

#

sklearn for now

wooden sail
#

should be able to do it with numpy, then, i think

steady basalt
#

what model would well handle one hot encoded data

#

neurla network?

desert oar
desert oar
steady basalt
#

i forgot the syntax

#

and shud it be converted to string as i encoded on floats which were categories

wooden sail
#

should be inverse_transform

steady basalt
#

i shud also string(values) yes?

wooden sail
#

this is the first example on the sklearn website if you look up onehot encoding

>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OneHotEncoder(handle_unknown='ignore')
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 1], ['Male', 4]]).toarray()
array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])
>>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
array([['Male', 1],
       [None, 2]], dtype=object)
>>> enc.get_feature_names_out(['gender', 'group'])
array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'], ...)
steady basalt
#

do u think this is whats causing my random forest to do this

#

when training cross val seems to be fine (balanced)

ornate shard
#

hello guys

#

how to read this csv file ?

wooden sail
#

in python, you can read csvs with the csv lib or pandas. there, you can specify the separator, which is a bar here instead of the more common comma

steady basalt
#

wud anyone know whjy my test adn train scores are nan

steady basalt
#

after i went back and removed the onehot encoding part of the process and just used strings

ornate shard
#

because it shows like this

wooden sail
rapid cedar
#

can someone explain about this
i was looking thru some videos on youtube about ml, and i found one that said about deep learning. i find that very interesting. but it said that it has generation, how do i know that what im using rn is the generation i wanted? i cant just set seed like minecraft do. so how do they reconige the generation? like theres not token of the generation that has certain data

steady basalt
#

its saying randomfrest doenst accept nans but i have no nans

ornate shard
ornate shard
wooden sail
#

glad it worked

steady basalt
#

sklearn doesnt take strings

#

so how to deal with categoricals?

#

if u arent meant to encode for random forest

ornate shard
#

how about this file how to read from a specifc line because i don't want the data definitions ?

wooden sail
#

there's another parameter called skiprows, which counts from 0. you want to skip lines 0 to 8. how about trying pandas.read_csv(file_name, sep = '|', skiprows = 8)

steady basalt
#

@desert oar i thinkit was you talking about this

wooden sail
lapis sequoia
# ornate shard

besides skiprows, there is a parameter called comment that skips some lines depending on how they start. in this case it would be comment="#"

ornate shard
steady basalt
#

Cannot index with multidimensional key

#

anyone know why this suddenly started happening when it was working earlier

#

sns.boxplot(x=combined_train.loc[combined_train['stroke'] == 1, 'bp'])

naive turret
#

Need help

#

being able to combine tables

#

I've so far used pandas to try and clean them

#

I want to try and restructure the second table to look like the first

steady basalt
#

anyone know why keras validation auc is always 0

#

but binary cross entropy seems to be ok

#

oh you h ave to type a function and not the string

#

no, ok that didnt fix it

#
262/262 [==============================] - 2s 9ms/step - loss: 0.2052 - auc: 0.9507 - val_loss: 0.2516 - val_auc: 0.0000e+00```
#

how is that possible

misty flint
#

hahaha this is so accurate

lapis sequoia
steady basalt
#

@misty flint any idea why my neural network always gets 0.0 auc

#

for validation

misty flint
#

hmm i have a note here that says you dont really listen to what i have to say

#

so...

steady basalt
#

but it makes 0 sense that training auc is 0.7 while val auc is 0

naive turret
#

Because the years arent headers wouldnt it merge and say "year" repeatedly

steady basalt
#

somehow the nerual network only predicts one class

#

even tho i did undersample

midnight rain
#

Has anyone ever thought about implementing a feature store by using the entity component system (ECS) architecture?

#

I've been thinking about it a ton over the last few weeks and it seems like it'd be super nice

misty flint
#

there are also a lot of out-of-the-box feature stores out there too

midnight rain
#

I've always wanted to make one as a SAAS product though

misty flint
#

ml startup when

#

+1 for a very good UI

#

if you do end up doing something

iron basalt
serene scaffold
# naive turret

you need to split Month and Year into separate columns, and then do pivot_table

#

oh, you want to go the other direction. well, you can also do that with pivot_table

#

!docs pandas.pivot_table

arctic wedgeBOT
#

pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)```
Create a spreadsheet-style pivot table as a DataFrame.

The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
midnight rain
#

Maybe you could use an in memory ECS setup to cache the "hot" features and allow a regular relational/graph db to handle colder features if they aren't in the cache.

iron basalt
# midnight rain Well with a good feature store you want to be able to define whether a feature i...

Databases already have caching and especially relational databases have pretty much always been ECS-like. They are already optimized (although still on-going as many databases adapted to changes in hardware such as SSDs, bigger caches / memory speed being the bottleneck, etc). Relational databases are not always the preferred choice, but if configured correctly they are pretty much already following the data-oriented-design principle.

#

A more specific datastore / database for a specific task, will as with most things, probably be faster.

#

So if you have some niche to hit, it could be worth it.

midnight rain
iron basalt
midnight rain
#

I think it would be pretty cool to build a feature store/service on top of apache arrow + arrow flight + parquet

iron basalt
#

(As long as it does not hold back your design)

midnight rain
iron basalt
#

More importantly, if multiple programs support it you get all the interop, but not with some janky format.

midnight rain
#

well apache arrow flight is built on top of gRPC + IPC

iron basalt
#

(So if someone wrote some fast parallel query stuff for it, you can just use it)

iron basalt
steady basalt
#

anyone know why xgb over predicts class 1 while random forest over predicts class 0

modest onyx
#

does it?

agile cobalt
#

probably something related to the way their data is distributed or just a coincidence, I highly doubt that this holds true in general for these model types

rancid kelp
#

hey I have a dataset where some of the variables are numerical and are very skewed. Would you recommend Transformation(log/normal) OR Discretization? or mix of both?
Thanks :))

steady basalt
#

wel it seems to have improved alot simply by just dropping all na values rather than find the most efficient way to impute, this way it isnt guessing a single class for every guess

#

now to somehow improve precision, i dont know when theres only 500 of each class in a binary prediction, originalyl had 500k rows

#

perhaps a good way to impute would be the mid-way between class means in the training set, but ive never done that before.

dusty valve
#

can someone recommend a tensorflow tutorial (not anything specific, i just wanna learn everything about it)

tropic matrix
#

when tuning hyperparameters in keras_tuner, what is the difference between hyperband and bayesian optimization? and which one should i be using to find good hyperparameters the fastest?

dusty valve
#

here comes the fun part, so, what layers should i use for my text gen tensorflow Sequential model

tame grail
#

what does the code under #normalize data do

dusty valve
#

what am i doing wrong?

lapis sequoia
lapis sequoia
tropic matrix
#

I am trying to optimize 3 metrics, number of layers, number of neurons per layer, and the learning rate. Would you like to know the ranges of these values?

rapid cedar
#

should i learn tensorflow b4 starting ml?

modest onyx
#

no it's the other way around

#

I don't recommend tensorflow though learn pytorch

lapis sequoia
#

hey, so im using pytorch. Would there be any problem down the road if weights.grad from my model's layers produce gradients tensors but weights.grad_fn returns None

rapid cedar
#

sooo, study math first?

silver ibex
#

please can someone help me, i really dont know where or what im doing wrong 😦

modest onyx
#

you'd be surprised how much of ml is math in comparison to coding

#

though if you're already pretty good at math then you can dive in and study the more complex math stuff in parallel with learning the coding aspect of it

#

at least that's what I did

silver ibex
#

is this what you mean?

steady basalt
#

anyone know why my accuracy doesnt change in my neural network

#

but loss does

modest onyx
#

could be because your loss function isn't strongly correlated with accuracy

serene scaffold
#

By the way, please don't ask people to read screenshots of text. actual text is easier for everyone.

#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

silver ibex
steady basalt
#

binary cross entropy?

mild dirge
#

Any good book on linear algebra someone would recommend? If it is specific too machine learning that would be a bonus

#

Read through some books about general mathematics for ml, but it did not go deep enough into linear algebra

#

Also started on the book by bishop but it seemed kinda dense, and it is also quite old, so the chapter on NN is probably a bit deprecated

grim coral
#

hi

wooden sail
#

axler's linear algebra done right is great, but it's a lot more abstract. i don't think it even deals with matrices until rather late on, focusing on abstract vector spaces and linear transformations first

#

e.g. integration, differential operators, operations on polynomials and the like

mild dirge
wooden sail
#

was there anything in particular you wanted to reinforce? the main point in strang's linalg is his so-called fundamental theorem of linalg which related the "4 fundamental subspaces" related to matrices as linear maps

mild dirge
#

I just feel in general that I don't have a good grip on the mathematics behind some of the machine learning methods. So I want to get a good understanding of that before I move on to getting a better understanding of more complicated deep learning concepts.

wooden sail
#

hmmm it's very likely that linalg alone won't be enough to make you feel comfortable with ML methods. it's a great place to start though

mild dirge
#

I am also reading some stuff on probability and statistics

wooden sail
#

so, linalg and stats help you in defining the problems you want to solve, but not in solving them 😛

#

that's still a separate topic that comes after

mild dirge
#

Yeah sure

#

Imo my uni just did not spend enough time on the basics, and just started to move onto topics like deep learning and cnns without really explaining why that stuff works

wooden sail
#

are you in bsc? engineering?

mild dirge
#

Master AI

wooden sail
#

huh, then that's kinda bad

mild dirge
#

But it's a bit of a jack of all trades, we can choose a lot of our courses

#

But I tend to choose more of the computer vision and mathematics type stuff

#

But there is also cognitive ergonomics (like we had to design a UI for a satnav f.e.) which I don't care about a lot

#

In hindsight I probably should have chosen ML master, but I feel like I can supplement most of the stuff with self-study

wooden sail
#

that's certainly the goal of masters programs, to prepare you to be able to teach yourself what you need

severe oriole
#

guys I just want to ask but is Kaggle a good place to study all about data science and practice competitions in it for experience?

wooden sail
#

kaggle is pretty good, sure

#

it won't teach you everything, but it's a nice complement to your studies

mild dirge
severe oriole
#

I'm still pretty much a beginner in this field right now tbh

mild dirge
#

And hopefully enough knowledge to just gain some practical exp

severe oriole
#

and what should I go for next after Kaggle then

mild dirge
#

You could maybe even join a team for competitions @severe oriole

wooden sail
#

kaggle is nice due to the challenges and availability of data, but you should learn some theory that you can throw at them

severe oriole
#

should it be practicing with Tableau for data visualization

severe oriole
#

And the theory you mean like statical theories basics right

mild dirge
#

Yeah or just different types of machine learning models

severe oriole
#

Thanks a lot man. Let me search more before I come back here and ask again

#

Right now I just know 3 models from the intro: decision tree, random forest, and the validation one

#

Lemme find more abt this

mild dirge
#

Those 2 could be sued on tabular data (like just numbers)

#

But you would also want to learn about multi-layer perceptrons, and stuff like CNNs (convolutional neural networks) for image data

severe oriole
#

Interesting. That's something new there for me

#

Now I'm even hyped more to finish this kaggle thing before diving into that xd

mild dirge
#

Well I would think this would come before trying to classify a lot of kaggle datasets

#

You would use these models on the data

severe oriole
#

Oh then it's better if I know them first right

#

Since they're models and knowing to use them helps me when I go and try those kaggle datasets is what you mean?

mild dirge
#

Yes, exactly

#

You could use kaggle datasets to try them out while you are learning about them though

severe oriole
#

Noted. Let me try monkeying with data around to see about that

#

I'll keep the update here then

desert oar
desert oar
# steady basalt

i think it would help a lot if you showed your actual model-fitting and evaluation code

#

personally i've never seen output this pathological even on egregiously unbalanced datasets. to the point where i'd sooner suspect a bug in your code than a problem in your procedures

livid goblet
#

Where do I start learning AI ? any beginner friendly resources ?

desert oar
tropic matrix
#

Alright, thank you!

desert oar
#

@tropic matrix @lapis sequoia i've also had good results w/ "halving search" (which is available in scikit-learn) for such problems. for black-box optimization though, i suggest the Optuna library, and i suggest comparing both techniques.

tropic matrix
# tropic matrix Alright, thank you!

One more question, is it effective to tune hyperparameters on just a portion of the dataset (if the dataset is large and can take hours per epoch)?

#

@desert oar

desert oar
tropic matrix
#

Alright, ig i'll see if it ends up becoming effective

desert oar
#

it's a very good technique for testing that your models actually work though

livid goblet
desert oar
#

there is a philosophical argument to be made that ml is ai, but it's important to distinguish between the appearance of intelligence and actually having intelligence.

serene scaffold
lapis sequoia
iron basalt
# desert oar kind of. "ai" in practice is more like "ml but with fancy window dressing". real...

What makes this very confusing is that they used to be considered the same thing (ML/AI). But then a rift formed between the two as there was a large push against statistical/probabilistic methods from the established symbolic AI community which was gate-keeping (they were considered the authority on AI at the time). This caused the "AI Winter" to happen because symbolic AI was going nowhere in terms of progress and all funding / effort to probabilistic methods and especially neural networks had been cut (at the universities in general, there still was much progress being made but it had no recognition). This went on until the performance of neural networks was too great to ignore, especially after CNNs completely blew all previous computer vision out of the water.

desert oar
iron basalt
#

Now I like to use ML as a term to avoid debate over what AI is and instead just work on something productive.

#

Because everyone has an opinion on what AI is. You can see the difference if you for example look what you can find on the AI subreddit vs the ML subreddit or anywhere else.

#

(One is productive in majority, the other would be productive in majority if it was not always spammed with debate that leads nowhere)

desert oar
#

i think the other problem is marketing and media hype

#

really complicates terminology and makes me want to avoid calling anything "ai" as much as possible

iron basalt
#

Yeah, it has caused everyone to have an opinion on it by muddying the definition to make it accessible to all for opinions. You will hear an opinion from everyone on things that are easy to have an opinion on. I don't see everyone having an opinion on the merits of batched gradient descent vs not.

desert oar
#

right, i will continue to let the "technologists" emptily debate on medium.com, while i go solve business problems

iron basalt
#

It's fine to do such debate, but at some point I just choose what matters to me (what do I find interesting / productive) and go do some work.

ornate shard
#

hello guys anyone available to help me with my project in a voice chanel for just 5 mins ?

wooden sail
iron basalt
#

👍

serene steeple
#

guys i downloaded a compressed data in a "zst" format, do i need to extract the file in it or is supposed to be just the compressed zst file ?

quaint leaf
modest onyx
# steady basalt hm?

Well we can't really help you with the information that you gave us. Could be a bug in your implementation

mint palm
#

what does it mean to say "anomaly detection methods are quite sparse"?
does it mean there less relevant features compared to useless ones?

unique flame
#

I would interpret it as the algorithms used. Like K-means clustering,

#

k nearest neighbour, hierarchical clustering

mint palm
#

^"sort of"

unique flame
#

no as in I can't think of any more methods

mint palm
#

sparse meant taking only highly pronounced features, thats what i wanted to know

vale solstice
#

Has anyone done Andrew Ngs Machine Learning Specialization that just came out?

last peak
#

im working on it

modest onyx
#

I did the classic one a while ago

dusty valve
#

it's beautiful

#

holy shit it just hit 100% accuracy

#

eeeeeee

#

it actually hit 100%, but it went down after

#

screw it, my 48% accuracy text model did better

#

it produced actual words, although they did not make sense

#

i want to string the python

#

maybe i just need a larger dataset

#

im just gonna train it on my all my text messages

modest onyx
#

is this an lstm?

lapis sequoia
#

is anyone good with tensorflow image classification btw
im having a hard time learning it

dusty valve
lapis sequoia
#

right?

dusty valve
#

Yes

modest onyx
#

damn

lapis sequoia
#

finally

#

can u help me :d

modest onyx
#

I tried building my own rnn and lstm and they won't budge

dusty valve
#

No

modest onyx
#

not learning at all

lapis sequoia
#

the official docs look like cancer to mee

dusty valve
#

Same

modest onyx
#
class LSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim) -> None:
        super(LSTM, self).__init__()

        self.input_linear = nn.Linear(input_dim + hidden_dim, 4*hidden_dim)
        self.output_linear = nn.Linear(hidden_dim, input_dim)

        self.hidden_state = torch.zeros(hidden_dim)
        self.cell_state = torch.zeros(hidden_dim)

    def forward(self, input):
        
        self.hidden_state, self.cell_state = self.hidden_state.detach(), self.cell_state.detach()
        # l is the length of each gate
        l = len(self.hidden_state)

        concatenated = torch.cat((self.hidden_state, input), axis=0)
        temp = self.input_linear(concatenated)
        i, f, o, g = (
            torch.sigmoid(temp[:l]), 
            torch.sigmoid(temp[l:2*l]), 
            torch.sigmoid(temp[2*l:3*l]), 
            torch.tanh(temp[3*l:])
        )
        self.cell_state = f * self.cell_state + i * g
        self.hidden_state = o * torch.tanh(self.cell_state)

        output = self.output_linear(self.hidden_state)
        return output
    
    def reset_states(self):
        hidden_dim = len(self.hidden_state)
        self.hidden_state = torch.zeros(hidden_dim)
        self.cell_state = torch.zeros(hidden_dim)
lapis sequoia
modest onyx
#

legit won't learn

dusty valve
#

I had to tweak with a bunch of stuff and rerun a lot

dusty valve
lapis sequoia
#

ohhh

dusty valve
#

I wanted to perfect my first

lapis sequoia
#

I want to make hcaptcha classifier but cant

dusty valve
#

When I get my discord messages I’m gonna train it on them

dusty valve
arctic wedgeBOT
#

5. Do not provide or request help on projects that may break laws, breach terms of services, or are malicious or inappropriate.

lapis sequoia
#

its not breaking law

#

I'm doing it as a fun project

#

it will be fun to learn tensorflow while trying to make it

dusty valve
#

That can be used to bypass captchas

lapis sequoia
dusty valve
#

That doesn’t matter

#

If it can be used maliciously we can’t help

lapis sequoia
#

well I find that a captcha identifier will motivate me to learn AI completely But alright I'm good with the rules
lets keep that apart

I just need help with tensorflow

#

nothing seems understandable to me

modest onyx
#

I'mma try making it more than one block deep and see how it goes

#

maybe my model's capacity is currently too low to generate text

lapis sequoia
#

then would you mind helping me :d

modest onyx
#

no I don't know tensorflow that well

lapis sequoia
#

oh

modest onyx
#

because I use pytorch

lapis sequoia
#

pytorch

#

pytorch can it classify images?

modest onyx
#

ofc

#

classifying images is now one of the simplest things to do using frameworks

lapis sequoia
#

oh cool

modest onyx
#

@dusty valvewhat encoding/decoding method did you use?

#

I did a bit of research and it seems my problem comes from a common problem known as text degeneration

#

which comes from using the naive encoding approach which I am using

modest spire
#

xD such a pretty graph

lapis sequoia
#

is anyone willing to help me with tensorflow

unique flame
#

For captcha classifying (and possibly bypassing)? no

lapis sequoia
#

I just want to learn image classification

#

to create stuffs like traffic signal detector etc

#

so can you help me with that

modest onyx
#

If you can't build a simple image classifier using tensorflow then you've definitely not done your homework

#

Go search a tutorial or better yet learn the underlying theory first

#

Thats if your goal is to learn

unique flame
#

So I'm trying to make a confusion matrix for the validation set of my data, but the accuracy shown using model.evaluate() seems to be different than the accuracy shown in classification_report(), so the confusion matrix would already look weird. Anyone know what I'm doing wrong? The batch size for this specific goal was set to the amount of images in the data-set as some comments on stack overflow recommended to do that.

validation_dataset = image_dataset_from_directory(Path_to_images,
    image_size=(400, 400),
    validation_split=0.3,
    subset="validation",
    seed=2,
    batch_size=1801)

model=keras.models.load_model(Path_to_model)

#first method to get accuracy
val_loss, val_acc = model.evaluate(validation_dataset)
print(f"Validation accuracy: {val_acc:.3f}") #prints acc 93%

#second method to get accuracy also to get confusion matrix
y_true = np.concatenate([y for x, y in validation_dataset], axis=0)
y_pred = model.predict(validation_dataset).argmax(axis=1)
print(classification_report(y_true, y_pred)) #prints acc 17%
iron tusk
#

Hello smart Python people! I have some problems with the MediaPipe realtime pose tracker. So I can get the landmarks and everything, but the data is very noisy and practically unusable. Does anybody know any way of smoothing realtime landmark data?

worthy phoenix
#

hi, i was trying to make an image synthesis which uses clip, but i wanna run it on my cpu, i get this error:

RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half

here is the code:

def convert_weights(model: nn.Module):
    """Convert applicable model parameters to fp16"""

    def _convert_weights_to_fp16(l):
        if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):
            l.weight.data = l.weight.data.half()
            if l.bias is not None:
                l.bias.data = l.bias.data.half()

        if isinstance(l, nn.MultiheadAttention):
            for attr in [*[f"{s}_proj_weight" for s in ["in", "q", "k", "v"]], "in_proj_bias", "bias_k", "bias_v"]:
                tensor = getattr(l, attr)
                if tensor is not None:
                    tensor.data = tensor.data.half()

        for name in ["text_projection", "proj"]:
            if hasattr(l, name):
                attr = getattr(l, name)
                if attr is not None:
                    attr.data = attr.data.half()

    model.apply(_convert_weights_to_fp16)

im using pytorch if thats not clear, any help would be appreciated

#

but i dont wanna comment it out tbh

unborn crow
#

hey, how can i show my df after i ran through a sklearn pipeline?

dim palm
unborn crow
dim palm
#

just print it ?

dim palm
sterile heath
#

@tired matrix

tired matrix
#

Thanks

real oyster
#

Anyone here have experience with ML in Python?

Question regarding CNN in Keras that I am building:

I have 10000 images and training a CNN but I added data augmentation but as soon as I did it is taking like 10 hours to train my model AND even if it does and I trained it on like 40 EPOCHS it reaches an efficiency of like 61%. Is there any way I can speed it up? I guess a easy fix would just be to increase epochs to like 100 and get higher efficiency cause longer training time and higher epochs but like that is going to take 2 full days. Am I being too impatient or what do you think? Thank you!

I am a beginner

mild dirge
#

Alright, so first of all, what is the model that you use, and is that "efficiency" the accuracy?

real oyster
#

Yes like the accuracy is taking so long

mild dirge
#

How many classes are there?

real oyster
#

Like in my dataset all I have are 5000 images of "Yes it is a blowdryer" and 5000 images of "No it is not a blow dryer", 0 or 1

#

So like 2 classes?

mild dirge
#

And your accuracy stays at 0.5?

real oyster
#

Kinda with little to no improvment

mild dirge
#

That is as good as random guessing then

real oyster
#

Yes

#

And this was after I added data augmentation

#

Before when I didn't have data augmentation which I heard was bad, my model was overfitting like crazy

mild dirge
#

Alright, so it might be that something horribly goes wrong then, because it is making basically nothing from your data right now

#

What did you get before data augmentation?

real oyster
#

But it was overfitting all the way to hell

mild dirge
#

training or validation?

real oyster
#

training

mild dirge
#

training accuracy is only really useful to check if you are over/under-fitting

#

validation is what we care about

real oyster
#

All I know is that the difference in val_loss and loss was like so large and the val_accuracy was staying the same

#

So it was overfitting

mild dirge
#

Yes, but how much

#

what was your validation accuracy

real oyster
#

82%

#

84%

mild dirge
#

That is already quite decent

#

Better than random at least

real oyster
#

Yes but I need it to be like 95%+

mild dirge
#

And what does your model look like

#

And what have you tried to prevent overfitting

real oyster
#

`model.add(Conv2D(64, kernel_size=4, activation="relu", input_shape = (256, 256, 3)))
model.add(MaxPooling2D(4,4))

model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(3,3))

model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(3,3))

model.add(Flatten())
model.add(Dense(32))
model.add(Dense(train_generator.num_classes, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l2(0.0001)))`

#

This is what it is right now

#

I added data augmentation and it became the results I just showed you

#

Before this was performing at like 90%+ but was overfitting like crazy

mild dirge
#

Why do you have sigmoid at the end, with 2 neurons?

real oyster
mild dirge
#

Well, you chose it 😛

real oyster
#

I asked another person and they said that since all I have is 0 and 1 make it 2 so I did

#

Like Blow dryer or no blow dryer

#

Could you help me understand

#

What would u change it to?

mild dirge
#

Sigmoid in the final layer is mostly used when you want to check if a class is present or not

mild dirge
#

And each node then represents a class

real oyster
mild dirge
#

So if you use sigmoid with 2 classes, you basically want to have 1 neuron in the final layer

#

which is the dryblower

#

OR you could use softmax with 2 neurons

real oyster
#

model.add(Dense(1, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l2(0.0001)))

#

So like that?

mild dirge
#

Yes, but you might want to change your labels from [0 1] [1 0] to just [0] and [1]

real oyster
#

Right now my Labels are already that

#

I have it seperated in 2 folders

mild dirge
#

I doubt this would change a lot about the accuracy, but this is the common way to do it

#

When looking at your model, it also seems that you start with a lot of filters, and you use maxpooling that decreases the size by a lot

mild dirge
#

I would try only using maxpooling with 2x2, and start with less filters, and have more filters for later layers

mild dirge
#

As there are less small patterns, and more complex patterns typically in an image

#

And later layers can extract more complex features from smaller features

real oyster
#

So like should I remove Conv2D too?

#

Or just change all the MaxPoolings to (2,2)

mild dirge
#

you need convolutional layers

#

they are the ones extracting patterns

real oyster
#

yes

mild dirge
#

but maxpool with 3x3 removes a lot of information all at once

#

or 4x4 even

real oyster
#

Ok gotcha

#

what else would you recommend thank you btw

mild dirge
#

But this is mostly just gut feeling, from trying out stuff with datasets

real oyster
#

`model.add(Conv2D(64, kernel_size=4, activation="relu", input_shape = (256, 256, 3)))
model.add(MaxPooling2D(2,2))

model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(2,2))

model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(2,2))

model.add(Flatten())
model.add(Dense(32))
model.add(Dense(train_generator.num_classes, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l2(0.0001)))`

#

changed it to 2,2 now

real oyster
mild dirge
#

So normally you want to increase the perceptive field (each layer should be able to deetect patterns from larger areas of the image)

mild dirge
#

And you want to increase the amount of filters further in the model

real oyster
#

gotcha

#

so how would I do that programmatically

mild dirge
#

What do you mean?

#

Do what?

real oyster
#

Like what would you change in my model

#

Can you show me programatically

#

code snippet

mild dirge
#

The amount of filters for the conv2d layers

real oyster
#

So increase that amount?

mild dirge
#

They are just parameters

#

incrementally over the layers yes

#

Later layers have more filters, earlier layers less

real oyster
#

O I see

#

So like this??

#

Is 16 a good starting point?

mild dirge
#

Yeah

#

could be

real oyster
#

Ok gotcha thank you

#

Any other recommendations?

mild dirge
#

There isn't a set rule, this is just common sense considering there are typically not that many small patterns, and more large patterns

real oyster
#

O ye I am also using the Adam optimizer
tf.keras.optimizers.Adam(learning_rate=0.01)

mild dirge
#

Yeah, could be okay

#

Just another hyper parameter you can tune

#

A lot of this is trial and error

real oyster
#

Do you really think changing the maxpooling to (2,2) and changing the filter to that is going to make a difference?

mild dirge
#

You should try a lot of cross validation with different parameters to see what works best

mild dirge
#

Do you know k-fold cross validation?

real oyster
mild dirge
#

You should look it up then, you need to it to find out what model gives good results

#

You can also just have training/validation/test split

#

But that means you would use the same data to train and validate on for each set of hyper parameters

real oyster
#

Hmm ok

mild dirge
#

In any case, make sure to not use your test set while you are still working on your model

real oyster
#

Btw thank you

#

It is looking way better..btw I just ran it

mild dirge
#

I would first just try it out on unaugmented data, because it seemed that gave some issues the last time

#

and when you seem to hit a wall, you can try that to decrease overfitting

#

You could also add a dropout layer between two dense layers that you have

#

And maybe add another dense layer, because you only have 2 right now

real oyster
#

So ur saying don't use data augmentation cause its messing it up and do it without. And when it overfits add more dense layers or dropout

real oyster
mild dirge
#

No, I would just add another dense layer, because 2 dense layers isn't often enough to predict the class using the convolutional layer features

#

And you can add a dropout layer to decrease overfitting

mild dirge
#

Dropout layer means that some weights get deactivated randomly between two layers

#

This means the model can't just base its decision on a set of features, it will need to make use of all/more of them

real oyster
#

Ok gotcha

#

Thank. youso much for the help

mild dirge
#

yeah nw

real oyster
#

I will ask again. Right now I will take all ur advice and then test it all out. And then come back again. Thanks again!

#

Gonna first try without augmentation

real oyster
#

model.add(Dense(32))

#

But how many neurons would I have? Still 32?

wooden sail
#

try a number between the previous layer and the next one

#

somewhere between 32 and the size of the output

real oyster
#

Hmm ok

wooden sail
#

that'S where you're putting the dense layer, between the 32 one and the output, right?

real oyster
#

model.add(Dense(32)) model.add(Dense(16)) model.add(Dense(train_generator.num_classes, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l2(0.0001)))

#

So like this? I added the model.add(Dense(16))

#

inbetween

wooden sail
#

sure

real oyster
#

Gotcha thanks

#

will update on the progress. i have a bad feeling its going to overfitt..again

mild dirge
#

You haven't added a dropout inbetween, which may also help

#

And still using sigmoid with 2 neurons for a binary prediction problem

wooden sail
#

ah yeah you can boil that down to a single neuron output. but i'd say it's good to change one thing at a time and see what improvements you get

real oyster
#

I am planning on changing it one thing at a time and then updating my progress in this chat. Thank you so much @mild dirge

#

And thank you @wooden sail for pitching in

#

Ok wait are you seeing this:

#

Again, it is starting to overfit

#

This is without data augmentation and my model looks like this:
`model.add(Conv2D(16, kernel_size=4, activation="relu", input_shape = (256, 256, 3)))
model.add(MaxPooling2D(2,2))

model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(2,2))

model.add(Conv2D(64, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(2,2))

model.add(Flatten())
model.add(Dense(32))
model.add(Dense(train_generator.num_classes, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l2(0.0001)))`

After that runs I will add another dense layer and dropout

wooden sail
#

but both the acc and val acc are increasing together, i wouldn't say that's overfitting, at least not all that bad

real oyster
wooden sail
#

can you show it?

real oyster
#

Oh no I will show it after it runs completely

#

but I have done it before and played around and the graph does not look nice. Like the val_loss and loss difference is too great

#

eventualy

#

@wooden sail the difference is getting very large

#

losss and val_loss

wooden sail
#

well, i've certainly seen worse models lol

real oyster
#

Ok but I want. thismodel to be good not better than worse

wooden sail
#

the loss is not normalized, so the raw number doesn't mean much to me

real oyster
#

Are you sure?

#

@mild dirge can u look at the image too

#

Cause I was told it was overfitting

wooden sail
#

if the loss has a dynamic range of thousands, for example, this is a tiny percentual difference

mild dirge
#

Why do you have the validation accuracy rounded to 1 decimal

#

Is it rounded, or floored?

real oyster
#

It isnt

#

It is cut of into the next line

#

Look at the next line it's 0.8272 what not

mild dirge
#

Well it is overfitting, but not by that much

real oyster
#

But it's overfitting

#

dammit

wooden sail
#

let it get to like 10 or 15 epochs and if you're not happy with the performance, add in the dropout that pccamel recommended

real oyster
#

Ok so next step looking at the current model:
`model.add(Conv2D(16, kernel_size=4, activation="relu", input_shape = (256, 256, 3)))
model.add(MaxPooling2D(2,2))

model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(2,2))

model.add(Conv2D(64, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(2,2))

model.add(Flatten())
model.add(Dense(32))
model.add(Dense(train_generator.num_classes, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l2(0.0001)))`

I will add dropout, another dense layer inbetween, and decrease epochs to 10 or 15?

#

Is doing the follow good:

#

I will add dropout, another dense layer inbetween, and decrease epochs to 10 or 15?

wooden sail
#

sounds ok

#

but really, validation accuracy is normalized, but loss isn't

#

looking at the loss alone is not always enough to say what the performance is like. the validation was looking ok (even though it could be better)

mild dirge
#

Not sure if an even kernel size is useful/common

#

Also, your first flattened layer has 65,000 ish neurons

#

So that will also not be too good for overfitting I'd think

wooden sail
#

yeah i was thinking of that as well, regarding the size of the flattened layer. maybe use like dense 2048, 512, 128, 32, 1 or something like that. with dropout before the first dense and maybe between some of the other dense layers

mild dirge
#

That's about 2 million weights

#

Between the first and second dense layer

wooden sail
#

as for the even kernel, it's fine, but it's not symmetric, which might introduce some unexpected or difficult to interpret behaviors

mild dirge
#

It's probably better to add another conv+maxpool

#

Or maybe a convolutional layer with a stride higher than 1

real oyster
mild dirge
#

The biggest problem right now is having 2 million weights between two of your dense layers

#

That would be the first thing to fix

real oyster
#

How can I fix that

#

I know I am asking a lot but would it be ok to hop on a quick call

#

I will share my screen

#

If not that's completely ok

real oyster
mild dirge
#

I replied to a message with a tip to reduce the amount of neurons

#

I can't go on voice right now

real oyster
#

Oh ok

wooden sail
#

try adding another conv with max pool, but don't increase the number of filters anymore

#

that'll cut the size of the output roughly in half before going into the dense layers

#

and put a dropout there, too

wooden sail
#

i flattened in my head in the wrong order, oops

#

1/4 is right, yeah

real oyster
#

`model.add(Conv2D(16, kernel_size=4, activation="relu", input_shape = (256, 256, 3)))
model.add(MaxPooling2D(2,2))

model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(Dropout(0.2))
model.add(MaxPooling2D(2,2))

model.add(Conv2D(64, kernel_size=3, activation="relu", padding="same"))
model.add(Dropout(0.2))
model.add(MaxPooling2D(2,2))

model.add(Conv2D(64, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(2,2))

model.add(Flatten())
model.add(Dense(64))
model.add(Dropout(0.2))
model.add(Dense(32))
model.add(Dropout(0.2))
model.add(Dense(train_generator.num_classes, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l2(0.0001)))`

#

Ok so like this?

wooden sail
#

i take back my 2048 recommendation, that'll make the problem monstrous lol

#

try 64 or keep it at 32 as you already had it, i had neglected the size of the conv output earlier

real oyster
#

I updated my message

#

does that look good

wooden sail
#

yeah

dusty valve
#

tbh that .join was not necessary

dusty valve
#

yes something is up with my encoding

dusky storm
#

Can someone help me i don't know where i went wrong though i gave correct import file it's showing me no module error !!! Any fixes

iron tusk
#

You should try adding a dot before Backend in line 1

hidden fern
#

i need a pseudo code for:

if column[a] has values: X, Y, Z then create a new column with values that say “Alphabet” else impute values “not alphabet”

desert oar
#
  1. please don't post screenshots of your IDE. explain your problem in words and post your error output & file structure as text using a code block.

  2. this isn't a data science question. please carefully read #❓|how-to-get-help.

dusky storm
#

Uhm sorry by the way !!! I will delete it

tidal bough
#

How can I, after something like df.groupby("username"), apply a custom function to each group that collapses each group into one row?

#

I thought .groupby().apply did that, but it seems to be called on each row, not on each group.

wooden sail
#

how about df.loc[some_condition_on_some_col, some_other_col].apply()? though i guess that requires repeating the process several times, hmm

tidal bough
#

Not sure how that'd help - yeah, I want to collapse every group (all rows with the same username) into one row.

wooden sail
#

you want to actually collapse it or just apply the same func to all of the ones with the same username?

tidal bough
#

Actually collapse. So I want groupby to call my function with an entire group at a time, and it'd return one row per group.

wooden sail
#

maybe agg? but i'm out of my depth in this one

untold bloom
#

.apply is capable of accepting a function returning 1 row as well as .agg

#

but the thing is: there might be a better way without these if the function is not so customized

untold bloom
#

.apply will work as well, except it will retain the grouper column as a column in the result, too.

tidal bough
untold bloom
#

need to prove it :p

tidal bough
#

as for aggregate, hmm

untold bloom
#

it can call your function for the first group twice to seek a fast path if possible, but no, it won't call it per row of groups; it will do for the entire group.

tidal bough
#

With aggregate, the issue seems to be that it works separately on each column

untold bloom
#

yes; you'd want .apply, then

#

afk&

tidal bough
#

this is some sort of magic, huh