#data-science-and-ml
1 messages · Page 5 of 1
this is exactly why we do stratified sampling. if the rare class is 5% of cases, you want to make sure that it's 5% in both the train and test sets if possible. then you can oversample in test later.
that isnt a issue
hmm?
i see that as cheating as test data shud be totally unseen
it's not cheating. it's a valid technique. the test data is meant to be a simulation of unseen data, but your data
how do i fix this?!
ok so about the same
yeah with that much data it should be ok. so alright, you've ruled out a pathological case like having literally 5 instances of the rare class in the test set
1800 is more th an enough
next question: feature distributions. are there any rare feature values that show up in one split but not the other?
is there a quick and ez way to test that
not in general. but conceptually it's "groupby and compute the distribution"
or just compute the distribution in both sets and compare to the baseline
how many features do you have anyway? what kind of model is this?
ok there was about 10 features but after dummifiying its 50
a handful of continouis
random forest for now, but will change that later once fixed
to xgb maybe
they're all categorical?
so, histograms for cont data and countplots for the onehot encoded data?
ei have 5 numerical
althoi about 70~% of one of them were imputed w median
generally it's not necessary to dummify categorical variables in tree-based models. if you do that, you end up skewing your model to over-fitting on high-cardinality features. sometimes it's actively harmful.
are you doing this inside the training loop? just checking. the median should only be computed on the train set, not the full data (train+test set).
well that's an obvious source of data leakage
https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769 here's a good illustration of the one-hot encoding problem
so its fillna of both test and train with train median
ah man, i imputed modes too way earlier when i ifrst loaded the data
yeah that's an easy mistake to make. assuming you're using scikit-learn, you'll want to add it as a step in your sklearn pipeline and not at the beginning when loading data
actually, im not so sure all of it was a problem
alot of things had to be imputed manually
such as missing variable to a certain category
yeah i don't want to declare prematurely that fixing the data leakage will solve your problem. but it's one thing you'll want to eliminate anyway.
maybe your advice on the balance of the features wil work il lcheck it out
@desert oar
first comparison
its not too different
omfg i have some outlier
no nvm
outlier shiudn tmatter
looking better with some other params i chose, but sitll not good enough for class1 f1
now, if i grid searched for f1, and it thought the best params are when class0 f1 is 0.9 and class 1 f1 is 0.2, how do i instead try to go for both being somewher ein the middle, say 0.7
@desert oar
class-specific f1 for class1 would fix it?
you only have 2 classes, right? use overall f1, not class-specific
so just 'f1' metric no variant
right
but it keeps thinking best params is when class0 f1 is really high
and that means clas1 will suck
isn't "class1 f1" what you want anyway?
yeah
assuming the rare class is class 1
so ignore class 0 f1, it's not the thing you care about
well, it kinda matters
but still, class1 f1 sucks
idk why
did u see the histograms above
in this case it doesn't matter because you only have 2 classes - every prediction of "class 1" is a prediction of "not class 0"
a false positive for one is a false negative for the other
i am suggesting that you focus on the f1 score for the class that you care about. and that you should convince yourself in the binary classification case that there's no point in looking at both
i'm not sure what the histograms are telling me
so if recall and precision for the negative class is rly low, im saying my models going to tell people theyre getting x disease when actually they aint?
thats bad
with predicting diseases its probably quite important to consider both classes
i dont want amodel that cannot properly predict people who dont have the disease
it is true that precision and recall don't take into account the true negative %. in this case you have two competing optimization goals
Yes
thats a further challenge, but for now, maybe i shud optimise to make class1 f1 highest rather than just say scorer=f1
you can also look at "balanced accuracy" which is the average of sensitivity and specificity
scorer=f1 is equivalent to "class1 f1", that's what i'm saying
the issue with that is that the stupid grid search will think '0.9' precision and '0.1' recall is best bcs the f1s highest, when i cud be getting something better like 0.65 and 0.65
in this case recall>precision
fair enough. maybe you can penalize extreme values somehow
i cud optimise for recall, and have tried, but still got the same bad result
try balanced accuracy maybe
i think this has to be a data problem, i shud step away from grid searching and just use default models until something looks alright
that's also a good decision
start with a baseline model and then optimize after you know it works
so how should i go about fixing this weird issue
i showed you the distributions arent too different
a bit for age, sure but shudnt ruin the model
right, so you've ruled out another issue
now it seems that you need to turn to the imbalance problem itself
well i used smote so its 5050 for train
what are your performance metrics within the train set?
i turned 18k/450k into 450/450k
it's possible also that the smote results aren't good
i dont think undersampling will help me her eeither as i need the data
i've heard "mixed reviews" about oversampling and i don't personally use it
smote shud work, i mean its just balancing the class while adding nothing new
for model purposese
i dont think thats the problem
my halving grid search seems to be showing test f1 of 0.7 so far maximuim
but im telling u when i print the actual results of the test set its gona suck for either class
espeically recall/precision being imbalanced
if i dont fix this soon im gona have to call it quits and submit bad results, which doenst look good for a thesis
sure, its not about performance but it helps when the reader sees decent numbers that would be helpful in deployment
unfortunately this is just how machine learning goes
i do need to get back to my own work for now, but you'll have to keep trying things and coming up with reasons why the model might be overfitting to one class or another
is there a metric thats balanced f1? so its not maximising by saying 0.99 precision and 0.1 recall and instead tries to find a maximum where both are highest and balanced?
you could e.g. take the average of both f1 scores
i'm not sure if that's equivalent to plain f1 though... might need to write it out
yeah no i meant for class 1 mostly
oh like penalizing extreme values
im not sure, it would be useful though
i meant i prefer 0.7 recall and 0.6 precision over 0.9 and 0.3
actually it's a harmonic mean, it should discourage extreme values anyway
optimising f1 again hasnt done well for clas 1
its maximiing class 0
0.07... terrible
0.55 recall...
you don't know that it's maximizing class 0 precision. you only know that it's coming out high
thats def not trying to optimise class 1 f1
scorer=f1 must be doing average f1
f2_score = make_scorer(fbeta_score, beta=2, pos_label=1) shud i do this
stop and read the docs. f1 score without any further qualifications is precision and recall of the "1" class. and that is how it's almost always used. resist the temptation to guess
that just cant be whats going on here
it has to be, unless there's a bug in your code
random_state=42,verbose=10,n_jobs=-1,cv=3,scoring='f1').fit(X_train, y_train)
clf=RandomForestClassifier(**search.best_params_)```
y_preds = clf.predict(X_test)
print(accuracy_score(y_test, y_preds))
print(classification_report(y_test, y_preds))
metrics.plot_roc_curve(clf, X_test, y_test) ```
maybe i shud try .95 .05 split ....
xd
900 rare class is still not too bad
i mean, it's possible that it's using the wrong f1 score i guess. try being explicit, scoring=make_scorer(f1_score, average='binary')
i was using grid search's scorer f1 function
right, it lets you pass a string for convenience
somewhere in the docs (i forget where) it says which string corresponds to which function
'f1' is probably the same as yours
anyway, tried splitting down to 95% 5% which is 870 class1's
some more training data for it
i suppose it wud be bad luck if its like, given the 800 samples of the 18k that are the hardes tto predict
: )
guess thats why i need to do cv, ill do it on the train set
Can anyone here help me with a code I've been working on...I am having difficulty calling a function into another function. lmk if anyone's interested, i'll give detailed explanation in PM.
help is appreciated!!
Don't ask people to go to your DMs. No one wants to wait for you to type out what your real question is to find out if it's something they can/want to answer.
oh okay...thanks!
hey, if possible i would like some guidance regarding starting my journey in data science….If anyone can help me where i should start from, what steps to take on first and where do i learn it from? thank you.
keep in mind that it's mostly math, and data scientist positions are very difficult to get without a degree. that said, this is the book I recommend: https://www.oreilly.com/library/view/data-science-from/9781492041122/
thank you so much, i do have a degree in IT but it doesnt take me to a direction i want to go in
if you work on a book like this for a while and feel confident that this is the way to go, a boot camp might help you bridge the gap. but it's very difficult to get a first job in data science, even with a CS degree.
so, you are suggesting doing MS in data science first is a better approach?
yes, that would make it quite likely that you'd find a position.
though it might be a CS degree, and you'd have to look into ways to make it data science-focused. there usually aren't data science degrees.
Whats AI ?
anything that has logic pretty much
could be an if statement
That's too general.
Good question...
Programs that emulate the application of knowledge.
There's no formal definition, but the one you proposed captures so many things that aren't AI that it shouldn't be considered
I have yet to see one of these diagrams that works fully.
I guess you could say AI is more about the goal, what you want the end result to be, and not so much about what exact methods being used / what it is right now (to some limit, it has to do enough for most to consider it AI, so some lower bound of things).
And that goal is often mimicking some part of human intelligence.
Or not even human intelligence.
Like maybe how bees map out their territory.
Though the very general idea of AI is often dropped to something more specific to be on the same page and be productive.
The problem is that "an if statement" does probably not reach that lower bound for most people. It does not say enough.
Hello, does anyone here have any knowledge about the Speech Recognition module? I have a question 
Hi, I have a problem for imbalanced data. How to determine the equation for class_weight with classes 0 or 1?
When a ML model "learns," how do I know what exactly is it doing to learn? What mathematical functions does it use?
look up a tutorial on gradient descent to have a basic idea or take on a course like Andew Ng's on Coursera
if you just want to "what it does", it's not that complex except the math behind the derivatives, but if you want to know "why does it works", specially for some of the most complex tasks... I'm not sure if whenever anyone in the world can answer that adequately
probably yes and I'm just exaggerating, but still
Ok, thank you. I looked up activation functions a bit and it's making more sense.
I have another question, so based on my understanding a neural network basically creates a mathematical predictive function that maps x to y. How can I see the function it has created? If I can't see it, can I know the rough shape of it?
I recommend taking a look at simpler algorithms before looking into neural networks
(e.g., linear regression and decision trees)
but the "function" in neural networks is essentially just multiplying all the previous layer's nodes by the connection weights, then adding it all together (for each node in the next layer)
oh, I almost forgot - then yeeting it into the activation function
but there must be a final function that it ends up with at the end after training, right?
that it would run future predictions on
I'm not sure if the libraries ever compile it into a single function or just stores all the weights then runs it layer by layer
my guess is that it has to store and run layer by layer because of the activation functions though
(otherwise, the final function would get monstrous really, really fast)
but yeah, like I suggested before: If you want to understand them properly, do take a course on it.
I don't understand them all that well either, as you may have already guessed
I've decided to read a book called Neural Networks from Scratch, where we build a NN from raw Python
After training there are no more changes or it will still be considered to be training.
Right, but what if we want to store the function and want to make some tweaks manually in the future(even though they might be worse than retraining)?
ah, francis explained it a bit better in #python-discussion
You can.
How can I access the function in order to do that?
you would just about never tweak a neural network manually and I kinda doubt that the libraries provide the means to do so
find tuning is a thing though
If you wrote the code, you have it.
If I use a pytorch library, I don't think I'll be able to see it
Yeah, probably not. But I'm asking this to improve my understanding of NN's.
the model is not one single function.
It's a bunch of weights.
you can look up "pytorch visualise model weights" or more generally "pytorch model visualization"
Depends on what kinds of tweaks, libraries like Pytorch are designed for a more specific kind of neural network, and only allow specific types of modifications.
You can edit its source code though.
But i'm not sure what kind of modifications you are trying to do. If it's just altering some weights, yeah, that is what it already does.
Neural networks can be seen as just big functions that have fine-tuned parameters via some algorithm (all can if you abstract hard enough).
(The details of that function are what matter though, and it's why it's considered a neural network and not just any big function)
hi, so there are IoU, confidence threshold, precision, recall, AP and im just still confused about the part where mAP is based on IoU plus theres also part where some say precision-recall curve is weighted by threshold, in this graph is IoU and confidence threshold the same thing? thanks
Anyone know of a paper for image classification in which the researchers tried different training/validation ratio? So a split of 60/40,70/30,80/20 etc.
my rnn is actually convering pog
coders of reddit discord, where did you learn ai?
also, what modules do you recommend to learn b4 starting ai learning
Uni
Differential equation
uni?
I started off in the classic andrew ng coursera ml course back in spring break
now there's a new and improved version you can check out (it's free)
hmm?
why would you need DE here
an NN is just a massive composite and piecewise function
There are many ways to visualize what each layer is doing though
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html I found this quite interesting
Started in statistic classes and then suddenly found ourselves using AI.
hello, I'm trying to make a few google accounts and google asks me for a phone number
to get around this, i've tried MAC spoofing, changing IP's, multiple browsers, various timeframes, various passwords and username configurations and clearing the browser cache and the cache on my local cache on my machine.
how does google still know im the same user trying to make new accounts?
is this some sort of ai
detecting it
Unknown, but this sounds a little sketchy. It's not difficult to make more than one Google account, so whatever you're doing is out of the norm. As such we won't help as per our rule 5.
!rule 5
5. Do not provide or request help on projects that may break laws, breach terms of services, or are malicious or inappropriate.
Is there any document ai outside google cloud?
out of the norm? i like to use 1 gmail account for each service
I’ve got like 4 gmails
hi 👋I was playing around with tensorflow and was wondering if there was a better way for the following:
I have a dataset with a feature (tcp port) that can range from 0-65500 where only the exact value matters. Eg if the dataset feature has port 123 then it only has a meaning if the input variable is exactly 123, and not 124, 122, etc. From what ive seen you normally want a categorical/integer encoding for something like this but i have a feeling there should be a better way when dealing with 65k different possible values
Right now my model seems to heavily depend on that one feature as a big indicator, giving outputs that make no sense when looking at the other features
is there a way to implement a custom loss function using sklearn mathematical operations?
i use a MinMaxScaler on my y output (regression), and i'm wondering if there's a way to inverse_transform the rmse in order to find a "true" rmse and judge how accurate it is
nvm figured it out
i had to first enable run_eagerly=True in model.compile, then converted my y_true and y_pred to np arrays using .numpy(), then i just inputted them into the function
does anyone know how to fillna conditionally so that where ycol = 1 it fills with medians with that condition rather than column median
as one of my features would be better imputed this way for better predictive power
seems like a tricky pandas query
do it in two lines, perhaps. df['somecol'].loc[ some_condition and some_other_condition ] = some_val, and a second line for where the condition doesn't hold
otherwise, pandas should have an equivalent to numpy's where, which does exactly this
https://stackoverflow.com/questions/38579532/pandas-equivalent-of-np-where here are some examples
so i can say impute median of x column using a median thats calculated only on a condition of anothjer col
for example, my target class is binary, and i want to impute medians based on that rather than blanket impute a column
so the median xcol is say 130 where y=1 but 125 where y=0
so a missing value will check first, does ycol equal 1 or 0, then accordingly impute the median of all columns where that condition meets
the code you wrote works for finding median values, how do you put that into thefillna command
this would be without fillna, but rather making one of the conditions into isna
with fillna youd literally writ ethe exact same code
before and after fillna
seemd to have worked
df['somecol'].loc[ some_condition and some_other_condition ]
.fillna(df['somecol'].loc[ some_condition and some_other_condition ] .median())
that's basically the same thing, and also requires two lines
you only need one condition in that case
seemed to h ave imputed the wrong number
or you could check out the pandas where that i shared above. any of these 3 work
show the code
combined_train['bp'].loc[combined_train['stroke'] == 1].fillna(combined_train['bp'].loc[combined_train['stroke'] == 1].median(),inplace=True)
NaN 1
still have these
you're replacing the entries if column 'bp' corresponding to where the rows of the column 'stroke' are true, with the median of those same entries of bp
But it’s fillna so it shud do so where it’s missing based on where it isn’t
no
if you tell it to replace nans with the median of some nans, it will happily give you nans again
Ummmm
pandas' median and many other methods exclude NaN in the result
code is perhaps not working due to inplace=True
I’m thinking it’s easier to fillna with a float that I’ve calculated myself to be the median
at that point, you're perhaps modifying a copy
Nah it works otherwise on normal use
inplace=True is to be avoided 999 out of 1000 occasions :p
it's useful in very rare cases; so perhaps try re-assigning
what does combined_train['bp'].loc[combined_train['stroke'] == 1].median() yield?
then try what nahita says
sub_values = combined_train.loc[combined_train["stroke"] == 1, "bp"]
combined_train.loc[combined_train["stroke"] == 1, "bp"] = sub_values.fillna(sub_values.median())
problem is inplace=True...
the other two methods i mentioned earlier also get you there without using fillna and are pretty similar to this, so try those out too, if you like
A value is trying to be set on a copy of a slice from a DataFrame
his strategy gave this
that means combined_train was defined from another dataframe, possibly as a subset or something
like combined_train = other_df[...] idk
you need to chain .copy() at the end
when creating it?
combined_train = other_df[...].copy()
well it was created from X_train and y_train which is also a subset of another df
and on and on and on before that
problem?
combined_train = pd.concat([X_train, y_train], axis=1).copy()
? works ?
will you work with the original df or only these new ones? you could use copies if you don't need to modify the original
pd.concat would give a copy anyway; that warning is perhaps due to some other code that you didn't share
.copy didnt work
nah its from x_train and y_train from previous df, whicih in itself is likely from another df
its a long notebook im not going back and going thru everyhing
do you have this part, or something different? if so, how? can you share that? i guess it gives you the warning after this or similar operation
instead of sub values i just said the condition you made subvalues from =
same thing
combined_train['bp'].loc[combined_train['stroke'] == 1].fillna(134.5,inplace=True)
this SHUD work wtf
that should never work really
the 1s didnt get filled
you have chained access...
combined_train['bp'].loc[combined_train['stroke'] == 1]=combined_train['bp'].loc[combined_train['stroke'] == 1].fillna(134.5) this worked
combined_train['bp'].loc[combined_train['stroke'] == 1] is unpreferred
combined_train.loc[combined_train['stroke'] == 1, "bp"] is preferred
so the weird thing is why this worked and the earlier didnt work
if this didn't work, we shouldn't be surprised as well
because of this.
well its done now and working on what i typed there
ok, sorry for the clutter
so now i have different medians filled in my train data to base imputing my tests on that are conditionalyl different
so the model easier predicts
maybe improve score
are you allowed to fillna of the test data conditionally or does it have to just be based off of a single column train value
it wud be cheating right? to impute test values where the test y is a certain value
is the best way to do this to impute based off of the enitre column median from train?
but isnt that leaking data in a way
its that you have seen and imputed on knowledge of test data
if you say 'impute test value with x value where testy=1
x value coming from training data
if you know that test1 is 1 or 0, isnt that cheating
testy*
hmm yeah, you mean to modify the values of the input based on the output? you don't wanna do that
it'll make your test results better than they would otherwise be, not representative of real data
so the entire load of work i just did, is just to make better training set, but for the test set id need to impute nissing values based on a isngle training value
you can incorporate this behavior into the model with some kind of recursion or adding a variable that keeps track of the previous state. that would allow you to accumulate your predictions
wdym?
looking at the first prediction, seeing its y value and adjusting accordingly ? cant do that if theres any na in the first place
let's take a step back. you're calling x the input and y what you're trying to predict, yeah?
yes
ok. and you want to modify values of x based on what y is
well i want to boost my f1 score its rly low
0 0.97 0.75 0.84 48436
1 0.04 0.30 0.08 1806
accuracy 0.73 50242
macro avg 0.50 0.53 0.46 50242
weighted avg 0.93 0.73 0.82 50242```
for class1 its still bad
sure, but what you're doing right now is replacing NaNs in x, yeah?
i just did that with your guys help to make it better than just blanket imputing based off of the entire column but instead based off of the value of train_y
just yes or no lol
yes
ok. well, in practice the values of y will not be available. but is there any reason to believe the current value of y depends on the previous values of y?
no
how about the current x on the previous values of x?
no, its meant to be random
then this approach cannot be done in practice
well that isnt what i was doing
was was giving the model the assumption that any missing data from test is going to be different depending on its x or y in the first place
by imputing differnet medians before training and evaluating
based on training sets x and y
you're still trying to modify x based on y
if the average bp of someone with a strok eis higher than someone without a stroke in the trainnig set, it makse sense to impute conditionally so that those with a stroke have higher bp value in the training data, and this isnt cheating because.. its training data
then the model will 'see' someone with that higher blood pressure and may tend towards putting it in stroke=1 category, which may improve accuracy
make sense?
yea that is what all the fuss was about
oh, I have y for everything
I only used data where y existed
yes but when you go out and actually use the network, y is not available
y is either 'the source said this person was on the record for stroke' or wasnt
or are you not trying to predict y from x?
I am
these are two inputs?
if you want to predict y from x in real data, you are given x and y is unknown
yes
so what do you plan on doing with the nans then
its based off of the training set
so this new unseen data will use the training sets values
to impute nans
same as how my test set has also done
which isnt conditional that is just based on entire column value
so the training nans in x are computed from the training y, and then the text x nans are computed from the training x values?
weird, my grid search cv is now saying 0.95 scores for these tests.... that is odd
yes
all right. i have to say it's a bit weird when you also said the values of x are unrelated to each other
idk what u meant by that. but i had to do it this way to prevent leakage while also improving the model
but you're using something like a population average, so that's okish, not great
what else can i do?
if you're computing just an average value, may as well look at statistics people have gathered in larger populations so that you get a better estimate of the median or mean or whatever you're using
this is odd, how my grid search scores are 0.97 now, what is happening? do you think its now seeing x values and being like 'well ill just guess y being this' and getting them all right due to the distribution of the x values
thats strange, shudnt score that well using this
its surpoassed 0.99 train and test in grid search cv on my trainig set
!!!!
how the hell has this happened
oh i know how
hmmm. i oversampled so actually still shudnt hav ehappened
somehow must still be guessing only one class to a massively high f1 and just disregarding the other class
optimising on f1
@wooden sail do u know why my f1 for class0 is 0.9 and class1 is only 0.1
what's your cost function
RF critereon/
entropy
im using random forest and xgb
[CV 4/10; 6/32] END bootstrap=True, criterion=entropy, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100, n_jobs=-1;, score=(train=0.997, test=0.993) total time= 1.8s this is on the training set btw
the training sets oversampled to balance
i did SMOTE after one-hot encoding also
if one of the categories is not very common, you could get it wrong always and still get a good performance by this metric
yep
didn't someone suggest using max f1 or something like that yesterday?
they said just f1
you could also make your own cost function where you average the results from the two categories, for instance
also, random forest isnt supposed to take onehot encoded columsn right?
if thats the case, it wudnt be possible to use smote
as it i beleive requires you to encode
it can, but it's probably not the most efficient at dealing with the sparsity involved
whats better model?
you could use smote and cast back to the original categories
how wud i code that
how?
how did you do the encoding? just do that backwards
this depends on which library you're using for the training
should be able to do it with numpy, then, i think
you use the median from the train data to impute missing values in the test data
you can undo the one-hot encoding afterwards
whats the best way to do this
i forgot the syntax
and shud it be converted to string as i encoded on floats which were categories
should be inverse_transform
i shud also string(values) yes?
this is the first example on the sklearn website if you look up onehot encoding
>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OneHotEncoder(handle_unknown='ignore')
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 1], ['Male', 4]]).toarray()
array([[1., 0., 1., 0., 0.],
[0., 1., 0., 0., 0.]])
>>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
array([['Male', 1],
[None, 2]], dtype=object)
>>> enc.get_feature_names_out(['gender', 'group'])
array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'], ...)
do u think this is whats causing my random forest to do this
when training cross val seems to be fine (balanced)
in python, you can read csvs with the csv lib or pandas. there, you can specify the separator, which is a bar here instead of the more common comma
wud anyone know whjy my test adn train scores are nan
how can i do this bro
after i went back and removed the onehot encoding part of the process and just used strings
because it shows like this
pandas.read_csv(file_name, sep='|')
can someone explain about this
i was looking thru some videos on youtube about ml, and i found one that said about deep learning. i find that very interesting. but it said that it has generation, how do i know that what im using rn is the generation i wanted? i cant just set seed like minecraft do. so how do they reconige the generation? like theres not token of the generation that has certain data
its saying randomfrest doenst accept nans but i have no nans
help please 🥹
thank you soooo much brother
❤️
glad it worked
sklearn doesnt take strings
so how to deal with categoricals?
if u arent meant to encode for random forest
how about this file how to read from a specifc line because i don't want the data definitions ?
there's another parameter called skiprows, which counts from 0. you want to skip lines 0 to 8. how about trying pandas.read_csv(file_name, sep = '|', skiprows = 8)
@desert oar i thinkit was you talking about this
thank you sooooo much ❤️
for further reference, you can check here if you like reading the docs https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html i've never used pandas myself, so i'm just digesting the docs for you 😛
besides skiprows, there is a parameter called comment that skips some lines depending on how they start. in this case it would be comment="#"
thank youu soo much brother ❤️
Cannot index with multidimensional key
anyone know why this suddenly started happening when it was working earlier
sns.boxplot(x=combined_train.loc[combined_train['stroke'] == 1, 'bp'])
Need help
being able to combine tables
I've so far used pandas to try and clean them
I want to try and restructure the second table to look like the first
anyone know why keras validation auc is always 0
but binary cross entropy seems to be ok
oh you h ave to type a function and not the string
no, ok that didnt fix it
262/262 [==============================] - 2s 9ms/step - loss: 0.2052 - auc: 0.9507 - val_loss: 0.2516 - val_auc: 0.0000e+00```
how is that possible
use pandas melt function
hmm i have a note here that says you dont really listen to what i have to say
so...

but it makes 0 sense that training auc is 0.7 while val auc is 0
I'm trying to wrap my head around it
Because the years arent headers wouldnt it merge and say "year" repeatedly
Has anyone ever thought about implementing a feature store by using the entity component system (ECS) architecture?
I've been thinking about it a ton over the last few weeks and it seems like it'd be super nice
thats a very interesting idea

there are also a lot of out-of-the-box feature stores out there too
Yeah for sure
I've always wanted to make one as a SAAS product though
A (relational) database / datastore? ECS works well for games that want something like a database, you can even implement an ECS by using one (pretty much a subset of what a database does).
you need to split Month and Year into separate columns, and then do pivot_table
oh, you want to go the other direction. well, you can also do that with pivot_table
!docs pandas.pivot_table
pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)```
Create a spreadsheet-style pivot table as a DataFrame.
The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
Well with a good feature store you want to be able to define whether a feature is computed when stored or at query time right? And you want graph dependencies on those as well. I'm curious if you could take the ECS architecture and get better performance and easy maintenance on a feature store/service
Maybe you could use an in memory ECS setup to cache the "hot" features and allow a regular relational/graph db to handle colder features if they aren't in the cache.
Databases already have caching and especially relational databases have pretty much always been ECS-like. They are already optimized (although still on-going as many databases adapted to changes in hardware such as SSDs, bigger caches / memory speed being the bottleneck, etc). Relational databases are not always the preferred choice, but if configured correctly they are pretty much already following the data-oriented-design principle.
A more specific datastore / database for a specific task, will as with most things, probably be faster.
So if you have some niche to hit, it could be worth it.
I think the most beneficial setup for an ECS + feature store would be allowing for an easy to use query + dsl setup to maintain the dependency graphs and query for mixtures of systems and component features
Yeah maybe, if you want to make it go for it, and see how it works out. I am very much for reinventing the wheel. Even if it ends up not being a better wheel you will have learned a lot.
I think it would be pretty cool to build a feature store/service on top of apache arrow + arrow flight + parquet
Yeah apache arrow or some other standard.
(As long as it does not hold back your design)
think arrow would be a nice base because you get immediate support for a super efficient storage format, get high performance vectorized compute (for basic operations at least), and you get a data transport protocol
The format is a pretty straight forward generic binary format. The speed comes from it just not being a mess so that you can write reasonably fast code. It does not prevent you from being fast I guess would be the way to put it.
More importantly, if multiple programs support it you get all the interop, but not with some janky format.
well apache arrow flight is built on top of gRPC + IPC
(So if someone wrote some fast parallel query stuff for it, you can just use it)
IPC via sockets on the same machine is silly, but if you plan on actually having it communicate with processes on different machines then it's fine.
anyone know why xgb over predicts class 1 while random forest over predicts class 0
does it?
probably something related to the way their data is distributed or just a coincidence, I highly doubt that this holds true in general for these model types
hey I have a dataset where some of the variables are numerical and are very skewed. Would you recommend Transformation(log/normal) OR Discretization? or mix of both?
Thanks :))
wel it seems to have improved alot simply by just dropping all na values rather than find the most efficient way to impute, this way it isnt guessing a single class for every guess
now to somehow improve precision, i dont know when theres only 500 of each class in a binary prediction, originalyl had 500k rows
perhaps a good way to impute would be the mid-way between class means in the training set, but ive never done that before.
can someone recommend a tensorflow tutorial (not anything specific, i just wanna learn everything about it)
when tuning hyperparameters in keras_tuner, what is the difference between hyperband and bayesian optimization? and which one should i be using to find good hyperparameters the fastest?
here comes the fun part, so, what layers should i use for my text gen tensorflow Sequential model
what does the code under #normalize data do
what am i doing wrong?
Share whole thing.
Well it is normalizing both xtrain and xtest. So you first fit xtrain and transform it. Then using same model transform xtest as well.
I am trying to optimize 3 metrics, number of layers, number of neurons per layer, and the learning rate. Would you like to know the ranges of these values?
should i learn tensorflow b4 starting ml?
hey, so im using pytorch. Would there be any problem down the road if weights.grad from my model's layers produce gradients tensors but weights.grad_fn returns None
sooo, study math first?
please can someone help me, i really dont know where or what im doing wrong 😦
yes basically
you'd be surprised how much of ml is math in comparison to coding
though if you're already pretty good at math then you can dive in and study the more complex math stuff in parallel with learning the coding aspect of it
at least that's what I did
is this what you mean?
could be because your loss function isn't strongly correlated with accuracy
it looks like row['shares'] is None, but in data science python, you want missing values to be represented as NaN. A number times NaN will just give you another NaN, instead of erroring.
By the way, please don't ask people to read screenshots of text. actual text is easier for everyone.
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
sorry, thank you for your help
hm?
binary cross entropy?
Any good book on linear algebra someone would recommend? If it is specific too machine learning that would be a bonus
Read through some books about general mathematics for ml, but it did not go deep enough into linear algebra
Also started on the book by bishop but it seemed kinda dense, and it is also quite old, so the chapter on NN is probably a bit deprecated
hi
gilbert strang's linalg is good
axler's linear algebra done right is great, but it's a lot more abstract. i don't think it even deals with matrices until rather late on, focusing on abstract vector spaces and linear transformations first
e.g. integration, differential operators, operations on polynomials and the like
I'll try the gilbert one, think I already downloaded it but haven't started on it yet
was there anything in particular you wanted to reinforce? the main point in strang's linalg is his so-called fundamental theorem of linalg which related the "4 fundamental subspaces" related to matrices as linear maps
I just feel in general that I don't have a good grip on the mathematics behind some of the machine learning methods. So I want to get a good understanding of that before I move on to getting a better understanding of more complicated deep learning concepts.
hmmm it's very likely that linalg alone won't be enough to make you feel comfortable with ML methods. it's a great place to start though
I am also reading some stuff on probability and statistics
so, linalg and stats help you in defining the problems you want to solve, but not in solving them 😛
that's still a separate topic that comes after
Yeah sure
Imo my uni just did not spend enough time on the basics, and just started to move onto topics like deep learning and cnns without really explaining why that stuff works
are you in bsc? engineering?
Master AI
huh, then that's kinda bad
But it's a bit of a jack of all trades, we can choose a lot of our courses
But I tend to choose more of the computer vision and mathematics type stuff
But there is also cognitive ergonomics (like we had to design a UI for a satnav f.e.) which I don't care about a lot
In hindsight I probably should have chosen ML master, but I feel like I can supplement most of the stuff with self-study
that's certainly the goal of masters programs, to prepare you to be able to teach yourself what you need
guys I just want to ask but is Kaggle a good place to study all about data science and practice competitions in it for experience?
kaggle is pretty good, sure
it won't teach you everything, but it's a nice complement to your studies
Well in the end I'll at least have some relevant masters degree
I'm still pretty much a beginner in this field right now tbh
And hopefully enough knowledge to just gain some practical exp
and what should I go for next after Kaggle then
You could maybe even join a team for competitions @severe oriole
kaggle is nice due to the challenges and availability of data, but you should learn some theory that you can throw at them
should it be practicing with Tableau for data visualization
Hmm that sounds interesting. I'd love to try after I master all basics first then
And the theory you mean like statical theories basics right
Yeah or just different types of machine learning models
Thanks a lot man. Let me search more before I come back here and ask again
Right now I just know 3 models from the intro: decision tree, random forest, and the validation one
Lemme find more abt this
Those 2 could be sued on tabular data (like just numbers)
But you would also want to learn about multi-layer perceptrons, and stuff like CNNs (convolutional neural networks) for image data
Interesting. That's something new there for me
Now I'm even hyped more to finish this kaggle thing before diving into that xd
Well I would think this would come before trying to classify a lot of kaggle datasets
You would use these models on the data
Oh then it's better if I know them first right
Since they're models and knowing to use them helps me when I go and try those kaggle datasets is what you mean?
Yes, exactly
You could use kaggle datasets to try them out while you are learning about them though
Noted. Let me try monkeying with data around to see about that
I'll keep the update here then
axler is definitely good for a "2nd course in linear algebra" after you've gone through a "1st course" that builds strong intuition in R^n like strang's
i think it would help a lot if you showed your actual model-fitting and evaluation code
personally i've never seen output this pathological even on egregiously unbalanced datasets. to the point where i'd sooner suspect a bug in your code than a problem in your procedures
Where do I start learning AI ? any beginner friendly resources ?
fast.ai is good (even if you're not a beginner, it's a great catch-up to industry-standard ai/ml), but i'm not sure what the pre-requisites are
Alright, thank you!
@tropic matrix @lapis sequoia i've also had good results w/ "halving search" (which is available in scikit-learn) for such problems. for black-box optimization though, i suggest the Optuna library, and i suggest comparing both techniques.
One more question, is it effective to tune hyperparameters on just a portion of the dataset (if the dataset is large and can take hours per epoch)?
@desert oar
it can be, if you are confident that the portion of the dataset is representative of the full dataset, and that you aren't introducing excessive sparsity in the features or labels
Alright, ig i'll see if it ends up becoming effective
it's a very good technique for testing that your models actually work though
thank you for the link i'll check it out now . I wanna get into ml and ai . They are in general the same thing right ?
kind of. "ai" in practice is more like "ml but with fancy window dressing". real "ai" is still something that mostly happens in university labs.
there is a philosophical argument to be made that ml is ai, but it's important to distinguish between the appearance of intelligence and actually having intelligence.
we had a big meeting in my division yesterday, and the AI director for my company said "no one can agree on what AI is, except that it's what you can't currently do"
I would recommend also Andrew Ng's machine learning course on Coursera, he starts from the basics to more advanced concepts.
What makes this very confusing is that they used to be considered the same thing (ML/AI). But then a rift formed between the two as there was a large push against statistical/probabilistic methods from the established symbolic AI community which was gate-keeping (they were considered the authority on AI at the time). This caused the "AI Winter" to happen because symbolic AI was going nowhere in terms of progress and all funding / effort to probabilistic methods and especially neural networks had been cut (at the universities in general, there still was much progress being made but it had no recognition). This went on until the performance of neural networks was too great to ignore, especially after CNNs completely blew all previous computer vision out of the water.
very good points. i think what's happened nowadays is that a distinction is made between "AGI or something close to it" and "automating tasks via computer w/ the user-facing perception of human intelligence"
Now I like to use ML as a term to avoid debate over what AI is and instead just work on something productive.
Because everyone has an opinion on what AI is. You can see the difference if you for example look what you can find on the AI subreddit vs the ML subreddit or anywhere else.
(One is productive in majority, the other would be productive in majority if it was not always spammed with debate that leads nowhere)
i think the other problem is marketing and media hype
really complicates terminology and makes me want to avoid calling anything "ai" as much as possible
Yeah, it has caused everyone to have an opinion on it by muddying the definition to make it accessible to all for opinions. You will hear an opinion from everyone on things that are easy to have an opinion on. I don't see everyone having an opinion on the merits of batched gradient descent vs not.
right, i will continue to let the "technologists" emptily debate on medium.com, while i go solve business problems
It's fine to do such debate, but at some point I just choose what matters to me (what do I find interesting / productive) and go do some work.
hello guys anyone available to help me with my project in a voice chanel for just 5 mins ?
i'm gonna let you solve business problems while i do research no one will ever read or care about 😌
You're both being productive IMO (research or business).
👍
guys i downloaded a compressed data in a "zst" format, do i need to extract the file in it or is supposed to be just the compressed zst file ?
from what I read you cannot open it directly, so yeah you need to extract to reach the files
Well we can't really help you with the information that you gave us. Could be a bug in your implementation
what does it mean to say "anomaly detection methods are quite sparse"?
does it mean there less relevant features compared to useless ones?
I would interpret it as the algorithms used. Like K-means clustering,
k nearest neighbour, hierarchical clustering
you mean diverse in feature?
^"sort of"
no as in I can't think of any more methods
sparse meant taking only highly pronounced features, thats what i wanted to know
Has anyone done Andrew Ngs Machine Learning Specialization that just came out?
im working on it
I did the classic one a while ago
it's beautiful
holy shit it just hit 100% accuracy
eeeeeee
it actually hit 100%, but it went down after
screw it, my 48% accuracy text model did better
it produced actual words, although they did not make sense
i want to string the python
maybe i just need a larger dataset
im just gonna train it on my all my text messages
is this an lstm?
is anyone good with tensorflow image classification btw
im having a hard time learning it
Yes
Yes
damn
I tried building my own rnn and lstm and they won't budge
No
not learning at all
the official docs look like cancer to mee
Same
class LSTM(nn.Module):
def __init__(self, input_dim, hidden_dim) -> None:
super(LSTM, self).__init__()
self.input_linear = nn.Linear(input_dim + hidden_dim, 4*hidden_dim)
self.output_linear = nn.Linear(hidden_dim, input_dim)
self.hidden_state = torch.zeros(hidden_dim)
self.cell_state = torch.zeros(hidden_dim)
def forward(self, input):
self.hidden_state, self.cell_state = self.hidden_state.detach(), self.cell_state.detach()
# l is the length of each gate
l = len(self.hidden_state)
concatenated = torch.cat((self.hidden_state, input), axis=0)
temp = self.input_linear(concatenated)
i, f, o, g = (
torch.sigmoid(temp[:l]),
torch.sigmoid(temp[l:2*l]),
torch.sigmoid(temp[2*l:3*l]),
torch.tanh(temp[3*l:])
)
self.cell_state = f * self.cell_state + i * g
self.hidden_state = o * torch.tanh(self.cell_state)
output = self.output_linear(self.hidden_state)
return output
def reset_states(self):
hidden_dim = len(self.hidden_state)
self.hidden_state = torch.zeros(hidden_dim)
self.cell_state = torch.zeros(hidden_dim)
what are you making btw
legit won't learn
I had to tweak with a bunch of stuff and rerun a lot
Text gen
ohhh
I wanted to perfect my first
I want to make hcaptcha classifier but cant
When I get my discord messages I’m gonna train it on them
!rule 5
5. Do not provide or request help on projects that may break laws, breach terms of services, or are malicious or inappropriate.
its not breaking law
I'm doing it as a fun project
it will be fun to learn tensorflow while trying to make it
That can be used to bypass captchas
I agree but thats not my intention
well I find that a captcha identifier will motivate me to learn AI completely But alright I'm good with the rules
lets keep that apart
I just need help with tensorflow
nothing seems understandable to me
I'mma try making it more than one block deep and see how it goes
maybe my model's capacity is currently too low to generate text
do you know tensorflow completely?
then would you mind helping me :d
no I don't know tensorflow that well
oh
because I use pytorch
oh cool
@dusty valvewhat encoding/decoding method did you use?
I did a bit of research and it seems my problem comes from a common problem known as text degeneration
which comes from using the naive encoding approach which I am using
is anyone willing to help me with tensorflow
For captcha classifying (and possibly bypassing)? no
not for that
I just want to learn image classification
to create stuffs like traffic signal detector etc
so can you help me with that
If you can't build a simple image classifier using tensorflow then you've definitely not done your homework
Go search a tutorial or better yet learn the underlying theory first
Thats if your goal is to learn
So I'm trying to make a confusion matrix for the validation set of my data, but the accuracy shown using model.evaluate() seems to be different than the accuracy shown in classification_report(), so the confusion matrix would already look weird. Anyone know what I'm doing wrong? The batch size for this specific goal was set to the amount of images in the data-set as some comments on stack overflow recommended to do that.
validation_dataset = image_dataset_from_directory(Path_to_images,
image_size=(400, 400),
validation_split=0.3,
subset="validation",
seed=2,
batch_size=1801)
model=keras.models.load_model(Path_to_model)
#first method to get accuracy
val_loss, val_acc = model.evaluate(validation_dataset)
print(f"Validation accuracy: {val_acc:.3f}") #prints acc 93%
#second method to get accuracy also to get confusion matrix
y_true = np.concatenate([y for x, y in validation_dataset], axis=0)
y_pred = model.predict(validation_dataset).argmax(axis=1)
print(classification_report(y_true, y_pred)) #prints acc 17%
Hello smart Python people! I have some problems with the MediaPipe realtime pose tracker. So I can get the landmarks and everything, but the data is very noisy and practically unusable. Does anybody know any way of smoothing realtime landmark data?
hi, i was trying to make an image synthesis which uses clip, but i wanna run it on my cpu, i get this error:
RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half
here is the code:
def convert_weights(model: nn.Module):
"""Convert applicable model parameters to fp16"""
def _convert_weights_to_fp16(l):
if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):
l.weight.data = l.weight.data.half()
if l.bias is not None:
l.bias.data = l.bias.data.half()
if isinstance(l, nn.MultiheadAttention):
for attr in [*[f"{s}_proj_weight" for s in ["in", "q", "k", "v"]], "in_proj_bias", "bias_k", "bias_v"]:
tensor = getattr(l, attr)
if tensor is not None:
tensor.data = tensor.data.half()
for name in ["text_projection", "proj"]:
if hasattr(l, name):
attr = getattr(l, name)
if attr is not None:
attr.data = attr.data.half()
model.apply(_convert_weights_to_fp16)
im using pytorch if thats not clear, any help would be appreciated
here is a similar issue: https://github.com/nerdyrodent/VQGAN-CLIP/issues/70
Bug I get RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half' when running this against my CPU. To reproduce $ python generate.py -p "A...
but i dont wanna comment it out tbh
hey, how can i show my df after i ran through a sklearn pipeline?
during the pipeline ? you can't
i would like to take a peak after preprocessing
just print it ?
where is the pb 😅
@tired matrix
Thanks
Anyone here have experience with ML in Python?
Question regarding CNN in Keras that I am building:
I have 10000 images and training a CNN but I added data augmentation but as soon as I did it is taking like 10 hours to train my model AND even if it does and I trained it on like 40 EPOCHS it reaches an efficiency of like 61%. Is there any way I can speed it up? I guess a easy fix would just be to increase epochs to like 100 and get higher efficiency cause longer training time and higher epochs but like that is going to take 2 full days. Am I being too impatient or what do you think? Thank you!
I am a beginner
Alright, so first of all, what is the model that you use, and is that "efficiency" the accuracy?
How many classes are there?
Like in my dataset all I have are 5000 images of "Yes it is a blowdryer" and 5000 images of "No it is not a blow dryer", 0 or 1
So like 2 classes?
And your accuracy stays at 0.5?
I mean it looks like it
Kinda with little to no improvment
That is as good as random guessing then
Yes
And this was after I added data augmentation
Before when I didn't have data augmentation which I heard was bad, my model was overfitting like crazy
Alright, so it might be that something horribly goes wrong then, because it is making basically nothing from your data right now
What did you get before data augmentation?
99%
But it was overfitting all the way to hell
training or validation?
training
training accuracy is only really useful to check if you are over/under-fitting
validation is what we care about
All I know is that the difference in val_loss and loss was like so large and the val_accuracy was staying the same
So it was overfitting
Yes but I need it to be like 95%+
`model.add(Conv2D(64, kernel_size=4, activation="relu", input_shape = (256, 256, 3)))
model.add(MaxPooling2D(4,4))
model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(3,3))
model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(3,3))
model.add(Flatten())
model.add(Dense(32))
model.add(Dense(train_generator.num_classes, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l2(0.0001)))`
This is what it is right now
I added data augmentation and it became the results I just showed you
Before this was performing at like 90%+ but was overfitting like crazy
Why do you have sigmoid at the end, with 2 neurons?
I don't knwo
Well, you chose it 😛
I asked another person and they said that since all I have is 0 and 1 make it 2 so I did
Like Blow dryer or no blow dryer
Could you help me understand
What would u change it to?
Sigmoid in the final layer is mostly used when you want to check if a class is present or not
Ok gotcha
And each node then represents a class
Ok gotcha...
So if you use sigmoid with 2 classes, you basically want to have 1 neuron in the final layer
which is the dryblower
OR you could use softmax with 2 neurons
model.add(Dense(1, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l2(0.0001)))
So like that?
Yes, but you might want to change your labels from [0 1] [1 0] to just [0] and [1]
I doubt this would change a lot about the accuracy, but this is the common way to do it
When looking at your model, it also seems that you start with a lot of filters, and you use maxpooling that decreases the size by a lot
Ok gotcha
I would try only using maxpooling with 2x2, and start with less filters, and have more filters for later layers
Ok gotcha
As there are less small patterns, and more complex patterns typically in an image
And later layers can extract more complex features from smaller features
yes
But this is mostly just gut feeling, from trying out stuff with datasets
`model.add(Conv2D(64, kernel_size=4, activation="relu", input_shape = (256, 256, 3)))
model.add(MaxPooling2D(2,2))
model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(2,2))
model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(2,2))
model.add(Flatten())
model.add(Dense(32))
model.add(Dense(train_generator.num_classes, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l2(0.0001)))`
changed it to 2,2 now
O ye like my dataset has lots of colors just saying
So normally you want to increase the perceptive field (each layer should be able to deetect patterns from larger areas of the image)
Ok
And you want to increase the amount of filters further in the model
from this
How would I do this
Like what would you change in my model
Can you show me programatically
code snippet
The amount of filters for the conv2d layers
So increase that amount?
They are just parameters
incrementally over the layers yes
Later layers have more filters, earlier layers less
There isn't a set rule, this is just common sense considering there are typically not that many small patterns, and more large patterns
O ye I am also using the Adam optimizer
tf.keras.optimizers.Adam(learning_rate=0.01)
Yeah, could be okay
Just another hyper parameter you can tune
A lot of this is trial and error
Do you really think changing the maxpooling to (2,2) and changing the filter to that is going to make a difference?
You should try a lot of cross validation with different parameters to see what works best
How would I do that
It will make some difference probably yeah
Do you know k-fold cross validation?
sadly no
You should look it up then, you need to it to find out what model gives good results
You can also just have training/validation/test split
But that means you would use the same data to train and validate on for each set of hyper parameters
Hmm ok
In any case, make sure to not use your test set while you are still working on your model
I would first just try it out on unaugmented data, because it seemed that gave some issues the last time
and when you seem to hit a wall, you can try that to decrease overfitting
You could also add a dropout layer between two dense layers that you have
And maybe add another dense layer, because you only have 2 right now
So ur saying don't use data augmentation cause its messing it up and do it without. And when it overfits add more dense layers or dropout
Is that bad for what I am doing right now
No, I would just add another dense layer, because 2 dense layers isn't often enough to predict the class using the convolutional layer features
And you can add a dropout layer to decrease overfitting
Ok I will do that
Dropout layer means that some weights get deactivated randomly between two layers
This means the model can't just base its decision on a set of features, it will need to make use of all/more of them
yeah nw
I will ask again. Right now I will take all ur advice and then test it all out. And then come back again. Thanks again!
Gonna first try without augmentation
So I would add another one like this:
model.add(Dense(32))
But how many neurons would I have? Still 32?
try a number between the previous layer and the next one
somewhere between 32 and the size of the output
Hmm ok
that'S where you're putting the dense layer, between the 32 one and the output, right?
model.add(Dense(32)) model.add(Dense(16)) model.add(Dense(train_generator.num_classes, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l2(0.0001)))
So like this? I added the model.add(Dense(16))
inbetween
sure
Gotcha thanks
will update on the progress. i have a bad feeling its going to overfitt..again
You haven't added a dropout inbetween, which may also help
And still using sigmoid with 2 neurons for a binary prediction problem
ah yeah you can boil that down to a single neuron output. but i'd say it's good to change one thing at a time and see what improvements you get
I am planning on changing it one thing at a time and then updating my progress in this chat. Thank you so much @mild dirge
And thank you @wooden sail for pitching in
Ok wait are you seeing this:
Again, it is starting to overfit
This is without data augmentation and my model looks like this:
`model.add(Conv2D(16, kernel_size=4, activation="relu", input_shape = (256, 256, 3)))
model.add(MaxPooling2D(2,2))
model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(2,2))
model.add(Conv2D(64, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(2,2))
model.add(Flatten())
model.add(Dense(32))
model.add(Dense(train_generator.num_classes, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l2(0.0001)))`
After that runs I will add another dense layer and dropout
but both the acc and val acc are increasing together, i wouldn't say that's overfitting, at least not all that bad
Ok but my graph when I graph it out looks really messed up
can you show it?
Oh no I will show it after it runs completely
but I have done it before and played around and the graph does not look nice. Like the val_loss and loss difference is too great
eventualy
@wooden sail the difference is getting very large
losss and val_loss
well, i've certainly seen worse models lol
Ok but I want. thismodel to be good not better than worse
the loss is not normalized, so the raw number doesn't mean much to me
Are you sure?
@mild dirge can u look at the image too
Cause I was told it was overfitting
if the loss has a dynamic range of thousands, for example, this is a tiny percentual difference
Not sure what you mean
Edd probably knows this stuff better then I lol
Why do you have the validation accuracy rounded to 1 decimal
Is it rounded, or floored?
It isnt
It is cut of into the next line
Look at the next line it's 0.8272 what not
Well it is overfitting, but not by that much
let it get to like 10 or 15 epochs and if you're not happy with the performance, add in the dropout that pccamel recommended
Ok so next step looking at the current model:
`model.add(Conv2D(16, kernel_size=4, activation="relu", input_shape = (256, 256, 3)))
model.add(MaxPooling2D(2,2))
model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(2,2))
model.add(Conv2D(64, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(2,2))
model.add(Flatten())
model.add(Dense(32))
model.add(Dense(train_generator.num_classes, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l2(0.0001)))`
I will add dropout, another dense layer inbetween, and decrease epochs to 10 or 15?
Is doing the follow good:
I will add dropout, another dense layer inbetween, and decrease epochs to 10 or 15?
sounds ok
but really, validation accuracy is normalized, but loss isn't
looking at the loss alone is not always enough to say what the performance is like. the validation was looking ok (even though it could be better)
Not sure if an even kernel size is useful/common
Also, your first flattened layer has 65,000 ish neurons
So that will also not be too good for overfitting I'd think
yeah i was thinking of that as well, regarding the size of the flattened layer. maybe use like dense 2048, 512, 128, 32, 1 or something like that. with dropout before the first dense and maybe between some of the other dense layers
as for the even kernel, it's fine, but it's not symmetric, which might introduce some unexpected or difficult to interpret behaviors
It's probably better to add another conv+maxpool
Or maybe a convolutional layer with a stride higher than 1
Ok so should I implement these changes?
The biggest problem right now is having 2 million weights between two of your dense layers
That would be the first thing to fix
How can I fix that
I know I am asking a lot but would it be ok to hop on a quick call
I will share my screen
If not that's completely ok
what
I replied to a message with a tip to reduce the amount of neurons
I can't go on voice right now
Oh ok
try adding another conv with max pool, but don't increase the number of filters anymore
that'll cut the size of the output roughly in half before going into the dense layers
and put a dropout there, too
it will be cut into a quarter
`model.add(Conv2D(16, kernel_size=4, activation="relu", input_shape = (256, 256, 3)))
model.add(MaxPooling2D(2,2))
model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(Dropout(0.2))
model.add(MaxPooling2D(2,2))
model.add(Conv2D(64, kernel_size=3, activation="relu", padding="same"))
model.add(Dropout(0.2))
model.add(MaxPooling2D(2,2))
model.add(Conv2D(64, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(2,2))
model.add(Flatten())
model.add(Dense(64))
model.add(Dropout(0.2))
model.add(Dense(32))
model.add(Dropout(0.2))
model.add(Dense(train_generator.num_classes, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l2(0.0001)))`
Ok so like this?
i take back my 2048 recommendation, that'll make the problem monstrous lol
try 64 or keep it at 32 as you already had it, i had neglected the size of the conv output earlier
yeah
all i did was py vocab = ''.join(sorted(set(text))) char2int = {c: i for i, c in enumerate(vocab)} int2char = {i: c for i, c in enumerate(vocab)}
tbh that .join was not necessary
yes something is up with my encoding
Can someone help me i don't know where i went wrong though i gave correct import file it's showing me no module error !!! Any fixes
You should try adding a dot before Backend in line 1
i need a pseudo code for:
if column[a] has values: X, Y, Z then create a new column with values that say “Alphabet” else impute values “not alphabet”
-
please don't post screenshots of your IDE. explain your problem in words and post your error output & file structure as text using a code block.
-
this isn't a data science question. please carefully read #❓|how-to-get-help.
Uhm sorry by the way !!! I will delete it
How can I, after something like df.groupby("username"), apply a custom function to each group that collapses each group into one row?
I thought .groupby().apply did that, but it seems to be called on each row, not on each group.
how about df.loc[some_condition_on_some_col, some_other_col].apply()? though i guess that requires repeating the process several times, hmm
Not sure how that'd help - yeah, I want to collapse every group (all rows with the same username) into one row.
you want to actually collapse it or just apply the same func to all of the ones with the same username?
Actually collapse. So I want groupby to call my function with an entire group at a time, and it'd return one row per group.
maybe agg? but i'm out of my depth in this one
.apply is capable of accepting a function returning 1 row as well as .agg
but the thing is: there might be a better way without these if the function is not so customized
e.g., df.groupby("item").agg(lambda gr: gr.iloc[0]) should reduce the dataframe to length being number of unique items, and the rows will be the first rows of each group
.apply will work as well, except it will retain the grouper column as a column in the result, too.
The issue is that it seems to get called on each row of each group rather than each group
need to prove it :p
as for aggregate, hmm
it can call your function for the first group twice to seek a fast path if possible, but no, it won't call it per row of groups; it will do for the entire group.
With aggregate, the issue seems to be that it works separately on each column
this is some sort of magic, huh

