#playground-series-s3e21

1 messages · Page 1 of 1 (latest)

broken yew
small coyote
broken yew
#

I’m a little confused

#

Why are we setting some variables to all 0s

#

The hidden test dataset may not have all 0s in that same variable (it is not given)

steady nymph
#

After evaluating the feature importance, many variables have little or no impact on the model's predictions. We'd typically remove these variables from the model. Setting the column to zero has the same effect. (ie the Random Forest will never choose the column to do a split).

broken yew
#

BUT setting the ID column to zero has no impact on the score at all

broken yew
#

If anything I think it makes the model worse due to data drift

#

When you train a model with 0s on those columns and test it on dataset with non zero values

#

In past playgrounds with LightGBM/CatBoost adding ID did worsen the score

twilit shore
#

I didn't even think to remove the id column ngfl

#

now you point it out it is probably best that it's removed

broken yew
#

hmm I think I understand why setting variables to 0s work now, I forgot you don’t construct decision tree on testing phase

#

but doesn’t explain why setting ID to 0 or not does not impact public score at all

broken yew
#

Evaluation metric changed catshock

steady nymph
shut osprey
#

what about other feature selection techniques and can i use that on top feature importance?

broken yew
#

It's the internal working of random forest

broken yew
#

Why are people doing data clipping though 🤔

#

thresholding

tidal shale
#

Why all of my submission is error, it shows "Evaluation metric raised an unexpected error". How to fix it thanks

finite oar
indigo zinc
agile sphinx
agile sphinx
#

Ah, I see that makes sense

orchid wagon
#

I don't think clipping the target will work

broken yew
orchid wagon
#

😭

shut osprey
#

submission['target'].clip(5.01,11.69)
and i combined with RFCV features it gave me score of 1.55 which is not a good score but descent.
without target clipping it gave score of 1.39 with same RFCV features

broken yew
#

Clipping the target makes my local CV higher actually

#

The data is too noisy to predict with random forest. 3500 is a very small dataset

#

Dropping rows makes it even smaller (3000)

steady nymph
broken yew
#

Wait, I think I shouldn't do that, you are right

#

I clipped outside of the CV loop

shut osprey
#

how to test the dataset after feature selection and clipping the targer variable?

broken yew
shut osprey
#

@broken yew 👍

jade adder
#

Who all are active in this competition, well I just got free from my college university exams and practicals...

Anybody wanna team up!!

Till now I only created a base model notebook...

Gonna speed up on this soon✨✨

Me on kaggle: https://www.kaggle.com/manavgupta92

tulip hill
#

What area of ml/ai are you interested in?

#

I'm interested in doing competitions but my schools about to start and I'm not sure I will have the time

shut osprey
#

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.model_selection import GridSearchCV

data_cluster = sample.drop(['id'], axis=1)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_cluster)

params_grid2 = {
'eps': [np.round(float(x), 1) for x in np.linspace(0.1, 1.0, 10)],
'min_samples': [int(x) for x in np.linspace(5, 50, 20)]
}

db = DBSCAN()

grid_search = GridSearchCV(db, params_grid2, cv=5, scoring=silhouette_score, verbose=1)
grid_search.fit(scaled_data)

best_params3 = grid_search.best_params_

db_best = DBSCAN(**best_params3)
array_of_cluster = db_best.fit_predict(scaled_data)

print("Cluster Labels:", array_of_cluster)

i am getting the array_of_cluster the whole array is filled with -1
try to remove the noise in the dataset

rustic sierra
#

How we can test optimal size of submission? All my submissions less than 3000 rows have poor result. But subs with 3000+ rows definitely too noisy... A bit confused how to proceed further. Tried a lot of stuff but stuck at 1.35

scarlet orchid
#

Subs with 3000+ rows def aren't as noisy as you'd think. I was able to get 1.31382 on public LB with 3497 rows

rustic sierra
shut osprey
#

i am tuning isolation forest and there attributed called "max_features"
what values i should conisder?

scarlet orchid
# rustic sierra And your current 1.27719? (if it not a secret)

3048 rows, although this leads to a high CV score of 1.32308 ± 0.58653, so I am not confident that this will fare well in the private LB. My lowest submitted CV score of 1.14431 ± 0.34609 got a score of 1.31754 on the public LB and uses 3462 rows, which is less likely to completely flop on the private LB

broken yew
#

1.29 +- 0.4 for me

#

But 1.47 on public LB lol

#

Idk, to me the std is too high to even trust the mean

#

CV scores are also quite random between 1.2 to 1.5

scarlet orchid
#

Yeah the distribution of CV scores for each individual fold for me is very much skewed right, so I don't find it that reliable either. I'm just hoping that between my best public LB and best CV one of them gets me a good private LB score

steady nymph
#

Best CV (10 x 5 folds) is 1.242 ± 0.305, but no confidence of how that will generalise to the private LB.

Even though that CV is smaller than some of the top scoring public notebooks
see: https://www.kaggle.com/code/paddykb/ps3e21-evaluate-top-notebooks

The "model" is better (has the lowest RMSE) in only 62% of folds.

So, even if the LB data looks a lot like the training data, it fairs just better than a coin toss against the "models" I tested.

rustic sierra
broken yew
#

Epic. Just got my best CV score of 1.24 in this comp, and then submitted with a public LB 1.53 😂

broken yew
#

OK lol, changing the random seed moved the CV from 1.24 to 1.33

shut osprey
#

should i have to drop 'target' along with 'id' before sending it to anamoly detection alogrithm?

broken yew
#

I didn't

tulip hill
#

Isn’t target the answer?

dusty dune
#

Just did my final submission on the final day, and I am ready to wipe my hands of this competition

broken yew
#

Interested to see what the top submissions did

#

Seems like feature selection is the right way

#

But gotta choose the correct one

#

The "3 features" choice didn't work out

#

Neither did my output from RFE

#

I think there's a specific subset that won the competition

scarlet orchid
#

My best private LB score was the sample submission

broken yew
#

Fr that’s so weird

#

Like we spent all the time just to make the model worse

#

Data Centric AI isn’t about submitting without making any modifications

broken yew
#

Outlier detection worked, I just submitted a few of it, but to also improve you need to select the correct subset of features holyfuck

broken yew
#

I think it does, better CV score with the leakage does not imply better private LB score

#

The sample submission had CV 1.52 it was too risky to choose that

#

“Best hyperparameter tuning for this dataset” nice

#

My guess is some records in the sample_submission were deliberately altered after the generation process and we are supposed to remove those

icy steeple
#

Feels like it was about reverse engineering the generating process

#

Too bad I didn’t get an answer to this

tacit isle
#

May I ask some stupid questions like what is feature engineering? and where the most difficult part lies? How and how much can our models benefit from feature engineering?

dapper geode