#playground-series-s3e21 | Kaggle | Page 1

broken yew Aug 22, 2023, 11:45 PM

#

kerneler

small coyote Aug 23, 2023, 12:13 AM

#

thank_you

broken yew Aug 23, 2023, 2:10 PM

#

I’m a little confused

#

Why are we setting some variables to all 0s

#

The hidden test dataset may not have all 0s in that same variable (it is not given)

steady nymph Aug 23, 2023, 2:18 PM

#

After evaluating the feature importance, many variables have little or no impact on the model's predictions. We'd typically remove these variables from the model. Setting the column to zero has the same effect. (ie the Random Forest will never choose the column to do a split).

broken yew Aug 23, 2023, 2:50 PM

#

BUT setting the ID column to zero has no impact on the score at all

#

https://www.kaggle.com/competitions/playground-series-s3e21/discussion/433967

Improve a Fixed Model the Data-Centric Way!

Playground Series - Season 3, Episode 21

broken yew Aug 23, 2023, 2:52 PM

#

steady nymph After evaluating the feature importance, many variables have little or no impact...

I don't get it, the test dataset may not have zeros on those columns

#

If anything I think it makes the model worse due to data drift

#

When you train a model with 0s on those columns and test it on dataset with non zero values

#

In past playgrounds with LightGBM/CatBoost adding ID did worsen the score

twilit shore Aug 23, 2023, 3:06 PM

#

I didn't even think to remove the id column ngfl

#

now you point it out it is probably best that it's removed

broken yew Aug 23, 2023, 3:09 PM

#

hmm I think I understand why setting variables to 0s work now, I forgot you don’t construct decision tree on testing phase

#

but doesn’t explain why setting ID to 0 or not does not impact public score at all

broken yew Aug 23, 2023, 11:37 PM

#

Evaluation metric changed catshock

steady nymph Aug 24, 2023, 7:19 AM

#

broken yew Evaluation metric changed <:catshock:1094702533525643324>

It was a more interesting problem before. I wouldn't have joined given the current setup...

shut osprey Aug 24, 2023, 12:18 PM

#

broken yew hmm I think I understand why setting variables to 0s work now, I forgot you don’...

why do set the variable to zero?

#

what about other feature selection techniques and can i use that on top feature importance?

broken yew Aug 24, 2023, 12:33 PM

#

shut osprey why do set the variable to zero?

discard the feature

#

It's the internal working of random forest

broken yew Aug 24, 2023, 2:31 PM

#

Why are people doing data clipping though 🤔

#

thresholding

tidal shale Aug 24, 2023, 3:28 PM

#

Why all of my submission is error, it shows "Evaluation metric raised an unexpected error". How to fix it thanks

finite oar Aug 24, 2023, 7:23 PM

#

tidal shale Why all of my submission is error, it shows "Evaluation metric raised an unexpec...

try index=False when save csv df.to_csv('submission.csv', index=False)

indigo zinc Aug 25, 2023, 1:07 PM

#

kernelina

agile sphinx Aug 26, 2023, 9:41 AM

#

broken yew Why are people doing data clipping though 🤔

Yes, I would also like to understand why dataclipping works. It seems a bit ad-hoc to me

agile sphinx Aug 26, 2023, 10:07 AM

#

Ah, I see that makes sense

orchid wagon Aug 28, 2023, 6:53 AM

#

I don't think clipping the target will work

broken yew Aug 28, 2023, 9:09 AM

#

orchid wagon I don't think clipping the target will work

ICR 😂

orchid wagon Aug 28, 2023, 9:41 AM

#

😭

shut osprey Aug 28, 2023, 9:56 AM

#

submission['target'].clip(5.01,11.69)
and i combined with RFCV features it gave me score of 1.55 which is not a good score but descent.
without target clipping it gave score of 1.39 with same RFCV features

broken yew Aug 28, 2023, 2:15 PM

#

Clipping the target makes my local CV higher actually

#

The data is too noisy to predict with random forest. 3500 is a very small dataset

#

Dropping rows makes it even smaller (3000)

steady nymph Aug 28, 2023, 2:41 PM

#

broken yew Clipping the target makes my local CV higher actually

I don’t see that. Are you clipping the holdout fold? (Though, I should add, I used optuna to find the clip cutoffs for the target and features at the same time… so my mileage might vary)

broken yew Aug 28, 2023, 11:04 PM

#

steady nymph I don’t see that. Are you clipping the holdout fold? (Though, I should add, I us...

Yes lmao

#

Wait, I think I shouldn't do that, you are right

#

I clipped outside of the CV loop

shut osprey Aug 29, 2023, 4:56 AM

#

how to test the dataset after feature selection and clipping the targer variable?

broken yew Aug 29, 2023, 9:16 AM

#

shut osprey how to test the dataset after feature selection and clipping the targer variable...

That's what the out of fold cross validation is for

shut osprey Aug 29, 2023, 9:18 AM

#

@broken yew 👍

jade adder Aug 29, 2023, 5:18 PM

#

Who all are active in this competition, well I just got free from my college university exams and practicals...

Anybody wanna team up!!

Till now I only created a base model notebook...

Gonna speed up on this soon✨✨

Me on kaggle: https://www.kaggle.com/manavgupta92

Tensor Boy | Expert

As a machine learning enthusiast, I'm often asked if I'm worried about robots taking over the world.
My Answer:
"As long as they're programmed to bring me coffee, I welcome our new robot overlords."

Ever since I was young, I've been fascinated with technology and the possibilities it holds. It all started with my fascination with time machi...

tulip hill Aug 29, 2023, 5:30 PM

#

What area of ml/ai are you interested in?

#

I'm interested in doing competitions but my schools about to start and I'm not sure I will have the time

broken yew Aug 30, 2023, 1:13 AM

#

jade adder Who all are active in this competition, well I just got free from my college uni...

Sure !

shut osprey Aug 30, 2023, 10:08 AM

#

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.model_selection import GridSearchCV

data_cluster = sample.drop(['id'], axis=1)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_cluster)

params_grid2 = {
'eps': [np.round(float(x), 1) for x in np.linspace(0.1, 1.0, 10)],
'min_samples': [int(x) for x in np.linspace(5, 50, 20)]
}

db = DBSCAN()

grid_search = GridSearchCV(db, params_grid2, cv=5, scoring=silhouette_score, verbose=1)
grid_search.fit(scaled_data)

best_params3 = grid_search.best_params_

db_best = DBSCAN(**best_params3)
array_of_cluster = db_best.fit_predict(scaled_data)

print("Cluster Labels:", array_of_cluster)

i am getting the array_of_cluster the whole array is filled with -1
try to remove the noise in the dataset

rustic sierra Aug 30, 2023, 4:06 PM

#

How we can test optimal size of submission? All my submissions less than 3000 rows have poor result. But subs with 3000+ rows definitely too noisy... A bit confused how to proceed further. Tried a lot of stuff but stuck at 1.35

scarlet orchid Aug 30, 2023, 9:45 PM

#

Subs with 3000+ rows def aren't as noisy as you'd think. I was able to get 1.31382 on public LB with 3497 rows

rustic sierra Aug 31, 2023, 7:45 AM

#

scarlet orchid Subs with 3000+ rows def aren't as noisy as you'd think. I was able to get 1.313...

And your current 1.27719? (if it not a secret)

shut osprey Aug 31, 2023, 7:57 AM

#

i am tuning isolation forest and there attributed called "max_features"
what values i should conisder?

scarlet orchid Aug 31, 2023, 12:59 PM

#

rustic sierra And your current 1.27719? (if it not a secret)

3048 rows, although this leads to a high CV score of 1.32308 ± 0.58653, so I am not confident that this will fare well in the private LB. My lowest submitted CV score of 1.14431 ± 0.34609 got a score of 1.31754 on the public LB and uses 3462 rows, which is less likely to completely flop on the private LB

broken yew Aug 31, 2023, 1:07 PM

#

1.29 +- 0.4 for me

#

But 1.47 on public LB lol

#

Idk, to me the std is too high to even trust the mean

#

CV scores are also quite random between 1.2 to 1.5

scarlet orchid Aug 31, 2023, 1:20 PM

#

Yeah the distribution of CV scores for each individual fold for me is very much skewed right, so I don't find it that reliable either. I'm just hoping that between my best public LB and best CV one of them gets me a good private LB score

steady nymph Aug 31, 2023, 1:59 PM

#

Best CV (10 x 5 folds) is 1.242 ± 0.305, but no confidence of how that will generalise to the private LB.

Even though that CV is smaller than some of the top scoring public notebooks
see: https://www.kaggle.com/code/paddykb/ps3e21-evaluate-top-notebooks

The "model" is better (has the lowest RMSE) in only 62% of folds.

So, even if the LB data looks a lot like the training data, it fairs just better than a coin toss against the "models" I tested.

rustic sierra Aug 31, 2023, 2:31 PM

#

steady nymph Best CV (10 x 5 folds) is 1.242 ± 0.305, but no confidence of how that will gene...

Hmm... Interesting point. Thanks for sharing it!

broken yew Sep 3, 2023, 9:39 AM

#

Epic. Just got my best CV score of 1.24 in this comp, and then submitted with a public LB 1.53 😂

broken yew Sep 3, 2023, 1:27 PM

#

OK lol, changing the random seed moved the CV from 1.24 to 1.33

shut osprey Sep 6, 2023, 12:42 PM

#

should i have to drop 'target' along with 'id' before sending it to anamoly detection alogrithm?

broken yew Sep 6, 2023, 1:08 PM

#

I didn't

tulip hill Sep 9, 2023, 7:59 PM

#

Isn’t target the answer?

dusty dune Sep 11, 2023, 10:10 PM

#

Just did my final submission on the final day, and I am ready to wipe my hands of this competition

broken yew Sep 12, 2023, 12:24 AM

#

Interested to see what the top submissions did

#

Seems like feature selection is the right way

#

But gotta choose the correct one

#

The "3 features" choice didn't work out

#

Neither did my output from RFE

#

I think there's a specific subset that won the competition

#

https://www.kaggle.com/competitions/playground-series-s3e21/discussion/438597

Improve a Fixed Model the Data-Centric Way!

Playground Series - Season 3, Episode 21

#

holyfuck

scarlet orchid Sep 12, 2023, 1:01 AM

#

My best private LB score was the sample submission

broken yew Sep 12, 2023, 1:25 AM

#

scarlet orchid My best private LB score was the sample submission

Same holyfuck

#

Fr that’s so weird

#

Like we spent all the time just to make the model worse

#

Data Centric AI isn’t about submitting without making any modifications

broken yew Sep 12, 2023, 4:59 AM

#

Outlier detection worked, I just submitted a few of it, but to also improve you need to select the correct subset of features holyfuck

broken yew Sep 12, 2023, 5:35 AM

#

I think it does, better CV score with the leakage does not imply better private LB score

#

The sample submission had CV 1.52 it was too risky to choose that

#

“Best hyperparameter tuning for this dataset” nice

#

My guess is some records in the sample_submission were deliberately altered after the generation process and we are supposed to remove those

icy steeple Sep 12, 2023, 10:19 AM

#

Feels like it was about reverse engineering the generating process

#

https://www.kaggle.com/competitions/playground-series-s3e21/discussion/434161

Improve a Fixed Model the Data-Centric Way!

Playground Series - Season 3, Episode 21

#

Too bad I didn’t get an answer to this

tacit isle Sep 21, 2023, 11:51 PM

#

May I ask some stupid questions like what is feature engineering? and where the most difficult part lies? How and how much can our models benefit from feature engineering?

dapper geode Oct 23, 2023, 3:34 PM

#

tacit isle May I ask some stupid questions like what is feature engineering? and where the ...

No question is stupid, but this is a really difficult question to answer. Simply put, feature engineering is one of the hardest parts of data science and it means creating, changing, or manipulating the data to get better performance in your model