#playground-series-s3e21
1 messages · Page 1 of 1 (latest)

I’m a little confused
Why are we setting some variables to all 0s
The hidden test dataset may not have all 0s in that same variable (it is not given)
After evaluating the feature importance, many variables have little or no impact on the model's predictions. We'd typically remove these variables from the model. Setting the column to zero has the same effect. (ie the Random Forest will never choose the column to do a split).
BUT setting the ID column to zero has no impact on the score at all
Playground Series - Season 3, Episode 21
I don't get it, the test dataset may not have zeros on those columns
If anything I think it makes the model worse due to data drift
When you train a model with 0s on those columns and test it on dataset with non zero values
In past playgrounds with LightGBM/CatBoost adding ID did worsen the score
I didn't even think to remove the id column ngfl
now you point it out it is probably best that it's removed
hmm I think I understand why setting variables to 0s work now, I forgot you don’t construct decision tree on testing phase
but doesn’t explain why setting ID to 0 or not does not impact public score at all
Evaluation metric changed 
It was a more interesting problem before. I wouldn't have joined given the current setup...
why do set the variable to zero?
what about other feature selection techniques and can i use that on top feature importance?
discard the feature
It's the internal working of random forest
Why all of my submission is error, it shows "Evaluation metric raised an unexpected error". How to fix it thanks
try index=False when save csv df.to_csv('submission.csv', index=False)

Yes, I would also like to understand why dataclipping works. It seems a bit ad-hoc to me
Ah, I see that makes sense
I don't think clipping the target will work
ICR 😂
😭
submission['target'].clip(5.01,11.69)
and i combined with RFCV features it gave me score of 1.55 which is not a good score but descent.
without target clipping it gave score of 1.39 with same RFCV features
Clipping the target makes my local CV higher actually
The data is too noisy to predict with random forest. 3500 is a very small dataset
Dropping rows makes it even smaller (3000)
I don’t see that. Are you clipping the holdout fold? (Though, I should add, I used optuna to find the clip cutoffs for the target and features at the same time… so my mileage might vary)
Yes lmao
Wait, I think I shouldn't do that, you are right
I clipped outside of the CV loop
how to test the dataset after feature selection and clipping the targer variable?
That's what the out of fold cross validation is for
@broken yew 👍
Who all are active in this competition, well I just got free from my college university exams and practicals...
Anybody wanna team up!!
Till now I only created a base model notebook...
Gonna speed up on this soon✨✨
Me on kaggle: https://www.kaggle.com/manavgupta92
As a machine learning enthusiast, I'm often asked if I'm worried about robots taking over the world.
My Answer:
"As long as they're programmed to bring me coffee, I welcome our new robot overlords."
Ever since I was young, I've been fascinated with technology and the possibilities it holds. It all started with my fascination with time machi...
What area of ml/ai are you interested in?
I'm interested in doing competitions but my schools about to start and I'm not sure I will have the time
Sure !
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.model_selection import GridSearchCV
data_cluster = sample.drop(['id'], axis=1)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_cluster)
params_grid2 = {
'eps': [np.round(float(x), 1) for x in np.linspace(0.1, 1.0, 10)],
'min_samples': [int(x) for x in np.linspace(5, 50, 20)]
}
db = DBSCAN()
grid_search = GridSearchCV(db, params_grid2, cv=5, scoring=silhouette_score, verbose=1)
grid_search.fit(scaled_data)
best_params3 = grid_search.best_params_
db_best = DBSCAN(**best_params3)
array_of_cluster = db_best.fit_predict(scaled_data)
print("Cluster Labels:", array_of_cluster)
i am getting the array_of_cluster the whole array is filled with -1
try to remove the noise in the dataset
How we can test optimal size of submission? All my submissions less than 3000 rows have poor result. But subs with 3000+ rows definitely too noisy... A bit confused how to proceed further. Tried a lot of stuff but stuck at 1.35
Subs with 3000+ rows def aren't as noisy as you'd think. I was able to get 1.31382 on public LB with 3497 rows
And your current 1.27719? (if it not a secret)
i am tuning isolation forest and there attributed called "max_features"
what values i should conisder?
3048 rows, although this leads to a high CV score of 1.32308 ± 0.58653, so I am not confident that this will fare well in the private LB. My lowest submitted CV score of 1.14431 ± 0.34609 got a score of 1.31754 on the public LB and uses 3462 rows, which is less likely to completely flop on the private LB
1.29 +- 0.4 for me
But 1.47 on public LB lol
Idk, to me the std is too high to even trust the mean
CV scores are also quite random between 1.2 to 1.5
Yeah the distribution of CV scores for each individual fold for me is very much skewed right, so I don't find it that reliable either. I'm just hoping that between my best public LB and best CV one of them gets me a good private LB score
Best CV (10 x 5 folds) is 1.242 ± 0.305, but no confidence of how that will generalise to the private LB.
Even though that CV is smaller than some of the top scoring public notebooks
see: https://www.kaggle.com/code/paddykb/ps3e21-evaluate-top-notebooks
The "model" is better (has the lowest RMSE) in only 62% of folds.
So, even if the LB data looks a lot like the training data, it fairs just better than a coin toss against the "models" I tested.
Hmm... Interesting point. Thanks for sharing it!
Epic. Just got my best CV score of 1.24 in this comp, and then submitted with a public LB 1.53 😂
OK lol, changing the random seed moved the CV from 1.24 to 1.33
should i have to drop 'target' along with 'id' before sending it to anamoly detection alogrithm?
I didn't
Isn’t target the answer?
Just did my final submission on the final day, and I am ready to wipe my hands of this competition
Interested to see what the top submissions did
Seems like feature selection is the right way
But gotta choose the correct one
The "3 features" choice didn't work out
Neither did my output from RFE
I think there's a specific subset that won the competition
Playground Series - Season 3, Episode 21

My best private LB score was the sample submission
Same 
Fr that’s so weird
Like we spent all the time just to make the model worse
Data Centric AI isn’t about submitting without making any modifications
Outlier detection worked, I just submitted a few of it, but to also improve you need to select the correct subset of features 
I think it does, better CV score with the leakage does not imply better private LB score
The sample submission had CV 1.52 it was too risky to choose that
“Best hyperparameter tuning for this dataset” 
My guess is some records in the sample_submission were deliberately altered after the generation process and we are supposed to remove those
Feels like it was about reverse engineering the generating process
Playground Series - Season 3, Episode 21
Too bad I didn’t get an answer to this
May I ask some stupid questions like what is feature engineering? and where the most difficult part lies? How and how much can our models benefit from feature engineering?
No question is stupid, but this is a really difficult question to answer. Simply put, feature engineering is one of the hardest parts of data science and it means creating, changing, or manipulating the data to get better performance in your model
