#playground-series-s4e11 | Kaggle | Page 1

latent cloud Nov 1, 2024, 2:01 PM

#

🛝 woohoo

craggy linden Nov 1, 2024, 2:59 PM

#

woohoo all_the_things

winter rivet Nov 3, 2024, 4:03 AM

#

harold

craggy linden Nov 3, 2024, 11:43 AM

#

being depressed just to see my freestyle model is performing better than all optuna-tuned models 💀

Custom tuning script doesn't seem to be a myth XD

worthy hemlock Nov 5, 2024, 12:25 AM

#

any hints on feature engineering? it seems everything I try lowers score

wary venture Nov 5, 2024, 2:30 AM

#

worthy hemlock any hints on feature engineering? it seems everything I try lowers score

Some of the numerical as factors. A couple logical features for the presence of missing values.

#

CV is big here. I can get lgbm to 💯 on train alone. But it doesn't generalize on test very well. Tune for a healthy CV and do not get distracted by the public score on the leaderboard.

worthy hemlock Nov 5, 2024, 2:41 AM

#

is the leaderboard "gamed" - for example, people modifying submission file to guess correct? or people training on test?

hazy ruin Nov 5, 2024, 6:36 AM

#

worthy hemlock is the leaderboard "gamed" - for example, people modifying submission file to gu...

If I'm not wrong , people can use submission files for tweaking your model , (if one wants to..idk) maybe since it is provided for competition but no one can do training on test file.
It's not possible to train the test file simply because the test csv file is missing the true values , so you can't really train a model since y value is missing. Test provided in competition only have x part , the y part is missing.
I have no idea about how the leaderboard mechanism works exactly though.

worthy hemlock Nov 5, 2024, 9:39 AM

#

I meant adding the submission as target column to the test dataset then retraining the model. Since we know the submission is mostly correct after first submit, I thought it might improve leaderboard at the risk of overfitting.

I think that's why some advice emphasizes CV and to ignore leaderboard otherwise during final phase of competition your model is overfitting to public leaderboard.

I guess I can kinda answer my own question and say that if I've thought of it as a total newbie, then I know others have thought of it and have definitely tried it. And that's probably why (among other reasons) there's a short final phase to the competitions.

hazy ruin Nov 5, 2024, 11:42 AM

#

worthy hemlock I meant adding the submission as target column to the test dataset then retraini...

CV is really really helpful and important, I'm a newbie too , but it is adviced to use CV on Validation dataset, which is a part of the training data set. Along with Gridsearch and hyperparameter tunning.

For CV we don't use testing dataset if no y (true value) is provided we can not know how well the model is working.
So instead the training dataset is broken to training and validation datasets (about 80%-20% ratio or 60-40 , 70-30 .. in which larger datachunk is training while rest is validation) we do cross validation(cv) on validation set , and in this phase we deal and modify the problems with model ,
Then we used testing data to make predictions.
We submit these predictions and get our score depending on the metric ( could be accuracy , probability etc) ,

I think the score is calculated with the true value of y are there in the scoring algorithm. As a cheat code it could be possible that the submission dataframe does have the true values of training set.

CV gives a really good idea how well the model works in real time. That's why people put emphasis on CV.

It is mostly in getting started or playground competitions like these you even get a sample datafile. Like most featured or research or advanced competitions usually don't even have sample submission file.
That's all I know.
I just started learning ML 2 months ago, so.. yeah I gained some knowledge by participating in few competitions and all.

craggy linden Nov 8, 2024, 5:23 AM

#

So I have made a notebook in which I tried best to align with the said advices here on Discord as well as on these discussions:

Notebook Link: "[LB: 0.94099] Depression Prediction 🧠🗣"

Feel free to leave suggestions regarding improvements!

[LB: 0.94099] Depression Prediction 🧠🗣

Explore and run machine learning code with Kaggle Notebooks | Using data from [Private Datasource]

wheat sentinel Nov 15, 2024, 5:16 PM

#

Hi, I'm new to ML and Kaggle and just submitted my results yesterday. Just wondering: what does blind-blending mean?

craggy linden Nov 19, 2024, 2:13 PM

#

wheat sentinel Hi, I'm new to ML and Kaggle and just submitted my results yesterday. Just wonde...

Hi!

As per what I understand, blind blending refers to creating an ensemble of models blindly (without proper cross-validation and weighing)

If you're wondering what CV is, you can gain a lot of information about cross validation just a few messages above this one :)

wheat sentinel Nov 19, 2024, 2:23 PM

#

craggy linden Hi! As per what I understand, blind blending refers to creating an ensemble of ...

I know what CV is, but I don't know what weighing means. Or maybe I do, but am not familiar with the lingo yet. What I am wondering is why one would want to skip CV?

craggy linden Nov 19, 2024, 4:39 PM

#

wheat sentinel I know what CV is, but I don't know what weighing means. Or maybe I do, but am n...

I don't have an answer to the question you're wondering. But it's kind of odd to believe that kaggle GMs are skipping CV!?

Stay away from kernels and discussions that encourage blends and ensemble solutions without a cv-backup. discussion source

Alright, but I think I got the real definition of blind blending.

The above paragraph in quotation can be put in a context of this discussion:

Just a quick note for relative newcomers, who might wonder what's wrong with combining a few of the top scoring public notebooks and getting a high score along with a great rank on the public LB - after all, isn't the rank a vindication of tapping into the wisdom of combining good solutions? Why are some of the more experienced Kagglers warning of a shakeup at the end of the month?... discussion source

So they're talking about people blending the predictions made by other competitors, without having a CV backup in this particular case

Exploring Mental Health Data

Playground Series - Season 4, Episode 11

Exploring Mental Health Data

Playground Series - Season 4, Episode 11

fresh stump Nov 24, 2024, 7:34 PM

#

can someone review my code and say what is wrong harold

haughty pewter Nov 24, 2024, 7:48 PM

#

fresh stump can someone review my code and say what is wrong <:harold:1138901472835293195>

what's your leaderboard score? i could probably help you out somewhere

fresh stump Nov 24, 2024, 7:57 PM

#

haughty pewter what's your leaderboard score? i could probably help you out somewhere

93.8 is the maximum i got

#

Using tabnet

haughty pewter Nov 24, 2024, 7:58 PM

#

maybe you could describe what you did

fresh stump Nov 24, 2024, 7:59 PM

#

haughty pewter maybe you could describe what you did

I simply just filled the values with median classified degrees based on difficulties, label encoded it same I did for diet and sleephours

#

I initially filled the nan values of job satisfaction which I shouldn't so that's that which I am doing it with masking now

#

My attempt with lgb and catboost was more miserable when I decided to encode cities based on if they are metropolitan or not

fresh stump Nov 25, 2024, 6:58 PM

#

I got progress today acc is 94.1