#playground-series-s4e11
1 messages ยท Page 1 of 1 (latest)


being depressed just to see my freestyle model is performing better than all optuna-tuned models ๐
Custom tuning script doesn't seem to be a myth XD
any hints on feature engineering? it seems everything I try lowers score
Some of the numerical as factors. A couple logical features for the presence of missing values.
CV is big here. I can get lgbm to ๐ฏ on train alone. But it doesn't generalize on test very well. Tune for a healthy CV and do not get distracted by the public score on the leaderboard.
is the leaderboard "gamed" - for example, people modifying submission file to guess correct? or people training on test?
If I'm not wrong , people can use submission files for tweaking your model , (if one wants to..idk) maybe since it is provided for competition but no one can do training on test file.
It's not possible to train the test file simply because the test csv file is missing the true values , so you can't really train a model since y value is missing. Test provided in competition only have x part , the y part is missing.
I have no idea about how the leaderboard mechanism works exactly though.
I meant adding the submission as target column to the test dataset then retraining the model. Since we know the submission is mostly correct after first submit, I thought it might improve leaderboard at the risk of overfitting.
I think that's why some advice emphasizes CV and to ignore leaderboard otherwise during final phase of competition your model is overfitting to public leaderboard.
I guess I can kinda answer my own question and say that if I've thought of it as a total newbie, then I know others have thought of it and have definitely tried it. And that's probably why (among other reasons) there's a short final phase to the competitions.
CV is really really helpful and important, I'm a newbie too , but it is adviced to use CV on Validation dataset, which is a part of the training data set. Along with Gridsearch and hyperparameter tunning.
For CV we don't use testing dataset if no y (true value) is provided we can not know how well the model is working.
So instead the training dataset is broken to training and validation datasets (about 80%-20% ratio or 60-40 , 70-30 .. in which larger datachunk is training while rest is validation) we do cross validation(cv) on validation set , and in this phase we deal and modify the problems with model ,
Then we used testing data to make predictions.
We submit these predictions and get our score depending on the metric ( could be accuracy , probability etc) ,
I think the score is calculated with the true value of y are there in the scoring algorithm. As a cheat code it could be possible that the submission dataframe does have the true values of training set.
CV gives a really good idea how well the model works in real time. That's why people put emphasis on CV.
It is mostly in getting started or playground competitions like these you even get a sample datafile. Like most featured or research or advanced competitions usually don't even have sample submission file.
That's all I know.
I just started learning ML 2 months ago, so.. yeah I gained some knowledge by participating in few competitions and all.
So I have made a notebook in which I tried best to align with the said advices here on Discord as well as on these discussions:
- Notebook Link: "[LB: 0.94099] Depression Prediction ๐ง ๐ฃ"
Feel free to leave suggestions regarding improvements!
Hi, I'm new to ML and Kaggle and just submitted my results yesterday. Just wondering: what does blind-blending mean?
Hi!
As per what I understand, blind blending refers to creating an ensemble of models blindly (without proper cross-validation and weighing)
If you're wondering what CV is, you can gain a lot of information about cross validation just a few messages above this one :)
I know what CV is, but I don't know what weighing means. Or maybe I do, but am not familiar with the lingo yet. What I am wondering is why one would want to skip CV?
I don't have an answer to the question you're wondering. But it's kind of odd to believe that kaggle GMs are skipping CV!?
Stay away from kernels and discussions that encourage blends and ensemble solutions without a cv-backup. discussion source
Alright, but I think I got the real definition of blind blending.
The above paragraph in quotation can be put in a context of this discussion:
Just a quick note for relative newcomers, who might wonder what's wrong with combining a few of the top scoring public notebooks and getting a high score along with a great rank on the public LB - after all, isn't the rank a vindication of tapping into the wisdom of combining good solutions? Why are some of the more experienced Kagglers warning of a shakeup at the end of the month?... discussion source
So they're talking about people blending the predictions made by other competitors, without having a CV backup in this particular case
can someone review my code and say what is wrong 
what's your leaderboard score? i could probably help you out somewhere
93.8 is the maximum i got
Using tabnet
maybe you could describe what you did
I simply just filled the values with median classified degrees based on difficulties, label encoded it same I did for diet and sleephours
I initially filled the nan values of job satisfaction which I shouldn't so that's that which I am doing it with masking now
My attempt with lgb and catboost was more miserable when I decided to encode cities based on if they are metropolitan or not
I got progress today acc is 94.1