#playground-series-s4e12
1 messages · Page 1 of 1 (latest)
Hi Kerven, check out previous regression competition solutions https://www.kaggle.com/competitions/playground-series-s4e9/leaderboard
episode 11 was way much active on discord, isn't it?
It seems like people like Classification challenges more than regression ones :D
Based on the features or any source, can someone please specify or explain what type of insurance is in the insurance prediction column like is it health insurance or car insurance or life insurance or some thing I don't know?
Please check out this discussion:
Hey guys!
I've submitted a (very basic) entry for this competition based mainly on code i've copy/pasted from various kaggle tutorials - specifically, Intro to Machine Learning, Pandas, Intermediate Machine Learning and Feature Engineering. But I've realised it's treating the Policy Start Date like a categoric and dropping it because its cardinality is too high. What do I need to do to treat it as a datetime instead?
I was wondering the same
Sr.... there is a discussion about that topic.. please notice that Policy Start Date is a datetime category, not a ordinal category
https://www.kaggle.com/competitions/playground-series-s4e12/discussion/549336
Hi ! Do we have the possibility to get information about the model of the competition's winner when it's finished ? Because it would be really interesting I think. Thanks !
thank you
Hi again, I'm struggling with my score of ~1.0455, it's been a few days that I'm testing different way to process my data but I don't really manage to improve my performances ..
Any of you have some tips to do it please ?
I handle missing values without imputing them as Chris did, I'm using LGBM (after trying with XGB and HistGB), I took a look at the correlation between my features and did some data viz but I'm stuck at this point.
I'm debutant so I suppose I don't know about some tips to really improve my score (not just improve it by 0.001 ^^)
I unsuccessfully tried to run a TabularModel, got some issues with datetime dtype.
Thanks a lot !
pd.to_datetime(train['Policy Start Date'])
hello i am a beginner trying to do this https://www.kaggle.com/competitions/playground-series-s4e12
i approached this problem with filling in the missing values depending on their distribution
uniform - random values in the range
skewed- median
normal- mean
i would then encode it with by nominal vs ordinal
ordinal_categories = ['Education Level','Customer Feedback','Exercise Frequency']
and did mappings such as {'High School': 0, 'Bachelor's': 1, 'Master's': 2,'PhD': 3}
while using .get_dummies for ['Smoking Status', 'Gender', 'Marital Status', 'Occupation', 'Policy Type', 'Property Type', 'Location']
then for certain values that had multimodaldistribution like age and credit score i used an imputer
then i scaled it but when using multiple models, like linear regression, random forest, my r^2 is always near 0 i dont know why
i still couldn't get a model to work with policy start date as a raw feature - can xgboost not take datetimes as features? but that link manusjara sent me inspired me to feature engineer features on specific parts of the datetime. i created 4 features, for year, month, day of week and hour, and that took me from #1520 with a score of 1.14069 to #1405 with a score of 1.13527.
still not happy with that score and position.
i probably need to do hyperparameter tuning and maybe feature pruning.
that discussion thing also mentioned transforming the target variable.
it also mentioned looking at the original data? i don't think it was mentioned in the competition description where this data came from? feels dirty but if other people are doing it 🤷
It’s skewed you can log the target variable
i've been using mean absolute error, is it better to use mean squared error (after logging the target variable) since that's what the competition is scored on?
i see sklearn.metrics has both mean_squared_error and root_mean_squared_error but does it make any difference which one you use, is one just the root of the other?
Hi everyone I'm new here,
Also trying this competition, but found out that the features are not correlating at all, and some features has many outliers.
I'm currently in the position of 988 with a score of 1.13978, what can I do to move forward, I tried transforming the data using box cox but got an even worst model.
I don't know maybe it's my method of imputation that is fault y.
Hi,
The dataset in this competition is synthetic, which suggests that we should not clean it the way we clean data in real world.
Because, in this case "outliers" are genuinely helpful in predicting the target, as the actual Premium Amount in test set also are derived from the exact same feature values that are provided to us.
Therefore, removing outliers leads to building a bad model in this case.
By the way, you can try feature engineering as well, some say it's not of great use here but might help capture more patterns.
they're not different except the other provides root of MSE. with addition of RMSE in SciLit Learn, MSE is now deprecated and will be removed in future versions
There's a similar discussion in the competition by a person concerning the same issue, but as an abstract of the replies is that you don't need to worry about R² score, our objective is to minimise RMSLE.
yea i saw
well
i figured it after searching for awhile
it was weir
d
i was like whys there no correlation at ALL
its my first real competition and ive really only taken hs stats and intro stats in college since im a freshman so it was new
I see, perhaps I can present you a hypothesis:
Context
Playground competitions fork a dataset from kaggle itself and use its synthetic version to avoid misuse of original dataset approaches.
What's happening in this competition
The "original" dataset itself is synthetic.. yeah so I would call the competition dataset as the 2nd layer of synthesys
And that may be the cause of weird stuff happening while EDA
I see
You're welcome
i got it to decently work not too bad but for awhile i was using r^2 as a measure of my progress when i shouldn't have
Ah, I would say I did even worse in a competition..
Objective: cohen kappa metric maximization
My objective: RMSE minimisation
Haha, but we learn as we progress. All the best for future competitions!
thanks
ImportError: cannot import name 'root_mean_squared_error' from 'sklearn.metrics' (/opt/conda/lib/python3.10/site-packages/sklearn/metrics/__init__.py)
guh? did i do something stupid. I just changed a line
from sklearn.metrics import mean_squared_errortofrom sklearn.metrics import root_mean_squared_errorand it threw an error. what's wrong with that?
do you need a beefy machine to use optuna? i found someone's notebook with a tutorial on how to use it and they said they could run 100 trials in under a minute but i tried adapting their code for this competition and it's taken like 15 minutes to run the first trial.
and given the competition ends in less than 2 hours i'm not going to get this done in time lmao
params = {
"n_estimators": trial.suggest_int("n_estimators", 100, 5000, step=100),
"learning_rate": trial.suggest_float("learning_rate", 1e-4, 0.3, log=True),
"max_depth": trial.suggest_int("max_depth", 3, 9),
"subsample": trial.suggest_float("subsample", 0.5, 0.9, step=0.1),
"max_features": trial.suggest_categorical(
"max_features", ["auto", "sqrt", "log2"]
),
"random_state": 0,
"n_iter_no_change": 50, # early stopping
"validation_fraction": 0.05,
}
# Perform CV
xgb__reg = XGBRegressor(**params)
scores = cross_validate(xgb__reg, X, y, cv=cv, scoring=scoring, n_jobs=-1)
# Compute RMSE
rmse = np.sqrt(-scores["test_score"].mean())
return rmse```
```%%time
# Create study that minimizes
study = optuna.create_study(direction="minimize")
# Wrap the objective inside a lambda with the relevant arguments
kf = KFold(n_splits=5, shuffle=True, random_state=0)
# Pass additional arguments inside another function
func = lambda trial: objective(trial, X_train, y, cv=kf, scoring="neg_mean_squared_error")
# Start optimizing with 100 trials
study.optimize(func, n_trials=100)```
i'm probably doing the wrong hyperparameters too given i'm doing xgboost and they were doing... something different
also optuna can take RMSLE, should i be using that instead of RMSE
you needed to upgrade the sklearn version to 1.5.2 for using RMSLE (and RMSE). In default version, MSE is available
The speed of trial depends on the complexity of model and length of dataset — in our case its huge. We could use GPUs instead of CPU to speed up the process though
thank you!
Hey guys, what were the approaches that you took for this competition?