#playground-series-s4e12 | Kaggle | Page 1

small egret Dec 3, 2024, 12:32 PM

#

Hello

#

I'm completely new at kaggle, any initial suggestions?

carmine iris Dec 8, 2024, 10:20 PM

#

Hi Kerven, check out previous regression competition solutions https://www.kaggle.com/competitions/playground-series-s4e9/leaderboard

Regression of Used Car Prices

Playground Series - Season 4, Episode 9

lavish oriole Dec 10, 2024, 12:41 PM

#

fringe glen Dec 16, 2024, 8:45 PM

#

episode 11 was way much active on discord, isn't it?

It seems like people like Classification challenges more than regression ones :D

minor dagger Dec 18, 2024, 3:42 AM

#

Based on the features or any source, can someone please specify or explain what type of insurance is in the insurance prediction column like is it health insurance or car insurance or life insurance or some thing I don't know?

fringe glen Dec 18, 2024, 9:39 AM

#

minor dagger Based on the features or any source, can someone please specify or explain what ...

Please check out this discussion:

"What is the insurance for??"

Regression with an Insurance Dataset

Playground Series - Season 4, Episode 12

glacial crow Dec 22, 2024, 3:47 AM

#

Hey guys!

open peak Dec 23, 2024, 10:19 PM

#

I've submitted a (very basic) entry for this competition based mainly on code i've copy/pasted from various kaggle tutorials - specifically, Intro to Machine Learning, Pandas, Intermediate Machine Learning and Feature Engineering. But I've realised it's treating the Policy Start Date like a categoric and dropping it because its cardinality is too high. What do I need to do to treat it as a datetime instead?

glacial crow Dec 24, 2024, 3:11 AM

#

fringe glen episode 11 was way much active on discord, isn't it? It seems like people like ...

I was wondering the same

glad cape Dec 27, 2024, 1:47 AM

#

open peak I've submitted a (very basic) entry for this competition based mainly on code i'...

Sr.... there is a discussion about that topic.. please notice that Policy Start Date is a datetime category, not a ordinal category
https://www.kaggle.com/competitions/playground-series-s4e12/discussion/549336

Regression with an Insurance Dataset

Playground Series - Season 4, Episode 12

dusk aspen Dec 28, 2024, 1:16 PM

#

Hi ! Do we have the possibility to get information about the model of the competition's winner when it's finished ? Because it would be really interesting I think. Thanks !

open peak Dec 28, 2024, 6:14 PM

#

glad cape Sr.... there is a discussion about that topic.. please notice that Policy Start ...

thank you

dusk aspen Dec 28, 2024, 7:51 PM

#

Hi again, I'm struggling with my score of ~1.0455, it's been a few days that I'm testing different way to process my data but I don't really manage to improve my performances ..
Any of you have some tips to do it please ?

I handle missing values without imputing them as Chris did, I'm using LGBM (after trying with XGB and HistGB), I took a look at the correlation between my features and did some data viz but I'm stuck at this point.
I'm debutant so I suppose I don't know about some tips to really improve my score (not just improve it by 0.001 ^^)

I unsuccessfully tried to run a TabularModel, got some issues with datetime dtype.

Thanks a lot !

old hearth Dec 28, 2024, 9:54 PM

#

open peak I've submitted a (very basic) entry for this competition based mainly on code i'...

pd.to_datetime(train['Policy Start Date'])

hidden badge Dec 29, 2024, 4:34 AM

#

hello i am a beginner trying to do this https://www.kaggle.com/competitions/playground-series-s4e12

i approached this problem with filling in the missing values depending on their distribution
uniform - random values in the range
skewed- median
normal- mean

i would then encode it with by nominal vs ordinal
ordinal_categories = ['Education Level','Customer Feedback','Exercise Frequency']
and did mappings such as {'High School': 0, 'Bachelor's': 1, 'Master's': 2,'PhD': 3}
while using .get_dummies for ['Smoking Status', 'Gender', 'Marital Status', 'Occupation', 'Policy Type', 'Property Type', 'Location']

then for certain values that had multimodaldistribution like age and credit score i used an imputer

then i scaled it but when using multiple models, like linear regression, random forest, my r^2 is always near 0 i dont know why

Regression with an Insurance Dataset

Playground Series - Season 4, Episode 12

open peak Dec 29, 2024, 9:35 AM

#

i still couldn't get a model to work with policy start date as a raw feature - can xgboost not take datetimes as features? but that link manusjara sent me inspired me to feature engineer features on specific parts of the datetime. i created 4 features, for year, month, day of week and hour, and that took me from #1520 with a score of 1.14069 to #1405 with a score of 1.13527.

still not happy with that score and position.

i probably need to do hyperparameter tuning and maybe feature pruning.

that discussion thing also mentioned transforming the target variable.

it also mentioned looking at the original data? i don't think it was mentioned in the competition description where this data came from? feels dirty but if other people are doing it 🤷

hidden badge Dec 29, 2024, 9:40 AM

#

open peak i still couldn't get a model to work with policy start date as a raw feature - c...

It’s skewed you can log the target variable

open peak Dec 29, 2024, 11:30 AM

#

i've been using mean absolute error, is it better to use mean squared error (after logging the target variable) since that's what the competition is scored on?

#

i see sklearn.metrics has both mean_squared_error and root_mean_squared_error but does it make any difference which one you use, is one just the root of the other?

warm plank Dec 29, 2024, 12:25 PM

#

Hi everyone I'm new here,
Also trying this competition, but found out that the features are not correlating at all, and some features has many outliers.

I'm currently in the position of 988 with a score of 1.13978, what can I do to move forward, I tried transforming the data using box cox but got an even worst model.

I don't know maybe it's my method of imputation that is fault y.

fringe glen Dec 30, 2024, 6:43 AM

#

warm plank Hi everyone I'm new here, Also trying this competition, but found out that the ...

Hi,

The dataset in this competition is synthetic, which suggests that we should not clean it the way we clean data in real world.

Because, in this case "outliers" are genuinely helpful in predicting the target, as the actual Premium Amount in test set also are derived from the exact same feature values that are provided to us.

Therefore, removing outliers leads to building a bad model in this case.

By the way, you can try feature engineering as well, some say it's not of great use here but might help capture more patterns.

fringe glen Dec 30, 2024, 6:44 AM

#

open peak i see sklearn.metrics has both mean_squared_error and root_mean_squared_error bu...

they're not different except the other provides root of MSE. with addition of RMSE in SciLit Learn, MSE is now deprecated and will be removed in future versions

fringe glen Dec 30, 2024, 6:46 AM

#

hidden badge hello i am a beginner trying to do this https://www.kaggle.com/competitions/play...

There's a similar discussion in the competition by a person concerning the same issue, but as an abstract of the replies is that you don't need to worry about R² score, our objective is to minimise RMSLE.

hidden badge Dec 30, 2024, 6:46 AM

#

fringe glen There's a similar discussion in the competition by a person concerning the same ...

yea i saw

#

well

#

i figured it after searching for awhile

#

it was weir

#

d

#

i was like whys there no correlation at ALL

#

its my first real competition and ive really only taken hs stats and intro stats in college since im a freshman so it was new

fringe glen Dec 30, 2024, 6:49 AM

#

hidden badge i was like whys there no correlation at ALL

I see, perhaps I can present you a hypothesis:

Context

Playground competitions fork a dataset from kaggle itself and use its synthetic version to avoid misuse of original dataset approaches.

What's happening in this competition

The "original" dataset itself is synthetic.. yeah so I would call the competition dataset as the 2nd layer of synthesys

And that may be the cause of weird stuff happening while EDA

fringe glen Dec 30, 2024, 6:50 AM

#

hidden badge its my first real competition and ive really only taken hs stats and intro stats...

I see

hidden badge Dec 30, 2024, 6:50 AM

#

i see

#

thansk for the help

fringe glen Dec 30, 2024, 6:51 AM

#

You're welcome

hidden badge Dec 30, 2024, 6:51 AM

#

i got it to decently work not too bad but for awhile i was using r^2 as a measure of my progress when i shouldn't have

fringe glen Dec 30, 2024, 6:52 AM

#

hidden badge i got it to decently work not too bad but for awhile i was using r^2 as a measur...

Ah, I would say I did even worse in a competition..

Objective: cohen kappa metric maximization

My objective: RMSE minimisation

Haha, but we learn as we progress. All the best for future competitions!

open peak Dec 30, 2024, 11:27 PM

#

fringe glen they're not different except the other provides root of MSE. with addition of RM...

thanks

open peak Dec 31, 2024, 2:59 PM

#

fringe glen they're not different except the other provides root of MSE. with addition of RM...

ImportError: cannot import name 'root_mean_squared_error' from 'sklearn.metrics' (/opt/conda/lib/python3.10/site-packages/sklearn/metrics/__init__.py)

guh? did i do something stupid. I just changed a line

from sklearn.metrics import mean_squared_errortofrom sklearn.metrics import root_mean_squared_errorand it threw an error. what's wrong with that?

open peak Dec 31, 2024, 10:22 PM

#

do you need a beefy machine to use optuna? i found someone's notebook with a tutorial on how to use it and they said they could run 100 trials in under a minute but i tried adapting their code for this competition and it's taken like 15 minutes to run the first trial.

and given the competition ends in less than 2 hours i'm not going to get this done in time lmao

#

    params = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 5000, step=100),
        "learning_rate": trial.suggest_float("learning_rate", 1e-4, 0.3, log=True),
        "max_depth": trial.suggest_int("max_depth", 3, 9),
        "subsample": trial.suggest_float("subsample", 0.5, 0.9, step=0.1),
        "max_features": trial.suggest_categorical(
            "max_features", ["auto", "sqrt", "log2"]
        ),
        "random_state": 0,
        "n_iter_no_change": 50,  # early stopping
        "validation_fraction": 0.05,
    }
    # Perform CV
    xgb__reg = XGBRegressor(**params)
    scores = cross_validate(xgb__reg, X, y, cv=cv, scoring=scoring, n_jobs=-1)
    # Compute RMSE
    rmse = np.sqrt(-scores["test_score"].mean())

    return rmse```

```%%time

# Create study that minimizes
study = optuna.create_study(direction="minimize")

# Wrap the objective inside a lambda with the relevant arguments
kf = KFold(n_splits=5, shuffle=True, random_state=0)
# Pass additional arguments inside another function
func = lambda trial: objective(trial, X_train, y, cv=kf, scoring="neg_mean_squared_error")

# Start optimizing with 100 trials
study.optimize(func, n_trials=100)```

#

i'm probably doing the wrong hyperparameters too given i'm doing xgboost and they were doing... something different

#

also optuna can take RMSLE, should i be using that instead of RMSE

fringe glen Jan 1, 2025, 1:37 AM

#

open peak ```ImportError: cannot import name 'root_mean_squared_error' from 'sklearn.metri...

you needed to upgrade the sklearn version to 1.5.2 for using RMSLE (and RMSE). In default version, MSE is available

fringe glen Jan 1, 2025, 1:39 AM

#

open peak do you need a beefy machine to use optuna? i found someone's notebook with a tut...

The speed of trial depends on the complexity of model and length of dataset — in our case its huge. We could use GPUs instead of CPU to speed up the process though

open peak Jan 1, 2025, 10:46 AM

#

thank you!

bleak creek Jan 1, 2025, 5:56 PM

#

Hey guys, what were the approaches that you took for this competition?