#playground-series-s4e12

1 messages · Page 1 of 1 (latest)

small egret
#

Hello

#

I'm completely new at kaggle, any initial suggestions?

carmine iris
lavish oriole
fringe glen
#

episode 11 was way much active on discord, isn't it?

It seems like people like Classification challenges more than regression ones :D

minor dagger
#

Based on the features or any source, can someone please specify or explain what type of insurance is in the insurance prediction column like is it health insurance or car insurance or life insurance or some thing I don't know?

fringe glen
glacial crow
#

Hey guys!

open peak
#

I've submitted a (very basic) entry for this competition based mainly on code i've copy/pasted from various kaggle tutorials - specifically, Intro to Machine Learning, Pandas, Intermediate Machine Learning and Feature Engineering. But I've realised it's treating the Policy Start Date like a categoric and dropping it because its cardinality is too high. What do I need to do to treat it as a datetime instead?

glad cape
dusk aspen
#

Hi ! Do we have the possibility to get information about the model of the competition's winner when it's finished ? Because it would be really interesting I think. Thanks !

dusk aspen
#

Hi again, I'm struggling with my score of ~1.0455, it's been a few days that I'm testing different way to process my data but I don't really manage to improve my performances ..
Any of you have some tips to do it please ?

I handle missing values without imputing them as Chris did, I'm using LGBM (after trying with XGB and HistGB), I took a look at the correlation between my features and did some data viz but I'm stuck at this point.
I'm debutant so I suppose I don't know about some tips to really improve my score (not just improve it by 0.001 ^^)

I unsuccessfully tried to run a TabularModel, got some issues with datetime dtype.

Thanks a lot !

old hearth
hidden badge
#

hello i am a beginner trying to do this https://www.kaggle.com/competitions/playground-series-s4e12

i approached this problem with filling in the missing values depending on their distribution
uniform - random values in the range
skewed- median
normal- mean

i would then encode it with by nominal vs ordinal
ordinal_categories = ['Education Level','Customer Feedback','Exercise Frequency']
and did mappings such as {'High School': 0, 'Bachelor's': 1, 'Master's': 2,'PhD': 3}
while using .get_dummies for ['Smoking Status', 'Gender', 'Marital Status', 'Occupation', 'Policy Type', 'Property Type', 'Location']

then for certain values that had multimodaldistribution like age and credit score i used an imputer

then i scaled it but when using multiple models, like linear regression, random forest, my r^2 is always near 0 i dont know why

open peak
#

i still couldn't get a model to work with policy start date as a raw feature - can xgboost not take datetimes as features? but that link manusjara sent me inspired me to feature engineer features on specific parts of the datetime. i created 4 features, for year, month, day of week and hour, and that took me from #1520 with a score of 1.14069 to #1405 with a score of 1.13527.

still not happy with that score and position.

i probably need to do hyperparameter tuning and maybe feature pruning.

that discussion thing also mentioned transforming the target variable.

it also mentioned looking at the original data? i don't think it was mentioned in the competition description where this data came from? feels dirty but if other people are doing it 🤷

hidden badge
open peak
#

i've been using mean absolute error, is it better to use mean squared error (after logging the target variable) since that's what the competition is scored on?

#

i see sklearn.metrics has both mean_squared_error and root_mean_squared_error but does it make any difference which one you use, is one just the root of the other?

warm plank
#

Hi everyone I'm new here,
Also trying this competition, but found out that the features are not correlating at all, and some features has many outliers.

I'm currently in the position of 988 with a score of 1.13978, what can I do to move forward, I tried transforming the data using box cox but got an even worst model.

I don't know maybe it's my method of imputation that is fault y.

fringe glen
# warm plank Hi everyone I'm new here, Also trying this competition, but found out that the ...

Hi,

The dataset in this competition is synthetic, which suggests that we should not clean it the way we clean data in real world.

Because, in this case "outliers" are genuinely helpful in predicting the target, as the actual Premium Amount in test set also are derived from the exact same feature values that are provided to us.

Therefore, removing outliers leads to building a bad model in this case.

By the way, you can try feature engineering as well, some say it's not of great use here but might help capture more patterns.

fringe glen
fringe glen
hidden badge
#

well

#

i figured it after searching for awhile

#

it was weir

#

d

#

i was like whys there no correlation at ALL

#

its my first real competition and ive really only taken hs stats and intro stats in college since im a freshman so it was new

fringe glen
# hidden badge i was like whys there no correlation at ALL

I see, perhaps I can present you a hypothesis:

Context

Playground competitions fork a dataset from kaggle itself and use its synthetic version to avoid misuse of original dataset approaches.

What's happening in this competition

The "original" dataset itself is synthetic.. yeah so I would call the competition dataset as the 2nd layer of synthesys

And that may be the cause of weird stuff happening while EDA

hidden badge
#

i see

#

thansk for the help

fringe glen
#

You're welcome

hidden badge
#

i got it to decently work not too bad but for awhile i was using r^2 as a measure of my progress when i shouldn't have

fringe glen
open peak
open peak
#

do you need a beefy machine to use optuna? i found someone's notebook with a tutorial on how to use it and they said they could run 100 trials in under a minute but i tried adapting their code for this competition and it's taken like 15 minutes to run the first trial.

and given the competition ends in less than 2 hours i'm not going to get this done in time lmao

#
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 5000, step=100),
        "learning_rate": trial.suggest_float("learning_rate", 1e-4, 0.3, log=True),
        "max_depth": trial.suggest_int("max_depth", 3, 9),
        "subsample": trial.suggest_float("subsample", 0.5, 0.9, step=0.1),
        "max_features": trial.suggest_categorical(
            "max_features", ["auto", "sqrt", "log2"]
        ),
        "random_state": 0,
        "n_iter_no_change": 50,  # early stopping
        "validation_fraction": 0.05,
    }
    # Perform CV
    xgb__reg = XGBRegressor(**params)
    scores = cross_validate(xgb__reg, X, y, cv=cv, scoring=scoring, n_jobs=-1)
    # Compute RMSE
    rmse = np.sqrt(-scores["test_score"].mean())

    return rmse```

```%%time

# Create study that minimizes
study = optuna.create_study(direction="minimize")

# Wrap the objective inside a lambda with the relevant arguments
kf = KFold(n_splits=5, shuffle=True, random_state=0)
# Pass additional arguments inside another function
func = lambda trial: objective(trial, X_train, y, cv=kf, scoring="neg_mean_squared_error")

# Start optimizing with 100 trials
study.optimize(func, n_trials=100)```
#

i'm probably doing the wrong hyperparameters too given i'm doing xgboost and they were doing... something different

#

also optuna can take RMSLE, should i be using that instead of RMSE

fringe glen
fringe glen
open peak
#

thank you!

bleak creek
#

Hey guys, what were the approaches that you took for this competition?