#🚢┊titanic

1 messages · Page 2 of 1

red ether
#

Hey everyone, so I'm working on the titanic dataset and trying to impute some missing values in the age column. But somehow I can't wrap my head around imputing them. Could someone explain or give any advice related to this

plush vortex
#

Are we allowed to share some github or Streamlit Cloud url to answer a question ?

compact mesa
#

Hey everyone!
I’ve just published a new Kaggle notebook:
"Titanic Survival Prediction using Machine Learning"
I used various ML models and feature engineering to predict passenger survival, and got a solid score on the leaderboard!

Would love it if you could check it out and drop an upvote if you find it helpful! 🙌

🔗 https://www.kaggle.com/code/mrmelvin/titanic-survival-prediction-using-machine-learning

Thanks a ton for the support! 💙

outer rover
#

hello guys, i'm just getting started w/ the titanic, I have basics of ML and I espacially know about the transformer architecture however I read that it maybe isn't the best fit for this challenge. Should I stick w/ a randomn forest or do you think a transformer could be a good fit ?

outer rover
#

ok thks

somber dragon
#

Hey guys! I’m pretty new to Kaggle competitions and currently working on the Titanic dataset. I’ve got a few things I’m confused about and hoping someone can help:
1- Preprocessing Test Data
In my train data, I drop useless columns (like Name, Ticket, Cabin), fill missing values, and use get_dummies to encode Sex and Embarked. Now when working with the test data — do I need to apply exactly the same steps? Like same encoding and all that?Does the model expect train and test to have exactly the same columns after preprocessing?
2- Using Target Column During Training
Another thing — when training the model, should the Survived column be included in the features?
What I’m doing now is:
Dropping Survived from the input features
Using it as the target (y)
Is that the correct way, or should the model actually see the target during training somehow? I feel like this is obvious but I’m doubting myself.
3- How Does Kaggle Submission Work?
Once I finish training the model, should I:
Run predictions locally on test.csv and upload the results (as submission.csv)? OR
Just submit my code and Kaggle will automatically run it on their test set?
I’m confused whether I’m supposed to generate predictions locally or if Kaggle runs my notebook/code for me after submission.

minor heart
# somber dragon Hey guys! I’m pretty new to Kaggle competitions and currently working on the Tit...

Hey Destiny, I got you!

  1. So for your preprocessing, I recommend putting all the cleaning steps you went through for the training set into a function. And then yes, you'll apply all of that again on the testing set. Just note that the testing set does not have the 'Target' value, everything else will be the same (though check to make sure there's not any new types in the testing set that your cleaning does not account for.
  2. No! You should seperate the Target column into a "y" variable and then the rest of the columns can be in an "X" variable. So yes,, you are doing it the correct way. You do want to keep it to make sure you have a way to check your predictions though.
  3. For your predictions, you'll submit a csv file with 2 columns (PassengerId, and Prediction), they detail the exact format. When you output it make sure you have index=False though. [eg. results.to_csv('Titanic_Predictions_Random_Forest.csv',index=False)]
#

Anyways, I just did the titanic dataset myself. I used Decision Trees and Random Forest. Ended up getting 0.75 and 0.76 respectively. Though I only used n_estimators and depth as hyperparameters for RF, so I'll go through it again with more parameters. Curious to see which models perform the best for this classification assignment.

somber dragon
#

now i am facing this problem :

#

In my notebook, I save the submission file like this:

# Option 1
submission = test[['PassengerId', 'Survived']]
submission.to_csv('submission.csv', index=False)

# Option 2 (also tried)
submission = test[['PassengerId', 'Survived']]
submission.to_csv('/kaggle/working/submission.csv', index=False)

I double-checked, the file looks like this:

PassengerId  Survived
0          892         0
1          893         0
...
(418, 2)
PassengerId    0
Survived       0
dtype: int64

It appears correctly in the output folder in Kaggle after running, but when I submit the notebook, I still get: "Submission CSV Not Found."

Any idea what could be wrong? Does Kaggle expect any specific step to detect it?

Thanks in advance!

#

and yea, i got the reason now
while submiting, kaggle will run the cells from the start so any errs or typo will stop the submition eventhough i could run the cell by my self wihtout errs

minor heart
opal carbon
#

Hello everyone! I'm here to share the code I used to make predictions and submit my solution for the Titanic competition.

I'm Brazilian, so I initially wrote the notebook in Portuguese. Depending on the interest or feedback, I’d be happy to create an English version to make it more accessible for everyone.

Using a model that combines Random Forest and SVM, I achieved an accuracy of 78.23%.
Here’s the link:

🔗 https://www.kaggle.com/code/marcomata/titanic-submission

quartz sonnet
opal carbon
quartz sonnet
#

In fact. A simple decision tree can achieve "good" results in the titanic dataset. Simple rules with age + gender, etc

oak spindle
#

i am trying to getting start with projects and i tried with titanic dataset and i am getting 82% accuracy... i dont know its good start or not ??

can anyone guide me here what i did wrong ?

bronze path
#

And can anybody guide me how to participate and what to do in titanic competitions as I am new to ai stuff and a beginner

hallow gust
lament badge
#

Kaggle is the best platform to practice Ml ,As a beginner you should go with logistic regression model in titanic compitition I have 79 percent accuracy 🫡

quartz sonnet
oak spindle
#

@opaque lotus sir please delete this, it's scam

quaint copper
#

Hello! Im new here as i started to learn ML quite recently, so be patient 😉

I've used 4 types of models: DT, KNN, MLP and XGboost.

Using stratified kfold with 5 splits and some hyperparameter optimization (just basic grid search with not much exploration) i got validation accuracies on the XGboost and MLP around 0.82 - 0.84.

However, when I run the models in the test df, the kaggle scores of XGboost and MLP drop to 0.76-0.77, and my best model becomes a DT with 0.782.

Is this normal behavior? I know the number of observations is not that large and therefore a few bad predictions will have impact. Still, i wasnt expecting a gap like this.

rigid storm
#

I am new to ML and am trying to grasp the basics. If anyone could please tell me what kind of checkpoints should i set for myself so that i know that i am doing the right thing and actually understanding this stuff.

ornate valve
#

@quaint copper It's too different to train it on your PC than to do it on Kaggle: Accuracy: 0.877778

median patio
#

Just finished my Titanic submission on Kaggle — scored 79% accuracy using Random Forest!
I’ve shared all the steps and reasoning behind each move in my notebook:
🔗 https://www.kaggle.com/code/nishchalpandey/titanic-survival-prediction-random-forest

Would love any feedback you guys have 🙌
And if you find it useful, a quick upvote would mean a lot! 🚀

prisma abyss
#

Hey guys I am new to ML and want to build models , I wanted to start with kaggle , are there any yt playlists that help me understand concepts of ML and cover basics

#

And I would love to collaborate , help and learn .

hushed fox
prisma abyss
solemn stream
#

Hello , I am new to ML and want to build models , i am looking to create a small team to start try out some of the Kaggle challenges. I'd like to start fairly simple with things like classification or regression problems.
Thanks 😇

frank jasper
#

Hi everyone! 👋
I’m new here, and I wanted to share a lightweight, rule-based framework I’ve been working on — called AdaptoFlux.

It currently uses simple, human-readable arithmetic operations — like addition, subtraction, multiplication, and division — along with basic transformations (e.g., ±1, negation, and value copying).
No machine learning models. No gradients. Just pure math 🧮

Despite its simplicity, it typically achieves over 70% accuracy on the leaderboard — and with further tuning, sometimes even higher!

I’ve put together a fully runnable notebook with:
✅ One-click execution
✅ Step-by-step breakdown of how the rules are built
✅ Visualization of the logic flow and decision process

Check it out here:
👉 https://www.kaggle.com/code/gugu12138/adaptoflux-on-titanic

Would love to hear your thoughts or feedback!
And if you find it interesting, an upvote would mean a lot 🙌

P.S. This is just the starting point — the framework is designed to support more complex operations in the future, so it’s not limited to basic arithmetic. Excited to explore where this can go! 🚀

Looking forward to learning and collaborating with all of you!

hushed moss
#

Guys i used random forest classifier algo for titanic and got 0.791 score.
how can i do better ?

quartz sonnet
cosmic pumice
#

Hi everyone, I used two MLP models for prediction and assigned a weight of 0.5 to each model's prediction for the final output. But now my score is only 0.7799. What should I do next to improve it?

lucid frost
#

Hello, i sent my first titanic submission and after 18h it still has a score of 0.000; Do you guys know how long it usually takes to receive a score?
Many thanks!

cosmic pumice
lucid frost
cosmic pumice
#

what happen

civic harness
#

Hi everyone,

I am trying to build my data science skills using Kaggle competition dataset. I submitted my test data for Titanic dataset. My accuracy score is 77% even after multiple submissions. I check Leadboard and some people have gotten score of 100%. So, I was wondering is there a way to compare their notebooks with mine?

sacred sleet
civic harness
#

@sacred sleet thank you for the information.

All other friends, please let me know if your accuracy was more than 77% . I would be happy to discuss with you all 😊

velvet harness
#

Looking for a teammate for this challenge. So far the best submissions achieved are the following:

XGBoost - 79.42%
Linear Regression - 77.03%

The final goal is to achieve >80% accuracy.
We can share some ideas for EDA and Feature Engineering

velvet harness
spare heart
cosmic pumice
knotty holly
hollow jasper
#

@quartz scarab is it possible to remove these advertisements from here. Happy to moderate if you're in need.

low stone
#

Looking for a partner who wants to collaborate for the titanic competition pls dm if interested

uneven monolith
#

Hello, could you please point me to a simple and easy-to-understand code for the Titanic competition so that I can learn from it? Thank you in advance.

hollow jasper
uneven monolith
paper ivy
#

Just curious, what scoring method should one use when hyperparameter tuning the models? I'm thinking accuracy, because that seems to be what the leaderboard uses and the imbalance is not that huge (so no balanced_accuracy needed?). But I'm kinda new to this, so... 🙂

hollow jasper
small anvil
#

I have a tricki cuestion about grid search, ES, and hiperparameters tunning stuff. With the same model (XGB) and features, i discover by accident a set of hiperparameters that provides 79% of accuracy, and with tecniques like grid search and ES, the best hiperparameters that i found provides me 77% of accuracy, the problem i can recognizance is i'm using cross validation to check the set of hiperparameters but that don't have linear correlation with the accuracy on the submits

#

so, i'm very confuse. I mean, if grid search and early stop don't find the best set of hiperparameters, i have to assume i make something wrong or is for the nature of the data, like different patterns in the train - test set that don't allows me to say "a improvement in cross validation represent an improvement in the submit"

#

i know hiperparameters tuning is not the big thing in this data set, but anyways i would like to understand whats is happening here

hollow jasper
# small anvil I have a tricki cuestion about grid search, ES, and hiperparameters tunning stuf...

yeah hyperparameter tuning won't do much. 79% is just bit above the two rule "model"

def predict_gender_rule(df):
"""Return Survived (positive class) if female, return Not Survived (negative class) if male"""
return np.where(df["Sex"] == "female", 1, 0)

def main() -> None:
pd.read_csv(TRAIN_CSV)
test = pd.read_csv(TEST_CSV)

pred_test = predict_gender_rule(test)

submission = pd.DataFrame(
{
"PassengerId": test["PassengerId"].astype(int),
"Survived": pred_test,
}
)

out_path = OUT_SUB / "submission_gender_rule.csv"
submission.to_csv(out_path, index=False)
print(f"[done] wrote {out_path.resolve()}")

if name == "main":
main()

small anvil
#

but my question is

#

it's normal to exist that kind of "blind points" for detect hiperparameters? which criteria i have to use to search it or define a range of searching?

#

how to proceed when a improve in cross validation represent a worse accuracy in submits?

paper ivy
small anvil
paper ivy
#

Thanks a lot!

broken juniper
#

Hi guys, I need a piece of advice, I'm really mad trying to fix this but no luck

I had 0,83 and something accuracy with predictions on validation data when I split the training set
I uploaded it to the competion and got 0,77

I asked ChatGPT and it? he? advised to check the tree max depth, and also use cross-validation where I should look to decrease the difference between best and worst score. Which I did

In 5 runs of cross validation I got:
Mean accuracy: 0.8305
Difference between max and min: 0.0449

So basically it should not be worse than 0.785

I uploaded the new submission and got even worse result of 0.76555

What do I do to get more relevant results on public data? 😭

small anvil
slow tangle
#

anybody has an idea why simple logistic regression is doing better than NN

small anvil
#

I had some advances with the problem, is clearly overfitting.

slow tangle
#

Ive tried using regularizations with best lambda for cross validation set but still worse in test set than logistic regr.

#

one thing I haven’t done is parsing Name.. mostly just ignored it

#

It might be likely that married with kids might be less likely to survive, as well titles like Master might mean more likely to survive…

hollow jasper
# small anvil look https://www.kaggle.com/code/yoni2k/top-3-with-only-4-features-no-data-leaka...

6 years old, also there's a ton of leakage and cheating in this since results are known. Any legit model in above 83% seems to be a big stack of models with a mix of rules and sending the rest to a gradient boosting algo.

Like Chris Deotte's scores https://www.kaggle.com/code/cdeotte/titanic-wcg-xgboost-0-84688 84.6%. This is a really complicated solution for a beginner which is why I don't even recommend people trying to learn to try to further optimize based basic logistic regression and XGBoost.

do some feature engineering, do some model building, submit and move on to a new problem, imo!

sage spoke
#

Hi, how are you? I’m working on the Titanic project and did stacking (XGBoost, Random Forest, and Logistic Regression) and finalized with a Soft Voting ensemble (XGBoost, Random Forest, Logistic Regression plus the previous Stacking). I got evaluation results of VotingClassifier (Soft Voting)
Accuracy: 0.8492
ROC AUC: 0.8764.

However, my ranking is very low (Score: 0.76315), and I don’t understand why — I thought these were good results. Could someone please suggest how to improve? I’m still learning!

My Code is here: https://www.kaggle.com/code/lorrancintra/titanic-4-hybrid-ensemble-final

small anvil
hollow jasper
#

Chris is just crafty as heck and was able to make the best of it given years of submissions before him

hollow jasper
#

My best was light feature engineering and a stable xgboost

#

Embarked is def not a feature I found useful, why'd you train your model on it

#

Couldn't look at it too long, looks very AI generated, you need to drop alot of useless / redundant features that aren't adding any signal though!

small anvil
hollow jasper
#

Try some other comps before going too crazy on titanic tho, it'll help you develop different techniques

small anvil
#

@hollow jasper i have one more question. I was reading about the Early Stop in grid search and i been trying that stuff but as long i can see, that have no native integration with cross validation. Is actually a good idea make an integration of early stop + cross validation? in case that is not worth, how u avoid the overfitting in Early Stop?

paper ivy
# sage spoke Hi, how are you? I’m working on the Titanic project and did stacking (XGBoost, R...

Your model is probably overfitted to the training data, that's the reason for the big score difference.

Some ideas:

  • Don't encode everything ordinarily, try OHE. This does not matter for tree-based models, but for Logistic Regression it does. Similarly, binning for ages and fare for logistic regression, maybe also log-scale the fare
  • Speaking of Logistic Regression: Are you ever using that model (not the meta learner, the other one)?
  • You're using xgb both in the stacking classifier as well as the voting, of course this weights xgb heavily
  • Have you ever calculated something like cohen's kappa for the models?
  • Try extracting more features, explore which actually have predictive power (correlation, feature importances in random forest) and drop the ones you don't actually use
sage spoke
hollow jasper
#

As a beginner at least, if it's just for fun and you're very experienced, it can be fun to re-visit and try to go for high scores

#

I'd also make sure you're using a clean repo structure and doing commits. One thing I noticed about your note book is your file paths looked a bit messy. Pathlib is OS agnostic and is very nice to collaborate with.

This is all null if u don't plan on ever being in a collaborative environment, but I figure most people are

hollow jasper
#

We can only train our model with training data, and we treat every validation set the same as test aka non train

#

That's why those don't mesh

#

Good question though

ionic lotus
#

Hi everyone. Got my titanic score to a 0.775 any tips on how to increase my score

sacred sleet
worthy forum
#

what are common ways to get high > 0.9 score?

#

is there a specific algorithm? some key features that need to be engineered?

#

i did two runs. One on my first try (0.74) and the second after reading MEG RISDAL's post (0.77)

#

is this good enough for a bigenner? Should i move on to another competition?

sacred sleet
worthy forum
trail breach
#

🚢

hollow jasper
#

It's ok though most do

hard tiger
#

link for this competetion anyone ?

worthy forum
#

like data with 20 female 80 male

hollow jasper
# worthy forum but i guess my model would generalize better if we change the data?

well there is no way to know. Which is why trying to eek out percentages off of titanic is rather silly. We'll never see another titanic and you'll only ever have that one set of test data to evaluate. its a good place to start and spin up your first real jupyter notebook and do your first real EDA, but beyond that -- dont waste your time optimizing this set

steep gazelle
#

Hi! I just made my first submission and Scored 0.76555 Any tips to improve this? Or should I move to other competitions/projects, also this project made me realize how important math is for this field, please let me know which concepts and subjects I can focus on

peak cypress
#

why have you needed maths? what model have you used?

hollow jasper
#

The heart of this project is about building a clean machine learning work flow.

Use VScode and make a remote repo on GitHub with regular commits (with proper commit messages) for practice.

Make a nice EDA note book (either a Python interactive file or Jupyter) that covers the basic ideas of feature engineering and includes a full spectrum of analysis on what you have.

vast aspen
kind jungle
#

Can anyone provide me with the titanic test.csv and train.csv. Unfortunately I cannot download it from the kaggle.com website. Kaggle support cannot help... Thanks a lot in advance...

cedar depot
torpid cradle
#

It's my first time participating in these type of competition on kaggle. I do have experience with Machine Learning projects. Is there any tip for beginner like on what steps I need to pay attention or anything?

steep gazelle
steep gazelle
hollow jasper
pastel axle
#

Hey

pastel axle
#

Can anyone help me to enter the competition

#

Please

elder timber
stiff gull
cedar depot
#

continuation: #5dgai-introductions message

@ivory lotus ,
Firstly do you have prior knowledge/experience with Python, Numpy, Matplotlib, and Pandas ?

#

@ivory lotus talk here please

#

That channel is for introductions, this is the right channel 🐣

#

So as you mentioned you have no experience, I followed this roadmap that I would suggest you to follow too:

Kaggle has a bunch of courses and guides at https://kaggle.com/learn

  1. Intro to Programming Course
    Prerequisites: None.
    https://share.google/68HMIU2sPcz7jXpKn
  2. Python Course
    Prerequisites: "Intro to the Programming Course if you have no knowledge about programming"
    https://share.google/tlKbeLBPrcKsfnSQZ
  3. Pandas Course
    Prerequisites: Python Course
    https://www.kaggle.com/learn/pandas
  4. Intro to Machine Learning
    Prerequisites: Python Course
    https://share.google/bTm7E168S656eQRzO

Or alternatively, you can find a video course on some other platform like YouTube and follow that..

elder narwhal
#

@tight robin

#

@tight robin sir please accept my request in linkedin and send me request to just please share your experience. Iam 17 years old

smoky wing
#

Hii

scarlet socket
#

everyone stay away from this guy, i think he is trying to hack

gleaming shard
#

hello guys

#

how do we do the submission?

static wyvern
#

Am thinking of keeping the name the next time. Experiment a little with the Random Forest first.

tame vessel
#

Hola alguien sabe si puedo usar lo aprendido en el cuso de house prices advanced en la prediccion de este ejercicio? gracias

spare temple
#

hi new here,looking for friends,and beginners also to learn together, acountability mate

nova furnace
#

Hello, I'm new to ML. I'd love to connect with ambitious friends to grow and learn together

errant mantle
#

Good day! I am a player stuck in the newbie village ! 😊

bitter oriole
#

Hello everyone, I'm Silver and I'm a beginner I just started today and I'd love friends that we can work together to make it easier for each other

jaunty shard
#

Hi everyone, I'm David and I'm a beginner in Data science.
Let's connect and learn from each other and probably end up working together.

midnight pulsar
versed sandal
#

Hi everyone, I'm tyros. I've studied machine learning algorithms and deep learning, and I'm currently focusing on computer vision (CV) learning. However, I've never participated in Kaggle competitions before, and my problem-solving thinking and coding skills are just average. I hope to meet like-minded friends here to learn and grow together.

versed sandal
glad ingot
#

Hey everyone! I am Akarshi and have some exprience in Data Science. I'd love to connect with like minded people to work together on projects and exchange ideas.

solid holly
#

Hi everyone! I’m a CS student starting my ML journey and using the Titanic competition as my “hello world” for data science (RIP Titanic, thank you for the dataset 🫡).I'm on preprocessing, basic models and hyperparameter tuning rn.
Feel free to say hi nd connect.

hybrid spire
shut oxide
#

can we participate in the competition as an individual or as a team?

dry rivet
oblique rain
calm hill
#

Does anyone know how to increase the titanic ML score on kaggle? I'm still learning

frail terrace
#

would you guys appreciate a notebook baseline template which you can easily iterate on?

calm hill
#

Have you ever tried up to 80%?

#

Thanks

paper ivy
#

You probably need more feature engineering. In the end I ended up using:

  • Title
  • Embarked
  • Sex
  • Pclass
  • HasCabin
  • IsChild (Age < 12.0)
  • AgeMissing
  • SmallGroup (Group size by ticket between 2 and 4 (both included))
  • FamSize (Family Size)
  • Age
  • FarePerPerson (log-scaled, although this does not matter for tree-based models)
  • Survival Rate (of the group, either by surname or by ticket, mean if both could be calculated, make sure you have no data leakage)

This achieved 80%+ (82% with a three-model ensemble). There are probably better or simpler models, but it might be a good start. Titanic score is mostly feature engineering + avoiding overfitting

calm hill
#

I didn't expect that

#

Thank you

leaden portal
#

hi, i am maxing out at ~82% with both XGB and linear SVM, any tips to break that wall?

whole sedge
#

im doing really bad with 0.63157, using only numerical features chosen right out of the dataset combined with RandomForestClassifier. Any suggestions on how to nudge my score and skill bit by bit?

steep swift
#

hi, guys i have covered supervised and unsurpervised ml , and right now i am learning deep learning , love to discuss concept , and i also want to participate in kaggle competetion , if anyone of have some experience or knowledge pls help me out.

dreamy ingot
#

Good morning everyone, I’m new to learning Data Science ready for the experience, Feel free to send some encouraging feedback
Thank you all ❤️

drifting pilot
#

I have created an Ensemble evaluation testing and tuning with several algorithms to choose a winner.

My notebook can be accessed here if it helps:
https://www.kaggle.com/code/rommelsharma/titanic-machine-learning-lr-rf-xgb-lgb-gbm

The algorithms include:

LR,
RF,
XGB,
LGB,
GBM

The steps I followed are:

1 Load Data CSV → pandas DataFrames
2 EDA Visualisations 8 charts covering class balance, sex, age, fare, family size
3 Feature Engineering 27+ features, no data leakage — stats fitted on train only
4 Encoding + Normalisation LabelEncoder + StandardScaler (fit on train, transform both)
5 Algorithm Comparison 5-fold stratified CV across LR, RF, XGB, LGB, GBM
6 Hyperparameter Tuning Optuna TPE + MedianPruner — cache-first, ~30–50% faster
7 OOF Predictions 5-fold out-of-fold predictions — zero leakage into meta-learner
8 Weighted Blend Nelder-Mead optimised weights across XGB + LGB + GBM
9 Stacking Logistic Regression meta-learner trained on OOF predictions
10 Final Models Re-trained on full training data + saved via joblib
11 Feature Importance Gain-based importance for XGBoost and LightGBM
12 ROC + AUC Charts Per-fold AUC progression + ensemble comparison
13 Confusion Matrix Accuracy + F1 on the best OOF ensemble
14 Submission CSV PassengerId, Survived for Kaggle submission

For feature engineering I had 27+ features, no data leakage — stats fitted on train only

There are several visualizations too.

I used Optuna to tune the models. since it takes a lot of time I am caching the generated and tuned models. In case you want to change the logic of say just one model, then delete the cached model file of that model and rerun the program that will be trained on the new model taking other models from the cache.

I have provided extensive documentation so that its easy for anyone to read. Feel free to fork it and make your improvements.

I hope you find it useful.

hidden ivy
#

Heyy im curious. What's considered a good score for this dataset?

#

beginners level ofcourse but I mean what would be the highest attainable?

#

is 90%+ an achievable mark?

#

I am currently at 83-85% range on F1-scores using a RandomForestClassifier.

quartz sonnet
#

@hidden ivy it's a very good score, you can go to the next competition.

desert remnant
#

I need help to understand something.

#

After feature engineering, I ran a simple RF model to test

rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=SEED
)

rf.fit(X_train, y_train)
pred = rf.predict(X_val)
accuracy_score(y_val, pred)

Result: 0.7988826815642458

Classification report shows: accuracy 0.80

Cross validation:

scores = cross_val_score(
    rf,
    X,
    y,
    cv=5
)

print("Mean CV Accuracy:", scores.mean())
print("Std:", scores.std())

Result:
Mean CV Accuracy: 0.8092461239093591
Std: 0.04010742324092075

After that I did some tuning and got my cross validation to:
Mean CV Accuracy: 0.8215805661917017
Std: 0.03339597851131273

Then I submitted the output CSV and Kaggle showed Score: 0.75598

I don't understand, none of my tests shows something close to 75%.
What am I doing wrong? Am I missing something?

#

Confusion Matrix also shows 80%

quartz sonnet
desert remnant
quartz sonnet
#

CV assumes that folds are representative of unseen data and have stable distribution. These assumptions break easily on small datasets like the titanic one.

Do you have high variance across splits? Check the scores of each fold. If you have 0.75 in some folds, 0.75 in public LB is expected.

Model CV Score LB Score
RF v1 0.80 0.75
RF v2 0.82 0.75

If CV improves but LB stays flat or declines, you may have overfitting.

For example, all those parameters will overfit:

RandomForestClassifier(
n_estimators=3000,
max_depth=20,
min_samples_leaf=1
)

Are you using some feature engineering that potentially lead to data leakage?
What to do:
Before spliting are you doing mean encoding bins, target encoding? try to use a pipeline instead.
Using KFold? try StratifiedKFold or RepeatedStratifiedKFold

deep cedar
#

Hii~ what's the theoretical accuracy ceiling on Titanic? I'm at 0.78 with a custom neural net but the leaderboard has people at 1.0. Is the test set just small enough that overfitting looks like perfect accuracy, or am I missing something about the dataset? 🥺

desert remnant
cursive yew
#

guys i got 0.78 with rf is that any good? first timer

craggy swan
quartz sonnet
green trench
#

dangg

#

i feel like thats cheating tho

blissful tundra
#

Guys can anyone tell me about this titanic contest

uneven stratus
polar sandal
#

hey i've been learning and working in ML for a while, but i'm curious as to how people develop the "intuition" in this field; I'd really appreciate it if anyone could kindly guide me on it using this competition problem as an example. Like how do you approach it, how to think?

vapid dust
jaunty cliff
#

ive been trying to work with the dataset and trying out various approaches to get a higher score. is there something im missing like should i move on after a certain score or keep trying for like over 90?

paper ivy
#

I think trying for 80%+ definitely has some benefits, much beyond that is probably not worth it

obsidian geyser
#

Hi guys. Just started with the Titanic challange. Following instrcutions, copy pasting leads to errors. is it intentional? or just outdated instrcutions?

sonic zodiac
#

Teach me I'm new here

thorn tusk
#

is there a vc where others are also looking at the information about the titanic predictions?

thorny arrow
#

I think look like Hackathon project

thorny arrow
#

Is it allowed to use Jupyter notebook or colab notebook?

upbeat iris
#

hello i am new to kaggle can anyone give me overview of the competition like from where to start

wicked prawn
#

Hello! My name is Pankaj and I am an AI and DS enthusiast. I am a student of Data Science at Bellevue University. I am here to start learning and discussing more about Predictive modeling.

deft star
#

I just see a bunch of people talking to themselves lol

potent spruce