#🚢┊titanic
1 messages · Page 2 of 1
Are we allowed to share some github or Streamlit Cloud url to answer a question ?
Hey everyone!
I’ve just published a new Kaggle notebook:
"Titanic Survival Prediction using Machine Learning"
I used various ML models and feature engineering to predict passenger survival, and got a solid score on the leaderboard!
Would love it if you could check it out and drop an upvote if you find it helpful! 🙌
🔗 https://www.kaggle.com/code/mrmelvin/titanic-survival-prediction-using-machine-learning
Thanks a ton for the support! 💙
hello guys, i'm just getting started w/ the titanic, I have basics of ML and I espacially know about the transformer architecture however I read that it maybe isn't the best fit for this challenge. Should I stick w/ a randomn forest or do you think a transformer could be a good fit ?
Try atleast could be a fit
ok thks
Hey guys! I’m pretty new to Kaggle competitions and currently working on the Titanic dataset. I’ve got a few things I’m confused about and hoping someone can help:
1- Preprocessing Test Data
In my train data, I drop useless columns (like Name, Ticket, Cabin), fill missing values, and use get_dummies to encode Sex and Embarked. Now when working with the test data — do I need to apply exactly the same steps? Like same encoding and all that?Does the model expect train and test to have exactly the same columns after preprocessing?
2- Using Target Column During Training
Another thing — when training the model, should the Survived column be included in the features?
What I’m doing now is:
Dropping Survived from the input features
Using it as the target (y)
Is that the correct way, or should the model actually see the target during training somehow? I feel like this is obvious but I’m doubting myself.
3- How Does Kaggle Submission Work?
Once I finish training the model, should I:
Run predictions locally on test.csv and upload the results (as submission.csv)? OR
Just submit my code and Kaggle will automatically run it on their test set?
I’m confused whether I’m supposed to generate predictions locally or if Kaggle runs my notebook/code for me after submission.
Hey Destiny, I got you!
- So for your preprocessing, I recommend putting all the cleaning steps you went through for the training set into a function. And then yes, you'll apply all of that again on the testing set. Just note that the testing set does not have the 'Target' value, everything else will be the same (though check to make sure there's not any new types in the testing set that your cleaning does not account for.
- No! You should seperate the Target column into a "y" variable and then the rest of the columns can be in an "X" variable. So yes,, you are doing it the correct way. You do want to keep it to make sure you have a way to check your predictions though.
- For your predictions, you'll submit a csv file with 2 columns (PassengerId, and Prediction), they detail the exact format. When you output it make sure you have index=False though. [eg. results.to_csv('Titanic_Predictions_Random_Forest.csv',index=False)]
Anyways, I just did the titanic dataset myself. I used Decision Trees and Random Forest. Ended up getting 0.75 and 0.76 respectively. Though I only used n_estimators and depth as hyperparameters for RF, so I'll go through it again with more parameters. Curious to see which models perform the best for this classification assignment.
thank you so much michael i do undrstand now
now i am facing this problem :
In my notebook, I save the submission file like this:
# Option 1
submission = test[['PassengerId', 'Survived']]
submission.to_csv('submission.csv', index=False)
# Option 2 (also tried)
submission = test[['PassengerId', 'Survived']]
submission.to_csv('/kaggle/working/submission.csv', index=False)
I double-checked, the file looks like this:
PassengerId Survived
0 892 0
1 893 0
...
(418, 2)
PassengerId 0
Survived 0
dtype: int64
It appears correctly in the output folder in Kaggle after running, but when I submit the notebook, I still get: "Submission CSV Not Found."
Any idea what could be wrong? Does Kaggle expect any specific step to detect it?
Thanks in advance!
and yea, i got the reason now
while submiting, kaggle will run the cells from the start so any errs or typo will stop the submition eventhough i could run the cell by my self wihtout errs
Ah great. But yeah when I first saw your response I heard you submitted the notebook, instead of submitted the csv file. You have it working now?
Hello everyone! I'm here to share the code I used to make predictions and submit my solution for the Titanic competition.
I'm Brazilian, so I initially wrote the notebook in Portuguese. Depending on the interest or feedback, I’d be happy to create an English version to make it more accessible for everyone.
Using a model that combines Random Forest and SVM, I achieved an accuracy of 78.23%.
Here’s the link:
That's interesting. VotingClassifier is one option. Another option is to use the average of probabilities of the Survived to determine the outcome. Another option is to train a new model with the probabiilities of the two models as predictors.
One thing I noticed is that my model classified a lot of men as survivors (20).
Maybe if I adjust the logic so that a prediction of '1' is only made when the model's confidence is above 70% or something similar, my final accuracy could improve.
In fact. A simple decision tree can achieve "good" results in the titanic dataset. Simple rules with age + gender, etc
i am trying to getting start with projects and i tried with titanic dataset and i am getting 82% accuracy... i dont know its good start or not ??
can anyone guide me here what i did wrong ?
And can anybody guide me how to participate and what to do in titanic competitions as I am new to ai stuff and a beginner
If you are under 30% on the leaderboard then that is good as a beginner
Kaggle is the best platform to practice Ml ,As a beginner you should go with logistic regression model in titanic compitition I have 79 percent accuracy 🫡
You did well. In this competition, the test labels are known, and some people use them to get better results.
I have a query that I drop some columns because I thought that might not be helpful.... But I actually don't know should I drop those or not ?
Is there any way to check that i should drop this column and i should drop this one
@opaque lotus sir please delete this, it's scam
Hello! Im new here as i started to learn ML quite recently, so be patient 😉
I've used 4 types of models: DT, KNN, MLP and XGboost.
Using stratified kfold with 5 splits and some hyperparameter optimization (just basic grid search with not much exploration) i got validation accuracies on the XGboost and MLP around 0.82 - 0.84.
However, when I run the models in the test df, the kaggle scores of XGboost and MLP drop to 0.76-0.77, and my best model becomes a DT with 0.782.
Is this normal behavior? I know the number of observations is not that large and therefore a few bad predictions will have impact. Still, i wasnt expecting a gap like this.
I am new to ML and am trying to grasp the basics. If anyone could please tell me what kind of checkpoints should i set for myself so that i know that i am doing the right thing and actually understanding this stuff.
@quaint copper It's too different to train it on your PC than to do it on Kaggle: Accuracy: 0.877778
Just finished my Titanic submission on Kaggle — scored 79% accuracy using Random Forest!
I’ve shared all the steps and reasoning behind each move in my notebook:
🔗 https://www.kaggle.com/code/nishchalpandey/titanic-survival-prediction-random-forest
Would love any feedback you guys have 🙌
And if you find it useful, a quick upvote would mean a lot! 🚀
Hey guys I am new to ML and want to build models , I wanted to start with kaggle , are there any yt playlists that help me understand concepts of ML and cover basics
And I would love to collaborate , help and learn .
Hello, you can start with the Iris flower classification, it is a classic
I am still starting out and this was really great, and I advise you to watch the tutorial from the channel "projects data science" channel on YouTube it is about 5 videos they are small but beneficial and doesn't take long.
hope this helps 👍
Thanks for the suggestion 😊, I will definitely check it out
Hello , I am new to ML and want to build models , i am looking to create a small team to start try out some of the Kaggle challenges. I'd like to start fairly simple with things like classification or regression problems.
Thanks 😇
Hi everyone! 👋
I’m new here, and I wanted to share a lightweight, rule-based framework I’ve been working on — called AdaptoFlux.
It currently uses simple, human-readable arithmetic operations — like addition, subtraction, multiplication, and division — along with basic transformations (e.g., ±1, negation, and value copying).
No machine learning models. No gradients. Just pure math 🧮
Despite its simplicity, it typically achieves over 70% accuracy on the leaderboard — and with further tuning, sometimes even higher!
I’ve put together a fully runnable notebook with:
✅ One-click execution
✅ Step-by-step breakdown of how the rules are built
✅ Visualization of the logic flow and decision process
Check it out here:
👉 https://www.kaggle.com/code/gugu12138/adaptoflux-on-titanic
Would love to hear your thoughts or feedback!
And if you find it interesting, an upvote would mean a lot 🙌
P.S. This is just the starting point — the framework is designed to support more complex operations in the future, so it’s not limited to basic arithmetic. Excited to explore where this can go! 🚀
Looking forward to learning and collaborating with all of you!
Guys i used random forest classifier algo for titanic and got 0.791 score.
how can i do better ?
The titanic dataset is small. You are getting an optimistic estimate from your cv score. What you can try is to see how your cv scores and lb scores correlate from simple to more complex models
Hi everyone, I used two MLP models for prediction and assigned a weight of 0.5 to each model's prediction for the final output. But now my score is only 0.7799. What should I do next to improve it?

Hello, i sent my first titanic submission and after 18h it still has a score of 0.000; Do you guys know how long it usually takes to receive a score?
Many thanks!
I think you might have submitted the wrong file. Normally, you wouldn’t get a score of zero unless you’re somehow predicting the exact opposite of the correct answers. Generally, the score appears right after you submit it.
thank you! i gotta check what went wrong
what happen
Hi everyone,
I am trying to build my data science skills using Kaggle competition dataset. I submitted my test data for Titanic dataset. My accuracy score is 77% even after multiple submissions. I check Leadboard and some people have gotten score of 100%. So, I was wondering is there a way to compare their notebooks with mine?
Hi ! unfortunately, you can only check code of users that published theirs notebooks on the "CODE" section on each challenges.
But 77% is a great score ! (100% are only users who cheated because the list of who survived or not is available on wikipedia, so instead of creating a learning machine they just put the values by hand and got 100%)
@sacred sleet thank you for the information.
All other friends, please let me know if your accuracy was more than 77% . I would be happy to discuss with you all 😊
Looking for a teammate for this challenge. So far the best submissions achieved are the following:
XGBoost - 79.42%
Linear Regression - 77.03%
The final goal is to achieve >80% accuracy.
We can share some ideas for EDA and Feature Engineering
- check if your csv file matches the submission template
- the values within the csv file have to be ints and you may have floats.
astype(int)should fix the problem
Hi @velvet harness you can dm me I have similar score
Hey, my accuracy is 77% too. I used a linear regression model.
Hi, I'm new to this challenge and to the subject in general. Out of curiosity, why did you choose linear regression instead of logistic, given it is a classification problem?
@quartz scarab is it possible to remove these advertisements from here. Happy to moderate if you're in need.
Logistic regression
Looking for a partner who wants to collaborate for the titanic competition pls dm if interested
Hello, could you please point me to a simple and easy-to-understand code for the Titanic competition so that I can learn from it? Thank you in advance.
it depends what you mean, you can ask ChatGPT for a simple logistic regression model that'll give you 76-77% on the first run. from there, squeezing out percentage points requires different techniques, feature engineering, hyperparameter tuning or all of the above.
Okay, I will try this. Thank you very much.
Just curious, what scoring method should one use when hyperparameter tuning the models? I'm thinking accuracy, because that seems to be what the leaderboard uses and the imbalance is not that huge (so no balanced_accuracy needed?). But I'm kinda new to this, so... 🙂
hyper parameter tuning does almost nothing on titanic. it's feature engineering + hybrid models that do best
I have a tricki cuestion about grid search, ES, and hiperparameters tunning stuff. With the same model (XGB) and features, i discover by accident a set of hiperparameters that provides 79% of accuracy, and with tecniques like grid search and ES, the best hiperparameters that i found provides me 77% of accuracy, the problem i can recognizance is i'm using cross validation to check the set of hiperparameters but that don't have linear correlation with the accuracy on the submits
so, i'm very confuse. I mean, if grid search and early stop don't find the best set of hiperparameters, i have to assume i make something wrong or is for the nature of the data, like different patterns in the train - test set that don't allows me to say "a improvement in cross validation represent an improvement in the submit"
i know hiperparameters tuning is not the big thing in this data set, but anyways i would like to understand whats is happening here
yeah hyperparameter tuning won't do much. 79% is just bit above the two rule "model"
def predict_gender_rule(df):
"""Return Survived (positive class) if female, return Not Survived (negative class) if male"""
return np.where(df["Sex"] == "female", 1, 0)
def main() -> None:
pd.read_csv(TRAIN_CSV)
test = pd.read_csv(TEST_CSV)
pred_test = predict_gender_rule(test)
submission = pd.DataFrame(
{
"PassengerId": test["PassengerId"].astype(int),
"Survived": pred_test,
}
)
out_path = OUT_SUB / "submission_gender_rule.csv"
submission.to_csv(out_path, index=False)
print(f"[done] wrote {out_path.resolve()}")
if name == "main":
main()
yeah, i get it
but my question is
it's normal to exist that kind of "blind points" for detect hiperparameters? which criteria i have to use to search it or define a range of searching?
how to proceed when a improve in cross validation represent a worse accuracy in submits?
What is a good score then? Looking at the leaderboard, 79% seems to be a fairly good score...
If your score can be matched with a heuristic criteria that don't necessarily mean your model have bad accuracy, that probably means machine learning is maybe not the right approach in cost / efficiency terms
look https://www.kaggle.com/code/yoni2k/top-3-with-only-4-features-no-data-leakage#Assumptions: this guy say 81.8% put you in the top 3% without data leakage
Thanks a lot!
Hi guys, I need a piece of advice, I'm really mad trying to fix this but no luck
I had 0,83 and something accuracy with predictions on validation data when I split the training set
I uploaded it to the competion and got 0,77
I asked ChatGPT and it? he? advised to check the tree max depth, and also use cross-validation where I should look to decrease the difference between best and worst score. Which I did
In 5 runs of cross validation I got:
Mean accuracy: 0.8305
Difference between max and min: 0.0449
So basically it should not be worse than 0.785
I uploaded the new submission and got even worse result of 0.76555
What do I do to get more relevant results on public data? 😭
it's exactly what i was asking lately. Seems like the score in cross validation and the score in the submissions have no linear correlation, or maybe is that the model may overfit even with good accuracy in cross validation. If you find what is happening pls comment it in this cannel, cuz i'm facing the same problem.
anybody has an idea why simple logistic regression is doing better than NN
I had some advances with the problem, is clearly overfitting.
Ive tried using regularizations with best lambda for cross validation set but still worse in test set than logistic regr.
one thing I haven’t done is parsing Name.. mostly just ignored it
It might be likely that married with kids might be less likely to survive, as well titles like Master might mean more likely to survive…
6 years old, also there's a ton of leakage and cheating in this since results are known. Any legit model in above 83% seems to be a big stack of models with a mix of rules and sending the rest to a gradient boosting algo.
Like Chris Deotte's scores https://www.kaggle.com/code/cdeotte/titanic-wcg-xgboost-0-84688 84.6%. This is a really complicated solution for a beginner which is why I don't even recommend people trying to learn to try to further optimize based basic logistic regression and XGBoost.
do some feature engineering, do some model building, submit and move on to a new problem, imo!
Hi, how are you? I’m working on the Titanic project and did stacking (XGBoost, Random Forest, and Logistic Regression) and finalized with a Soft Voting ensemble (XGBoost, Random Forest, Logistic Regression plus the previous Stacking). I got evaluation results of VotingClassifier (Soft Voting)
Accuracy: 0.8492
ROC AUC: 0.8764.
However, my ranking is very low (Score: 0.76315), and I don’t understand why — I thought these were good results. Could someone please suggest how to improve? I’m still learning!
My Code is here: https://www.kaggle.com/code/lorrancintra/titanic-4-hybrid-ensemble-final
Sry i don't get it. U say that approach have data leakage? cuz i read his methodology and i was not able to recognize any type of data leakage.
Which approach? Chris? Not really leakage but it's implicit reverse engineering. The models that were stacked performed very well on the same test data. It's pretty easy to see how this wouldn't happen in a majority of settings. But it's kaggle, not real life so go for it
Chris is just crafty as heck and was able to make the best of it given years of submissions before him
I feel bad that you went to all that trouble with that result. Your feature engineering doesn't look to add any signal to the log reg base line model with no feature engineering, it may have added more noise
My best was light feature engineering and a stable xgboost
Embarked is def not a feature I found useful, why'd you train your model on it
Couldn't look at it too long, looks very AI generated, you need to drop alot of useless / redundant features that aren't adding any signal though!
Which score you consider a good result to achieve without data leakage, crazy ensembles or reverse engineering?
Anything above 81%
Try some other comps before going too crazy on titanic tho, it'll help you develop different techniques
@hollow jasper i have one more question. I was reading about the Early Stop in grid search and i been trying that stuff but as long i can see, that have no native integration with cross validation. Is actually a good idea make an integration of early stop + cross validation? in case that is not worth, how u avoid the overfitting in Early Stop?
Your model is probably overfitted to the training data, that's the reason for the big score difference.
Some ideas:
- Don't encode everything ordinarily, try OHE. This does not matter for tree-based models, but for Logistic Regression it does. Similarly, binning for ages and fare for logistic regression, maybe also log-scale the fare
- Speaking of Logistic Regression: Are you ever using that model (not the meta learner, the other one)?
- You're using xgb both in the stacking classifier as well as the voting, of course this weights xgb heavily
- Have you ever calculated something like cohen's kappa for the models?
- Try extracting more features, explore which actually have predictive power (correlation, feature importances in random forest) and drop the ones you don't actually use
You're right, Zach. I'll work on approaching the system in a more appropriate way. I saw your work — you went through the features one by one. Thanks for the tip!
Thanks, There's many people who have spent a long time on this set since it's been out for awhile. It's a cool toy example, but nothing to stay stuck on beyond putting a submission together that uses a 0 leakage and clean preprocess, fit, transform, predict , submit with any score above or even equal to the baseline of 76.5%
As a beginner at least, if it's just for fun and you're very experienced, it can be fun to re-visit and try to go for high scores
I'd also make sure you're using a clean repo structure and doing commits. One thing I noticed about your note book is your file paths looked a bit messy. Pathlib is OS agnostic and is very nice to collaborate with.
This is all null if u don't plan on ever being in a collaborative environment, but I figure most people are
Well, early stopping and CV serve different purposes. CV is for evaluating how the model will generalize on unseen, early stop on GridSearch tells you when you're hitting a point of no improvement so it's best to go back to the hyperparameters that produced best score
If they could be "integrated", that's pure data leakage since validation set is to be treated as unseen
We can only train our model with training data, and we treat every validation set the same as test aka non train
That's why those don't mesh
Good question though
Hi everyone. Got my titanic score to a 0.775 any tips on how to increase my score
You can try multiple types of models like XGBOOST and try different settings values, you can also learn about data eengenring
but 0.775 is a very good score, i suggest you to try other copetitions with bigger datasets like the #🏠┊house-prices-advanced-regression-techniques competition
what are common ways to get high > 0.9 score?
is there a specific algorithm? some key features that need to be engineered?
i did two runs. One on my first try (0.74) and the second after reading MEG RISDAL's post (0.77)
is this good enough for a bigenner? Should i move on to another competition?
0.9 is impossible, the peoples how succed cheated by putting manualy the data from wikipedia
Ok now i feel better about my model's performance 😅
You can x)
🚢
Your model performed on par with the no model if female lives if male dies
It's ok though most do
link for this competetion anyone ?
but i guess my model would generalize better if we change the data?
like data with 20 female 80 male
well there is no way to know. Which is why trying to eek out percentages off of titanic is rather silly. We'll never see another titanic and you'll only ever have that one set of test data to evaluate. its a good place to start and spin up your first real jupyter notebook and do your first real EDA, but beyond that -- dont waste your time optimizing this set
Hi! I just made my first submission and Scored 0.76555 Any tips to improve this? Or should I move to other competitions/projects, also this project made me realize how important math is for this field, please let me know which concepts and subjects I can focus on
why have you needed maths? what model have you used?
That's equal to heuristic model if male, dead if female live. This isn't a project to dwell on. You have 800 rows to train and Val on (tiny) and 400 rows to test (tiny)
Trying to optimize it isn't worth it
The heart of this project is about building a clean machine learning work flow.
Use VScode and make a remote repo on GitHub with regular commits (with proper commit messages) for practice.
Make a nice EDA note book (either a Python interactive file or Jupyter) that covers the basic ideas of feature engineering and includes a full spectrum of analysis on what you have.
I created a wonderful Streamit app about the preaching of the Titanic survivors 🚢
You can check it out at: https://app-app-titanic-data-bdwwycbgdejsmtuv4ntkss.streamlit.app/
I'm really interested in your opinions.
Thank you.
Can anyone provide me with the titanic test.csv and train.csv. Unfortunately I cannot download it from the kaggle.com website. Kaggle support cannot help... Thanks a lot in advance...
Source: Kaggle https://share.google/LiH1qRrt24FREscc0
Download from here, in "Data" tab you can find all files
It's my first time participating in these type of competition on kaggle. I do have experience with Machine Learning projects. Is there any tip for beginner like on what steps I need to pay attention or anything?
Just to get a deeper understanding of the models I am using. I used Logistic Regression, KNN classifier, Random forest classifier and svc
Understood I should move to other projects then since, I made it in a jupyter notebook I will add comments and descriptions and push on github
Yes, if you are aiming to be a professional, there's quite a few things that are not tested in kaggle that you absolutely should have mastered; GitHub, ETL ( leaving you flexible for smaller teams where data engineering is not yet stream lined) and managing cloud platforms.
I'm being paid extremely well (TC) bc I joined a start up who needed MLE and data engineer/analyst in one person
Hey
I feel your pain, I got to a .779 and can't seem to break past it. My param were n_estimators=200, depth=10 and I had [Pclass, SibSp, Parch, Fare, Sex,Embarked]. How did you get there?
Oh, I thought those people were great data scientists
continuation: #5dgai-introductions message
@ivory lotus ,
Firstly do you have prior knowledge/experience with Python, Numpy, Matplotlib, and Pandas ?
If yes then try to read this:
https://www.kaggle.com/competitions/titanic and watch the attached youtube video. And share here if you face any doubts/issues.
@ivory lotus talk here please
That channel is for introductions, this is the right channel 🐣
So as you mentioned you have no experience, I followed this roadmap that I would suggest you to follow too:
Kaggle has a bunch of courses and guides at https://kaggle.com/learn
- Intro to Programming Course
Prerequisites: None.
https://share.google/68HMIU2sPcz7jXpKn - Python Course
Prerequisites: "Intro to the Programming Course if you have no knowledge about programming"
https://share.google/tlKbeLBPrcKsfnSQZ - Pandas Course
Prerequisites: Python Course
https://www.kaggle.com/learn/pandas - Intro to Machine Learning
Prerequisites: Python Course
https://share.google/bTm7E168S656eQRzO
Or alternatively, you can find a video course on some other platform like YouTube and follow that..
@tight robin
@tight robin sir please accept my request in linkedin and send me request to just please share your experience. Iam 17 years old
Hii
everyone stay away from this guy, i think he is trying to hack
Hey Guys!! Getting my hands on Titanic Competition. Check out my recent submission through Random Forest. Dropped the columns['PassengerId', 'Name', 'Embarked', 'Cabin', 'Ticket'] https://www.kaggle.com/code/abhangkolte/titanic-randomforest
Am thinking of keeping the name the next time. Experiment a little with the Random Forest first.
Hola alguien sabe si puedo usar lo aprendido en el cuso de house prices advanced en la prediccion de este ejercicio? gracias
hi new here,looking for friends,and beginners also to learn together, acountability mate
Hello, I'm new to ML. I'd love to connect with ambitious friends to grow and learn together
Good day! I am a player stuck in the newbie village ! 😊
Hello everyone, I'm Silver and I'm a beginner I just started today and I'd love friends that we can work together to make it easier for each other
Hi everyone, I'm David and I'm a beginner in Data science.
Let's connect and learn from each other and probably end up working together.
Hi, I'm a beginner too! If you want to learn with me, let me know 😁 I just know the basics of data science, but I want to learn a lot about this topic
Nice brother.
Hi everyone, I'm tyros. I've studied machine learning algorithms and deep learning, and I'm currently focusing on computer vision (CV) learning. However, I've never participated in Kaggle competitions before, and my problem-solving thinking and coding skills are just average. I hope to meet like-minded friends here to learn and grow together.
I hope we can grow together. If possible, we can connect with each other as friends—so we can chat, discuss and solve problems together.
Same
Hey everyone! I am Akarshi and have some exprience in Data Science. I'd love to connect with like minded people to work together on projects and exchange ideas.
Hi everyone! I’m a CS student starting my ML journey and using the Titanic competition as my “hello world” for data science (RIP Titanic, thank you for the dataset 🫡).I'm on preprocessing, basic models and hyperparameter tuning rn.
Feel free to say hi nd connect.
What is newbie village
can we participate in the competition as an individual or as a team?
Hi, Padmesh. I've just started with Titanic as my introduction to datasets and data science as well. Would love to discuss more.
My opinion don't mention this on resume dataset because the most of the companies didn't appraise it ..it's too appriseal you can use it for the training purpose
Does anyone know how to increase the titanic ML score on kaggle? I'm still learning
would you guys appreciate a notebook baseline template which you can easily iterate on?
You probably need more feature engineering. In the end I ended up using:
- Title
- Embarked
- Sex
- Pclass
- HasCabin
- IsChild (Age < 12.0)
- AgeMissing
- SmallGroup (Group size by ticket between 2 and 4 (both included))
- FamSize (Family Size)
- Age
- FarePerPerson (log-scaled, although this does not matter for tree-based models)
- Survival Rate (of the group, either by surname or by ticket, mean if both could be calculated, make sure you have no data leakage)
This achieved 80%+ (82% with a three-model ensemble). There are probably better or simpler models, but it might be a good start. Titanic score is mostly feature engineering + avoiding overfitting
Wow
I didn't expect that
Thank you
hi, i am maxing out at ~82% with both XGB and linear SVM, any tips to break that wall?
im doing really bad with 0.63157, using only numerical features chosen right out of the dataset combined with RandomForestClassifier. Any suggestions on how to nudge my score and skill bit by bit?
hi, guys i have covered supervised and unsurpervised ml , and right now i am learning deep learning , love to discuss concept , and i also want to participate in kaggle competetion , if anyone of have some experience or knowledge pls help me out.
Good morning everyone, I’m new to learning Data Science ready for the experience, Feel free to send some encouraging feedback
Thank you all ❤️
I have created an Ensemble evaluation testing and tuning with several algorithms to choose a winner.
My notebook can be accessed here if it helps:
https://www.kaggle.com/code/rommelsharma/titanic-machine-learning-lr-rf-xgb-lgb-gbm
The algorithms include:
LR,
RF,
XGB,
LGB,
GBM
The steps I followed are:
1 Load Data CSV → pandas DataFrames
2 EDA Visualisations 8 charts covering class balance, sex, age, fare, family size
3 Feature Engineering 27+ features, no data leakage — stats fitted on train only
4 Encoding + Normalisation LabelEncoder + StandardScaler (fit on train, transform both)
5 Algorithm Comparison 5-fold stratified CV across LR, RF, XGB, LGB, GBM
6 Hyperparameter Tuning Optuna TPE + MedianPruner — cache-first, ~30–50% faster
7 OOF Predictions 5-fold out-of-fold predictions — zero leakage into meta-learner
8 Weighted Blend Nelder-Mead optimised weights across XGB + LGB + GBM
9 Stacking Logistic Regression meta-learner trained on OOF predictions
10 Final Models Re-trained on full training data + saved via joblib
11 Feature Importance Gain-based importance for XGBoost and LightGBM
12 ROC + AUC Charts Per-fold AUC progression + ensemble comparison
13 Confusion Matrix Accuracy + F1 on the best OOF ensemble
14 Submission CSV PassengerId, Survived for Kaggle submission
For feature engineering I had 27+ features, no data leakage — stats fitted on train only
There are several visualizations too.
I used Optuna to tune the models. since it takes a lot of time I am caching the generated and tuned models. In case you want to change the logic of say just one model, then delete the cached model file of that model and rerun the program that will be trained on the new model taking other models from the cache.
I have provided extensive documentation so that its easy for anyone to read. Feel free to fork it and make your improvements.
I hope you find it useful.
Heyy im curious. What's considered a good score for this dataset?
beginners level ofcourse but I mean what would be the highest attainable?
is 90%+ an achievable mark?
I am currently at 83-85% range on F1-scores using a RandomForestClassifier.
@hidden ivy it's a very good score, you can go to the next competition.
Bro check dm
I need help to understand something.
After feature engineering, I ran a simple RF model to test
rf = RandomForestClassifier(
n_estimators=300,
max_depth=None,
random_state=SEED
)
rf.fit(X_train, y_train)
pred = rf.predict(X_val)
accuracy_score(y_val, pred)
Result: 0.7988826815642458
Classification report shows: accuracy 0.80
Cross validation:
scores = cross_val_score(
rf,
X,
y,
cv=5
)
print("Mean CV Accuracy:", scores.mean())
print("Std:", scores.std())
Result:
Mean CV Accuracy: 0.8092461239093591
Std: 0.04010742324092075
After that I did some tuning and got my cross validation to:
Mean CV Accuracy: 0.8215805661917017
Std: 0.03339597851131273
Then I submitted the output CSV and Kaggle showed Score: 0.75598
I don't understand, none of my tests shows something close to 75%.
What am I doing wrong? Am I missing something?
Confusion Matrix also shows 80%
@desert remnant the titanic dataset is a small dataset, so your CV score is overoptimistic. You should track your CV score and your LB score and assure it correlates. You can also try to use 10 folds instead of 5.
Thank you, I tried 10 folds and still returns 80%.
Is there another way to get a more realistic score for Titanic? Since CV is not reliable
Track your CV score and your LB score and assure they correlate and try to not overfit recuding the complxity of your models. Are the SD of the folds similar of what you would get to the public LB?
CV assumes that folds are representative of unseen data and have stable distribution. These assumptions break easily on small datasets like the titanic one.
Do you have high variance across splits? Check the scores of each fold. If you have 0.75 in some folds, 0.75 in public LB is expected.
Model CV Score LB Score
RF v1 0.80 0.75
RF v2 0.82 0.75
If CV improves but LB stays flat or declines, you may have overfitting.
For example, all those parameters will overfit:
RandomForestClassifier(
n_estimators=3000,
max_depth=20,
min_samples_leaf=1
)
Are you using some feature engineering that potentially lead to data leakage?
What to do:
Before spliting are you doing mean encoding bins, target encoding? try to use a pipeline instead.
Using KFold? try StratifiedKFold or RepeatedStratifiedKFold
Hii~ what's the theoretical accuracy ceiling on Titanic? I'm at 0.78 with a custom neural net but the leaderboard has people at 1.0. Is the test set just small enough that overfitting looks like perfect accuracy, or am I missing something about the dataset? 🥺
Thank you for your help. Your questions pointed me in the right direction.
It turns out I was dealing with some data leakage and potential overfitting.
Now when I print accuracy_score() I get a result pretty close to LB score, differing by only 1%.
guys i got 0.78 with rf is that any good? first timer
its not the best, but a good score on RF ig....keep it up!
It feels like cheating, but if you are interested... Look here https://www.kaggle.com/code/cdeotte/titantic-mega-model-0-84210
Guys can anyone tell me about this titanic contest
Watch the video on kaggle's website, it explains everything you need to know
hey i've been learning and working in ML for a while, but i'm curious as to how people develop the "intuition" in this field; I'd really appreciate it if anyone could kindly guide me on it using this competition problem as an example. Like how do you approach it, how to think?
it means that, you can use basic ML models, Feature engineering...
ive been trying to work with the dataset and trying out various approaches to get a higher score. is there something im missing like should i move on after a certain score or keep trying for like over 90?
I think trying for 80%+ definitely has some benefits, much beyond that is probably not worth it
Hi guys. Just started with the Titanic challange. Following instrcutions, copy pasting leads to errors. is it intentional? or just outdated instrcutions?
Teach me I'm new here
is there a vc where others are also looking at the information about the titanic predictions?
I think look like Hackathon project
Is it allowed to use Jupyter notebook or colab notebook?
hello i am new to kaggle can anyone give me overview of the competition like from where to start
Hello! My name is Pankaj and I am an AI and DS enthusiast. I am a student of Data Science at Bellevue University. I am here to start learning and discussing more about Predictive modeling.
I just see a bunch of people talking to themselves lol
https://www.kaggle.com/code/udaken10/feature
My name is Ken, Hi