#š¢ātitanic
1 messages Ā· Page 1 of 1 (latest)
It was really cool, and I really enjoyed participating in it, I even got a gold medal for getting so many upvotes for my notebook
Hah, awesome. Gold medal on titanic is a nice achievement š
Thanks Jonathan, I appreciate that
My past experience with Titanic was great. It was what nudged my towards getting a notebooks expert title
hello dear kagglers, I'm a complete beginner 101 and happy to join the titanic competition. any Godfather to mentor me? any help will be appreciated
Hi Frank, your test post went through just fine. Going to delete it now to keep the channel on topic.
@desert epoch have you started the titanic exploration yet? I can help
Hello everyone, I am a baby Kaggler it would be great if someone could guide me with the Titanic challenge
@golden schooner I can guide
@golden schooner if you are still looking for assistance maybe we can link up and go over some things. I use Rstudio
Great
So how do you want to link up? Should we have a Google meet?
I'm not familiar with Rstudio but I can catch up
@golden schooner yes yes.
@golden schooner what program script are you familiar with
Python
@golden schooner okay no problem now is this your first time using Kaggle
Yes it is ...
Thanks for helping out, @strong pumice !
Has anyone tried k-means on the titanic dataset, I haven't seen it anywhere.
I have tried it, but score around 6.5
I don't know if I use the Name feature to find people in a family then it is possible?
yeah, for clustering, but they may already have a cluster within themselves.
I have, the MI score was really low though so I don't think it's that useful. I got similar predictions from my model when giving a random number for the cluster.
Hi I'm just wondering if a no-cheated accuracy score of 83% is considered good (as in I should be proud) for the Titanic Competition or if It's average, high average, low, bad etc. I just wanna know if I should keep working on it
Hello to everyone, I have recently written an article on hackernoon that is about analyzing titanic dataset, i hope you'll like it. https://hackernoon.com/how-likely-was-one-to-survive-on-the-titanic
Great work, @rough mortar. Thanks for sharing!
Thanks
Hi everyone, I saw in one of the posted notebooks for this competition where the creator of that notebook changed values into categories before fitting models.
For example, the values of the age feature were changed to 1, 2, 3, 4 where 1 was the youngest and 4 was the oldest and the age feature was dropped entirely.
Is this a recommend practice for numeric columns? Or is it a different way to normalizing data? Can I leave the age feature as-is? Or is this just one of many ways that can be tried before fitting?
I think that feature can be turned into an ordinary feature so it will be more meaningful than non-ordinary
Am I calculating a score right?
X_train = train_data.drop(["Survived","Ticket","PassengerId","Name","Cabin"], axis=1)
Y_train = train_data["Survived"]
X_test = test_data.drop(["PassengerId","Name","Cabin","Ticket"], axis=1).copy()
random_forest = RandomForestClassifier(max_depth=4)
random_forest.fit(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
I checked a score based on X_train and Y_train.
Should I use X_test and Y_test from test.csv?
Hey, I'm new in ML and I'm working on Titanic competition. What is the best possible way to deal with the missing cabin values? I don't think I should drop the rows because there is a lot of missing values.
Survival rates by age did correlate to children first, but survival rates by fare group also show that the expensive titanic seats were prioritized on the lifeboats. More insights from @rough mortar's HackerNoon story about his Titanic entry: https://hackernoon.com/how-likely-was-one-to-survive-on-the-titanic
Hey HackerNoon, glad to have you here! Thanks for sharing! š
Thank you
is it too late to start this competitions i just found out about it just now
@strong pumice please help me on the Titanic competition
Okay
Hey! I am up for that too, if you want
Can someone suggest the best model for this competition?
@strong pumice Give your time and online place.
@minor turtle thank you for showing interest to help me on the competition. I and @strong pumice have agreed a date for the meeting. Perhaps it will be supper when you are in copy.
Hey guys! Saransh this side, whats up?
If anyone wants to finish this competition from scratch.
I have released my notebook that using "Polynomial Regression from scratch"
https://www.kaggle.com/code/jackksoncsie/polynomial-regression-from-scratch
What about Neural Networks?
Iām trying to learn ML from the ground up rn and Iām just avoiding NN/DL for now. Just classical ML & ensemble methods.
Imo Iāve heard ppl say NN is over powered for structure tabular data like this one
Damn, even though I had some issues understanding stuff from that notebook and still would just import the models from skicitlearn library, but the fact that I was even able to follow it through somewhat tells me that I am infact learning Python and ML, not at a high level., But am learning and that is what matters!
If you showed me this notebook like a month ago, I wouldn't have understood like ANYTHING at all
how to solve titanic data for accuracy 1.0
^^
Hello all, I'm Tamunotonye Samuel Solomon Inioribo,
I am new to Kaggle competition and would like to be part of the Titanic. Though I have done a couple of solo projects on ML, I will be glad to be part of a team for this... I can work on R-studio, Jupyter and other IDEs.
Thanks
hi+
hello
I couldn't find any way to create a team for the Titanic competition. I read the docs but I cannot find the Team Tab or Team section. Can someone help me by pointing out the link or the button to create a team?
Start here! Predict survival on the Titanic and get familiar with ML basics
Let me know if this helps @hearty veldt
Hi ! thanks ! but how can I make a team, we are three developers that we want to work as a team. We couldn't find how to define or how to make the team. I am sure that this is a basic question, but we are not finding how to solve it
Create a notebook, and while saving, you get the option to keep it private. In that tab, enter your team members as the collaborators (with permission to view, edit). Hope this helps!
Thanks a lot ! I could do it
https://www.kaggle.com/code/vanessah26/titanic-79-accuracy-using-rfc
Hi everyone, I'm a CS student and started learning about data science this Summer. I joined Kaggle in the middle of August and I've been learning a lot from this community! I just finished my first competition. Please check out my first notebook and give some advice or feedback.
Much appreciate it : ), Happy data analyzing!
Hello Everyone . Great to be finally active on Kaggle
https://www.kaggle.com/code/aniketsiraswal/titanic-machine-learning-from-disaster
85 accuracy using Logistic-Regression Model
š Hello everyone;
In this analysis, I explore the Titanic dataset through Exploratory Data Analysis (EDA), conduct statistical analysis, and build predictive models to understand and predict passenger survival on the Titanic. This project incorporates Kaggle's Titanic dataset for comprehensive insights and predictions .
Please check out my notebook and give some advice or feedback. If you like it , don't forget vote it, please, Happy data analyzing!
š https://www.kaggle.com/code/huseyincenik/titanic-eda-statistical-analysis-and-prediction
Hello everyone,
I am new to kaggle, today itself I participated in titanic challenge and want someone to guide me. I am really a novice to the field of machine learning.
My first notebook, just code but I spend many hours to have good accuracy, I want to share to all of you. I hope it will be useful for you in this competition! Review and you can feedback to me so I can develop it better! Thank you so much
https://www.kaggle.com/code/hoanglongroai/79-accuracy-from-titanic-disaster#Classification-model
Hello @ionic arrow . Your's notebook is very good for machine learning . If you want to improve the notebook , you can try to ignore "warns" .
Thank you for your advice šš», I will fix that š„°
Hi all, I absolutely have no idea how to start this challenge any advices on the learning material that I should take
Try my notebook https://www.kaggle.com/code/hoanglongroai/79-accuracy-from-titanic-disaster#Classification-model
Define the bin edges and labels
bin_edges = [0, 10, 25, 45, 55, 100] # Define your desired age bins
bin_labels = ['Children', 'Young', 'Adult', 'Late Adult', 'Old']
Use pd.cut() to bin the Age column
train_data['AgeGroup'] = pd.cut(train_data['Age'], bins=bin_edges, labels=bin_labels)
do you mean I need categories for the age to start?
I dropped the following columns I think they are completely unnecessary and cannot be used: Cabin, Ticket, Name . Am I right here ? Should I drop the Embarked as well?
How do I determine which model and feautures to use to me it seems like a regression task, are there metrics that I can use?
My Titanic Solutions:
My Titanic solution Notebook ,by Hyperparameter tuning accuracy is 94%.
Please evaluate it ....
šhttps://www.kaggle.com/code/harshpatelind13/titanic-machine-learning-from-disaster-13
Thanks @glad stone , I have read and liked your Notebook.
Can you explain to me the usage of some of these Classes like:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score,classification_report
What does StandardScaler do and what how does KNeighborsClassifier work. I would like to know more about them.
I got surprised when you said you are getting 94% accuracy. I checked your notebook and found that what you are doing is incorrect. Your 'y_test' is not the ground truth, you have taken it from the gender submission file which is not correct. You need to submit your model to the competition to get the correct accuracy.
If you want to tune your hyperparameters, then split the initial data set into train and test, for example, like some 8:2 ratio and then perform tuning on that test dataset.
@chrome rain Thank you for evaluation and identify my error š«”.
I have worked on it and now its accuracy comes out 82.6%.
šhttps://www.kaggle.com/code/harshpatelind13/titanic-machine-learning-from-disaster-13
StandardScaler is used to rescale numeric features so that they have mean=0 and variance=1. Helps avoid giving undue importance to features with large magnitude over those with small magnitude. More details in the scikit-learn docs https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling.
The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream esti...
Oh, so it is similar to keras.Normalizer()
Also, I was wondering how does this scaler function treats categorical values or mapped categorical values.
For example, I converted male, female to 0, 1 and keras.Normalizer() was treating it as -0.6-ish value and +1.4-ish value.
Is that ok?
I was also confused as for how you got the accuracy values without submitting the results. It seems like I was right. I didnāt notice that you were using gender_submission.csv though.
82.6% is still like 4% better than me. Did you submit it this time?
Ok, I just checked your Notebook and it says your score is 0.76555, which means the accuracy is 76.5%.
I think you are not clear about how accuracy works. The accuracy you get during training is heavily biased and means little even if it is 100%. We cannot decide that much from it since feeding the data a model has already seen and calculating accuracy based on that is heavily biased.
For example if you only do the example maths from your textbook and to test your skills I give you the same math you always do, that scores are heavily biased even if you do 100% cause you could just memorise the answers and write them as it. Only when you are given new unseen matn that we can be sure that your skills can be generally applied for any math of similar nature that you have never seen.
That is why training and testing data is kept different. You cannot judge a model by the data it has already seen. You have to let it predict results based on new unseen data to get a sense of prediction ability.
So, you should always submit your models predictions before telling others your models accuracy.
Yes, StandardScaler and a keras Normalization layer have a similar purpose.
I don't know whether it's okay to rescale mapped categorical values. I usually separate numeric & categorical preprocessing e.g. with ColumnTransformer.
Hello! How do you interpret parch and sibsp in the dataset? I'm having trouble on how to interpret them because they are only numbers. For example, since majority of the passengers has a parch of 0, does that mean that all of them has a nanny accompanying them or are they alone? What about for adults whose parch is 0? Is it a nanny?
Thanks! Anyways, Can you look into my Notebooks and give me some advice for how to increase their accuracy?
I am currently sad that my accuracies does not go beyond 80% for titanic.
Here is the link: #āāask-a-question message
My titanic solution got 74%: https://github.com/valimikayilov/Titanic_ML
Do deep learning technique work better for the titanic or do simpler techniques like random forest work better?
I was able to achieve around 80% accuracy with a random forest and hyperparameter tuning
would I get better result using a simple ANN, or would I be better off just using the random forest from sklearn?
you got 80% on the submission?
Hi I am starting titanic competition so I need a team for it
No 80 percent on a testing set
Using stratified split
Hey guys, I got 0.78229 on my first submission. Could anyone look over my code and offer some suggestions? This is my first ML project and I want to also make a YouTube video on how I built it out and such, I know I still need to leave a lot more comments/documentation and clean up a few sections
Hello Everyoneš
I just finished working on this competition and actually enjoyed it very much!
in my notebook I focused on EDA, feature engineering, and diagnosing missing values.
Feedback is an essential step for learning, so I would love to hear your input, guys!
https://www.kaggle.com/code/leen98/eda-and-feature-engineering-the-titanic-sinking
Hello everyone,
I got a 0.559 on my first submission. I tried to use variations of the dataset but I wasnt able to increase the score. What are key factors to look out for in the data preparation which could increase the performance of the model? I use a RandomForestClassifier and the accuracy on the train data is around 0.9 which confuses me even more because I dont understand how the differences in the accuracy on the training data (0.9) and the test data (0.559) can be so big. I hope anyone can help me with these problems!
count me in!
i saw a zero score after submission is it even possible
Probably formatted it wrong
need partners for this project
Let me know what you think, I recorded a full 2 hour walkthough of my code: https://www.youtube.com/watch?v=6IGx7ZZdS74&ab_channel=RyanNolanData
Welcome to my data science journey through the Kaggle Titanic - Machine Learning from Disaster Project!
In this video, we'll dive deep into the world of data analysis, feature engineering, and machine learning to predict passenger survival rates on the Titanic.
As Kaggle states: "The competition is simple: use machine learning to create a mode...
I want to make a part 2 with improvements, so if you see any mistakes or ways I can make it better please lmk
hello everyone,
I got 76.79% accuracy on Titanic Competition.
link :- https://www.kaggle.com/code/dinanksoni/titanic-76-79
Hey everyone, I scored a 77% on the Titanic Dataset using a random forest. I'm seeing that the top scorers on the leaderboard have a perfect 100%. Would this count as overfitting the data? Is it possible to actually score a 100%?
No itās impossible, they cheat
Let's just say if Olympic sank the next day it would not be 100% :)
Hello Kagglers!! I have just joined Kaggle and this is my very first competition. I was following the tutorial to get started but when I copied the code for women and men who survived to find out the percentage it is showing me an error. I exactly copied it from the tutorial. Can anyone help with this?
completed the challenege...... hope you guys find this helpful
Which tutorial were you using?
And which error is it showing to you?
Hello all, i am new to kaggle and also to data science š¤Ŗ
starting with this competition now, are there any teams or is there anyone who would like to team up?
Iāve found itās so hard to get to .80 on the submission haha, but alas thatās my goal. Got to 0.78 last night. Anyone wannna chat to see if they have ideas to get my random forest to 0.80?
Here's my 0.78229 score that, for the life of me, I can't improve. Let me know if there's anything I can do to push it forward.
Did u try with any other model?
No, I was trying to get to 80 with only random forest, but if i can't crack the case, I might try another. Do you suggest any?
Knn or support vector might give higher score, u can try once with that
Have you used one of those models wit this one?
Would like to..š¤
what's the score for default submission (just copy/pasting the tutorial)?
Yea i got better result with decision tree, but later i tried to change the hyper parameter and got score better for decision tree as well as for random forest (here i am talking about the individual score of the model and not the kaggle submission score š¤Ŗ)
I am still trying to improve my overall submission score
chatgpt told me you can get high 80s low 90s without cheating that true?
If anyone needs a notebook to look at, just got 0.79: https://www.kaggle.com/ryannolan1/titanic-voting-classifier-0-78947
Releasing the video + notes next week
have the original vid still here: https://www.youtube.com/watch?v=6IGx7ZZdS74&t=3724s&ab_channel=RyanNolanData
Welcome to my data science journey through the Kaggle Titanic - Machine Learning from Disaster Project!
In this video, we'll dive deep into the world of data analysis, feature engineering, and machine learning to predict passenger survival rates on the Titanic.
As Kaggle states: "The competition is simple: use machine learning to create a mode...
Hi, my score right now is 0.78 in the leaderboard, I am using deep learning, my question is if this dl approach is suitable for the challenge or is it better to use traditional Ml like random forest or similar ? thanks !
how do I confim my account? I cant find my country in the phone number codes to send the confirmation SMS
@obsidian pasture try multiple models. A voting classifier gave me the best results although I didnāt use deep learning
Thanks !
Is possible if two ppl use same model but have differents results ?
yes...
Anyone can get Score above 0.77511 ? and which method he uses ?
Yes I have a 0.79
And way possible to have different scores
@restive kestrel https://youtu.be/KzK1pifa2Vk?si=6umfhORMZyolBXTd
Today we are taking a look at how I was able to improve my Titanic Kaggle score up to a 0.79, which was good enough for the top 9%.
I showcase all the code changes and what I would still improve on, if I had more time.
I'll be adding notes to the Kaggle Notebook if interested.
Kaggle Notebook: https://www.kaggle.com/code/ryannolan1/titanic-v...
thx @safe orbit
Part 1 is also uploaded as well as all the model tutorials so check them all out
Housing predictions is being worked on now as well as writing scores
But writing vid will have to wait till Jan. I wonāt win but I think itās better that way
Hi everyone! Is this the right page for the titanic competition? First time in Kaggle for me and not an expert on discord š
Yes
I wanted to get a sense of how good my result is - using NN with some hyperparameter tuning, scoring 78% tops on the leaderboard. I understand that random forest probably performs better on this dataset, and I'm not currently using ticket #/name to identify groups.
Is that a good score within those constraints, or does that indicate that there are issues with my architecture/feature engineering/etc?
In this video we build a model, which predicts titanic survivors with a decent accuracy.
Kaggle Challenge: https://www.kaggle.com/c/titanic
ā¾ā¾ā¾ā¾ā¾ā¾ā¾ā¾ā¾ā¾ā¾ā¾ā¾ā¾ā¾ā¾ā¾
š Programming Books & Merch š
š The Python Bible Book: https://www.neuralnine.com/books/
š» The Algorithm Bible Book: https://www.neuralnine.com/books/
š Programming Merch: https://www.neu...
Was following this tut but during the applying of the pipeline for all the pre processing in the train set its giving me reshape error
try to see if you have all column name and size same
Yes everything is same as the video...I dont understand how this tut code always runs for them but whenever we try we face problems
I know I'm a bit new, but in order to submit my csv, I should just click "Submit Predictions" and upload my csv, right? I've tried 3 different browsers, different security settings, verified my account, and a different computer, but I can not click that button. Am I missing something?
Hey everyone, I'm new to Kaggle. I've created a deep NN model for this project and I've got 80% accuracy on my validation set. However, I'm yet to fine tune the model so I'm hoping to get slightly better results. I wanted to ask if there's a way to check the bayes error for this project? What is the highest accuracy that has been achieved for this model without cheating. I saw the leaderboards where people have achieved accuracy = 1 , which is certainly not possible.
How could someone cheat?
I'm not sure how could someone cheat, but given the nature of the data, I don't think 100% accuracy is achievable with any model.
Anyone currently working on this project would like to discuss? I've done some fine tuning to the model and so far I'm at 82-83% accuracy. Today I'll build out multiple models with randomly sampling hyperparameter values and I'm targeting to reach 85% accuracy.
Ye a lot of people cheat top of leaderboard itās stupid
On this project
Not all projects
Also @gaunt cedar Iām currently learning PyTorch, would def like to see what youāre working on. I did standard ML models from scikit, xgboost and a voting classified and it got a top 10% score
@shadow cave i might not be able to help you but can i know the issue ?
how to increase accuracy ?
i don't know where to start
@arctic rune thank you for your response. I need help with the feature process of how you could take any value as a feature.
ok...i don't know how to do that
Are you also a beginner?
yes, very beginner
Hello @shadow cave and @arctic rune I just started working with this dataset yesterday after joining kaggle. If either or both of you are interested in working together message me and we can work through it.
@heavy raptor ok
@heavy raptor bro let me know i'm working on it. We can perform it together.
@safe orbit sorry I wasn't available for the past few days. It's great to hear that you're working with pytorch. I myself am interested in learning pytorch for some reasons. Would love to see your working.
Bro over here I am getting exhausted just applying and learning the regular classifiers and you're making NN models?
Lol I'm exhausted too bro. But good thing I've got to 85%+ accuracy.
Same here everyone is talking about dl and staff while i am getting confused about confusion matrix
What's your project status of titanic?
Confused about confusion matrix... š
I havent finished yet.
Can you tell me what is the advantage u get applying NN instead of normal Supervised learning?
What do you mean by normal Supervised learning?
Like just applying classifier algorithms after pre processing
You need to elaborate a little. I'm also using classification using a neural network.
What i want to say is that what are the advantages ur getting using NN instead we can just use regular classification without NN
Why make it more complex?
I'm assuming that you're asking why not just take the input features and pass them through a classification algorithm directly, let's say binary classification. And use the output that we get to make a prediction, right?
Yes
Okay, so here a problem with that...
Let's say you have this data and you have to classify either an input feature is 1 or 0...
All you do is train your algorithm using binary classification and it will learn a decision boundary (which is the straight line here) that separates the 2 classes. Now if you're input get the value 1 or 0 based on the side that it lies with respect to the decision boundary.
Makes sense?
Okay so now let's say I give you this dataset...
How do you draw a straight line (decision boundary) to separate the 2 classes?
We can use simply random forest witrhthis
Why need strt line...The distance off all points from the strt line determines the loss
Yes you can, but what if you want to use classification algorithm instead.
All of these algorithms are classification algo
Sorry I shouldn't have said classification algorithm here, I misunderstood your question.
By this you meant use something like random forest instead of NN right?
Yes
Nevermind, I misunderstood what you were saying. But the answer to that is, yes you can. As far as I know, using something like random forest can be way more efficient here, computationally as well as timely. And we may as well get the same results as with a deep NN for this titanic dataset.
I'm just using NN for the sake of practicing.
no prob. So what's the status of your project?
Hmm,I am applying the algorithms all pre processing done
right.
Hey! I' wrangling with the the 'Cabin' data (or lack thereof) in the titanic set. I'm toying with the idea of playing detective like using ticket numbers or fare details to guess the missing cabins. Or maybe taking shortcut by plugging in the most common cabin for each class for starting point but theres alot of missing values . I'm curious about like any other possible approach-how would you handle this? Looking forward to your insights..
š¤·āāļø Anyone there???
i'm here

why you sad
Nobody responded to my message....
I didn't respond to it because i didn't understand what you were saying and I don't have the knowledge to help you
Ahh I see... :(
Feature engineering?
yeah
what is titanicand spaceship titanic?
so these are like simple datasets to start with as a beginner?
yes
.
If you need help I made 2 vids
Welcome to my data science journey through the Kaggle Titanic - Machine Learning from Disaster Project!
In this video, we'll dive deep into the world of data analysis, feature engineering, and machine learning to predict passenger survival rates on the Titanic.
As Kaggle states: "The competition is simple: use machine learning to create a mode...
Today we are taking a look at how I was able to improve my Titanic Kaggle score up to a 0.79, which was good enough for the top 9%.
I showcase all the code changes and what I would still improve on, if I had more time.
I'll be adding notes to the Kaggle Notebook if interested.
Kaggle Notebook: https://www.kaggle.com/code/ryannolan1/titanic-v...
THIS IS GREAT ! thank you!
No problem
should you normalize categorical data being fed into a neural network like df[column] = (df[column] - column_mean) / column_std? the numbers are like 0 to 10... normalizing makes sense to me for like Age but for like Cabin prefix I don't know. except that it seems weird to not normalize some columns if I normalize others
Normalization is usually applied to numerical features for efficient training but not typically needed for categorical features. It may seem a bit unusual, it's entirely acceptable to apply normalization selectively based on the type of data you're working with...and for the categorical features, especially those one-hot encoded, the binary nature already provides a kind of normalization.
Right was just reading about one-hot encoding... I think I'll try that one out thank you
Hi everyone, im a uni student trying to get started in data science. What sort of pre-requisite knowledge would I need to get started, specifically this (Titanic) competition?
I think you should learn some basic about python, sklearn, not too much but a little. After that, you need to take some machine learning courses and Andrew ng is a nice teacher.
Hii all I am Bhimana i am new to kaggle and ML competitions . Hope i will mingle with you soon
I want to learn all about this machine learning and data analysis from where should I start?
I don't have any prior knowledge about coding stuff where I should start from...
Hey try starting with learning pandas and Numpy and some basic knowledge on statistics
@hollow brook https://youtu.be/6IGx7ZZdS74?si=_7O8l1JJTPNHc8AE
Welcome to my data science journey through the Kaggle Titanic - Machine Learning from Disaster Project!
In this video, we'll dive deep into the world of data analysis, feature engineering, and machine learning to predict passenger survival rates on the Titanic.
As Kaggle states: "The competition is simple: use machine learning to create a mode...
This is my titanic project. Everything in this vid Iāve covered on my YouTube channel
HI everyone, I have a question that how come there are a lot with 100 percent scores on the leaderboard, as a beginner to me that sounds practically impossible unless test data is exploited in some way i have done many models with feature engineering and 82 percent for me has been the highest score.
They cheated
Yeah that seems to be the case then why dont they get removed from the leaderboard
Itās a training competition
In how many hours can this be done
Hi everyone, I have a question not directly related to this competition but more about embedded models, Iāve heard that for most of the competitions to get a decent result (0.9) embedded models are the right way. Is that true ?
?
I have a score of around 0.779 in LB. Does anyone have any idea on how to improve it?
how is the name column given? is it with place and name?
No, name column is distinct and consists of Mr/Ms name
Hi guys! I need some advice.
I have some knowledges in sklearn, pandas, numpy, matplotlib, but i can't measure it. I want to understand, where am I and how I can measure it.
May be you reccomend some courses, which covers basic/intermediate/advanced level of using these libs.
Thanks a lot!
Hello there !
I am new to data analysis
I have questions that let's say I download the Titanic data okay , what should I do ? let's say I calculated the median for the data what is the purpose ? I don't have the logic of the data analyst . Could someone please help me or give me resources to learn these concepts ?
I will watch that I think I will find answers for my questions
Thank you š¹
sweet, and I have 2 parts so should help
i've been getting 0.57177 for the kaggle submissions even if the .csv file had different results. anyone else experiencing this?
You should learn python
Ive recently done the tutorial Titanic Competition, and wanted to redo it with an ML model. However, my model is now getting a 0 public score. Idk where I'm going wrong or how to testā¦
Here is the link to my notebook https://www.kaggle.com/code/abishekjayan/this-is-where-it-starts
I just saw a code for titanic. It's too overwelming. Is it always like that?
The titanic code is one of the more simplier datasets and code. Which code are you looking at? The top rated notebook is pretty simple to follow. It uses a RandomForestClassifier model to do the prediction.
Hi, I made the notebook for beginners to make model by using OpenAI API.
https://www.kaggle.com/code/yutodennou/tips-open-interpreter-titanic
By this notebook, good XGBoost model was generated automatically in Titanic example.
Hey @cobalt grove i would like to answer your questions:
-First try to understand the data given to you like what each feature tells about the passenger like what's SibSp etc..
-Get the data types of each column and determine which are continuous and categorical data.
-Check for null values and try to deal with them like using Simpleimputer is the basic level to deal with these. Use median for continuous and mode for category data.
-We use the median for the data that shouldn't be lying in the outlier by replacing mean (It's about boxplot)
-Then learn which graphs are used to compare which type of data.
For prediction use ML algorithms
Hi all, the Titanic ship which was a British passenger, sank in the North Atlantic Ocean on 15 April 1912 after striking an iceberg duringā¦
Refer my medium article for detailed explanation
š„¹
Yeah dude any doubts just ping me on my dms
Thank you š¹
Hi friends š
I have recently completed my Data Science machine learning and AI course, and in final exam we have 25 questions and i correct 23/25.
Also play with the dataset heart.csv, titanic dataset and other dataset from kaggle through google colab and a mobile application terminal.
hey @everyone even i am in top 10% still i didnt get a bronze medal in the titantic competition whats the deal ?
my notebook : https://www.kaggle.com/code/ayeshairshadcoder/titanic
Not sure, but could be medals aren't awarded on the titanic as its a reoccurring/rolling competition.. just my guess though
Hello guys i just started competitions in Kaggle with Titanic problem. I am unable to understand which Algorithm to apply here . How to go to succed in learning with competitions.
Hey @twilit cloak You can refer to my kaggle notebook with full clear instructions given out there.
https://www.kaggle.com/code/umbro10/titanic-data-analysis-0-78-accuracy
Any doubts just ping me
Hi, I made the notebook for beginners to make model by using Gemini API.
By this notebook, good ensemble model was generated automatically in Titanic competition.
https://www.kaggle.com/code/yutodennou/tips-auto-model-generate-by-gemini-api
Good Afternoon, I'm new to the whole competition thing, I wanted to know if I have to build the model from scratch.
No, you just use external libraries and call them
how can I improve the accuracy ? i used logistic regression function which is built in sklearn.linear_model. when i counted the number of correctly classified examples and divided it by the total number of examples, i got 0.6067
thanks alot
doing a lot of things, changing or upgrading the way you are doing feature engineering, the parameters in your models and when dealing with hyperparameters
Is there any way to see other's code? I really want to know what did the 1.0 accuracy one do differently?
It is supposedly impossible to reach an accuracy of 1.0, the guy is probably cheating to get that result.
@wet drift 100% is impossible, but I got a top 10% score and shared my code
Welcome to my data science journey through the Kaggle Titanic - Machine Learning from Disaster Project!
In this video, we'll dive deep into the world of data analysis, feature engineering, and machine learning to predict passenger survival rates on the Titanic.
As Kaggle states: "The competition is simple: use machine learning to create a mode...
Today we are taking a look at how I was able to improve my Titanic Kaggle score up to a 0.79, which was good enough for the top 9%.
I showcase all the code changes and what I would still improve on, if I had more time.
I'll be adding notes to the Kaggle Notebook if interested.
Kaggle Notebook: https://www.kaggle.com/code/ryannolan1/titanic-v...
What about top scorers? Isn't there a way to see their code?
Thanks a lot bro. I will definately check it out.
only if they shared it
Okay, Thank you once again
Saved your Playlist for future purpose. Thanks
no problem
Will try this competition
GL
get the answers
ez
why is it impossible?
@safe orbit @gilded tusk
And how can you cheat for a competition?
No model will predict 100% as some of the results are random
People hard code the results
Change excel file
Then why does that work for some competitions and not for others?
For instance MLCV doesn't have a 100 % in #šāspaceship-titanic but rather almost a 99 %
Sorry if dumb question, new to kaggle, if it's known to be impossible to get 100% why aren't the 1.00000 scores simply removed from the leaderboard?
It gives the impression to newcomers that 100% is achievable.
It's not impossible - even if the results weren't datamined a model could just get very lucky and guess through the noise. In practice, if a model is ever getting 100% something is wrong with the testing methods as well, a perfect is bad in data science.
Also people (read: interviewers) care about this problem about as much as hello world.
I trust a 95% so much more than a 100% in almost every real-world problem. BUT technically it is theoretically possible, I would just have to do an ablation study to examine how the model's able to achieve that score as well as a comprehensive study of the test set.
Great info thanks!
Np, now that I backscroll there are alot of questions on this.
does anyone know the theoretical highest score achievable with logistic regression?
i'm following andrew ng's course and implemented my own model, and have gotten to 76.79% accuracy but hope to get to mid - 80s at least
but I'm not sure if this is possible with just logistic regression and some more feature engineering, or if I need to use an entirely different model
is there a way to know that?
like it would be cool to know it for sure, the theoretical highest score achievable with a model in particular
including all the set of the possible hyperparameters and feature engineering you can do
short answer from what i've googled is that no
long answer is:
plus this sets are possibly infinite
No, there is no limit to the number of values even a single continuous hyperparameter can be, and the solution space is only approximately smooth. That coupled with the randomness injected in during many ML algorithms by design makes this impossible.
But, that is an interesting question.
I suppose you could limit a continuous hyperparameter to the 64-bit* float limit on most machines but that would still only be an approximation of reality.
oh guys i kinda meant practically instead of theoretically
has anyone managed to get good (mid 80s i mean) results with just logistic regression on this channel?
i suggest you reviewing public notebooks related to this dataset
maybe someone got an approximated score using logistic regression indeed
@thin epoch looks like you got lucky this time
damn i should've just looked up haha, thank you so much
i shall inspect this man / woman's code
i also have been watching some videos and I realized how deep feature engineering really goes
i should probably try doing more of that
yeah feature engineering is in many cases the factor that makes your accuracy improve substantially
quick question -- do you guys combine ur testing and training data when figuring out columns (for stuff like one hot encoding, etc..) the reason I ask is because for example sometimes the training data doesn't have certain values within a column that are present within the testing data, and this causes issues with one hot encoding
like is that considered a good practice
Hello, this is Bing. I can help you with your question about handling categorical variables with different values in training and testing data. š
Generally, it is not a good practice to combine your testing and training data when figuring out columns, because this can lead to data leakage¹, which means that your model may learn information from the test data that it should not have access to. This can result in overfitting², which means that your model performs well on the test data, but poorly on new data.
One way to handle categorical variables with different values in training and testing data is to use label encoding³, which means that you assign a numerical value to each category, such as 0, 1, 2, etc. This way, you can avoid creating too many new features with one-hot encoding, and also handle the case where there are new categories in the test data that are not in the training data. However, label encoding may introduce some ordinalityā“, which means that the model may assume that there is some order or ranking among the categories, which may not be true.
Another way to handle categorical variables with different values in training and testing data is to use feature engineering, which means that you transform or create new features from the existing ones, based on some domain knowledge or analysis. For example, if your categorical variable is related to time, such as "era", you may be able to convert it to a numerical variable by using the year or the period as a proxy. This way, you can reduce the number of categories and also capture some meaningful information from the variable.
Have a nice day! š
Source: Conversation with Bing, 10/01/2024
tldr; no
If I can't train on it, it does not exist to me.
thanks
i may be incorrect but I don't think this has 85% accuracy... I ran your code and it gave me a submission with 76%
please let me know if I am mistaken
you probably aren't
Three days of constant effort
aids feature encoding
and I have reached 0.79425
im a bit dissapointed because I expected better results tbh for the amount of work I put in, but at least this is somewhat closer to my goal of 82% 
and now I increased the number of iterations which should make the model better, and the score went down to 75% š š š š š š
did you use ann
no not yet
there are 12 columns in train.csv
so do I need to make an AI that takes 11 inputs and spit out 1 result?
Feature names unseen at fit time:
- Age_0.17
- Age_0.33
- Age_11.5
- Age_18.5
- Age_22.5
- ...
Feature names seen at fit time, yet now missing:
- Age_0.42
- Age_0.67
- Age_11.0
- Age_20.5
- Age_23.5
- ...```
does anyone know what's wrong with age in the column for test.csv data?
oh... is it because tree models are basically key-value path like structure
it needs to know exactly THAT 0.17 age in the previous training data
ok I get tit now
ok so tree model is stupid
or not really meant for this type of prediction
Hey all
Finally got discord auth to work, excited to discuss titanic with people who work on it
I got .791 accuracy, not sure if that's good, but it took me a while!
In my local machine I got the accuracy of 92.3 using the xg boost model. But when I upload the CSV file it's saying I have 0.58 accuracy
Why please can anyone tell me what's the problem
@storm saddle you're overfitting the training set
your code is mastering the training data at the expense of generalizing to the test data
also, with a score that low, keep in mind you're doing worse than just saying female = alive, male = dead, which gives 75% acc
how much you got it actually ?
got that point
i got to .791
I could've gotten in farther but moved on
fine bruh got it will take care form next time.
ok
i just tsarted with this dataset, and im still a bbit lost, can someone help me
@remote loom What's up
im just starting out and thi sis my first project, how do i fugre out which metrics to use and predict the survival? the problem i did beofre this was a car sales one
yo
well the metric you'll want to train on is the one they're going to judge you on, which is accuracy
what shud i do here?
@remote loom you're confusing your test dataset with y
Think of it this way. You're given X_train, y_train (the Survived labels provided on training set). For test you're only given X_test, not the y_test labels
I assume you're trying to split your training set into training and validation? That would look more like:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2)
(Also I'd recommend specifying stratify=y_train in that, but we can get to that once you fix the basic misunderstanding)
Here's an example which should make it clearer:
import pandas as pd
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X_train = train_data[features]
y_train = train_data['Survived']
X_test = test_data[features]
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, stratify=y_train, test_size=0.2, random_state=1)
What is the baseline for titanic competition? What percentage of model prediction is considered good for this challenge?
I started the titanic competition just around a week ago. Why it says I crossed the deadline? How many days are allowed for this competition submission?
Start here! Predict survival on the Titanic and get familiar with ML basics
80% is an impressive score. 75% is an easy score, as simple as just looking at gender. 85% is about the best you can get without cheating or super-luck
Thanks Ben.
Hey everyone, this would be my first kaggle competition. So before diving into it, I just wanted to confirm if I understood the problem statement correctly.
So the idea is to calculate survivial rate of the passengers. The features could be any, gender is one of them, which I guess is default.
Can we use any model to implement it? Like I went through some courses which showed Linear, decision tree, etc?
Am I missing anything?
Nope youāre correct. You can use any model you see fit to predict which passengers survive and which ones perish
What is the purpose of the YOLO model under the "Models" section of the competition? https://www.kaggle.com/competitions/titanic/models
I know that we can use whatever model we would like, so just wondering (particularly because it seems that it is an image recognition model)
Also, what is the performance metric of this competition? Is it just submission based?
Performance metric is based on accuracy of submission yea
not sure what the yolo model is
how do people get 100% percent accuracy ? is that rly possible ?
It's trivial to just look up the real answers for who lived/died and submit them to get 100% without using machine learning. That's why this is just a tutorial competition and not a real one. You should just ignore people who have 100%, all results on titanic are wiped every few months anyway, it's just for learning.
ohhh ok , thanks !
Hey everyone! I've just started working on this competition properly and have a question.
Given that the dataset is already split into training and testing sets, does it still make sense to use cross validation?
In my head it feels like using cross validation on the training set won't be very beneficial because each iteration will have relatively small training sets
Anyone want to collaborate? I am very new at this
Yes it still make sense to use validation split so that you can know how your model is performing on unseen data. In real life project, test data is not touched until the very last to evaluate model's accuracy. On kaggle, I think there is a limit on the maximum number of submission in 24 hours , so if you want to try more models than that limit then you won't be able to evaluate it. So it's advised to use cross validation.
Got 72% accuracy using RandomForstClassifier, is this good?
I think you can improve the accuracy with some tuning
I'll try my best
Hey guys
So I was trying something cool wanted to share with you all.
I used LLM (GPT4) to predict the outcome of each row, I fed a good set of facts about our data and I am getting an accuracy of 78%.
If someone wants to know more feel free to dm, might share soon on twitter too
"input": {
"PassengerId": 944,
"Pclass": 2,
"Name": "Hocking, Miss. Ellen Nellie""",
"Sex": "female",
"Age": 20.0,
"SibSp": 2,
"Parch": 1,
"Ticket": "29105",
"Fare": 23.0,
"Cabin": null,
"Embarked": "S"
},
"output": {
"Analysis": "The passenger whose details we are analyzing was a young adult (20 years old) female belonging to the 2nd passenger class. She boarded the Titanic from Southampton, she was not traveling alone but had two siblings or spouse on board, and one parent or child. Her fare belongs to the 3rd quartile. Unfortunately, we don't have data on the cabin assigned to her.",
"conclusion": "Based on the provided data and the survival rates, we know that females (74.2%) had a much higher survival rate than males. In addition, while the overall survival rate in the 2nd class was lower than the 1st class, young adult female passengers in the 2nd class had a respectable survival rate of 43.28%. The survival rate from Southampton was 33.7%, and for those traveling with family, it's slightly better at 50.6%. Being in the 3rd quartile fare might have slightly improved her chances as its survival rate is 45.5% compared to lower fares. Lack of cabin data gives her a lower rate of survival at 30%. Given the survival rates, the chances of this passenger's survival seems quite positive.",
"Answer": "Survive",
"Facts Used": "Gender, age, Pclass, Embarked, SibSp, Fare, Cabin",
" Additional Facts that might have helped": "Deck level would have given us more insights into the survival rate, as would knowing more about the relative ages and classes of her siblings/spouse and parent/child on board."
}
Haven't yet fine tuned the model, that would help give parameters the weightage
Hey everyone, I'm a beginner in machine learning and I've been working on Titanic competition lately. I used logistic regression and my score is 0.59.
I believe there is an overfit, I tried regularization and selecting less features but it didn't work. What can I do to improve my model?
I have basically the same problem, only that I'm using a different model.
hi guys, i just want to check for this dataset is it possible to get score 1?
just by using ML
i know this question have been asked multiple times sorry for asking repeated question
@woven crater Getting a perfect score using ML is unrealistic / basically impossible. If you think about the nature of the problem you can also intuitively understand this - survival on the titanic was not something you could perfectly predict from the information you are using to infer these judgements.
Scores of 1 on the leaderboard are simply people looking up the answers (since it's a real historical event). This is why the titanic is simply a tutorial problem.
Thanks for answering my question!
Hello everyone, I am taking my first steps in data analysis with Python and I am not knowing how to solve this, I guess I am making a mistake in the location of the file, but the only thing I did was copy from the tutorial, sorry for the inconvenience and thank you very much to all.
Hi Francisco. Can you confirm if the data set train.csv is present in the input section?
I have created a notebook from the competition and executed the commands same as yours. It works perfectly fine for me
Hi Yogita, I don't know how to do that, I'm lost
Is it possible for you to once try Desktop/Website version. In this case the "Input" Section is present in the top right corner. I have never used on mobile
Thanks a lot Yogita š«°
Hey, I am new to ML and learning ML by watching YT videos and MOOC. I am looking for a Mentor/Guide/Buddy with whom can share his/her experience with me and help me learn become a better ML practitioner.
Hey guys, I wanted to ask what classifiers should I use to get higher accuracy? So far the highest accuracy I have got with an optimized Decision Tree is 0.7799. I have also used Logistic regression,KNN,SVM,Random forest but they got lesser accuracy.
It's the highest dude which is possible other techniques are merging other datasets to get more details thus predicting with higher accuracy
How can we merge other datasets? Did you mean shuffling of datasets or just merging other public Titanic datasets? If so doesn't that violate the term of the competition?
I also heard there is a way to merge different ML models to achieve greater accuracy? Is this true?
There is an extension of this dataset which people use to get the higher accuracy
I see
i scored 0.78229 in titanic predictions. how are people scoring 1.000???
may be its overfit result
This was my first competition and I tried to use XGboost by obtaining an accuracy of 0.83, since I am new to Data Science and ML can you give me some feedback on my work?
my notebook: https://www.kaggle.com/code/davidg960/xgboost-classifier
Really enjoyed doing the competition tutorial.. are there little hints on techniques to try to boost my score? I don't want the answers, just a nudge in the right direction
Right now I am using Google Gemini to help me out
Here is a notebook on feature engineering that I found useful: https://www.kaggle.com/code/gunesevitan/titanic-advanced-feature-engineering-tutorial/notebook?scriptVersionId=27280410
Hey guys, I'm new here and managed to get 84% at best, is that a reasonnable score & should I move to something else, or would you adivse me to keep looking to improve?
actually, that 84% is on my validation set. When submitting, I'm at 75%, which is 2% lower than my previous attempt at 77%
I'm a bit confused: I've spent time trying to improve the validation set, making sure of no data leakage, and it ended up being worse at test set.
What's the recommanded process to avoid that in the future?
Hey guys , I upload a notebook on missing data in Titanic.
Explore the notebook now and be part of the quest to reveal the untold tales hidden within the depths of the Titanic dataset!
https://www.kaggle.com/code/sakshisatre/titanic-s-missing-data-navigating-null-values
Can we download the datasets and run it on our own computers? I have a windows computer and linux computer.
My windows computer is a Dell PowerEdge R720 server running Windows Server 2022. 2 x 10 core processors threaded which makes it 40 threads. 3.5 TB HardDrive and 128 GB RAM.
Yep!
I was trying to follow the tutorial on this page: https://www.kaggle.com/code/alexisbcook/titanic-tutorial and seen that the package sklearn is deprecated in PyPI.
When I ran pip install sklearn it says it is deprecated and to use scikit-learn.
Nevermind. I had to install other packages.
If anyone wants a vid to follow, I made this a few months ago: https://www.youtube.com/watch?v=6IGx7ZZdS74 Also have a p2 improving my model
Welcome to my data science journey through the Kaggle Titanic - Machine Learning from Disaster Project!
In this video, we'll dive deep into the world of data analysis, feature engineering, and machine learning to predict passenger survival rates on the Titanic.
As Kaggle states: "The competition is simple: use machine learning to create a mode...
I have a question - why doesn't a pytorch model work for this competition
I tried making a neural network and getting predictions out of it but I repetitively get 0 score
i do get an f1 score of 0.65
but my accuracy is still 0
i would greatly appreciate the help!
Sorry it's late, but give this a read:
https://www.kaggle.com/code/carlmcbrideellis/titanic-leaderboard-a-score-0-8-is-great
I was super confused at first too and I thought that my scores were pretty lacklustre.
Damn even though this was not addressed to me - Thanks for this! I was one of those folks who submitted float values and got a 0.00000 š
Ayo guys I had a question what is a good score for a beginner in this competition?
Nvm I saw the message above
Anything above 0.78 I'd say
Hahaha yeah same
Damn I got 0.68899 -need to refine it
I have a pretty high standard for good haha don't sweat it too much š
no because i guess the modal submission has 0.75 so you are right
I toned my model down that may be a reason why I got a lower score (maybe)
I'll try to get at least a 0.75 - refining
I got near 0.61 and 0.62 for 1st two tries and 0.71 for the third since I changed my algorithm a bit
I got 0.67 yesterday)
It was my first attempt and I know only about LinearRegression. Got float results and divided them by 1 if abs(x) >= 0.5 else 0
took a couple tries but got it to 0.75
What are the next steps to improving? I got a 0.77, and I basically just did everything that I learned and took notes on from the pandas and intro to machine learning kaggle courses, so I just did a random forest with optimized max_leaf, and filled NaN values with the mean, and that was it
Hello, I am new here, and trying with this "Titanic dataset". Upon examination, I've noticed a significant number of missing values in the 'Cabin' column. I believe that the cabin data could potentially offer valuable insights into survival probabilities. So, I am stuck in a dilemma regarding whether to discard this column or not. what are your thoughts on this matter?
I have got 0.784 but I can't go beyond that I did feature selection and hyper parameter tuning too š
Try feature selection on it I think
Edit the new one I submitted is 0.78947
Hey all, new to ML and this is my first ever contest. Just learned a bit of KNN and Naive bayes and I could achieve an accuracy of 0.665. Long way to go it seems!
What's the minimum accuracy for it to be considered "good" informally? 0.75?
What algorithm are you using?
Changed LinearRegression to LogisticRegression with the same feature selection and got 0.77
Random forest classifier with feature selection and hyper parameter tuning
damn
thanks
lemme guess scikit learn
Yes
Hi everyone , I used Random forest classifier too , and got a score of 0.97 , with 7 features and pd.get_dummies () for categorical features. How can I check my model is not overfitting?
How much score did you get with Linear regression?
0.67
LinearRegression.score(X_test, Y_test) is this the metric we have to measure??
Your 0.97 is the actual scoreboard score given by the competition right? If I understand overfitting correctly, it is not possible to overfit on test data when you don't have the true test results. Overfitting happens when your training test score is too high because it memorized the training data along with its noise etc. and generalizes it. But if you are running it on the test data, where you aren't even provided with the true answers, and they check it and give you a score of 0.97, then I feel like the 0.97 cannot be an indication of overfitting right?
I am new too, so take what I am saying with a grain of salt.
a 0.97 score is great on Kaggle
I got a 0.75 but I used a neural network
If your leaderboard score is 0.97, then you're all good.
Same, i got 0.76 to 0.79 everytime i used a NN
Damn I didn't know we could play games
Damn nice- how big was your network and what activation functions did you use?
(32, relu, validation split 20%) Ć 3 layers and last layer sigmoid iirc
Damn nice. I guess I have a very big NN cause mine goes to 128 and I don't use sigmoid (I use it only when I need to get the prediction labels)
You use BCE loss or BCE with Logits loss?
hi guys, im doing a project for class on this dataset and need to understand how the data was collected and the original purpose of it. I couldn't find anything besides competition rules on kaggle. Does anyone have an idea?
Bce loss
It's literally the titanic. the data must have been collected by the rescuers for obvious purposes
Make a for loop and try for 3 layers with 16, 32, 64 and 128 nodes each layer. Check the loss at each and you'll see that the 32 node one probably has the least. Also in this for loop you can control different learning rates, num of epochs, validation split, etc etc
wdym by look at the loss for each layer? Also how many input features do you take? I take 6. Also do you use batches?
I don't remember how many features i took in but yea around 7 8 ignoring names and some others. And I didn't mean loss for each layer, i meant for each model.
ah ok
hi everyone, i wanna do the titanic competition with a friend but we cant seem to find a way to team up
according to some forums there should be a 'team' tab next to the rules tab, but there isnt
and when i tried to share my kaggle notebook with him, i wasnt able to enter my phone number. klicking into the field did not allow me to enter digits
Hey everyone, I'm currently stuck at 0.82 acc using Logistic Regression. I'm using 7 features, mapped the categorical ones and used the mean to fill the nans in the numerical columns. Should i try to improve the model even more or try another one? Any tips for improving this one?
Hello everyone! I'm new to Kaggle. I recently worked with a few other students on this competition. But I just noticed, this one doesn't have a "Teams" tab. How do we submit as a team for this competition?
invite them
as i said, there's no team tab
also this link https://www.kaggle.com/competitions/titanic/team seems to lead to a 404 page
Start here! Predict survival on the Titanic and get familiar with ML basics
that's weird! it's working for me.
I think you should contact support for this one
huh. okay thanks ^_^
some time it happens but it come normal again after some time
it's happens alot with me with datasets
Hey, need help
I dont find the option to join anywhere
Neither I can see the teams option
Hey guys,
I am a beginner and I need help fixing my TensorFlow model that I created to participate in this competition. I have described everything in detail here: https://www.kaggle.com/competitions/titanic/discussion/502611. Could someone please take a look?
Start here! Predict survival on the Titanic and get familiar with ML basics
yeah. same problem here... maybe you could send the support team a message but i doubt they would answer
Hi! im new to the kaggle. im looking for teamates for titanic competition and also find anyone with good heart helping me for the machine learning
hello, i am also starting with this titanic dataset
maybe we can work together?
Me too! Can we work together?
Same here, let's work together
me too
let's start
hoy empiezo
I am new in this feild and I am starting with
IBM AI Engineering Professional Certificate course on Coursera(https://www.coursera.org/professional-certificates/ai-engineer)
any suggestion?
hello i am also starting with titanic dataset
Yeah me too started just now
Hi, me too started with titanic dataset
im just confused to start
Hello guys, I am brand new on Kaggle and just finished this challenge as my first challenge ever. I got a decent score of round about 83% accuracy. Now I am looking for improvements on my methods. Is there some common ground on how this problem should be approached? I am basically looking for a state of the art or best practice version where I can learn some new tricks.
Hi guys
I also did my first submission just now on titanic.
I scored a 0 though. Idk what I did wrong.
For some reason, the actual accuracy (not the train.csv one but the test.csv one) of my bernoulli NB model is 100%
Can anyone help me with this?
Did your answer CSV have the two columns?
Or that
Did you convert to ints? @mortal marten
Yes. It doesn't accept if it has less or more than passengerid and survived
I converted to floats. Does it make a difference?
The survived needs to be int
I see. I converted everything to floats
What about this?
I actually used many types of models just to compare them. My xgboost random forest model also got only one prediction wrong.
Hello jinay Vora iam also worked on it 2 days before
What was your method
I did it by decision tree classifier
Nice. I found my problem. As stated by @pale field I submitted 'Survived' as float datatype instead of int. I corrected it and I got 0.77 score. I still don't know why my model predicts all of them correct.
I used 19 different models so that I can compare all of them with each other.
1 model got 100%. 1 model only predicted one wrong.
Not the training accuracies but the actual accuracies
Training accuracy was like 86
Did you use logistic regression
@sweet shore I used
Gaussian NB
Bernoulli NB
Complement NB
Multinomial NB
Decision Tree Classifier
Random forest classifier
Xgboost classifier
Xgboost random forest classifier
Adaboost classifier
Logistic regression
K nearest neighbour's
Bagging classifier
Hard voting classifier
Soft voting classifier
Stacking classifier
SVC classifier
And 3 more I don't remember.
Hi anyone looking to work on a dataset together.
I'm very new to kaggle and wanted to kick start my journey.
hey guys, I was working on this dataset and I have
used Random forest without tuning
features = all - (name, ticket, cabin)
but accuracy at submission was 77%
Any suggestion??
Try logistic regression!
It is a very standard approach towards this data, and maybe for this Random Forest algorithm, try dropping the name column as it adds no value to the data!
Try looking into the correlation of the different features amongst each other...... try plotting the data, do more of EDA and see what you can derive from the data...
Ok I will try
I saw correlation using heatmap
after that I chose 3 features for prediction ['Pclass', 'Sex','Fare'] accuracy didn't improved when I submitted on kaggle it was 77%
I don't know how people are getting 100% accuracy
I got 100% accuracy. Even I dont know how I got it.
Which features you give in titanic model?
I got only 82% accuracy
Which algorithm give you 100 per cent accuracy
I used gender, age, fare, class and embarked
And the algorithm was Bernoulli NB
I converted embarked of SQC to 1 2 3
With LabelEncoder or OrdinalEncoder
Default settings lmao
Using dictionary
Dataframe
Random state 42 and test size 0.3 or 0.2 idk
Yes i did this
Nice
Maybe due to LabelEncoder the value SQC will mismatch with the value.let me check it out
I manually converted it to 123
Then dropped the original sqc column
ok I will try these features but I have used those features after seeing Heatmap correlation
I see. I just checked simple correlation and then used them.
When i predict the test.csv it accuracy go to 0.76 why??
Did you remove outliers?
I think my model has some problem I'm too beginner to understand
Okay, so I have my predictions
the bottom is the survival rate laid out in the train data, and the top is the prediction from the test data. Its a smaller size dataset, so would it be different, or?
Hi i'm new. Perhaps a silly question; i'm confused on how i can measure accuracy of my model if there is no y_test data
split train_data into train_data and test_data
Hello there, before testing you could split your train into train and validation. When you measure based on validation you might have an idea of the outcome of the test.
Something else you can do is.k fold cross validation of the entire train dataset using cross_val_score.
I hope this helps.
Thanks
Hi Manu , Even i'm new here
Hey guys , how long does it take for the competition submission score to occur?
Glad to build the model today , it was a pure headache honestly but i did learned many things along the way
My model does not make predictions very accurately, is there a recommendation for an educational video (with an easy level of English) that will make predictions that are close to 100%?
provide more information, what's accuracy you've got? What the model you used and how do you handled your data?
I don't understand intuitively why the embarked location would impact survival rate? i kown this is predicted by model, but it's just not that intuitive
use grid search on the regularization term u may get it up a bit
Hi, anyone know if there's another link for Alexis Cookās Titanic Tutorial? , the one given in the platform is not working.
I've found it, in case anyone else need it: https://www.kaggle.com/code/alexisbcook/titanic-tutorial
Thanks
Embarked location may help classify rich and poor people. Like there is one place where the 1st class passengers were in majority.
Idk or I'm shooting a shot in the dark
But why is it important? When I try to add this feature, it worsens my score on the test set but improves it on the train set. I think we already have Pclass that determines who's poor and who's rich. So, maybe this feature is redundant, or I'm doing something wrong.
As i said, shot in the dark
Hey fellas ,
I got an accuracy of 0.78708 V17 , i have done feature engineering , data cleaning and trained different models and found out the best model according to me fn. How to improve my rating ? I am a beginner? At CV set i am getting about 0.8715 but idk why test is too low? Any reasons"
Also when do you stop ,till you achieve 100% or 80%+?
you don't have to achieve 100% that would be a case of overfitting
you can predefine a benchmark for yourself.
He's talking about the test dataset probably
I get the same issue too š no matter what I do, it looks like it gets even worse.
https://www.kaggle.com/code/richarddev/titanic-survival-prediction
i get like 77% using random forests
i'm Completely new to kaggle can anyone help me to start? How to start the challange?
Just go to code, follow any workbook, make a copy, and submit.
is there a step by step so that we can learn as well
hi there folks looking for some help. The titanic test data is missing values. what are some recommended ways of handling that? I am using decision tree model. Deletion of rows with missing data won't work since the output must be 418 rows.
Hey, I suggest you use either XGBoost or Gradient Boost, they will give you a higher accuracy than decision tree. Plus, you should fill in the missing values in the test data with either the mean or mode. And can anyone check out my code, so I can improve my accuracy? https://www.kaggle.com/code/vishalyginny/titanic
maybe try to follow the tutorials in the beginner competitions?
Hi guys I just submit the first prediction titanic with the tutorial, but I am wondering how it goes from now, I do not think it is over. Right?!? tks
After submitting the first prediction, we have to learn to how implement a machine learning model, then updating second version.
Hello All, I am trying to do my first kaggle project, I am having basic issue, I am not able to see "Join competition button",, any suggestions?
Hello All,, I followed through the cook book and used randomforest model and it generateda public score of .77751, Is there any suggestion on how I can make it better
Check if your account is verified. A Phone number is required to verify your account.
me. Im beginner in this field.
here
100% accuracy is not good.at that time, you should take overfitting into consideration.
Yes, I have learned a lot since my last post about the problem and now I understand that training should be interrupted when the error function of the validation sample starts to increase, but I still haven't dared to solve the problem again....
can you tell me what accuracy I, as a beginner with two months of experience in machine learning (I can start solving it later), should strive for, I am already finishing a fairly large course on machine learning and am beginning to suspect that rough training in this problem is unnecessary, therefore I will have to take it up again in the near future
Does anyone know what we are suppose to do in feature engineering?
sry, i am a beginner too. The max accuracy i have got is 80%. in fact , there is a long way to increase the accuracy. However, i have no idea now. i'm so sorry that i can't help you.
hello, can you share your codes. hearing you got a high accuracy , i have a great interest to learn how you implement. thanks!
https://www.kaggle.com/code/jinayvora25/titanic-ml-models
This is my first ml model. So please don't mind the messy code. Also let me know if you find any errors.
thanks, it's so kind of you! what is your test accuray ?
what dose this graph mean?
Good question. I don't remember
Oh wait it's the sibling one
I plotted the sibling column with respect to the survived
Hey guys, I am starting in kaggle and Idk how to start in the titanic competition
Refer to other submitted notebooks. Also use your EDA fundamentals.
Try and visualize the data as much as possible. Understand the data thoroughly.
Can you tell me why when I train a model (XGBoostClassifier) with hyperparameter tuning it gives me an accuracy score lower that the one obtained without tunig?
Is it overfitting?
Hey guys, in this tutorial im following: https://www.kaggle.com/code/amitkumarjaiswal/beginner-s-tutorial-to-titanic-using-scikit-learn/notebook
there is a reference to #1, #2, and #3, any idea what theyre talking about with these numbers,
Heres an example blurb,:
Is there anyone get a full prediction?
Hello everyone, I'm (new and) learning ML and this is my first time joining a Kaggle competition, I am stuck in an error for hours, and any help is welcome! Challenge: Titanic - Machine Learning from Disaster, needs help with feature engineering pipeline error. (I'm new to implementing a pipeline like this) What I'm trying to achieve: Age is a predictor variable, from 0.0 to 80.0, and also contains NaN. I want to bin this feature, first assigning NaN to number 999, then binning like: 0-1 is "Infants", "1-4" is "Toddlers", ..., "100-inf" is "Unknown", then, One-hot encoding the features. However, when I try to run a random forest model (image 4), I got the following error because my AgeBinningTransformer (image 1) is incorrect: `Cell In[53], line 24, in AgeBinningTransformer.transform(self, X)
22 X_copy = X.copy()
23 # Ensure 'Unknown' label is correctly handled
---> 24 X_copy['AgeGroup'] = pd.cut(X_copy['Age'], bins=self.bins, labels=self.labels, include_lowest=True)
25 return X_copy
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices` Any suggestion is welcome, thank you š„²
Hi All, new member from Sheffield UK.
My question re Titanic comp is - are people genuinely achieving 100% on Titanic dataset competition? Seems like a stretch to reach. Is it a loophole? What score should we be aiming for before moving on?
You should convert the age column to numeric.
Do you use LLMs to help you with errors like this or do you class that as cheating? You can always ask the LLM to help you get to the answer without being explicit
I might have to look at the full code to suggest a correct diagnosis. However, I managed to make it work by using two separate approaches: The first one is by using KBinsDiscretizer (image 1) to bin into 5 groups (for example) via ordinal encoding, and the next one is a modified version of your code (image 2) which then feeds the pipeline for processing (image 3). Hope it helps.
Hi! Thank you for helping! For that particular error I managed to resolve it... I still need to work on the preprocessing step tho. Also, I made my workbook public: https://www.kaggle.com/code/bigsmallmediumpotato/titanic-ml-challenge
you mean ChatGPT? it's great when I know what code I'm looking for, so I just enter the prompt and let it return what I'm expecting. It's great automation. It can't replace debugging process though, and independent thinking. However, it is helpful to enter the prompt: in simple terms, explain [a line of code]. followed by the prompt: an example
Cool, I'll give it a look.
Ty! you're welcome to collaborate on this if you want š
Sure, I'm still relatively new in Kaggle so you might have to guide me with some of the rules and regulations lol.
This is my first challenge as well, I think I can add you as a collaborator with your Kaggle username. rules wise, team submissions are allowed (per leaderboard).
Kaggle username is shahriarrahman10, I'm also connecting with you on Discord.
Hello guys , is anyone doing titanic challenge ? I'm kind of new to ml coding, and after two days of coding I have sucesfully got the accuracy rate of 73% to 75% (using naive bayes) for the validation set that I broke from the train set ,
I have tried to work with the test set , though I have completed the data preprocessing but.. since the y_test (Survival feature) is missing I can't exactly use the model to test set..
you should split your training set into two parts, one for test and one for training. The test set provided is just the tests you need to run for submission
oh i guess you knew that actually lol,
you'll know your accuracy when you submit
yeah , I don't know how to apply the model to the test set since I don't know which true y value to compare the y predicted value for the test set
once you submit your csv it will give you the accuracy, they dont give you the trueY for the test set becuse then you could just submit that
check out this section on the competition page
ah thank you , I submitted my submission file , I got 67.7% accuracy.
Hello @everyone, just made my first submission
. Had an accuracy of 76% using the Random Forest classifier š² . I replaced all the NaN values in the Age feature using 0. This is probably not a good idea š„² . I selected the min_samples_split, max_depth and n_estimators by plotting a few graphs, but my feature engineering skill is not developed yet. Any advice on how to increase the accuracy? š
Hi, you can look through my notebook section 2 feature engineering for a summary of what I did. My highest score is 0.789 using categorial boosting. Hope this helps! https://www.kaggle.com/code/bigsmallmediumpotato/titanic-ml-challenge-top-10
Man, I'm already stuck at the first line of code. Can someone please guide me here? I'm not sure what I did wrong..
here is the first block of code that got cut off at the top:
You need to run import pandas as pd for the data to be read in. That line is present in the first block of code that got cut off so maybe you forgot to run it
I got 0.76555 accuracy in the titanic competition using XGBoost model, how can I improve the accuracy ?
hey guys, i have been suggested to use knn or mice to impute the missing value for age, is this optimal?
Since age is one of the primary contributors, it is generally a good approach to impute using a regression model like RFR, XGboost, Knn Reg, etc.
Hey guys, if anyone can help me out on how to improve I'd be forever grateful! So, this is the the train and test data before one-hot-coding on embarked and social status features. Before creating X and y, I dropped Name, ID, and that was it! Also, using RandomForest for the clf!
My goal is to get at least 80%+ before using multiple classifiers, and often the outcome for this one is 77-79
I think you need to import pandas as pd in first line
Hey everyone! I am new here. I recently tried a decision tree and a random forest algo on the dataset and had a score of 0.75 and 0.77 respectively. Can someone please share how they got a score of 1 and what algorithm/ process they followed?
I did manage to see one example of someone reaching for that 1.0, but he used multiple classifiers, tensor flow and a voting system!
hey guys ,im new here,i want to know how do we select particular algorithm ?is it based on data set?
and what is Exploratory data analysis ?what is its use?
How do i updated my regression parameter value as i adding or deleting data?
Anyone have the floor plans of the Titanic? I'm splitting up the Cabin row so I can make a floor column and a roomNumber column, but I can't find floor plans to see if the room number matters for nearness to stairs\exit.
regarding this problem, i am supposed to try whatever model i want and choose the one with best result? incidentaly for this problem ill be choosing decision tree but for other problem should i try more than 1 model?
For me at least, Cabin was dropped since 77% of the column is missed.
Hey, guys. Could you share an insight on how to select a proper model for the Titanic task and similar tasks?
I don't know how correct I'm but from what I understand , the selection of your model depends on the task you want to do , in case of titanic dataset, since we are asked to predict whether the passenger will survive (True or 1) or will perish (False or 0) , since we are asked to classify the passenger's survival , this is a Binary(since only two possible outcomes) Classification problem.
If there were more than 1 outcomes like 0, 1 or 2 etc it would have been a multiclass Classification problem.
Since we are are to do classification , some standard algorithms like KNN , Naive Bayes , Logistic Regession models and some others are use to classify specifically , we select the model which has the least (minimum) cost function [think of this as error] , a lower learning curve(takes less time to compute) and higher accuracy score,
on the other hand we could also classify using neural networks for more robustness and more dexterity.
Hey guys . How can I submit my colab notebook on keggle?
IMO - it's not as black and white as that. They can increase independent thinking productivity tenfold if used creatively.
Hi, I see the score is 1.0. Does that mean the accuracy is 100%? If so, could this indicate the model is overfitting?
hello, I'm just getting started on my titanic submission and I can't figure out how to use the notebook. When I try to type 'import', the entire line disappears and so does the cursor. then I try to click on the code cell to edit the code again, and the entire line I click on disappears. It's pretty much impossible to write any code. Does anyone know what I'm doing wrong?
I wasn't able to go over 78% also usinf multiple classifier with LR, DT, KNN, RF, XG, I have defined the Fare/Pclass, FamilySize and divided in bin the ages, have you found other features or have you tried other algorithms?
Sorry about the delay, I'll take a good look in that dataset later in the day, but as far as memory go, I did tried many. I used this flowchart https://scikit-learn.org/stable/machine_learning_map.html, but at the end of the day, the RF seems to perform considerably bettter than any other. As far as Feature Engineering goes, I kept pretty simple. One thing that I remember making a very little difference - but some improve nonetheless, it's to use a Grid Search to find the best hyperparameters for the estimator. So, take a look of that as well!
I got 81 using a neural network, no stacking. Definitely possible to get higher.
Ah ok, I am trying to do without NN, but thank you for the answer
Hi guys, What do you consider a good score (satisfactory)?
Hi folks, i'd like to understand if someone of you experienced to have a very high accuracy on BOTH train and x-validation sets, and nonetheless having 10% less on the test submitted on kaggle. Since i see no particular issues on how i engineered features of the test set to have it aligned with the features of the model trained (one hot encoding, features dropped etc...) it remains only the assumption of having an overfitting model. But then why is that model performing well on the cross validation set ?
Just to give you an order of magnitude:
- acc on train set: 86%
- acc on cv set: 85 and counting %
-acc on kaggle submission: 76 and counting % ... -.-
Hey, I am a beginner as well but I encountered something like this while working on the playground series for this month, you might want to look at increasing the amount of folds (if youre using a stratified k-fold) just throwing this out there
I was encountering a large difference between cv and leaderboard with 5 folds (around 140k entries in dataset) and then after doing some parameter optimization trials with 8 splits instead my leaderboard score almost exactly related to my cv score
I see
Just starting off so wanted to know when can I move on from a competition to another
Can anyone suggest me how can I fill Null values in the cabin column
I see a pattern that cabins starting with C,D,E,F have higher chance of survival compared to others but dont find any relation between fare, class and cabin
also can anyone tell me if they are able to use fare and age? I dont see them being useful
To get a better understanding of how the submission is evaluated, I suggest you use Matthew's Correlation Coefficient for evaluating your train, as this is a better reflection of the submission scores.
Hello, I'm a beginner, just started learning data analytics last month. I used an XGBoost model on my train dataset with some engineered features. I've had four submissions and the highest I got so far is only at 77.99%. I've tried different kinds of engineered features to improve the model but retained only those that seem to work. For a beginner, is it possible to push past 80% or that requires a bit more advanced knowledge?
my friend used some tree diagrams and made through 77%
you definitely can push past 80%
Can you let me know if you succeed pushing past 80%. Iām in the exact same situation!
everyone is now a farm merge valley player
Z
What are the prerequisites for this exercise?
Hello everyone, I am new to this competition and kaggle, and I made a LGBM model that had 80% accuracy, however, my public score is of 0.0000. why is this happening?
Hi! I'm a newbie with datascience. I'm looking to partner up with people to discuss and do datascience projects. I'm looking for people who are interested in understanding why something works the way it does, not just bumbling through to increase accuracy scores. I've finished the IBM Data Science course and now doing the Titanic project. Anyone interested to work with me?
i'm down!
I'll DM you.
Hey Iām a software engineer student and I want to improve my score
( 0.77751 ) , itās for a class and I am only allowed to use logistic regression, anyone has suggestions to how to improve my feature engineering ? Maybe share what you did in your code? Thanks
i would like to know how can i do hyper parameter tuning correctly ?
and what should i use? Grid search / Bayesian / Hyperopt / Optuna?
Welcome to my data science journey through the Kaggle Titanic - Machine Learning from Disaster Project!
In this video, we'll dive deep into the world of data analysis, feature engineering, and machine learning to predict passenger survival rates on the Titanic.
As Kaggle states: "The competition is simple: use machine learning to create a mode...
https://youtu.be/t-INgABWULw
https://youtu.be/LrCylIe0RJM
@fresh spear @tacit skiff
In this comprehensive tutorial, we delve deep into the world of hyperparameter tuning using Optuna, a powerful Python library for optimizing machine learning models. Whether you're a data scientist, machine learning enthusiast, or just looking to improve your model's performance, this video is packed with valuable insights and practical tips to ...
Welcome to our comprehensive guide on hyperparameter tuning with Scikit-Learn! š
In this tutorial, we'll dive deep into the world of machine learning model optimization. If you're looking to take your data science skills to the next level and boost your model's performance, you're in the right place.
Interested in discussing a Data or AI proje...
@whole garden Start with this playlist, https://www.youtube.com/playlist?list=PLcQVY5V2UY4LNmObS0gqNVyNdVfXnHwu8
Without ad blocker
thanks man
I'll have spaceship vid out in feb i think. Working on some new ml vids this month though
I made my first submission following the guide and I got a 0.7751. Is anyone else a beginner in the process of raising their score? I'm looking to collab :)
hey i am also in a somewhat similar position can you add me too
hello, i am trying to save and it keeps saying failed, when i do a quick save it works and i can not submit to competition. please
you might have an error in the code
if you dont write into the csv you cant submit it
Hello Amit, I have managed to get it to submit, thanks for the response
no problem
i'm also still looking for people to discuss with!
Hello, Do I just need to press the submit button or do I need to ask for some kind of permission to join this titanic thing?
me too!
Hey, I'm a beginner in this Titanic project. I followed Alexis Cook's Titanic Tutorial and submitted, but I am unsure how to improve/progress forward. Anyone have any advice on where to look next?
try paid competitions
in the video they explained that looking at the data set and looking online can help you understand better what to look for
also looking at the submissions or forums on kaggle could help you out with different and unique ideas
This should also be a big help
https://www.kaggle.com/learn/intro-to-machine-learning
Learn the core ideas in machine learning, and build your first models.
Hi, I am new to Machine Learning anyone can explain why we need to Normalize and standardize the data?
source: https://www.geeksforgeeks.org/what-is-data-normalization/ Why do we need Data Normalization in Machine Learning?
There are several reasons for the need for data normalization as follows:
Normalisation is essential to machine learning for a number of reasons. Throughout the learning process, it guarantees that every feature contributes equally, preventing larger-magnitude features from overshadowing others.
It enables faster convergence of algorithms for optimisation, especially those that depend on gradient descent. Normalisation improves the performance of distance-based algorithms like k-Nearest Neighbours.
Normalisation improves overall performance by addressing model sensitivity problems in algorithms such as Support Vector Machines and Neural Networks.
Because it assumes uniform feature scales, it also supports the use of regularisation techniques like L1 and L2 regularisation.
In general, normalisation is necessary when working with attributes that have different scales; otherwise, the effectiveness of a significant attribute that is equally important (on a lower scale) could be diluted due to other attributes having values on a larger scale.
Starting with the Titanic survival prediction after singing up with Kaggle for about 2 years, better late than never.
Hi, just a quick question. I submitted some data sets, just to experiment. After looking at the leaderboard, I noticed that a lot of people have 100% accuracy. Is this even possible, or did they just use historical data to give the correct answer for each passenger?
That's what I would think happened
I just updated the basic gender/sex logistic regression, to familarize myself with the submission process and its and its .76555
OK, thank you very much, I was really confused about that.
Titanicās fate highlighted the flaw in this plan. š¢
People say that the Titanic wasn't equipped with enough lifeboats to accommodate everyone in case it sank.
But the plan was, if we ran into trouble, other ships in the area would come to our aid, and we needed enough lifeboats to ferry people in shifts from our ship to the others.
This data not only reflects the social dynamics of the early 20th century but also serves as a reminder of the ongoing need to address inequalities in crisis situations. It's crucial for modern safety protocols to ensure fairness and prioritize human life regardless of socioeconomic status, gender or employment role.
What lessons do you think we can draw from the Titanic tragedy that are applicable to today's society? š
I finished my homework! ā
https://www.kaggle.com/code/alexandroskanakis/titanic-survived-classifier
After about 25 submissions, Iāve finally landed on a notebook Iām happy with. It scored pretty well, and feels like a good stopping point as I move on to the next project. Iām still getting started with Kaggle competitions, so Iād love any feedback you might haveāor if you find my approach useful or worthy of praise, an upvote would mean a lot!
Hereās the link to my notebook: https://www.kaggle.com/code/josephnehrenz/classification-titanic-random-forest-model-in-r
Thanks in advance, and good luck to everyone still working on the challenge!
As a complete n00b, I'm curious how your previous submissions scored, either in the competition or on your validation set. Most of the feature engineering things I've tried so far seemed to make things worse (using XGBoost as the model), so I'm curious how big of an impact your features made as you added them in later submissions
Hey @pine sonnet,
I totally understand your experience ā I had something similar happen. I initially scored 79% with a fairly basic model, and when I started adding more features, my results actually got worse. At first, I thought I was on the right track, but it turned out I was running into issues with overfitting.
As I added more complex features and ramped up cross-validation and parameter tuning, my results really tanked. What I learned is that for this competition, the sweet spot seems to be finding a balance:
- Adding some meaningful features to improve the model, but not going overboard.
- Avoiding over-tuning to the training data so the model still generalizes well to unseen data.
Itās tempting to throw in every feature you can think of, but for this challenge, simplicity with a little refinement seems to work better than full complexity. An overtuned/overdone model can easily translate to 5%-10% prediction accuracy loss for this data.
Hello, I'm noob in machine learning and I try hard to understand all the mecanics. Your notebook is very helpful, I just have a thing that I don't understand : you explain a lot of statistics things in order to see the correlation of columns, to underline the link between the rate survive with gender, title etc. However, I don't understand when do you use your graph and calcul in your model to predict the test set.
Hello, I am newbie to kaggle. what are the next steps after titanic tutorial?
Hi @distant linden ,
Thank you so much for taking the time to check out my notebook! š Iām really glad to hear that you found it helpfulāyour feedback means a lot. Regarding your question, youāre absolutely right that the graphs and calculations I included are primarily used to explore the relationships between features and the target variable, as well as to confirm the value of those features for prediction. For example, when we see a strong relationship between Sex and survival rate in the graphs, this insight suggests that Sex is an important feature. I take these insights and incorporate them by creating or refining features for the model. For instance:
Transforming features:
If we notice patterns in the Age distribution, we might group it into bins or take a log transformation.
Creating interaction terms:
If two variables (like Sex and Pclass) show combined effects, I might include an interaction term like Sex * Pclass.
In essence, the visualizations arenāt directly used in the model but guide which features I create or prioritize. They act as a bridge between data understanding and model performance. As you mentioned you're new to machine learning (welcome! š), I hope this helps clarify things. If thereās a specific part youād like me to expand on, let me know, and Iād be happy to help further.
P.S. If you found the notebook helpful, Iād really appreciate an upvoteāworking toward my first bronze medal has been quite the journey! š
Asking for feedback on my kernel
Hello, I have recently started on titanic challenge and I finished tutorial. Then, I tried doing some preprocessing to clean up and improve the accuracy. However, accuracy decreased. What is the problem with my code? I attached the link to my notebook.
https://www.kaggle.com/code/eidenspark/notebook6b8d8cd056
Thanks!
Hi @simple egret, nice work and welcome to kaggle! I'm fairly new myself but think I can help you out quite a bit. There's a lot I can share, it may be better for you to check out my notebook linked about 5 comments up:
https://www.kaggle.com/code/josephnehrenz/classification-titanic-random-forest-model-in-r>
It's written in R but super easy to follow, with plenty of documentation explaining the process, I think if you spend a few minutes reviewing it, you'll be able to make a number of connections and know what to do going forward.
High level, you can definitely do more than cut missing values and run the model! I'd suggest starting by imputing the missing valuesāthere are plenty of methods for that, so it's worth exploring. After that, dive into the data and try creating some interesting features that might boost the model's predictive power.
Hope this helps get you started! Feel free to ask if you have any questions, and if you find it helpful, I'd really appreciate an upvote as I'm questing for my first bronze medal! Good luck! š
I think part of what I'm wondering is whether seeing no improvement on an individual feature is a sign that that feature shouldn't be included, or whether it's the kind of thing where you only see the benefit when you've added everything in together. Looking at what I was doing, it looks like I had added a feature equivalent to your FarePP, but I didn't see any benefit to adding it, I got exactly the same accuracy on a validation set with vs. without. I'm not sure whether that means I should leave it out, or whether it might matter in interaction with something else. If you drop FarePP from your model, does it get worse?
Hey @pine sonnet, you've definitely hit on the heart of the modeling process here! Deciding when to add, drop, or create interactions between features is often the trickiest part. There aren't clear-cut rules, it's more about experimenting, asking the right questions (which you are doing right now), and iterating based on what the model is telling you. You're right, sometimes a feature doesn't show an immediate benefit on its own, but it could have an impact in combination with other features.
As for FarePP or any feature, I wouldn't drop it just yet based on a single validation result. A lot of times, the true value of a feature emerges only after a few changes have been made elsewhere in the model. This is why you see such a variety of approaches across the 15k+ entrants. It truly is all about finding your personal model "touch." Sometimes it goes great and sometimes not so much.
Personally, I try to limit my changes to one or two things at a time, so I can trace back any performance dips and better understand what went wrong. I also like to document those updates in the comment section when submitting results, so if a change hurts the score, I can easily roll it back and try something else. It's normal to submit many, many files for any given competition. For example, I'm already into the teens for the 20s for the new Playground competition for February that opened a few days ago.
So keep experimenting and trust the process! Thereās no one-size-fits-all solution here, and itās all about finding that balance.
Hey all!
While going through the titanic train data i noticed, that there are a couple of people who are way too old for their age to be true. Mr. Patrick Connors for example was supposedly 705 yrs old. How do I deal with such an obviously false dataset? Do I exclude these extreme ages? In this case I can also change it by hand because you can find the true age online. But should I expect that the data is false in more cases, which I didnt find yet? Finding wrong information in the more "normal" looking entrances seems to be difficult.
thanks for any answer.
cheers Laurin
Treat them like outliers.
Hi all, I am starting my Kaggle journey with this Getting Started Competition.
Tonight will be all about setting up my dev environment and creating a first benchmark submission without any data preprocessing/cleaning to start from.
What are some of yours methods of operation when starting a competition? Do you submit a first benchmark or do most of you try to squeeze out as much accuracy as possible from the first try?
Have a good evening everyone!
Because I am coming back to ML from a hiatus, I used my first attempt to try my best from memory alone as a recall exercise. Then I started to take ideas from forums like this and make improvements from there.
Ready to open my first notebook
Hi all, I was wondering if someone would be open to helping me with the titanic tutorial? I entered the two lines of code into the second code cell (copy+paste). The data is coming out but not in a table format, is that ok? I also don't see a third code cell.
Hi everyone! I just completed my first Kaggle submission for the Titanic competition. Iād love your feedback on my notebook: https://www.kaggle.com/code/amrkabbary/titanic-survival-prediction-a-beginner-s-guide. Any tips or suggestions for improvement are welcome!
i've submitted my first notebook as well
On the tutorial, my random forest model code is continuously running
I am starting this project and almost through the tutorial, does anyone want to partner up with me on this? Looking to collaborate
Hello I would love to do that! You have contacts?
Hello everybody! I am a complete newbie to Kaggle and starting to get my hands on the Titanic competition. Before starting, I realized there are several modeling approaches and models to use before submitting. I would like to use the best models to get at least 0.8 score. Is there any notebooks and links I can look at to learn?
I just sent you a direct message
Send me too the resources. I am new to Kaggle
Hello, I'm a newbie. Anyone want to learn together?
Yo what's the highest possible ceiling for titanic dataset without cheating?
Hello, I'm a beginner, and when I tried to solve this problem, I faced some issues with my evaluation process.
I used XGBoost Classifier to predict, and applied 5-fold cross-validation to evaluate my results. In my cross-validation, 8 out of 10 folds achieved an accuracy higher than 80%, while the remaining folds had around 75%. However, when I submitted my predictions, my score dropped significantly to around 70%, which was much worse than expected.
Could anyone give me some advice on how to improve my test set accuracy? Thank you in advance!
send something like this
it's hard to give advices based on your words alone since we havent see any kind of data
I've fixed it. I think my mistake was using cut and qcut separately for the train and test sets.
great!
Hi to everyone! I have seen many people with score 1.00 is it really possible to predict 100% of passajers?
Nope, they probably downloaded the ground truths, there are guides on how to do that in kaggle, either way Titanic is not about getting 1s or high score, it's just something you do to warm up
Hello everyone,
I attempted the titanic survival challenge in kaggle. I was hoping to get some feedback regarding my approach. I'll summarize my workflow:
-
Performed exploratory data analysis, heatmaps, analyzed the distribution of numeric features (addressed skewed data using log transform and handled multimodal distributions using combined rbf_kernels)
-
Created pipelines for data preprocessing like imputing, scaling for both categorical and numerical features.
-
Creating svm classifier and random forest classifier pipelines
-
Test metrics used was accuracy, precision, recall, roc aoc score
-
Performed random search hyperparameter tuning
-
Cross validation score of svm was slightly higher than random forest
-
Testing score of random forest was 0.78229
-
Testing score of svm was 0.53588
I think some flaws in my notebook are not performing feature extraction, feature selection and missing outlier analysis. I would appreciate any feedback provided. I really want to improve and perform better in the coming competitions.
link to my kaggle notebook:https://www.kaggle.com/code/jayasuryanmutyala/titanic-survival/notebook
Thanks in advance!
Hi there, is that possible to make team in titanic competition?
Impossible
If I get good in this compatitoan wiill I get nobal prize?
Nope though u will get typo master prize
is that prastigous?
You imputed using median on numerical pipeline, this applies to age since it's the only feature that has missing values, have you checked the graph? see what happens if you dump all those 177 missing values on median.
You applied log transform, did you check whether it actually fixed skewness?
You applied Standard scaler which uses Z score and assumes normal distribution, did you check whether your distributions are normal or even close to normal?
Pipelines from my understanding is made for automating data preprocessing that expect data inputs at certain frequencies with the same features and the transformations are decided after carefully analyzing and taking considering on how to handle each features properly, What I'm trying to say is that for projects like these pipelines aren't necessary, it is used in the industry to automate cleaning process so that people wont have to every time a new set of dataset are produced. I always see people use it the wrong way and just plug in simple imputer and stuffs to clean the data to make the cleaning process instant, but I think this is very wrong as every features are handled differently based on the data analysis, but I may be wrong on my understanding š
My feedback is use formula's and technique with intent, spend more time analyzing your data and know when and when not to use statistical treatments
I upvote tho for support š
Yes thank you. I understood your points. Also when I applied the log transform it did not make too much difference for some features If i remember correctly because out of the all numerical features only one looked skewed it fixed that feature but others looked very much multimodal . I still don't really have a solid idea on how to properly address multimodal distributions I'll read some articles online and try it again. Also I don't know how to upvote. Thanks again for the feedback.
you are using tree based model and an SVM with RBF Kernel so multimodal wont matter, but skewness does a bit. it's not a problem for tree but when you have something extreme like the ones in Fare when most of your values are around less than 100 and then you got like 2 data that has 512 fares it MIGHT become a problem, so you might wanna check whether those extreme values affect the model you are using or not
so one solid piece of advice that I apply to myself too is to always ask myself 'why', 'why do I need this', 'why do I do that', 'why is this neccessary', 'why do I choose this instead of that model' etc
I will definitely do it the next time
NO, not "next time" bro, always haha
I build a very simple solution based on minimal knowledge about the models when I worked on this
Yup definitely
it's a nice notebook, definitely has stuffs to improved upon but it has almost everything from start to finish
I try implementing everything that I learn from Hands On Machine Learning Book by AurƩlien GƩron in kaggle challenges. Its still a work on progress haha. I recently finished learning svm so I will definitely do better.
Also do you have any advice for choosing a deep learning framework ? @eager cedar
I have some experience with pytorch before working on some simple projects
Nah, treat all of these framework as a tool, everything is just applied mathematics and some frameworks might have stuffs you need that others dont have, you can switch back and forth between these frameworks
I agree 100 percent
that is one important thing too, basically AI is just applied math since most of the coding has dedicated libraries and frameworks it solve most of your coding stuff but the math behind these you still need to perfectly grasp
I always had it confusing which one to just follow since I see pytorch has been gaining a lot of popularity over the last few years but the book implements everything in tensorflow. I'm thinking of just following the book and picking up pytorch again
Pick pytorch asap
Yeah very true I feel the mathematical intuition behind the models is really important and gives a lot of knowledge behind the scenes for the model
well tensorflow migh be a bit complex to look at while pytorch is easier in the eyes either way both have uses
Do you recommend any books or courses I can follow for pytorch ?
what can I say, choose what works best for you
hmm I dont know about books but I know some videos in youtube
I have some experience prior with working on pytorch but a very foundational level
Will do that.
I grinded this 25 hrs video like a year ago
https://www.youtube.com/watch?v=V_xro1bcAuA
but most of my skills came from actually trying to use it
Learn PyTorch for deep learning in this comprehensive course for beginners. PyTorch is a machine learning framework written in Python.
āļø Daniel Bourke developed this course. Check out his channel: https://www.youtube.com/channel/UCr8O8l5cCX85Oem1d18EezQ
š Code: https://github.com/mrdbourke/pytorch-deep-learning
š Ask a question: htt...
I followed the same video haha when I learned it too
I don't think he covered nlp though
I have 0 patience so I just take up what I know and build something and just comeback to some videos when Im totally lost
one of the things I wish I did back when still learn was, I should have specialized on 1 thing and focused on it instead of trying everything haha
I did Computer Vision, NLPs among other bunch of stuffs
Thats good approach tbh
but everything worked out in the end, I just had to deepen each of these specific areas
I'll try revisiting my basics in pytorch and work on some simple projects to deepen my understanding as well
Reason: Bad word usage
whaaat
Thanks again bro I really appreciate it for the advice and guidance. I'll definitely improve in next attempts in kaggle
No problem
https://www.kaggle.com/competitions/titanic/discussion/571172
anyone else who check my question?
š
wassup everyone
I donāt know why, but I canāt seem to boost my score. Any tips from you all?
Hello everybody, this is my first time being on Kaggle, what skills do I need know before tackling Titanic?
can you give more details
Data Analysis, Statistics, Machine Learning
I tried to tweak my model for the Titanic competition a bunch of times, but I kept hitting the same score.
what is your common score can you tell me the steps you did before 'tweaking' your model
My common score is 0.69, and I followed the common steps before modifying my model, such as splitting the data, handling NaNs, and some other tasks. I obtained the same score after using XGBClassifier instead of RandomForestClassifier, although I didn't use early_stopping_rounds.
You see my friend 90% of your score will come from Data Analysis and the remaining 10% is handled by hyperparameter tuning, while I can't say for sure how effective you handled the preliminary steps these are the checklist you can use for self check:
Data Cleaning ->
Did you properly handled missing data and/or Duplicates? What I mean by "properly handled" is you didn't just fill the missing blindly, The strategy used is guided by analysis and statistics like for example:
Filling missing values in Age ->
did you just fill it with median/mode? did you check it's distribution before and after filling?
Did you made a smart imputation by analyzing the data -> extracting titles from names and see how this titles corresponds to specific age bracket or did you try to impute the missing values with machine learning instead? did it's distribution make sense after imputation?
EDA ->
Did you analyze what factors contribute to your target? or did you just fill the missing blanks and then feed it to the model hopefully it magically output high score?
What are the major driving features for survival? what contributes to it? what features does not contribute to it? did you test the hypothesis using statistics? what are the result?
Feature Engineering and Selection ->
How does the feature correlate to the target and to each other? can we make a new feature to better capture a pattern? did we select the best set of features according to our goal? are the features in the format and is processed in a way that passes the assumption of the models we are gonna use?
Model Training ->
Let's run a baseline, check feature importance and see what features contribute less, more and did not contribute at all, from our metrics result does it reflect our expected outcome? what can we improve? does our model overfit or underfit etc.
Hyperparameter tuning ->
Base on the result of the baseline model analysis, do we need regularization? maybe we need more trees or limit max depth... Let's use bayesian optimization and see what range of parameters are the best etc etc
There's so much that is going on other than filling the blanks and running the data into a machine learning model. this is just a summary of all the things you can do believe me there are tons of stuffs that can guide you in building the proper model and the data needed for it.
I sincerely appreciate your help. I attempted some modifications and obtained an accuracy of 79% in the spaceship-titanic competition.
heyyy https://www.kaggle.com/code/lakshay5312/titanic-eda/notebook can anybody checkout my notebook and tell me what part have i missed, this is my first ml project and i tried to learn all the steps of preproccessing with it .
my accuracy resulted in 72% approx but why
All I did was copy/paste what was instructed but got this. Any ideas on why?
The test_data variable needs to be defined first, it seems that train_data was defined so see where it is and do the same for test_data
does anyone know what is gender submission dataset ???
That is a baseline model prediction which is based on only 'Sex' attribute of the dataset, which predicts that all males die and all females survive. This achieves ~0.786CV and 0.76555LB... Gender is the most important attribute along with a few others.
Anybody else tuning a RandomForestClassifier atm? Just looking for someone to DM / bounce ideas/questions off of as I try to increase my score. Currently at 83 using StratifiedKFold / cross_val_score but I've been observing my submission test scores being consistently 3% less -- Is this normal or a sign I'm overfitting?
@everyone Question for the Group - I have been working through a Data Science bootcamp and wanted to keep sharpening my skills. I found this contest and figured I would give it a try...however, I am a little intimidated by the fact I am still fairly new to this Data Science arena. Does anyone have any thoughts on this?
I'm new as well George -- Dive in!
Yeah. I think that is normal to have that difference in scores when you get around 80% accuracy. Its pretty difficult to get beyond that score with a plain model.
do you know which data is compared with our predicted data?
Test labels... which should not be revealed
the same test labels which are given in the test.csv right?
No, test.csv has all the features used for prediction but it doesn't have the column 'Survived' which are the labels to compute your score
I mean that when we submit our predictions on kaggle how do they measure accuracy there has to be some reference
Yes. The test.csv, if you notice doesn't have the column 'Survived'... You have to predict the value of this column for test set, kaggle has that actual values but it won't be revealed
Thanks actually I got confused with that gender submission dataset
I got accuracy of 76 any suggestion how can I improve?
Start with basic baselines. Try to find which are the important features, how some features interact and try to improve over your baseline
Hello everyone. shorlty system counted 0.78468 but the cross_val_score(cv=342) function says 0.82115 is this even legal? how is it counted? ma work:
https://www.kaggle.com/code/leleleonid/titanic-data-type-optimization-randomforest
For a small dataset like this, this difference might be normal. Although possibly you can reduce that difference, but to increase the score you need to engineer better features.
@jolly matrix: To add to @charred tinsel 's advice, one thing that might also help is accounting for the distribution shift between training and test sets. Have you looked into this? FYI I'm working through this now so DM-me if interested in talking out ideas and sharing observations. ā
I am new as well. Let's dive in and have some fun!
Hello guys I'm new
Notice me š
By the way guys I have a question
I just got started with the titanic prediction.
Wanted to find out how y'all dealt with missing data especially in the AGE column.
Should I use the age mean to fill the missing part?
The missing values is quite a lot tho
Very helpful thanks a lot
Just visualize the data, it will be cleared
Okay I should visualise before filling missing spaces?
I'll do that thanks
After filling the 177 missing values in the age column with median the skewness actually increased " (0.51)" compared to the way it was "(0.36)" when there were still missing values
I had hoped that filling the missing values with median would have given a distribution closer to normal based on logic/common sense
Log transformation and square root transformation didn't help matters.
Planning to try other methods I find on the net but what would you suggest?
I hope I'm not the only one getting such issues tho
Imputing with a constant value will generally disturb the distribution of that feature if there are many missing values, as frequency will peak at that imputed value.. this may or may not affect the model you use (depends). To preserve closely the original distribution, you may want to use a better strategy, like you can try to use other features to predict Age.
But if the feature is not so important, you may be just wasting some of your time..
I see , thanks for the quick response š„¹
Just for clarity,
you're suggesting I focus rather on selecting an effective predicting model to use that won't have issues with skewed distribution rather than focusing on normalising the age distribution right?
Maybe you can do that.. but on this particular dataset, i would say exploring and understanding the data in general is much more important than anything.
I see
Thanks for the insights š«¶
I may need to cancel my advice but want to hear others' opinions š
Hey btw, what's the progress? You were working on that.. I was assuming both have very similar distributions.
I had discovered back on the 7th that distribution for some variables was different between training and test sets and had been accounting for it since then, until I realized today that accounting for distribution shift between training and test sets is a type of data leakage (IMO) because per competition rules the test set is supposed to be "unseen" data. I've begun adjusting my analysis to assess distribution shift between training/test folds of training set data only.
Ohk
But until you are not really peeking too much into the testing data, and are using standard techniques for correcting distribution shift using only information from training set, i consider there won't be any data leakage.
what did this end up scoring after submission - curious seeing these nice numbers I'm at 0.79665 on my 3rd submission (my F1 is always close to my score so far going from 0.77 on my first try).
If you actually peek at the submission distribution (i.e. use the unseen testāset statistics) and bake that back into your training logic or final thresholds, youāve leaked information from the test set.
Thanks for confirming @thick pagoda - I came to the same conclusion! Better late than never š¤¦āāļø
Ofc np I'm new to this stuff but I instantly presumed most of the 1.0 (100%) models exploit this and its not the spirit of the exercise!
How did your models do? @waxen sage
I'm new as well! I love this comp - perfect way to dev skills. Revisions so far have clocked in between 75-78%; I got lucky with one 79% submission early on but haven't been able to reproduce and was really just throwing darts blind at the time. I'm currently working through a revision and hoping for some gains this week! š¤
That's really good i think, I'm at 79.6% - and now problem is I saw 'the answer' to get "more" isn't anything code or model specific so I'll stop there for now, if the models generalizing at that range its a good model imo š¤·āāļø
its a good lesson in feature engineering above everything, I went in with lots of fancy code and although I did a pretty decent EDA phase, played with UMAP I missed some key connections in the data, to get those 80+ scores... good luck!
"I had hoped that filling the missing values with the median would result in a distribution closer to normal, based on logic and common sense."
This is generally true for datasets or features with less than 5% missing values.
However, in the case of the age column, missing values account for about 20% (train dataset only).
Anyway, they say a picture is worth a thousand words, so I did some visualizations to help you understand better:
What happens when you use the median to impute the 177 missing values is that all those values are dumped at a single point, which greatly distorts the distribution. The main goal of imputation is to fill missing values in a way that resembles the original distribution."
this is my second discord btw, my first account got hacked, anyway the score with using only gender as feature in this problem scores 76 percent you can use that as comparison with your current model and the highest legit scored I see so far without cheating is probably from cdeotte - 84%
Thanks a lot for the response Nixon
Do you feel using the median to fill 177 missing values is the wrong approach in this case since it's far away from resembling the original distribution.
So maybe probably in cases like this I engage in smart imputation like you showed in the images you sent
Were I feel the missing values with scores within the median range instead filling solely with median
So sorry about that
Hmmm okay I see
Still working on mine, will compare when done
Thanks
Yes you will always decide how you impute your missing values based on the insight you gain after analyzing the data and not resort automatically to simple imputations unless the missing data is small enough that it wont distort the distribution
I see
Now I know my mistake. Thanks for the tip Nixon I'm super grateful š«¶š
No. By "accounting", I'm assuming you are referring to covariance shift, checking if there is a difference in "feature " distributions is okay.
You're welcome š«”
After the leaderboard changed to 100% some yrs ago, the scores dropped.. now the highest score achievable would be around 83% i believe.
I am currently at 81.8%
Okay so you mean rather than blindly imputing median to missing values, we should analyze and visualize the data and then impute missing values
So if I have outliers in my data should I blindly go for median or like what?
Try to understand the data in general... how that feature with missing values correlates with other features, or how the missingness of the feature may be correlated to another feature (this would help you to streamline your strategy).
Sometimes imputing with median may be very much sufficient, sometimes it may be the only possible best option and other times may even be worse. That is only possible to find out when you'll understand a data.
Impressiveee š
For this competition, is accounting for distribution shift between training and test (submission) data sets an example of data leakage?
2
4
Hi chat, Im a new learner on kaggle and im trying to make a notebook submission for the titanic survivor prediction competition. But even though my output file is created and visible, the competion wont accept the notebook when I click "Create Submission"
any idea why? I can send screenshots if necessary
Hey everyone
Here is how I achieved 82.5% accuracy on test set. I tried to apply the most of what I've learned through my own experimentations, along with insights I gained from public notebooks/discussions.
https://www.kaggle.com/code/a00000100/titanic-machine-learning-from-disaster
Do check it out.
I encountered this issue when my notebook had an error and wasn't able to execute completely. Have you confirmed no errors are visible via the Kaggle UI when scrolling through your notebook?
Anyone get score 0.8+?
what is the submission accuracy u got?
Submission accuracy is 0.8253.
CV was ~0.85
with logistic regressor the best I could get was: 0.77272
what did yall do with cabin column? did you guys just drop it ?
i dont even see any relevance of cabin feature for our target. should i drop it or what should i do ?
No the cabin is actually useful. Most of the cabin data has missing values. But if you look thru the data the cabins that have values have prefixes to them like C85, E46 etc. The C, E represent the decks. The life boats were kept above the higher decks, i.e. closest to A, B,C so the people living in A, B,C decks had more chances of survival than the E,F,G ones and of course more chances than people who didn't have cabins (who lived in dormitories and not cabins which are probably the missing value data which is huge in numbers coz rich people who used cabins were less).
When I added this deck feature along with some more features my score went from 0.7751 to 0.78708
I got 82% accuracy using RandomForest.
That sounds cool. I'm already getting 82 using RF without cabin. But this insight sounds useful. I'll try using it in the next iteration. Thanks for it bro.
Impressive š
0.787 with binary classification NN. I might try some more feature engineering later. For now i did HasCabin?, one hot encoded (Mr, Mrs, Miss, Other) and Embarked, and combined Parch + SibSP = FamilySize.
I want to try ranking the decks, and changing the 'Other' title to their respective genders.
Fun challenge, would recommend.
i got 70% Accuracy using Gradient Boosting Classifier
i got 78% using logistic
Hi,
I 've got 75% accuracy. My code is on Github, have you some advice to improve it ?
https://github.com/Jeremy-Duval-PhD/Kaggle_Titanic
https://www.kaggle.com/competitions/titanic. Contribute to Jeremy-Duval-PhD/Kaggle_Titanic development by creating an account on GitHub.
Is your code public ?
Not yet. But I'm planning to make it soon.
I'll let you know once it's out.
Thank you !
Hey guys, quick question, I was wondering that if I haven't previously worked on anything at least too substantive in AI/ML, if this titanic project is doable for me? Any tips/suggesstions as to where I can start is also appreciated!
i used algorithm evaluation
Everyone needs to start somewhere, and this is a pretty good problem to get started on. Try getting familiar with general visualisation tools and Pandas, figure out how to manipulate the data & select some features that seem useful, then try running some Sklearn classifiers on it & assessing accuracy.
thank you
It's a good way to start. I can give you the link to my commented code if you want.
Thank you, if I end up needing it I'll definitely shoot u a dm, appreciate you