#šŸš¢ā”Štitanic

1 messages Ā· Page 1 of 1 (latest)

final cliff
#

What was everyone's experience like with the titanic competition? It's a very popular competition for those who are just getting started. I'd love to hear about your past experience, or current if you're getting started!

acoustic storm
final cliff
acoustic storm
daring sundial
#

My past experience with Titanic was great. It was what nudged my towards getting a notebooks expert title

desert epoch
#

hello dear kagglers, I'm a complete beginner 101 and happy to join the titanic competition. any Godfather to mentor me? any help will be appreciated

final cliff
#

Hi Frank, your test post went through just fine. Going to delete it now to keep the channel on topic.

strong pumice
#

@desert epoch have you started the titanic exploration yet? I can help

golden schooner
#

Hello everyone, I am a baby Kaggler it would be great if someone could guide me with the Titanic challenge

strong pumice
#

@golden schooner I can guide

#

@golden schooner if you are still looking for assistance maybe we can link up and go over some things. I use Rstudio

golden schooner
#

So how do you want to link up? Should we have a Google meet?

#

I'm not familiar with Rstudio but I can catch up

strong pumice
#

@golden schooner yes yes.

#

@golden schooner what program script are you familiar with

golden schooner
#

Python

strong pumice
#

@golden schooner okay no problem now is this your first time using Kaggle

golden schooner
#

Yes it is ...

final cliff
#

Thanks for helping out, @strong pumice !

strong pumice
#

@golden schooner I am available now

#

@final cliff no problem at all

robust rivet
#

Has anyone tried k-means on the titanic dataset, I haven't seen it anywhere.

ionic arrow
#

I don't know if I use the Name feature to find people in a family then it is possible?

robust rivet
safe quarry
#

Hi I'm just wondering if a no-cheated accuracy score of 83% is considered good (as in I should be proud) for the Titanic Competition or if It's average, high average, low, bad etc. I just wanna know if I should keep working on it

rough mortar
final cliff
#

Great work, @rough mortar. Thanks for sharing!

pastel cosmos
#

Hi everyone, I saw in one of the posted notebooks for this competition where the creator of that notebook changed values into categories before fitting models.

For example, the values of the age feature were changed to 1, 2, 3, 4 where 1 was the youngest and 4 was the oldest and the age feature was dropped entirely.
Is this a recommend practice for numeric columns? Or is it a different way to normalizing data? Can I leave the age feature as-is? Or is this just one of many ways that can be tried before fitting?

ionic arrow
fiery marsh
#

Am I calculating a score right?

X_train = train_data.drop(["Survived","Ticket","PassengerId","Name","Cabin"], axis=1)
Y_train = train_data["Survived"]
X_test = test_data.drop(["PassengerId","Name","Cabin","Ticket"], axis=1).copy()
random_forest = RandomForestClassifier(max_depth=4)
random_forest.fit(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

I checked a score based on X_train and Y_train.
Should I use X_test and Y_test from test.csv?

finite ermine
#

Hey, I'm new in ML and I'm working on Titanic competition. What is the best possible way to deal with the missing cabin values? I don't think I should drop the rows because there is a lot of missing values.

mint path
#

Survival rates by age did correlate to children first, but survival rates by fare group also show that the expensive titanic seats were prioritized on the lifeboats. More insights from @rough mortar's HackerNoon story about his Titanic entry: https://hackernoon.com/how-likely-was-one-to-survive-on-the-titanic

Only 38% of the passengers survived this devastating event, prompting me to wonder about the individuals who were aboard the Titanic that fateful night.

final cliff
vagrant pebble
#

Thank you

eternal mantle
#

is it too late to start this competitions i just found out about it just now

sinful notch
#

It’s always available

#

And it’s two month rolling leaderboard

eternal night
#

@strong pumice please help me on the Titanic competition

strong pumice
#

Okay

minor turtle
#

Hey! I am up for that too, if you want

ionic arrow
#

Can someone suggest the best model for this competition?

sinful notch
#

^^

#

Keep getting 60-70% with sklearn’s LogisticReg and SVC

eternal night
#

@strong pumice Give your time and online place.

eternal night
#

@minor turtle thank you for showing interest to help me on the competition. I and @strong pumice have agreed a date for the meeting. Perhaps it will be supper when you are in copy.

tepid lark
#

Hey guys! Saransh this side, whats up?

north pagoda
meager flare
sinful notch
#

Imo I’ve heard ppl say NN is over powered for structure tabular data like this one

desert epoch
#

If you showed me this notebook like a month ago, I wouldn't have understood like ANYTHING at all

steel thorn
#

how to solve titanic data for accuracy 1.0

sinful notch
#

^^

mild echo
#

Hello all, I'm Tamunotonye Samuel Solomon Inioribo,

I am new to Kaggle competition and would like to be part of the Titanic. Though I have done a couple of solo projects on ML, I will be glad to be part of a team for this... I can work on R-studio, Jupyter and other IDEs.

Thanks

fringe mesa
#

hi+

desert epoch
#

hello

hearty veldt
#

I couldn't find any way to create a team for the Titanic competition. I read the docs but I cannot find the Team Tab or Team section. Can someone help me by pointing out the link or the button to create a team?

swift igloo
#

Let me know if this helps @hearty veldt

hearty veldt
plush musk
hearty veldt
small crescent
#

https://www.kaggle.com/code/vanessah26/titanic-79-accuracy-using-rfc
Hi everyone, I'm a CS student and started learning about data science this Summer. I joined Kaggle in the middle of August and I've been learning a lot from this community! I just finished my first competition. Please check out my first notebook and give some advice or feedback.
Much appreciate it : ), Happy data analyzing!

pulsar sail
dim ruin
#

šŸŽ€ Hello everyone;

In this analysis, I explore the Titanic dataset through Exploratory Data Analysis (EDA), conduct statistical analysis, and build predictive models to understand and predict passenger survival on the Titanic. This project incorporates Kaggle's Titanic dataset for comprehensive insights and predictions .

Please check out my notebook and give some advice or feedback. If you like it , don't forget vote it, please, Happy data analyzing!

šŸ”— https://www.kaggle.com/code/huseyincenik/titanic-eda-statistical-analysis-and-prediction

grim hornet
#

Hello everyone,
I am new to kaggle, today itself I participated in titanic challenge and want someone to guide me. I am really a novice to the field of machine learning.

ionic arrow
dim ruin
ionic arrow
fluid ocean
#

Hi all, I absolutely have no idea how to start this challenge any advices on the learning material that I should take

ionic arrow
fervent orchid
#

Define the bin edges and labels

bin_edges = [0, 10, 25, 45, 55, 100] # Define your desired age bins
bin_labels = ['Children', 'Young', 'Adult', 'Late Adult', 'Old']

Use pd.cut() to bin the Age column

train_data['AgeGroup'] = pd.cut(train_data['Age'], bins=bin_edges, labels=bin_labels)

fluid ocean
#

I dropped the following columns I think they are completely unnecessary and cannot be used: Cabin, Ticket, Name . Am I right here ? Should I drop the Embarked as well?

#

How do I determine which model and feautures to use to me it seems like a regression task, are there metrics that I can use?

grim granite
#

My Titanic Solutions:

glad stone
grim granite
#

Thanks @glad stone , I have read and liked your Notebook.
Can you explain to me the usage of some of these Classes like:

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score,classification_report

What does StandardScaler do and what how does KNeighborsClassifier work. I would like to know more about them.

chrome rain
#

If you want to tune your hyperparameters, then split the initial data set into train and test, for example, like some 8:2 ratio and then perform tuning on that test dataset.

glad stone
glad stone
keen kayak
# grim granite Thanks <@1005456847702085642> , I have read and liked your Notebook. Can you exp...

StandardScaler is used to rescale numeric features so that they have mean=0 and variance=1. Helps avoid giving undue importance to features with large magnitude over those with small magnitude. More details in the scikit-learn docs https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling.

grim granite
#

Also, I was wondering how does this scaler function treats categorical values or mapped categorical values.

#

For example, I converted male, female to 0, 1 and keras.Normalizer() was treating it as -0.6-ish value and +1.4-ish value.

#

Is that ok?

grim granite
#

Ok, I just checked your Notebook and it says your score is 0.76555, which means the accuracy is 76.5%.

I think you are not clear about how accuracy works. The accuracy you get during training is heavily biased and means little even if it is 100%. We cannot decide that much from it since feeding the data a model has already seen and calculating accuracy based on that is heavily biased.

For example if you only do the example maths from your textbook and to test your skills I give you the same math you always do, that scores are heavily biased even if you do 100% cause you could just memorise the answers and write them as it. Only when you are given new unseen matn that we can be sure that your skills can be generally applied for any math of similar nature that you have never seen.
That is why training and testing data is kept different. You cannot judge a model by the data it has already seen. You have to let it predict results based on new unseen data to get a sense of prediction ability.

So, you should always submit your models predictions before telling others your models accuracy.

keen kayak
tender shale
#

Hello! How do you interpret parch and sibsp in the dataset? I'm having trouble on how to interpret them because they are only numbers. For example, since majority of the passengers has a parch of 0, does that mean that all of them has a nanny accompanying them or are they alone? What about for adults whose parch is 0? Is it a nanny?

grim granite
#

I am currently sad that my accuracies does not go beyond 80% for titanic.

fluid ocean
soft patio
#

Do deep learning technique work better for the titanic or do simpler techniques like random forest work better?

#

I was able to achieve around 80% accuracy with a random forest and hyperparameter tuning

#

would I get better result using a simple ANN, or would I be better off just using the random forest from sklearn?

coarse plume
#

you got 80% on the submission?

strange ridge
#

Hi I am starting titanic competition so I need a team for it

soft patio
#

Using stratified split

safe orbit
#

Hey guys, I got 0.78229 on my first submission. Could anyone look over my code and offer some suggestions? This is my first ML project and I want to also make a YouTube video on how I built it out and such, I know I still need to leave a lot more comments/documentation and clean up a few sections

https://www.kaggle.com/ryannolan1/titanic-wip-78-accuracy

keen jasper
#

Hello Everyone😁
I just finished working on this competition and actually enjoyed it very much!
in my notebook I focused on EDA, feature engineering, and diagnosing missing values.
Feedback is an essential step for learning, so I would love to hear your input, guys!

https://www.kaggle.com/code/leen98/eda-and-feature-engineering-the-titanic-sinking

strange canyon
#

Hello everyone,
I got a 0.559 on my first submission. I tried to use variations of the dataset but I wasnt able to increase the score. What are key factors to look out for in the data preparation which could increase the performance of the model? I use a RandomForestClassifier and the accuracy on the train data is around 0.9 which confuses me even more because I dont understand how the differences in the accuracy on the training data (0.9) and the test data (0.559) can be so big. I hope anyone can help me with these problems!

sturdy bridge
#

i saw a zero score after submission is it even possible

safe orbit
#

Probably formatted it wrong

strange ridge
#

need partners for this project

safe orbit
#

Let me know what you think, I recorded a full 2 hour walkthough of my code: https://www.youtube.com/watch?v=6IGx7ZZdS74&ab_channel=RyanNolanData

Welcome to my data science journey through the Kaggle Titanic - Machine Learning from Disaster Project!

In this video, we'll dive deep into the world of data analysis, feature engineering, and machine learning to predict passenger survival rates on the Titanic.

As Kaggle states: "The competition is simple: use machine learning to create a mode...

ā–¶ Play video
#

I want to make a part 2 with improvements, so if you see any mistakes or ways I can make it better please lmk

manic sentinel
vapid nymph
#

Hey everyone, I scored a 77% on the Titanic Dataset using a random forest. I'm seeing that the top scorers on the leaderboard have a perfect 100%. Would this count as overfitting the data? Is it possible to actually score a 100%?

sleek minnow
signal solar
tough portal
#

Hello Kagglers!! I have just joined Kaggle and this is my very first competition. I was following the tutorial to get started but when I copied the code for women and men who survived to find out the percentage it is showing me an error. I exactly copied it from the tutorial. Can anyone help with this?

vast summit
#

completed the challenege...... hope you guys find this helpful

ivory magnet
#

And which error is it showing to you?

tough portal
#

It’s working now

#

Thank you

near peak
#

Hello all, i am new to kaggle and also to data science 🤪
starting with this competition now, are there any teams or is there anyone who would like to team up?

ebon sun
#

I’ve found it’s so hard to get to .80 on the submission haha, but alas that’s my goal. Got to 0.78 last night. Anyone wannna chat to see if they have ideas to get my random forest to 0.80?

ebon sun
#

Here's my 0.78229 score that, for the life of me, I can't improve. Let me know if there's anything I can do to push it forward.

ebon sun
near peak
ebon sun
#

Have you used one of those models wit this one?

cunning abyss
#

what's the score for default submission (just copy/pasting the tutorial)?

near peak
# ebon sun Have you used one of those models wit this one?

Yea i got better result with decision tree, but later i tried to change the hyper parameter and got score better for decision tree as well as for random forest (here i am talking about the individual score of the model and not the kaggle submission score 🤪)
I am still trying to improve my overall submission score

jovial bronze
#

chatgpt told me you can get high 80s low 90s without cheating that true?

safe orbit
#

Releasing the video + notes next week

#

Welcome to my data science journey through the Kaggle Titanic - Machine Learning from Disaster Project!

In this video, we'll dive deep into the world of data analysis, feature engineering, and machine learning to predict passenger survival rates on the Titanic.

As Kaggle states: "The competition is simple: use machine learning to create a mode...

ā–¶ Play video
obsidian pasture
#

Hi, my score right now is 0.78 in the leaderboard, I am using deep learning, my question is if this dl approach is suitable for the challenge or is it better to use traditional Ml like random forest or similar ? thanks !

merry marlin
#

how do I confim my account? I cant find my country in the phone number codes to send the confirmation SMS

safe orbit
#

@obsidian pasture try multiple models. A voting classifier gave me the best results although I didn’t use deep learning

restive kestrel
#

Is possible if two ppl use same model but have differents results ?

whole kindle
#

yes...

restive kestrel
#

Anyone can get Score above 0.77511 ? and which method he uses ?

safe orbit
#

Yes I have a 0.79

#

And way possible to have different scores

#

Today we are taking a look at how I was able to improve my Titanic Kaggle score up to a 0.79, which was good enough for the top 9%.

I showcase all the code changes and what I would still improve on, if I had more time.

I'll be adding notes to the Kaggle Notebook if interested.

Kaggle Notebook: https://www.kaggle.com/code/ryannolan1/titanic-v...

ā–¶ Play video
restive kestrel
#

thx @safe orbit

safe orbit
#

Part 1 is also uploaded as well as all the model tutorials so check them all out

#

Housing predictions is being worked on now as well as writing scores

#

But writing vid will have to wait till Jan. I won’t win but I think it’s better that way

desert epoch
#

Hi everyone! Is this the right page for the titanic competition? First time in Kaggle for me and not an expert on discord šŸ™ƒ

safe orbit
#

Yes

west bolt
#

I wanted to get a sense of how good my result is - using NN with some hyperparameter tuning, scoring 78% tops on the leaderboard. I understand that random forest probably performs better on this dataset, and I'm not currently using ticket #/name to identify groups.

Is that a good score within those constraints, or does that indicate that there are issues with my architecture/feature engineering/etc?

steep edge
#

In this video we build a model, which predicts titanic survivors with a decent accuracy.

Kaggle Challenge: https://www.kaggle.com/c/titanic

◾◾◾◾◾◾◾◾◾◾◾◾◾◾◾◾◾
šŸ“š Programming Books & Merch šŸ“š
šŸ The Python Bible Book: https://www.neuralnine.com/books/
šŸ’» The Algorithm Bible Book: https://www.neuralnine.com/books/
šŸ‘• Programming Merch: https://www.neu...

ā–¶ Play video
#

Was following this tut but during the applying of the pipeline for all the pre processing in the train set its giving me reshape error

lost ermine
steep edge
wraith ridge
#

I know I'm a bit new, but in order to submit my csv, I should just click "Submit Predictions" and upload my csv, right? I've tried 3 different browsers, different security settings, verified my account, and a different computer, but I can not click that button. Am I missing something?

gaunt cedar
#

Hey everyone, I'm new to Kaggle. I've created a deep NN model for this project and I've got 80% accuracy on my validation set. However, I'm yet to fine tune the model so I'm hoping to get slightly better results. I wanted to ask if there's a way to check the bayes error for this project? What is the highest accuracy that has been achieved for this model without cheating. I saw the leaderboards where people have achieved accuracy = 1 , which is certainly not possible.

gaunt cedar
#

Anyone currently working on this project would like to discuss? I've done some fine tuning to the model and so far I'm at 82-83% accuracy. Today I'll build out multiple models with randomly sampling hyperparameter values and I'm targeting to reach 85% accuracy.

safe orbit
#

Ye a lot of people cheat top of leaderboard it’s stupid

#

On this project

#

Not all projects

safe orbit
#

Also @gaunt cedar I’m currently learning PyTorch, would def like to see what you’re working on. I did standard ML models from scikit, xgboost and a voting classified and it got a top 10% score

shadow cave
#

Hello

#

I need help with the prediction model of Titanic survival dat

arctic rune
#

@shadow cave i might not be able to help you but can i know the issue ?

#

how to increase accuracy ?

#

i don't know where to start

shadow cave
#

@arctic rune thank you for your response. I need help with the feature process of how you could take any value as a feature.

arctic rune
shadow cave
#

Are you also a beginner?

arctic rune
#

yes, very beginner

heavy raptor
#

Hello @shadow cave and @arctic rune I just started working with this dataset yesterday after joining kaggle. If either or both of you are interested in working together message me and we can work through it.

shadow cave
#

@heavy raptor ok

shadow cave
#

@heavy raptor bro let me know i'm working on it. We can perform it together.

gaunt cedar
steep edge
gaunt cedar
quiet needle
gaunt cedar
steep edge
quiet needle
steep edge
gaunt cedar
steep edge
gaunt cedar
#

You need to elaborate a little. I'm also using classification using a neural network.

steep edge
#

What i want to say is that what are the advantages ur getting using NN instead we can just use regular classification without NN

#

Why make it more complex?

gaunt cedar
#

I'm assuming that you're asking why not just take the input features and pass them through a classification algorithm directly, let's say binary classification. And use the output that we get to make a prediction, right?

gaunt cedar
#

Okay, so here a problem with that...

#

Let's say you have this data and you have to classify either an input feature is 1 or 0...

#

All you do is train your algorithm using binary classification and it will learn a decision boundary (which is the straight line here) that separates the 2 classes. Now if you're input get the value 1 or 0 based on the side that it lies with respect to the decision boundary.

#

Makes sense?

steep edge
#

Yes

#

This is a very basic problem got it

gaunt cedar
#

Okay so now let's say I give you this dataset...

#

How do you draw a straight line (decision boundary) to separate the 2 classes?

steep edge
#

We can use simply random forest witrhthis

#

Why need strt line...The distance off all points from the strt line determines the loss

gaunt cedar
#

Yes you can, but what if you want to use classification algorithm instead.

steep edge
#

All of these algorithms are classification algo

gaunt cedar
gaunt cedar
gaunt cedar
#

Nevermind, I misunderstood what you were saying. But the answer to that is, yes you can. As far as I know, using something like random forest can be way more efficient here, computationally as well as timely. And we may as well get the same results as with a deep NN for this titanic dataset.

#

I'm just using NN for the sake of practicing.

steep edge
#

Yes i knew it just wanted to know

#

Thnx

gaunt cedar
#

no prob. So what's the status of your project?

steep edge
#

Hmm,I am applying the algorithms all pre processing done

gaunt cedar
#

right.

winter mica
#

Hey! I' wrangling with the the 'Cabin' data (or lack thereof) in the titanic set. I'm toying with the idea of playing detective like using ticket numbers or fare details to guess the missing cabins. Or maybe taking shortcut by plugging in the most common cabin for each class for starting point but theres alot of missing values . I'm curious about like any other possible approach-how would you handle this? Looking forward to your insights..

winter mica
#

šŸ¤·ā€ā™€ļø Anyone there???

arctic rune
winter mica
arctic rune
winter mica
arctic rune
winter mica
oblique perch
#

what is titanicand spaceship titanic?

oblique perch
#

so these are like simple datasets to start with as a beginner?

safe orbit
#

yes

desert epoch
#

.

safe orbit
#

If you need help I made 2 vids

#

Welcome to my data science journey through the Kaggle Titanic - Machine Learning from Disaster Project!

In this video, we'll dive deep into the world of data analysis, feature engineering, and machine learning to predict passenger survival rates on the Titanic.

As Kaggle states: "The competition is simple: use machine learning to create a mode...

ā–¶ Play video
#

Today we are taking a look at how I was able to improve my Titanic Kaggle score up to a 0.79, which was good enough for the top 9%.

I showcase all the code changes and what I would still improve on, if I had more time.

I'll be adding notes to the Kaggle Notebook if interested.

Kaggle Notebook: https://www.kaggle.com/code/ryannolan1/titanic-v...

ā–¶ Play video
desert epoch
#

THIS IS GREAT ! thank you!

safe orbit
#

No problem

jovial bronze
#

should you normalize categorical data being fed into a neural network like df[column] = (df[column] - column_mean) / column_std? the numbers are like 0 to 10... normalizing makes sense to me for like Age but for like Cabin prefix I don't know. except that it seems weird to not normalize some columns if I normalize others

winter mica
jovial bronze
#

Right was just reading about one-hot encoding... I think I'll try that one out thank you

void birch
#

Hi everyone, im a uni student trying to get started in data science. What sort of pre-requisite knowledge would I need to get started, specifically this (Titanic) competition?

gaunt mulch
twilit cloak
#

Hii all I am Bhimana i am new to kaggle and ML competitions . Hope i will mingle with you soon

hollow brook
#

I want to learn all about this machine learning and data analysis from where should I start?
I don't have any prior knowledge about coding stuff where I should start from...

sage grove
#

Hey try starting with learning pandas and Numpy and some basic knowledge on statistics

safe orbit
#

Welcome to my data science journey through the Kaggle Titanic - Machine Learning from Disaster Project!

In this video, we'll dive deep into the world of data analysis, feature engineering, and machine learning to predict passenger survival rates on the Titanic.

As Kaggle states: "The competition is simple: use machine learning to create a mode...

ā–¶ Play video
#

This is my titanic project. Everything in this vid I’ve covered on my YouTube channel

urban girder
#

HI everyone, I have a question that how come there are a lot with 100 percent scores on the leaderboard, as a beginner to me that sounds practically impossible unless test data is exploited in some way i have done many models with feature engineering and 82 percent for me has been the highest score.

safe orbit
#

They cheated

urban girder
safe orbit
#

It’s a training competition

desert epoch
#

In how many hours can this be done

obsidian pasture
#

Hi everyone, I have a question not directly related to this competition but more about embedded models, I’ve heard that for most of the competitions to get a decent result (0.9) embedded models are the right way. Is that true ?

desert epoch
desert epoch
#

I have a score of around 0.779 in LB. Does anyone have any idea on how to improve it?

ruby pagoda
#

gridsearch

#

worked for me

flat rapids
#

how is the name column given? is it with place and name?

scarlet mauve
#

Hi guys! I need some advice.

I have some knowledges in sklearn, pandas, numpy, matplotlib, but i can't measure it. I want to understand, where am I and how I can measure it.

May be you reccomend some courses, which covers basic/intermediate/advanced level of using these libs.

Thanks a lot!

cobalt grove
#

Hello there !
I am new to data analysis
I have questions that let's say I download the Titanic data okay , what should I do ? let's say I calculated the median for the data what is the purpose ? I don't have the logic of the data analyst . Could someone please help me or give me resources to learn these concepts ?

cobalt grove
safe orbit
#

sweet, and I have 2 parts so should help

deep holly
#

i've been getting 0.57177 for the kaggle submissions even if the .csv file had different results. anyone else experiencing this?

lament sand
#

hi

#

I wan to learn pyton with a person

toxic venture
coarse cedar
#

I just saw a code for titanic. It's too overwelming. Is it always like that?

reef obsidian
#

The titanic code is one of the more simplier datasets and code. Which code are you looking at? The top rated notebook is pretty simple to follow. It uses a RandomForestClassifier model to do the prediction.

marsh lava
sage grove
# cobalt grove Hello there ! I am new to data analysis I have questions that let's say I downlo...

Hey @cobalt grove i would like to answer your questions:

-First try to understand the data given to you like what each feature tells about the passenger like what's SibSp etc..
-Get the data types of each column and determine which are continuous and categorical data.
-Check for null values and try to deal with them like using Simpleimputer is the basic level to deal with these. Use median for continuous and mode for category data.
-We use the median for the data that shouldn't be lying in the outlier by replacing mean (It's about boxplot)
-Then learn which graphs are used to compare which type of data.

#

For prediction use ML algorithms

#

Refer my medium article for detailed explanation

sage grove
#

Yeah dude any doubts just ping me on my dms

cobalt grove
tidal goblet
#

Hi friends šŸ‘‹
I have recently completed my Data Science machine learning and AI course, and in final exam we have 25 questions and i correct 23/25.
Also play with the dataset heart.csv, titanic dataset and other dataset from kaggle through google colab and a mobile application terminal.

tranquil shoal
#

hey @everyone even i am in top 10% still i didnt get a bronze medal in the titantic competition whats the deal ?

tranquil shoal
runic finch
twilit cloak
#

Hello guys i just started competitions in Kaggle with Titanic problem. I am unable to understand which Algorithm to apply here . How to go to succed in learning with competitions.

sage grove
marsh lava
desert epoch
#

Good Afternoon, I'm new to the whole competition thing, I wanted to know if I have to build the model from scratch.

buoyant frost
#

No, you just use external libraries and call them

desert epoch
#

how can I improve the accuracy ? i used logistic regression function which is built in sklearn.linear_model. when i counted the number of correctly classified examples and divided it by the total number of examples, i got 0.6067

desert epoch
buoyant frost
wet drift
#

Is there any way to see other's code? I really want to know what did the 1.0 accuracy one do differently?

gilded tusk
#

It is supposedly impossible to reach an accuracy of 1.0, the guy is probably cheating to get that result.

safe orbit
#

@wet drift 100% is impossible, but I got a top 10% score and shared my code

wet drift
wet drift
safe orbit
#

only if they shared it

wet drift
#

Okay, Thank you once again

desert epoch
safe orbit
#

no problem

safe orbit
#

GL

buoyant frost
#

why is it impossible?

#

@safe orbit @gilded tusk

#

And how can you cheat for a competition?

safe orbit
#

No model will predict 100% as some of the results are random

#

People hard code the results

#

Change excel file

buoyant frost
#

Then why does that work for some competitions and not for others?

hexed sail
#

Sorry if dumb question, new to kaggle, if it's known to be impossible to get 100% why aren't the 1.00000 scores simply removed from the leaderboard?

#

It gives the impression to newcomers that 100% is achievable.

low linden
# hexed sail It gives the impression to newcomers that 100% is achievable.

It's not impossible - even if the results weren't datamined a model could just get very lucky and guess through the noise. In practice, if a model is ever getting 100% something is wrong with the testing methods as well, a perfect is bad in data science.

Also people (read: interviewers) care about this problem about as much as hello world.

#

I trust a 95% so much more than a 100% in almost every real-world problem. BUT technically it is theoretically possible, I would just have to do an ablation study to examine how the model's able to achieve that score as well as a comprehensive study of the test set.

hexed sail
#

Great info thanks!

low linden
#

Np, now that I backscroll there are alot of questions on this.

thin epoch
#

does anyone know the theoretical highest score achievable with logistic regression?

#

i'm following andrew ng's course and implemented my own model, and have gotten to 76.79% accuracy but hope to get to mid - 80s at least

#

but I'm not sure if this is possible with just logistic regression and some more feature engineering, or if I need to use an entirely different model

buoyant frost
#

is there a way to know that?

#

like it would be cool to know it for sure, the theoretical highest score achievable with a model in particular

#

including all the set of the possible hyperparameters and feature engineering you can do

#

short answer from what i've googled is that no

#

long answer is:

buoyant frost
low linden
# buoyant frost is there a way to know that?

No, there is no limit to the number of values even a single continuous hyperparameter can be, and the solution space is only approximately smooth. That coupled with the randomness injected in during many ML algorithms by design makes this impossible.

#

But, that is an interesting question.

#

I suppose you could limit a continuous hyperparameter to the 64-bit* float limit on most machines but that would still only be an approximation of reality.

thin epoch
#

oh guys i kinda meant practically instead of theoretically

#

has anyone managed to get good (mid 80s i mean) results with just logistic regression on this channel?

buoyant frost
#

i suggest you reviewing public notebooks related to this dataset

#

maybe someone got an approximated score using logistic regression indeed

buoyant frost
thin epoch
#

i shall inspect this man / woman's code

#

i also have been watching some videos and I realized how deep feature engineering really goes

#

i should probably try doing more of that

buoyant frost
#

yeah feature engineering is in many cases the factor that makes your accuracy improve substantially

thin epoch
#

quick question -- do you guys combine ur testing and training data when figuring out columns (for stuff like one hot encoding, etc..) the reason I ask is because for example sometimes the training data doesn't have certain values within a column that are present within the testing data, and this causes issues with one hot encoding

#

like is that considered a good practice

buoyant frost
# thin epoch quick question -- do you guys combine ur testing and training data when figuring...

Hello, this is Bing. I can help you with your question about handling categorical variables with different values in training and testing data. 😊

Generally, it is not a good practice to combine your testing and training data when figuring out columns, because this can lead to data leakage¹, which means that your model may learn information from the test data that it should not have access to. This can result in overfitting², which means that your model performs well on the test data, but poorly on new data.

One way to handle categorical variables with different values in training and testing data is to use label encoding³, which means that you assign a numerical value to each category, such as 0, 1, 2, etc. This way, you can avoid creating too many new features with one-hot encoding, and also handle the case where there are new categories in the test data that are not in the training data. However, label encoding may introduce some ordinality⁓, which means that the model may assume that there is some order or ranking among the categories, which may not be true.

Another way to handle categorical variables with different values in training and testing data is to use feature engineering, which means that you transform or create new features from the existing ones, based on some domain knowledge or analysis. For example, if your categorical variable is related to time, such as "era", you may be able to convert it to a numerical variable by using the year or the period as a proxy. This way, you can reduce the number of categories and also capture some meaningful information from the variable.

Have a nice day! 😊

Source: Conversation with Bing, 10/01/2024

#

tldr; no

low linden
thin epoch
#

thanks

thin epoch
#

please let me know if I am mistaken

buoyant frost
#

you probably aren't

thin epoch
#

rip

#

85% with logistic regression

#

too good to be true

thin epoch
#

Three days of constant effort

#

aids feature encoding

#

and I have reached 0.79425

#

im a bit dissapointed because I expected better results tbh for the amount of work I put in, but at least this is somewhat closer to my goal of 82% sad_panda

#

and now I increased the number of iterations which should make the model better, and the score went down to 75% šŸ˜‚ šŸ˜‚ šŸ˜‚ 😭 😭 😭

thin epoch
#

no not yet

drifting tapir
#

there are 12 columns in train.csv

#

so do I need to make an AI that takes 11 inputs and spit out 1 result?

drifting tapir
#
Feature names unseen at fit time:
- Age_0.17
- Age_0.33
- Age_11.5
- Age_18.5
- Age_22.5
- ...
Feature names seen at fit time, yet now missing:
- Age_0.42
- Age_0.67
- Age_11.0
- Age_20.5
- Age_23.5
- ...```
#

does anyone know what's wrong with age in the column for test.csv data?

#

oh... is it because tree models are basically key-value path like structure

#

it needs to know exactly THAT 0.17 age in the previous training data

#

ok I get tit now

#

ok so tree model is stupid

#

or not really meant for this type of prediction

remote nimbus
#

Hey all

#

Finally got discord auth to work, excited to discuss titanic with people who work on it

#

I got .791 accuracy, not sure if that's good, but it took me a while!

storm saddle
#

In my local machine I got the accuracy of 92.3 using the xg boost model. But when I upload the CSV file it's saying I have 0.58 accuracy

Why please can anyone tell me what's the problem

remote nimbus
#

@storm saddle you're overfitting the training set

#

your code is mastering the training data at the expense of generalizing to the test data

#

also, with a score that low, keep in mind you're doing worse than just saying female = alive, male = dead, which gives 75% acc

storm saddle
storm saddle
remote nimbus
#

I could've gotten in farther but moved on

storm saddle
#

fine bruh got it will take care form next time.

remote nimbus
#

ok

remote loom
#

i just tsarted with this dataset, and im still a bbit lost, can someone help me

remote nimbus
#

@remote loom What's up

remote loom
remote nimbus
#

yo

#

well the metric you'll want to train on is the one they're going to judge you on, which is accuracy

remote loom
#

what shud i do here?

remote nimbus
#

@remote loom you're confusing your test dataset with y

#

Think of it this way. You're given X_train, y_train (the Survived labels provided on training set). For test you're only given X_test, not the y_test labels

#

I assume you're trying to split your training set into training and validation? That would look more like:

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2)

#

(Also I'd recommend specifying stratify=y_train in that, but we can get to that once you fix the basic misunderstanding)

#

Here's an example which should make it clearer:
import pandas as pd
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

X_train = train_data[features]
y_train = train_data['Survived']

X_test = test_data[features]

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, stratify=y_train, test_size=0.2, random_state=1)

iron nest
#

What is the baseline for titanic competition? What percentage of model prediction is considered good for this challenge?

iron nest
#

I started the titanic competition just around a week ago. Why it says I crossed the deadline? How many days are allowed for this competition submission?

remote nimbus
remote nimbus
hearty stream
#

Hey everyone, this would be my first kaggle competition. So before diving into it, I just wanted to confirm if I understood the problem statement correctly.
So the idea is to calculate survivial rate of the passengers. The features could be any, gender is one of them, which I guess is default.
Can we use any model to implement it? Like I went through some courses which showed Linear, decision tree, etc?
Am I missing anything?

runic finch
mellow roost
#

What is the purpose of the YOLO model under the "Models" section of the competition? https://www.kaggle.com/competitions/titanic/models
I know that we can use whatever model we would like, so just wondering (particularly because it seems that it is an image recognition model)

Also, what is the performance metric of this competition? Is it just submission based?

remote nimbus
#

not sure what the yolo model is

tacit flax
#

how do people get 100% percent accuracy ? is that rly possible ?

quartz scarab
# tacit flax how do people get 100% percent accuracy ? is that rly possible ?

It's trivial to just look up the real answers for who lived/died and submit them to get 100% without using machine learning. That's why this is just a tutorial competition and not a real one. You should just ignore people who have 100%, all results on titanic are wiped every few months anyway, it's just for learning.

mellow pulsar
#

Hey everyone! I've just started working on this competition properly and have a question.

Given that the dataset is already split into training and testing sets, does it still make sense to use cross validation?

In my head it feels like using cross validation on the training set won't be very beneficial because each iteration will have relatively small training sets

half ledge
#

Anyone want to collaborate? I am very new at this

earnest steppe
# mellow pulsar Hey everyone! I've just started working on this competition properly and have a ...

Yes it still make sense to use validation split so that you can know how your model is performing on unseen data. In real life project, test data is not touched until the very last to evaluate model's accuracy. On kaggle, I think there is a limit on the maximum number of submission in 24 hours , so if you want to try more models than that limit then you won't be able to evaluate it. So it's advised to use cross validation.

plush musk
#

Got 72% accuracy using RandomForstClassifier, is this good?

earnest steppe
plush musk
#

I'll try my best

restive patrol
#

Hey guys
So I was trying something cool wanted to share with you all.
I used LLM (GPT4) to predict the outcome of each row, I fed a good set of facts about our data and I am getting an accuracy of 78%.

If someone wants to know more feel free to dm, might share soon on twitter too

#

"input": {
"PassengerId": 944,
"Pclass": 2,
"Name": "Hocking, Miss. Ellen Nellie""",
"Sex": "female",
"Age": 20.0,
"SibSp": 2,
"Parch": 1,
"Ticket": "29105",
"Fare": 23.0,
"Cabin": null,
"Embarked": "S"
},

#

"output": {
"Analysis": "The passenger whose details we are analyzing was a young adult (20 years old) female belonging to the 2nd passenger class. She boarded the Titanic from Southampton, she was not traveling alone but had two siblings or spouse on board, and one parent or child. Her fare belongs to the 3rd quartile. Unfortunately, we don't have data on the cabin assigned to her.",
"conclusion": "Based on the provided data and the survival rates, we know that females (74.2%) had a much higher survival rate than males. In addition, while the overall survival rate in the 2nd class was lower than the 1st class, young adult female passengers in the 2nd class had a respectable survival rate of 43.28%. The survival rate from Southampton was 33.7%, and for those traveling with family, it's slightly better at 50.6%. Being in the 3rd quartile fare might have slightly improved her chances as its survival rate is 45.5% compared to lower fares. Lack of cabin data gives her a lower rate of survival at 30%. Given the survival rates, the chances of this passenger's survival seems quite positive.",
"Answer": "Survive",
"Facts Used": "Gender, age, Pclass, Embarked, SibSp, Fare, Cabin",
" Additional Facts that might have helped": "Deck level would have given us more insights into the survival rate, as would knowing more about the relative ages and classes of her siblings/spouse and parent/child on board."
}

#

Haven't yet fine tuned the model, that would help give parameters the weightage

finite ermine
#

Hey everyone, I'm a beginner in machine learning and I've been working on Titanic competition lately. I used logistic regression and my score is 0.59.

#

I believe there is an overfit, I tried regularization and selecting less features but it didn't work. What can I do to improve my model?

crisp laurel
woven crater
#

hi guys, i just want to check for this dataset is it possible to get score 1?

#

just by using ML

#

i know this question have been asked multiple times sorry for asking repeated question

quartz scarab
#

@woven crater Getting a perfect score using ML is unrealistic / basically impossible. If you think about the nature of the problem you can also intuitively understand this - survival on the titanic was not something you could perfectly predict from the information you are using to infer these judgements.

Scores of 1 on the leaderboard are simply people looking up the answers (since it's a real historical event). This is why the titanic is simply a tutorial problem.

woven crater
#

Thanks for answering my question!

unique hawk
#

Hello everyone, I am taking my first steps in data analysis with Python and I am not knowing how to solve this, I guess I am making a mistake in the location of the file, but the only thing I did was copy from the tutorial, sorry for the inconvenience and thank you very much to all.

vivid shuttle
#

I have created a notebook from the competition and executed the commands same as yours. It works perfectly fine for me

unique hawk
unique hawk
vivid shuttle
cyan delta
#

Hey, I am new to ML and learning ML by watching YT videos and MOOC. I am looking for a Mentor/Guide/Buddy with whom can share his/her experience with me and help me learn become a better ML practitioner.

brisk lintel
#

Hey guys, I wanted to ask what classifiers should I use to get higher accuracy? So far the highest accuracy I have got with an optimized Decision Tree is 0.7799. I have also used Logistic regression,KNN,SVM,Random forest but they got lesser accuracy.

sage grove
brisk lintel
#

How can we merge other datasets? Did you mean shuffling of datasets or just merging other public Titanic datasets? If so doesn't that violate the term of the competition?

#

I also heard there is a way to merge different ML models to achieve greater accuracy? Is this true?

sage grove
brisk lintel
#

I see

neat citrus
#

i scored 0.78229 in titanic predictions. how are people scoring 1.000???

tawdry raptor
#

may be its overfit result

daring plover
hushed steppe
#

Really enjoyed doing the competition tutorial.. are there little hints on techniques to try to boost my score? I don't want the answers, just a nudge in the right direction

#

Right now I am using Google Gemini to help me out

hushed steppe
wide yoke
#

Hey guys, I'm new here and managed to get 84% at best, is that a reasonnable score & should I move to something else, or would you adivse me to keep looking to improve?

wide yoke
#

actually, that 84% is on my validation set. When submitting, I'm at 75%, which is 2% lower than my previous attempt at 77%

I'm a bit confused: I've spent time trying to improve the validation set, making sure of no data leakage, and it ended up being worse at test set.

What's the recommanded process to avoid that in the future?

wraith spruce
final cove
#

Can we download the datasets and run it on our own computers? I have a windows computer and linux computer.

#

My windows computer is a Dell PowerEdge R720 server running Windows Server 2022. 2 x 10 core processors threaded which makes it 40 threads. 3.5 TB HardDrive and 128 GB RAM.

final cove
#

When I ran pip install sklearn it says it is deprecated and to use scikit-learn.

#

Nevermind. I had to install other packages.

safe orbit
#

If anyone wants a vid to follow, I made this a few months ago: https://www.youtube.com/watch?v=6IGx7ZZdS74 Also have a p2 improving my model

Welcome to my data science journey through the Kaggle Titanic - Machine Learning from Disaster Project!

In this video, we'll dive deep into the world of data analysis, feature engineering, and machine learning to predict passenger survival rates on the Titanic.

As Kaggle states: "The competition is simple: use machine learning to create a mode...

ā–¶ Play video
acoustic oracle
#

I have a question - why doesn't a pytorch model work for this competition

#

I tried making a neural network and getting predictions out of it but I repetitively get 0 score

acoustic oracle
#

i do get an f1 score of 0.65

#

but my accuracy is still 0

#

i would greatly appreciate the help!

pseudo stratus
acoustic oracle
hazy owl
#

Ayo guys I had a question what is a good score for a beginner in this competition?

#

Nvm I saw the message above

pseudo stratus
acoustic oracle
pseudo stratus
acoustic oracle
#

I toned my model down that may be a reason why I got a lower score (maybe)

#

I'll try to get at least a 0.75 - refining

hazy owl
plush raven
#

I got 0.67 yesterday)
It was my first attempt and I know only about LinearRegression. Got float results and divided them by 1 if abs(x) >= 0.5 else 0

acoustic oracle
green sparrow
#

What are the next steps to improving? I got a 0.77, and I basically just did everything that I learned and took notes on from the pandas and intro to machine learning kaggle courses, so I just did a random forest with optimized max_leaf, and filled NaN values with the mean, and that was it

silk oracle
#

Hello, I am new here, and trying with this "Titanic dataset". Upon examination, I've noticed a significant number of missing values in the 'Cabin' column. I believe that the cabin data could potentially offer valuable insights into survival probabilities. So, I am stuck in a dilemma regarding whether to discard this column or not. what are your thoughts on this matter?

hazy owl
hazy owl
#

Edit the new one I submitted is 0.78947

lost gazelle
#

Hey all, new to ML and this is my first ever contest. Just learned a bit of KNN and Naive bayes and I could achieve an accuracy of 0.665. Long way to go it seems!

#

What's the minimum accuracy for it to be considered "good" informally? 0.75?

acoustic oracle
plush raven
hazy owl
acoustic oracle
#

thanks

#

lemme guess scikit learn

hazy owl
signal dawn
#

Hi everyone , I used Random forest classifier too , and got a score of 0.97 , with 7 features and pd.get_dummies () for categorical features. How can I check my model is not overfitting?

silk oracle
silk oracle
# plush raven 0.67

LinearRegression.score(X_test, Y_test) is this the metric we have to measure??

green sparrow
# signal dawn Hi everyone , I used Random forest classifier too , and got a score of 0.97 , ...

Your 0.97 is the actual scoreboard score given by the competition right? If I understand overfitting correctly, it is not possible to overfit on test data when you don't have the true test results. Overfitting happens when your training test score is too high because it memorized the training data along with its noise etc. and generalizes it. But if you are running it on the test data, where you aren't even provided with the true answers, and they check it and give you a score of 0.97, then I feel like the 0.97 cannot be an indication of overfitting right?

I am new too, so take what I am saying with a grain of salt.

acoustic oracle
#

I got a 0.75 but I used a neural network

lost gazelle
lost gazelle
vast hingeBOT
lost gazelle
#

Damn I didn't know we could play games

acoustic oracle
lost gazelle
acoustic oracle
acoustic oracle
kind fox
#

hi guys, im doing a project for class on this dataset and need to understand how the data was collected and the original purpose of it. I couldn't find anything besides competition rules on kaggle. Does anyone have an idea?

lost gazelle
lost gazelle
lost gazelle
acoustic oracle
lost gazelle
spring coyote
#

hi everyone, i wanna do the titanic competition with a friend but we cant seem to find a way to team up

#

according to some forums there should be a 'team' tab next to the rules tab, but there isnt

#

and when i tried to share my kaggle notebook with him, i wasnt able to enter my phone number. klicking into the field did not allow me to enter digits

hoary pawn
#

Hey everyone, I'm currently stuck at 0.82 acc using Logistic Regression. I'm using 7 features, mapped the categorical ones and used the mean to fill the nans in the numerical columns. Should i try to improve the model even more or try another one? Any tips for improving this one?

vast hingeBOT
vast hingeBOT
vast hingeBOT
sharp island
#

Hello everyone! I'm new to Kaggle. I recently worked with a few other students on this competition. But I just noticed, this one doesn't have a "Teams" tab. How do we submit as a team for this competition?

sharp island
#

as i said, there's no team tab

velvet mist
#

I think you should contact support for this one

sharp island
#

huh. okay thanks ^_^

simple bone
#

it's happens alot with me with datasets

pliant flume
#

Hey, need help

#

I dont find the option to join anywhere

#

Neither I can see the teams option

feral ruin
sharp island
sinful plover
#

Hi! im new to the kaggle. im looking for teamates for titanic competition and also find anyone with good heart helping me for the machine learning

vernal creek
#

hello, i am also starting with this titanic dataset

sinful plover
nimble badge
vernal creek
#

let's start

uncut stirrup
#

hoy empiezo

nocturne vapor
storm trail
sweet shore
#

Yeah me too started just now

left palm
#

Hi, me too started with titanic dataset

sweet shore
#

im just confused to start

orchid anvil
#

Hello guys, I am brand new on Kaggle and just finished this challenge as my first challenge ever. I got a decent score of round about 83% accuracy. Now I am looking for improvements on my methods. Is there some common ground on how this problem should be approached? I am basically looking for a state of the art or best practice version where I can learn some new tricks.

mortal marten
#

Hi guys

#

I also did my first submission just now on titanic.

#

I scored a 0 though. Idk what I did wrong.

#

For some reason, the actual accuracy (not the train.csv one but the test.csv one) of my bernoulli NB model is 100%

#

Can anyone help me with this?

pale field
#

Did your answer CSV have the two columns?

#

Or that

#

Did you convert to ints? @mortal marten

mortal marten
mortal marten
pale field
#

The survived needs to be int

mortal marten
#

I see. I converted everything to floats

mortal marten
#

I actually used many types of models just to compare them. My xgboost random forest model also got only one prediction wrong.

sweet shore
#

Hello jinay Vora iam also worked on it 2 days before

mortal marten
sweet shore
#

I did it by decision tree classifier

mortal marten
# sweet shore I did it by decision tree classifier

Nice. I found my problem. As stated by @pale field I submitted 'Survived' as float datatype instead of int. I corrected it and I got 0.77 score. I still don't know why my model predicts all of them correct.

#

I used 19 different models so that I can compare all of them with each other.

#

1 model got 100%. 1 model only predicted one wrong.

#

Not the training accuracies but the actual accuracies

#

Training accuracy was like 86

sweet shore
#

Did you use logistic regression

mortal marten
#

@sweet shore I used
Gaussian NB
Bernoulli NB
Complement NB
Multinomial NB
Decision Tree Classifier
Random forest classifier
Xgboost classifier
Xgboost random forest classifier
Adaboost classifier
Logistic regression
K nearest neighbour's
Bagging classifier
Hard voting classifier
Soft voting classifier
Stacking classifier
SVC classifier
And 3 more I don't remember.

coarse hill
#

Hi anyone looking to work on a dataset together.
I'm very new to kaggle and wanted to kick start my journey.

nocturne vapor
#

hey guys, I was working on this dataset and I have
used Random forest without tuning
features = all - (name, ticket, cabin)
but accuracy at submission was 77%
Any suggestion??

indigo radish
#

Try logistic regression!

#

It is a very standard approach towards this data, and maybe for this Random Forest algorithm, try dropping the name column as it adds no value to the data!

#

Try looking into the correlation of the different features amongst each other...... try plotting the data, do more of EDA and see what you can derive from the data...

nocturne vapor
#

Ok I will try

nocturne vapor
#

I saw correlation using heatmap
after that I chose 3 features for prediction ['Pclass', 'Sex','Fare'] accuracy didn't improved when I submitted on kaggle it was 77%

#

I don't know how people are getting 100% accuracy

mortal marten
wraith flare
#

I got only 82% accuracy

wraith flare
mortal marten
#

And the algorithm was Bernoulli NB

mortal marten
#

I converted embarked of SQC to 1 2 3

wraith flare
mortal marten
#

Default settings lmao

wraith flare
mortal marten
#

Random state 42 and test size 0.3 or 0.2 idk

wraith flare
mortal marten
#

Nice

wraith flare
#

Maybe due to LabelEncoder the value SQC will mismatch with the value.let me check it out

mortal marten
#

Then dropped the original sqc column

nocturne vapor
mortal marten
wraith flare
wraith flare
#

When i predict the test.csv it accuracy go to 0.76 why??

mortal marten
#

I think my model has some problem I'm too beginner to understand

coarse brook
#

Okay, so I have my predictions
the bottom is the survival rate laid out in the train data, and the top is the prediction from the test data. Its a smaller size dataset, so would it be different, or?

plain belfry
#

Hi i'm new. Perhaps a silly question; i'm confused on how i can measure accuracy of my model if there is no y_test data

half grail
desert epoch
young zenith
#

Hey guys , how long does it take for the competition submission score to occur?

restive granite
#

Glad to build the model today , it was a pure headache honestly but i did learned many things along the way

livid lantern
#

My model does not make predictions very accurately, is there a recommendation for an educational video (with an easy level of English) that will make predictions that are close to 100%?

icy wadi
rotund sail
versed bone
# wraith flare

use grid search on the regularization term u may get it up a bit

median shale
#

Hi, anyone know if there's another link for Alexis Cook’s Titanic Tutorial? , the one given in the platform is not working.

median shale
mortal marten
#

Idk or I'm shooting a shot in the dark

humble ferry
junior needle
#

Hey fellas ,
I got an accuracy of 0.78708 V17 , i have done feature engineering , data cleaning and trained different models and found out the best model according to me fn. How to improve my rating ? I am a beginner? At CV set i am getting about 0.8715 but idk why test is too low? Any reasons"
Also when do you stop ,till you achieve 100% or 80%+?

neon raft
#

you don't have to achieve 100% that would be a case of overfitting

#

you can predefine a benchmark for yourself.

desert epoch
desert epoch
crystal reef
#

i get like 77% using random forests

gray charm
#

i'm Completely new to kaggle can anyone help me to start? How to start the challange?

serene plover
wooden lintel
sweet steppe
#

hi there folks looking for some help. The titanic test data is missing values. what are some recommended ways of handling that? I am using decision tree model. Deletion of rows with missing data won't work since the output must be 418 rows.

red socket
# sweet steppe hi there folks looking for some help. The titanic test data is missing values. w...

Hey, I suggest you use either XGBoost or Gradient Boost, they will give you a higher accuracy than decision tree. Plus, you should fill in the missing values in the test data with either the mean or mode. And can anyone check out my code, so I can improve my accuracy? https://www.kaggle.com/code/vishalyginny/titanic

red socket
manic dew
#

Hi guys I just submit the first prediction titanic with the tutorial, but I am wondering how it goes from now, I do not think it is over. Right?!? tks

stone basin
opal nebula
#

Hello All, I am trying to do my first kaggle project, I am having basic issue, I am not able to see "Join competition button",, any suggestions?

opal nebula
#

Hello All,, I followed through the cook book and used randomforest model and it generateda public score of .77751, Is there any suggestion on how I can make it better

fickle geyser
gaunt cradle
#

me. Im beginner in this field.

stone knoll
#

here

stone knoll
livid lantern
#

can you tell me what accuracy I, as a beginner with two months of experience in machine learning (I can start solving it later), should strive for, I am already finishing a fairly large course on machine learning and am beginning to suspect that rough training in this problem is unnecessary, therefore I will have to take it up again in the near future

near bolt
#

Does anyone know what we are suppose to do in feature engineering?

stone knoll
stone knoll
mortal marten
stone knoll
mortal marten
#

Oh wait it's the sibling one

#

I plotted the sibling column with respect to the survived

supple galleon
#

Hey guys, I am starting in kaggle and Idk how to start in the titanic competition

mortal marten
#

Try and visualize the data as much as possible. Understand the data thoroughly.

daring plover
#

Can you tell me why when I train a model (XGBoostClassifier) with hyperparameter tuning it gives me an accuracy score lower that the one obtained without tunig?

small wolf
stone knoll
#

Is there anyone get a full prediction?

umbral garden
#

Hello everyone, I'm (new and) learning ML and this is my first time joining a Kaggle competition, I am stuck in an error for hours, and any help is welcome! Challenge: Titanic - Machine Learning from Disaster, needs help with feature engineering pipeline error. (I'm new to implementing a pipeline like this) What I'm trying to achieve: Age is a predictor variable, from 0.0 to 80.0, and also contains NaN. I want to bin this feature, first assigning NaN to number 999, then binning like: 0-1 is "Infants", "1-4" is "Toddlers", ..., "100-inf" is "Unknown", then, One-hot encoding the features. However, when I try to run a random forest model (image 4), I got the following error because my AgeBinningTransformer (image 1) is incorrect: `Cell In[53], line 24, in AgeBinningTransformer.transform(self, X)
22 X_copy = X.copy()
23 # Ensure 'Unknown' label is correctly handled
---> 24 X_copy['AgeGroup'] = pd.cut(X_copy['Age'], bins=self.bins, labels=self.labels, include_lowest=True)
25 return X_copy

IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices` Any suggestion is welcome, thank you 🄲

fossil skiff
#

Hi All, new member from Sheffield UK.

My question re Titanic comp is - are people genuinely achieving 100% on Titanic dataset competition? Seems like a stretch to reach. Is it a loophole? What score should we be aiming for before moving on?

fossil skiff
fossil skiff
mental trout
umbral garden
umbral garden
umbral garden
mental trout
umbral garden
#

This is my first challenge as well, I think I can add you as a collaborator with your Kaggle username. rules wise, team submissions are allowed (per leaderboard).

mental trout
digital latch
#

Hello guys , is anyone doing titanic challenge ? I'm kind of new to ml coding, and after two days of coding I have sucesfully got the accuracy rate of 73% to 75% (using naive bayes) for the validation set that I broke from the train set ,
I have tried to work with the test set , though I have completed the data preprocessing but.. since the y_test (Survival feature) is missing I can't exactly use the model to test set..

small wolf
#

oh i guess you knew that actually lol,

#

you'll know your accuracy when you submit

digital latch
small wolf
#

once you submit your csv it will give you the accuracy, they dont give you the trueY for the test set becuse then you could just submit that

#

check out this section on the competition page

digital latch
small wolf
fiery palm
#

Hello @everyone, just made my first submission woohoo . Had an accuracy of 76% using the Random Forest classifier 🌲 . I replaced all the NaN values in the Age feature using 0. This is probably not a good idea 🄲 . I selected the min_samples_split, max_depth and n_estimators by plotting a few graphs, but my feature engineering skill is not developed yet. Any advice on how to increase the accuracy? šŸ™‚

umbral garden
simple meteor
#

Man, I'm already stuck at the first line of code. Can someone please guide me here? I'm not sure what I did wrong..

#

here is the first block of code that got cut off at the top:

rigid parcel
arctic pilot
#

I got 0.76555 accuracy in the titanic competition using XGBoost model, how can I improve the accuracy ?

strong sandal
#

hey guys, i have been suggested to use knn or mice to impute the missing value for age, is this optimal?

mental trout
hexed birch
#

Hey guys, if anyone can help me out on how to improve I'd be forever grateful! So, this is the the train and test data before one-hot-coding on embarked and social status features. Before creating X and y, I dropped Name, ID, and that was it! Also, using RandomForest for the clf!

#

My goal is to get at least 80%+ before using multiple classifiers, and often the outcome for this one is 77-79

glad sapphire
steep crag
#

Hey everyone! I am new here. I recently tried a decision tree and a random forest algo on the dataset and had a score of 0.75 and 0.77 respectively. Can someone please share how they got a score of 1 and what algorithm/ process they followed?

hexed birch
vocal island
#

hey guys ,im new here,i want to know how do we select particular algorithm ?is it based on data set?

#

and what is Exploratory data analysis ?what is its use?

mystic egret
#

How do i updated my regression parameter value as i adding or deleting data?

maiden cove
#

Anyone have the floor plans of the Titanic? I'm splitting up the Cabin row so I can make a floor column and a roomNumber column, but I can't find floor plans to see if the room number matters for nearness to stairs\exit.

azure wharf
#

regarding this problem, i am supposed to try whatever model i want and choose the one with best result? incidentaly for this problem ill be choosing decision tree but for other problem should i try more than 1 model?

hexed birch
sharp roost
#

Hey, guys. Could you share an insight on how to select a proper model for the Titanic task and similar tasks?

digital latch
# sharp roost Hey, guys. Could you share an insight on how to select a proper model for the Ti...

I don't know how correct I'm but from what I understand , the selection of your model depends on the task you want to do , in case of titanic dataset, since we are asked to predict whether the passenger will survive (True or 1) or will perish (False or 0) , since we are asked to classify the passenger's survival , this is a Binary(since only two possible outcomes) Classification problem.
If there were more than 1 outcomes like 0, 1 or 2 etc it would have been a multiclass Classification problem.
Since we are are to do classification , some standard algorithms like KNN , Naive Bayes , Logistic Regession models and some others are use to classify specifically , we select the model which has the least (minimum) cost function [think of this as error] , a lower learning curve(takes less time to compute) and higher accuracy score,
on the other hand we could also classify using neural networks for more robustness and more dexterity.

dusky furnace
#

Hey guys . How can I submit my colab notebook on keggle?

fossil skiff
proven flower
#

Hi, I see the score is 1.0. Does that mean the accuracy is 100%? If so, could this indicate the model is overfitting?

tiny zenith
#

hello, I'm just getting started on my titanic submission and I can't figure out how to use the notebook. When I try to type 'import', the entire line disappears and so does the cursor. then I try to click on the code cell to edit the code again, and the entire line I click on disappears. It's pretty much impossible to write any code. Does anyone know what I'm doing wrong?

keen halo
hexed birch
#

Sorry about the delay, I'll take a good look in that dataset later in the day, but as far as memory go, I did tried many. I used this flowchart https://scikit-learn.org/stable/machine_learning_map.html, but at the end of the day, the RF seems to perform considerably bettter than any other. As far as Feature Engineering goes, I kept pretty simple. One thing that I remember making a very little difference - but some improve nonetheless, it's to use a Grid Search to find the best hyperparameters for the estimator. So, take a look of that as well!

pseudo stratus
keen halo
sly trellis
#

Hi guys, What do you consider a good score (satisfactory)?

tough linden
#

depends

#

95 is really hard to push past

#

80 isnt a bad number though

knotty knot
#

Hi folks, i'd like to understand if someone of you experienced to have a very high accuracy on BOTH train and x-validation sets, and nonetheless having 10% less on the test submitted on kaggle. Since i see no particular issues on how i engineered features of the test set to have it aligned with the features of the model trained (one hot encoding, features dropped etc...) it remains only the assumption of having an overfitting model. But then why is that model performing well on the cross validation set ?
Just to give you an order of magnitude:

  • acc on train set: 86%
  • acc on cv set: 85 and counting %
    -acc on kaggle submission: 76 and counting % ... -.-
gentle marten
#

I was encountering a large difference between cv and leaderboard with 5 folds (around 140k entries in dataset) and then after doing some parameter optimization trials with 8 splits instead my leaderboard score almost exactly related to my cv score

sly trellis
#

Can anyone suggest me how can I fill Null values in the cabin column
I see a pattern that cabins starting with C,D,E,F have higher chance of survival compared to others but dont find any relation between fare, class and cabin

also can anyone tell me if they are able to use fare and age? I dont see them being useful

desert epoch
real pebbleBOT
real pebbleBOT
real pebbleBOT
slender oyster
#

Hello, I'm a beginner, just started learning data analytics last month. I used an XGBoost model on my train dataset with some engineered features. I've had four submissions and the highest I got so far is only at 77.99%. I've tried different kinds of engineered features to improve the model but retained only those that seem to work. For a beginner, is it possible to push past 80% or that requires a bit more advanced knowledge?

hazy falcon
#

you definitely can push past 80%

ripe ridge
real pebbleBOT
real pebbleBOT
hazy falcon
#

everyone is now a farm merge valley player

real pebbleBOT
real pebbleBOT
real pebbleBOT
sinful rain
#

Z

real pebbleBOT
real pebbleBOT
real pebbleBOT
whole garden
#

What are the prerequisites for this exercise?

real pebbleBOT
real pebbleBOT
real pebbleBOT
real pebbleBOT
real pebbleBOT
real pebbleBOT
real pebbleBOT
real pebbleBOT
real pebbleBOT
real pebbleBOT
twin ember
#

Hello everyone, I am new to this competition and kaggle, and I made a LGBM model that had 80% accuracy, however, my public score is of 0.0000. why is this happening?

real pebbleBOT
solid girder
#

Hi! I'm a newbie with datascience. I'm looking to partner up with people to discuss and do datascience projects. I'm looking for people who are interested in understanding why something works the way it does, not just bumbling through to increase accuracy scores. I've finished the IBM Data Science course and now doing the Titanic project. Anyone interested to work with me?

real pebbleBOT
solid girder
tacit skiff
#

Best tutorial for mastering this comp?

#

I know that 100% score is cheating

fresh spear
#

Hey I’m a software engineer student and I want to improve my score
( 0.77751 ) , it’s for a class and I am only allowed to use logistic regression, anyone has suggestions to how to improve my feature engineering ? Maybe share what you did in your code? Thanks

fresh spear
#

i would like to know how can i do hyper parameter tuning correctly ?
and what should i use? Grid search / Bayesian / Hyperopt / Optuna?

safe orbit
#

Welcome to my data science journey through the Kaggle Titanic - Machine Learning from Disaster Project!

In this video, we'll dive deep into the world of data analysis, feature engineering, and machine learning to predict passenger survival rates on the Titanic.

As Kaggle states: "The competition is simple: use machine learning to create a mode...

ā–¶ Play video
#

In this comprehensive tutorial, we delve deep into the world of hyperparameter tuning using Optuna, a powerful Python library for optimizing machine learning models. Whether you're a data scientist, machine learning enthusiast, or just looking to improve your model's performance, this video is packed with valuable insights and practical tips to ...

ā–¶ Play video

Welcome to our comprehensive guide on hyperparameter tuning with Scikit-Learn! šŸš€

In this tutorial, we'll dive deep into the world of machine learning model optimization. If you're looking to take your data science skills to the next level and boost your model's performance, you're in the right place.

Interested in discussing a Data or AI proje...

ā–¶ Play video
tacit skiff
#

Lucky boy

#

I will watch them all

safe orbit
tacit skiff
#

Without ad blocker

safe orbit
#

thanks man

#

I'll have spaceship vid out in feb i think. Working on some new ml vids this month though

fresh spear
#

is there a way to pass 0.8 with log reg?

#

0.787 is the best i achived

real pebbleBOT
real pebbleBOT
dark stone
#

I made my first submission following the guide and I got a 0.7751. Is anyone else a beginner in the process of raising their score? I'm looking to collab :)

real pebbleBOT
thorny flare
real pebbleBOT
real pebbleBOT
real pebbleBOT
real pebbleBOT
real pebbleBOT
real pebbleBOT
remote quail
#

hello, i am trying to save and it keeps saying failed, when i do a quick save it works and i can not submit to competition. please

real pebbleBOT
real pebbleBOT
fresh spear
#

if you dont write into the csv you cant submit it

remote quail
#

Hello Amit, I have managed to get it to submit, thanks for the response

fresh spear
#

no problem

real pebbleBOT
dark stone
eager cedar
#

Hello, Do I just need to press the submit button or do I need to ask for some kind of permission to join this titanic thing?

neat bough
#

Hey, I'm a beginner in this Titanic project. I followed Alexis Cook's Titanic Tutorial and submitted, but I am unsure how to improve/progress forward. Anyone have any advice on where to look next?

eager cedar
#

try paid competitions

tribal pivot
tribal pivot
#

also looking at the submissions or forums on kaggle could help you out with different and unique ideas

tribal pivot
green harbor
#

Hi, I am new to Machine Learning anyone can explain why we need to Normalize and standardize the data?

copper knot
# green harbor Hi, I am new to Machine Learning anyone can explain why we need to Normalize and...

source: https://www.geeksforgeeks.org/what-is-data-normalization/ Why do we need Data Normalization in Machine Learning?
There are several reasons for the need for data normalization as follows:

Normalisation is essential to machine learning for a number of reasons. Throughout the learning process, it guarantees that every feature contributes equally, preventing larger-magnitude features from overshadowing others.
It enables faster convergence of algorithms for optimisation, especially those that depend on gradient descent. Normalisation improves the performance of distance-based algorithms like k-Nearest Neighbours.
Normalisation improves overall performance by addressing model sensitivity problems in algorithms such as Support Vector Machines and Neural Networks.
Because it assumes uniform feature scales, it also supports the use of regularisation techniques like L1 and L2 regularisation.
In general, normalisation is necessary when working with attributes that have different scales; otherwise, the effectiveness of a significant attribute that is equally important (on a lower scale) could be diluted due to other attributes having values on a larger scale.

#

Starting with the Titanic survival prediction after singing up with Kaggle for about 2 years, better late than never.

full jewel
#

Hi, just a quick question. I submitted some data sets, just to experiment. After looking at the leaderboard, I noticed that a lot of people have 100% accuracy. Is this even possible, or did they just use historical data to give the correct answer for each passenger?

copper knot
copper knot
full jewel
#

OK, thank you very much, I was really confused about that.

hollow wren
#

Titanic’s fate highlighted the flaw in this plan. 🚢

People say that the Titanic wasn't equipped with enough lifeboats to accommodate everyone in case it sank.

But the plan was, if we ran into trouble, other ships in the area would come to our aid, and we needed enough lifeboats to ferry people in shifts from our ship to the others.

This data not only reflects the social dynamics of the early 20th century but also serves as a reminder of the ongoing need to address inequalities in crisis situations. It's crucial for modern safety protocols to ensure fairness and prioritize human life regardless of socioeconomic status, gender or employment role.

What lessons do you think we can draw from the Titanic tragedy that are applicable to today's society? šŸ‘‡

Original post: https://www.linkedin.com/posts/arshmankhalid_titanic-kaggle-survival-activity-7284676098452254721-GBWG?utm_source=share&utm_medium=member_desktop

regal lily
fresh spear
wicked abyss
#

After about 25 submissions, I’ve finally landed on a notebook I’m happy with. It scored pretty well, and feels like a good stopping point as I move on to the next project. I’m still getting started with Kaggle competitions, so I’d love any feedback you might have—or if you find my approach useful or worthy of praise, an upvote would mean a lot!

Here’s the link to my notebook: https://www.kaggle.com/code/josephnehrenz/classification-titanic-random-forest-model-in-r

Thanks in advance, and good luck to everyone still working on the challenge!

pine sonnet
wicked abyss
# pine sonnet As a complete n00b, I'm curious how your previous submissions scored, either in ...

Hey @pine sonnet,

I totally understand your experience — I had something similar happen. I initially scored 79% with a fairly basic model, and when I started adding more features, my results actually got worse. At first, I thought I was on the right track, but it turned out I was running into issues with overfitting.

As I added more complex features and ramped up cross-validation and parameter tuning, my results really tanked. What I learned is that for this competition, the sweet spot seems to be finding a balance:

  • Adding some meaningful features to improve the model, but not going overboard.
  • Avoiding over-tuning to the training data so the model still generalizes well to unseen data.

It’s tempting to throw in every feature you can think of, but for this challenge, simplicity with a little refinement seems to work better than full complexity. An overtuned/overdone model can easily translate to 5%-10% prediction accuracy loss for this data.

distant linden
# wicked abyss After about 25 submissions, I’ve finally landed on a notebook I’m happy with. It...

Hello, I'm noob in machine learning and I try hard to understand all the mecanics. Your notebook is very helpful, I just have a thing that I don't understand : you explain a lot of statistics things in order to see the correlation of columns, to underline the link between the rate survive with gender, title etc. However, I don't understand when do you use your graph and calcul in your model to predict the test set.

simple egret
#

Hello, I am newbie to kaggle. what are the next steps after titanic tutorial?

wicked abyss
# distant linden Hello, I'm noob in machine learning and I try hard to understand all the mecanic...

Hi @distant linden ,

Thank you so much for taking the time to check out my notebook! 😊 I’m really glad to hear that you found it helpful—your feedback means a lot. Regarding your question, you’re absolutely right that the graphs and calculations I included are primarily used to explore the relationships between features and the target variable, as well as to confirm the value of those features for prediction. For example, when we see a strong relationship between Sex and survival rate in the graphs, this insight suggests that Sex is an important feature. I take these insights and incorporate them by creating or refining features for the model. For instance:

Transforming features:
If we notice patterns in the Age distribution, we might group it into bins or take a log transformation.

Creating interaction terms:
If two variables (like Sex and Pclass) show combined effects, I might include an interaction term like Sex * Pclass.

In essence, the visualizations aren’t directly used in the model but guide which features I create or prioritize. They act as a bridge between data understanding and model performance. As you mentioned you're new to machine learning (welcome! šŸŽ‰), I hope this helps clarify things. If there’s a specific part you’d like me to expand on, let me know, and I’d be happy to help further.

P.S. If you found the notebook helpful, I’d really appreciate an upvote—working toward my first bronze medal has been quite the journey! 😊

simple egret
#

Asking for feedback on my kernel
Hello, I have recently started on titanic challenge and I finished tutorial. Then, I tried doing some preprocessing to clean up and improve the accuracy. However, accuracy decreased. What is the problem with my code? I attached the link to my notebook.
https://www.kaggle.com/code/eidenspark/notebook6b8d8cd056
Thanks!

wicked abyss
#

Hi @simple egret, nice work and welcome to kaggle! I'm fairly new myself but think I can help you out quite a bit. There's a lot I can share, it may be better for you to check out my notebook linked about 5 comments up:

https://www.kaggle.com/code/josephnehrenz/classification-titanic-random-forest-model-in-r>

It's written in R but super easy to follow, with plenty of documentation explaining the process, I think if you spend a few minutes reviewing it, you'll be able to make a number of connections and know what to do going forward.

High level, you can definitely do more than cut missing values and run the model! I'd suggest starting by imputing the missing values—there are plenty of methods for that, so it's worth exploring. After that, dive into the data and try creating some interesting features that might boost the model's predictive power.

Hope this helps get you started! Feel free to ask if you have any questions, and if you find it helpful, I'd really appreciate an upvote as I'm questing for my first bronze medal! Good luck! šŸš€

pine sonnet
# wicked abyss Hey <@1221029413156818946>, I totally understand your experience — I had someth...

I think part of what I'm wondering is whether seeing no improvement on an individual feature is a sign that that feature shouldn't be included, or whether it's the kind of thing where you only see the benefit when you've added everything in together. Looking at what I was doing, it looks like I had added a feature equivalent to your FarePP, but I didn't see any benefit to adding it, I got exactly the same accuracy on a validation set with vs. without. I'm not sure whether that means I should leave it out, or whether it might matter in interaction with something else. If you drop FarePP from your model, does it get worse?

wicked abyss
#

Hey @pine sonnet, you've definitely hit on the heart of the modeling process here! Deciding when to add, drop, or create interactions between features is often the trickiest part. There aren't clear-cut rules, it's more about experimenting, asking the right questions (which you are doing right now), and iterating based on what the model is telling you. You're right, sometimes a feature doesn't show an immediate benefit on its own, but it could have an impact in combination with other features.

As for FarePP or any feature, I wouldn't drop it just yet based on a single validation result. A lot of times, the true value of a feature emerges only after a few changes have been made elsewhere in the model. This is why you see such a variety of approaches across the 15k+ entrants. It truly is all about finding your personal model "touch." Sometimes it goes great and sometimes not so much.

Personally, I try to limit my changes to one or two things at a time, so I can trace back any performance dips and better understand what went wrong. I also like to document those updates in the comment section when submitting results, so if a change hurts the score, I can easily roll it back and try something else. It's normal to submit many, many files for any given competition. For example, I'm already into the teens for the 20s for the new Playground competition for February that opened a few days ago.

So keep experimenting and trust the process! There’s no one-size-fits-all solution here, and it’s all about finding that balance.

potent smelt
#

Hey all!
While going through the titanic train data i noticed, that there are a couple of people who are way too old for their age to be true. Mr. Patrick Connors for example was supposedly 705 yrs old. How do I deal with such an obviously false dataset? Do I exclude these extreme ages? In this case I can also change it by hand because you can find the true age online. But should I expect that the data is false in more cases, which I didnt find yet? Finding wrong information in the more "normal" looking entrances seems to be difficult.
thanks for any answer.
cheers Laurin

mortal abyss
#

Hi all, I am starting my Kaggle journey with this Getting Started Competition.

Tonight will be all about setting up my dev environment and creating a first benchmark submission without any data preprocessing/cleaning to start from.

What are some of yours methods of operation when starting a competition? Do you submit a first benchmark or do most of you try to squeeze out as much accuracy as possible from the first try?

Have a good evening everyone!

dim ledge
#

Because I am coming back to ML from a hiatus, I used my first attempt to try my best from memory alone as a recall exercise. Then I started to take ideas from forums like this and make improvements from there.

viral oyster
#

Ready to open my first notebook

real pebbleBOT
steady dawn
#

Hi all, I was wondering if someone would be open to helping me with the titanic tutorial? I entered the two lines of code into the second code cell (copy+paste). The data is coming out but not in a table format, is that ok? I also don't see a third code cell.

hybrid eagle
cold ravine
#

i've submitted my first notebook as well

median venture
#

On the tutorial, my random forest model code is continuously running

wary willow
#

I am starting this project and almost through the tutorial, does anyone want to partner up with me on this? Looking to collaborate

visual plank
#

Hello everybody! I am a complete newbie to Kaggle and starting to get my hands on the Titanic competition. Before starting, I realized there are several modeling approaches and models to use before submitting. I would like to use the best models to get at least 0.8 score. Is there any notebooks and links I can look at to learn?

wary willow
scenic kayak
uncut fox
#

Hello, I'm a newbie. Anyone want to learn together?

eager cedar
#

Yo what's the highest possible ceiling for titanic dataset without cheating?

digital nacelle
#

Hello, I'm a beginner, and when I tried to solve this problem, I faced some issues with my evaluation process.

I used XGBoost Classifier to predict, and applied 5-fold cross-validation to evaluate my results. In my cross-validation, 8 out of 10 folds achieved an accuracy higher than 80%, while the remaining folds had around 75%. However, when I submitted my predictions, my score dropped significantly to around 70%, which was much worse than expected.

Could anyone give me some advice on how to improve my test set accuracy? Thank you in advance!

eager cedar
#

send something like this

#

it's hard to give advices based on your words alone since we havent see any kind of data

digital nacelle
# eager cedar

I've fixed it. I think my mistake was using cut and qcut separately for the train and test sets.

eager cedar
#

great!

blazing yew
#

Hi to everyone! I have seen many people with score 1.00 is it really possible to predict 100% of passajers?

eager cedar
#

Nope, they probably downloaded the ground truths, there are guides on how to do that in kaggle, either way Titanic is not about getting 1s or high score, it's just something you do to warm up

signal geyser
#

Hello everyone,

I attempted the titanic survival challenge in kaggle. I was hoping to get some feedback regarding my approach. I'll summarize my workflow:

  • Performed exploratory data analysis, heatmaps, analyzed the distribution of numeric features (addressed skewed data using log transform and handled multimodal distributions using combined rbf_kernels)

  • Created pipelines for data preprocessing like imputing, scaling for both categorical and numerical features.

  • Creating svm classifier and random forest classifier pipelines

  • Test metrics used was accuracy, precision, recall, roc aoc score

  • Performed random search hyperparameter tuning

  • Cross validation score of svm was slightly higher than random forest

  • Testing score of random forest was 0.78229

  • Testing score of svm was 0.53588

I think some flaws in my notebook are not performing feature extraction, feature selection and missing outlier analysis. I would appreciate any feedback provided. I really want to improve and perform better in the coming competitions.

link to my kaggle notebook:https://www.kaggle.com/code/jayasuryanmutyala/titanic-survival/notebook

Thanks in advance!

strong hornet
#

Hi there, is that possible to make team in titanic competition?

wary heron
#

Impossible

surreal gyro
#

If I get good in this compatitoan wiill I get nobal prize?

wary heron
surreal gyro
eager cedar
# signal geyser Hello everyone, I attempted the titanic survival challenge in kaggle. I was hop...

You imputed using median on numerical pipeline, this applies to age since it's the only feature that has missing values, have you checked the graph? see what happens if you dump all those 177 missing values on median.

You applied log transform, did you check whether it actually fixed skewness?

You applied Standard scaler which uses Z score and assumes normal distribution, did you check whether your distributions are normal or even close to normal?

Pipelines from my understanding is made for automating data preprocessing that expect data inputs at certain frequencies with the same features and the transformations are decided after carefully analyzing and taking considering on how to handle each features properly, What I'm trying to say is that for projects like these pipelines aren't necessary, it is used in the industry to automate cleaning process so that people wont have to every time a new set of dataset are produced. I always see people use it the wrong way and just plug in simple imputer and stuffs to clean the data to make the cleaning process instant, but I think this is very wrong as every features are handled differently based on the data analysis, but I may be wrong on my understanding šŸ™‚

My feedback is use formula's and technique with intent, spend more time analyzing your data and know when and when not to use statistical treatments

#

I upvote tho for support šŸ™‚

signal geyser
# eager cedar You imputed using median on numerical pipeline, this applies to age since it's t...

Yes thank you. I understood your points. Also when I applied the log transform it did not make too much difference for some features If i remember correctly because out of the all numerical features only one looked skewed it fixed that feature but others looked very much multimodal . I still don't really have a solid idea on how to properly address multimodal distributions I'll read some articles online and try it again. Also I don't know how to upvote. Thanks again for the feedback.

eager cedar
#

you are using tree based model and an SVM with RBF Kernel so multimodal wont matter, but skewness does a bit. it's not a problem for tree but when you have something extreme like the ones in Fare when most of your values are around less than 100 and then you got like 2 data that has 512 fares it MIGHT become a problem, so you might wanna check whether those extreme values affect the model you are using or not

#

so one solid piece of advice that I apply to myself too is to always ask myself 'why', 'why do I need this', 'why do I do that', 'why is this neccessary', 'why do I choose this instead of that model' etc

signal geyser
eager cedar
#

NO, not "next time" bro, always haha

signal geyser
#

I build a very simple solution based on minimal knowledge about the models when I worked on this

signal geyser
eager cedar
#

it's a nice notebook, definitely has stuffs to improved upon but it has almost everything from start to finish

signal geyser
#

Also do you have any advice for choosing a deep learning framework ? @eager cedar

#

I have some experience with pytorch before working on some simple projects

eager cedar
#

Nah, treat all of these framework as a tool, everything is just applied mathematics and some frameworks might have stuffs you need that others dont have, you can switch back and forth between these frameworks

eager cedar
#

that is one important thing too, basically AI is just applied math since most of the coding has dedicated libraries and frameworks it solve most of your coding stuff but the math behind these you still need to perfectly grasp

signal geyser
#

I always had it confusing which one to just follow since I see pytorch has been gaining a lot of popularity over the last few years but the book implements everything in tensorflow. I'm thinking of just following the book and picking up pytorch again

signal geyser
eager cedar
#

well tensorflow migh be a bit complex to look at while pytorch is easier in the eyes either way both have uses

signal geyser
#

Do you recommend any books or courses I can follow for pytorch ?

eager cedar
#

what can I say, choose what works best for you

#

hmm I dont know about books but I know some videos in youtube

signal geyser
signal geyser
eager cedar
#

I grinded this 25 hrs video like a year ago
https://www.youtube.com/watch?v=V_xro1bcAuA

but most of my skills came from actually trying to use it

Learn PyTorch for deep learning in this comprehensive course for beginners. PyTorch is a machine learning framework written in Python.

āœļø Daniel Bourke developed this course. Check out his channel: https://www.youtube.com/channel/UCr8O8l5cCX85Oem1d18EezQ

šŸ”— Code: https://github.com/mrdbourke/pytorch-deep-learning
šŸ”— Ask a question: htt...

ā–¶ Play video
signal geyser
#

I don't think he covered nlp though

eager cedar
#

I have 0 patience so I just take up what I know and build something and just comeback to some videos when Im totally lost

#

one of the things I wish I did back when still learn was, I should have specialized on 1 thing and focused on it instead of trying everything haha

I did Computer Vision, NLPs among other bunch of stuffs

eager cedar
#

but everything worked out in the end, I just had to deepen each of these specific areas

signal geyser
#

I'll try revisiting my basics in pytorch and work on some simple projects to deepen my understanding as well

tiny jungleBOT
#
rapid.roll.off has been warned

Reason: Bad word usage

eager cedar
#

whaaat

signal geyser
spice ledge
limpid rune
#

šŸ‘

sage bridge
#

wassup everyone

hollow widget
#

I don’t know why, but I can’t seem to boost my score. Any tips from you all?

empty patio
#

Hello everybody, this is my first time being on Kaggle, what skills do I need know before tackling Titanic?

eager cedar
hollow widget
eager cedar
#

what is your common score can you tell me the steps you did before 'tweaking' your model

hollow widget
#

My common score is 0.69, and I followed the common steps before modifying my model, such as splitting the data, handling NaNs, and some other tasks. I obtained the same score after using XGBClassifier instead of RandomForestClassifier, although I didn't use early_stopping_rounds.

eager cedar
#

You see my friend 90% of your score will come from Data Analysis and the remaining 10% is handled by hyperparameter tuning, while I can't say for sure how effective you handled the preliminary steps these are the checklist you can use for self check:

Data Cleaning ->
Did you properly handled missing data and/or Duplicates? What I mean by "properly handled" is you didn't just fill the missing blindly, The strategy used is guided by analysis and statistics like for example:

Filling missing values in Age ->
did you just fill it with median/mode? did you check it's distribution before and after filling?

Did you made a smart imputation by analyzing the data -> extracting titles from names and see how this titles corresponds to specific age bracket or did you try to impute the missing values with machine learning instead? did it's distribution make sense after imputation?

#

EDA ->
Did you analyze what factors contribute to your target? or did you just fill the missing blanks and then feed it to the model hopefully it magically output high score?

What are the major driving features for survival? what contributes to it? what features does not contribute to it? did you test the hypothesis using statistics? what are the result?

Feature Engineering and Selection ->
How does the feature correlate to the target and to each other? can we make a new feature to better capture a pattern? did we select the best set of features according to our goal? are the features in the format and is processed in a way that passes the assumption of the models we are gonna use?

Model Training ->
Let's run a baseline, check feature importance and see what features contribute less, more and did not contribute at all, from our metrics result does it reflect our expected outcome? what can we improve? does our model overfit or underfit etc.

Hyperparameter tuning ->
Base on the result of the baseline model analysis, do we need regularization? maybe we need more trees or limit max depth... Let's use bayesian optimization and see what range of parameters are the best etc etc

There's so much that is going on other than filling the blanks and running the data into a machine learning model. this is just a summary of all the things you can do believe me there are tons of stuffs that can guide you in building the proper model and the data needed for it.

hollow widget
#

I sincerely appreciate your help. I attempted some modifications and obtained an accuracy of 79% in the spaceship-titanic competition.

green shale
#

my accuracy resulted in 72% approx but why

vestal island
#

All I did was copy/paste what was instructed but got this. Any ideas on why?

deft mural
ocean pivot
#

does anyone know what is gender submission dataset ???

charred tinsel
waxen sage
#

Anybody else tuning a RandomForestClassifier atm? Just looking for someone to DM / bounce ideas/questions off of as I try to increase my score. Currently at 83 using StratifiedKFold / cross_val_score but I've been observing my submission test scores being consistently 3% less -- Is this normal or a sign I'm overfitting?

sonic nacelle
#

@everyone Question for the Group - I have been working through a Data Science bootcamp and wanted to keep sharpening my skills. I found this contest and figured I would give it a try...however, I am a little intimidated by the fact I am still fairly new to this Data Science arena. Does anyone have any thoughts on this?

waxen sage
charred tinsel
ocean pivot
#

do you know which data is compared with our predicted data?

charred tinsel
ocean pivot
#

the same test labels which are given in the test.csv right?

charred tinsel
ocean pivot
#

I mean that when we submit our predictions on kaggle how do they measure accuracy there has to be some reference

charred tinsel
ocean pivot
#

Thanks actually I got confused with that gender submission dataset

#

I got accuracy of 76 any suggestion how can I improve?

charred tinsel
jolly matrix
charred tinsel
waxen sage
sturdy sinew
frosty hamlet
#

Hello guys I'm new

Notice me 😭

#

By the way guys I have a question

I just got started with the titanic prediction.

Wanted to find out how y'all dealt with missing data especially in the AGE column.

Should I use the age mean to fill the missing part?

The missing values is quite a lot tho

timid magnet
frosty hamlet
frosty hamlet
# eager cedar You imputed using median on numerical pipeline, this applies to age since it's t...

After filling the 177 missing values in the age column with median the skewness actually increased " (0.51)" compared to the way it was "(0.36)" when there were still missing values

I had hoped that filling the missing values with median would have given a distribution closer to normal based on logic/common sense

Log transformation and square root transformation didn't help matters.

Planning to try other methods I find on the net but what would you suggest?
I hope I'm not the only one getting such issues tho

charred tinsel
# frosty hamlet After filling the 177 missing values in the age column with median the skewness ...

Imputing with a constant value will generally disturb the distribution of that feature if there are many missing values, as frequency will peak at that imputed value.. this may or may not affect the model you use (depends). To preserve closely the original distribution, you may want to use a better strategy, like you can try to use other features to predict Age.
But if the feature is not so important, you may be just wasting some of your time..

frosty hamlet
charred tinsel
frosty hamlet
waxen sage
waxen sage
charred tinsel
waxen sage
# charred tinsel Hey btw, what's the progress? You were working on that.. I was assuming both hav...

I had discovered back on the 7th that distribution for some variables was different between training and test sets and had been accounting for it since then, until I realized today that accounting for distribution shift between training and test sets is a type of data leakage (IMO) because per competition rules the test set is supposed to be "unseen" data. I've begun adjusting my analysis to assess distribution shift between training/test folds of training set data only.

charred tinsel
#

But until you are not really peeking too much into the testing data, and are using standard techniques for correcting distribution shift using only information from training set, i consider there won't be any data leakage.

thick pagoda
# eager cedar

what did this end up scoring after submission - curious seeing these nice numbers I'm at 0.79665 on my 3rd submission (my F1 is always close to my score so far going from 0.77 on my first try).

thick pagoda
# waxen sage

If you actually peek at the submission distribution (i.e. use the unseen test‐set statistics) and bake that back into your training logic or final thresholds, you’ve leaked information from the test set.

waxen sage
thick pagoda
#

Ofc np I'm new to this stuff but I instantly presumed most of the 1.0 (100%) models exploit this and its not the spirit of the exercise!
How did your models do? @waxen sage

waxen sage
thick pagoda
#

That's really good i think, I'm at 79.6% - and now problem is I saw 'the answer' to get "more" isn't anything code or model specific so I'll stop there for now, if the models generalizing at that range its a good model imo šŸ¤·ā€ā™‚ļø

#

its a good lesson in feature engineering above everything, I went in with lots of fancy code and although I did a pretty decent EDA phase, played with UMAP I missed some key connections in the data, to get those 80+ scores... good luck!

stiff marsh
# frosty hamlet After filling the 177 missing values in the age column with median the skewness ...

"I had hoped that filling the missing values with the median would result in a distribution closer to normal, based on logic and common sense."

This is generally true for datasets or features with less than 5% missing values.

However, in the case of the age column, missing values account for about 20% (train dataset only).

Anyway, they say a picture is worth a thousand words, so I did some visualizations to help you understand better:

What happens when you use the median to impute the 177 missing values is that all those values are dumped at a single point, which greatly distorts the distribution. The main goal of imputation is to fill missing values in a way that resembles the original distribution."

#

this is my second discord btw, my first account got hacked, anyway the score with using only gender as feature in this problem scores 76 percent you can use that as comparison with your current model and the highest legit scored I see so far without cheating is probably from cdeotte - 84%

frosty hamlet
# stiff marsh "I had hoped that filling the missing values with the median would result in a d...

Thanks a lot for the response Nixon

Do you feel using the median to fill 177 missing values is the wrong approach in this case since it's far away from resembling the original distribution.

So maybe probably in cases like this I engage in smart imputation like you showed in the images you sent

Were I feel the missing values with scores within the median range instead filling solely with median

frosty hamlet
stiff marsh
frosty hamlet
stiff marsh
# waxen sage

No. By "accounting", I'm assuming you are referring to covariance shift, checking if there is a difference in "feature " distributions is okay.

charred tinsel
#

I am currently at 81.8%

timid magnet
#

So if I have outliers in my data should I blindly go for median or like what?

charred tinsel
# timid magnet Okay so you mean rather than blindly imputing median to missing values, we shoul...

Try to understand the data in general... how that feature with missing values correlates with other features, or how the missingness of the feature may be correlated to another feature (this would help you to streamline your strategy).
Sometimes imputing with median may be very much sufficient, sometimes it may be the only possible best option and other times may even be worse. That is only possible to find out when you'll understand a data.

frosty hamlet
waxen sage
# waxen sage
poll_question_text

For this competition, is accounting for distribution shift between training and test (submission) data sets an example of data leakage?

victor_answer_votes

2

total_votes

4

sullen torrent
#

Hi chat, Im a new learner on kaggle and im trying to make a notebook submission for the titanic survivor prediction competition. But even though my output file is created and visible, the competion wont accept the notebook when I click "Create Submission"
any idea why? I can send screenshots if necessary

charred tinsel
waxen sage
wanton quest
#

Anyone get score 0.8+?

raw sierra
#

using rf

radiant isle
charred tinsel
#

Submission accuracy is 0.8253.
CV was ~0.85

pure lark
abstract anchor
#

what did yall do with cabin column? did you guys just drop it ?

#

i dont even see any relevance of cabin feature for our target. should i drop it or what should i do ?

pure lark
# abstract anchor what did yall do with cabin column? did you guys just drop it ?

No the cabin is actually useful. Most of the cabin data has missing values. But if you look thru the data the cabins that have values have prefixes to them like C85, E46 etc. The C, E represent the decks. The life boats were kept above the higher decks, i.e. closest to A, B,C so the people living in A, B,C decks had more chances of survival than the E,F,G ones and of course more chances than people who didn't have cabins (who lived in dormitories and not cabins which are probably the missing value data which is huge in numbers coz rich people who used cabins were less).

#

When I added this deck feature along with some more features my score went from 0.7751 to 0.78708

molten grove
#

I got 82% accuracy using RandomForest.

molten grove
still vector
#

0.787 with binary classification NN. I might try some more feature engineering later. For now i did HasCabin?, one hot encoded (Mr, Mrs, Miss, Other) and Embarked, and combined Parch + SibSP = FamilySize.
I want to try ranking the decks, and changing the 'Other' title to their respective genders.
Fun challenge, would recommend.

lime fable
#

i got 70% Accuracy using Gradient Boosting Classifier

dusk terrace
blissful aurora
blissful aurora
molten grove
#

I'll let you know once it's out.

blissful aurora
#

Thank you !

cedar gust
#

Hey guys, quick question, I was wondering that if I haven't previously worked on anything at least too substantive in AI/ML, if this titanic project is doable for me? Any tips/suggesstions as to where I can start is also appreciated!

lime fable
fallen carbon
blissful aurora
cedar gust