#🚀┊spaceship-titanic
1 messages · Page 1 of 1 (latest)
Classification accuracy: https://www.kaggle.com/competitions/spaceship-titanic/overview/evaluation
Predict which passengers are transported to an alternate dimension
Well, it can’t be that my predictions have 0 score and using multiple models. There must be something wrong I am doing
I have read all the details but still not sure why any score would be zero (statistically it is impossible)
I found the error. The True and False are case sensitive and I was using all CAPs
I have question why we dont drop VRDeck , roomService, FoodCourt , ShoopingMall along with PassengerId and Name
things like these should not matter in training the model or am i missing something
The utilization of specialized services may serve as an indicator that certain people have access to unique resources, thereby increasing their likelihood of being transported. We must consider the potential existence of biases or specific skills that these individuals may leverage—either their own or those of others—when employing these services
thnx man
Anyone interested in teaming up?
yes,
Glad to see you all teaming up! Please post in our https://discord.com/channels/1101210829807956100/1130572338182762657 channel so others have the opportunity to team up as well!
Yo guys! Lol, finally a place to connect to other people, geez. It's been a lonely 7 months, and 3 of them doing Data Analytics and ML. I'm from South Africa by the way, it's great to meet you all! And I just joined this competition
Welcome to our community 🙂
Thanks brother!! Lol I got a long way to go in this and I'm wondering where the summaries are to get up to scratch. I'm totally new to tech
I'm so happy with my solution for null values in the test dat in the expenditure columns
I just had to say lol
I haven't started my feature engineering yet, but I'm not on a team either, but anyways, it seems that the utility expenditure is quite important after all...
Hmm anybody got any ideas for further feature creation?
hey completely new to machine learning anyone want to team up
what will be thought behind the fill nan values for cabin?
After data processing and converting to integers, I just filled random values from min() to max(), There is probably a better solution but I think that this should suffice.
what will be the approch for converting to integers?
For converting, a simple mapping did the trick!
map={'categorical':numerical, 'A':0}
df['CabinDeck'] = df['CabinDeck'].map(map)
Hi everyone, I am struggling to really get the ball rolling for this project. Is there anyone willing to help me? I am able to do most of the basics and then a be able to do a mixture of things, which isn't good enough. I am struggling to do EDA and then categorical encoding. Please send a DM if anyone is willing to help. Much a appreciated, I'm based in South Africa and my timezone is GMT +2. I will put in extra hours to accommodate anyone willing to help me get through the spaceship Titanic, I am still a noob but I got 3 months worth of skill and self taught.
thank you so much
my test score is way to low how can i imporve it am using Random Forest Classifier
Hey everyone,
I'm thrilled to share that my work for the Spaceship Titanic competition has achieved an impressive Top 5% rank on the leaderboard! 🥳🏆
In this Notebook: https://www.kaggle.com/code/ishanpurohit/top-5-solution-with-detailed-explanation, I've poured countless hours into conducting in-depth analysis, crafting insightful visualizations, and implementing advanced modeling techniques. From uncovering hidden patterns in missing data to optimizing feature engineering, my notebook showcases a comprehensive approach that has propelled me into the top echelon of participants.
If you're looking for a comprehensive guide to conquering the Spaceship Titanic challenge, complete with strategies to improve model performance and achieve remarkable insights, this is the notebook you've been waiting for.
I invite you all to check out my notebook and explore the journey that led to this achievement. Your support means the world to me, so if you find it valuable, please consider giving it an upvote! 🙌👍
damn that notebook is too good! Completely changed my perspective of this competition. I might start from scratch again!
Thank you for your feedback and comment.
Hi! - What worked for me in the past is to look at other's notebook ( @zinc grove shared the notebook in this channel), learn from it, come back to my own notebook and make modifications (vs. spending an enormous amount of time trying to perfect my code). This method enabled me to stay positive, learn from others, make changes, keep the momentum, and repeat. 🤗
I can agree. I have reviewed many notebooks and discussions before implementing and finding what's work best. I am glad my notebook was helpful.
Another suggestion is to start small; make small progress based on one learning module at a time. When I started with Kaggle, I take one Kaggle Learn course, pick a dataset that interests me and apply what I learned by anlayzing it. Below post talks about the approach more in detail. Hope you find it helpful. 🤗
Getting Started with Data Science | How to leverage Kaggle resources to Maximize Learning .
Hey guys thanks for the help! Lol 2 weeks later and finally got somewhere in the spaceship Titanic challenge. Now my model is trained and is 78% accurate. But now im stuck, I don't know how to deal with the missing column that should be predicted in the test.csv 🤣😅😅 please any suggestions would help. It took me 2 weeks to get to this point and now I'm totally clueless
Could neural networks potentially be used to solve this problem
Or is there not enough data
Update: It was...
I recommend looking through discussions.
Espicially...
https://www.kaggle.com/competitions/spaceship-titanic/discussion/315987
Predict which passengers are transported to an alternate dimension
Hello! After I converted some columns into numerical values (or I normalized some of them), so I still need to do the same on test.csv? Or should we not edit test.csv? THank you in advance for your answer!
Yeah bro
It's a question with two choices. It's a bit difficult to understand with what you meant by "yeah bro"
You should edit test.csv too
All right, thanks a bunch!
No problem
This is my notebook about this competition. I have used some basic technique to extract feature and fill NA but I have a pretty good accuracy. I didn't explore data and EDA much but I will share my notebook to all of you so you can find something interesting in EDA and improve it much better. If you have some advice for me, please comment and I will hear all of your comments. And if you interested my notebook. pleas support me by voting, I will very appreciate that. Thank you so much!!
https://www.kaggle.com/code/hoanglongroai/80-accuracy-spaceship-titanic#Make-prediction
Anyone interested in teaming up with me?
Nice and clean notebook, well done!
You should consider replacing missing values before datatype conversion. {Series,DataFrame,..}.astype(bool) converts NaN values to True - which is not a good idea especially for the VIP column, where most of the values are False.
Thank you so much for giving me advice!!! 🥰🙏🏻 I will cover that later
Going to look into this competition soon. Pretty similar to normal Titanic problem I assume?
Need partner for this project , a team
My notebook with accuracy of 78.6% on"Spaceship Titanic"
🔗https://www.kaggle.com/code/harshpatelind13/spaceship-titanic-13
hello everyone,
I achieved an accuracy of 78.04% on the spaceship Titanic.
https://www.kaggle.com/dinanksoni/spaceship-titanic-78-04
hey can you help me with something
Like I face problem everytime how to decide which algorithm to use
Hey y'all,
I'm trying my best at feature engineering the spaceship-titanic.
I want to find the best way at tackling the NaN values and impute rather than just delete them. And I ran across this discussion post https://www.kaggle.com/competitions/spaceship-titanic/discussion/315987#2461774
How did the heck did they identify so many great and powerful relationships/rules?
Woohoo! I just scored a 0.78536 using Random Forest 🌳 on the space-titanic. I gotta say this one's feature engineering was so rigorous... https://www.kaggle.com/code/m000sey/space-random-forest/edit/run/144798140 I am going to try a few more hyperparameters to see if I can inch up the score
i did the normal EDA on the dataset, now i cant think of what to do next???
What are your insights from the EDA?
Anything stick out to you? Did you try any transformations of the data?
Hey ! im looking for some feedbacks on my submission notionbook on spaceship titanic. Ive been studying python for only 2 weeks and im currently studying google analytics on coursera to become a data analyst.
https://www.kaggle.com/code/sebastienmotionstats/spaceship-titanic-sub-motionstats
Yes. Now I am thinking that maybe RandomForests would work here, since there is no direct relation between the data and the desired output.
Am I thinking right?
Need to check that out
Woah, I got 0.75 w/o it
anyone team-up?
I cant get over 0.75
I can team up , but I am not that experienced
dm me
i tried to essemble 3 modles , lr , random fortest ,and gradient boosting but only manage to boost my result by like 0.77 to 0.78
Anyone interested on teaming up on this spaceship titanic, can you pls DM me?
Having exact same problem
My model keeps improving in validation scores and but keeps decreasing in kaggle score
If someone is willing to look at my code I would be really greatful. I am doing something very stupid and I don’t know what
Does catboosts performs better than xgboost on this dataset?
with xgboost i manage to get 0.803
I got 0.71
Share the link
I got 78%
Has anyone tried neural network to solve this problem?
I got 97.96 with decision tree 🙂
This is my notebook: https://www.kaggle.com/code/tomedison/spaceship-titanic
but my score is 0
anyone know why?
thanks
Hello, I'm following this notebook: https://www.kaggle.com/code/oscardata963/spaceship-titanic-notebook
And I'm trying to do it in R.
So far I have something like this: [Photo 2]
Which works but I want the logarithmic scale and the tight_layout where the values are pretty much tight rather than what I have like this: [Photo 3]
Can someone help me? I can't find anything on the internet and I've tried everything. One of the problems I have in my professional journey is the fact that I can't solve things when I'm stuck and I have nobody to solve my problems. Please ping me.
Here the code for you to modify it:
par(mfrow=c(1,3), mar=c(4, 4, 2, 1))
for (col in colnames(train)) {
if (class(train[[col]]) == "numeric") {
hist(train[[col]], main=col, xlab="")
}
}
I if I try to add the parameter of hist() log = 'y' it gives me these two errors:
Which means I have nulls and the 'is not a graphical parameter' which I don't understand it really. Anyways is impressive to me that python is able to detect and avoid the nulls when plotting with matplotlib (because the dtype is object not numeric or double) and tightens the data with such few code and I have to do all this, I just don't know what to do. Please someone help me.
is it possible to get a perfect score in this competition?
could somebody help me with my question?
that's the problem with my journey in coding and data science I can't solve my problems when I'm stuck
and nobody helps me
I've already also asked in the kaggle forum
if you can recommend me a place where somebody can help me? 
The top two scores on the leaderboard are 0.98 and 0.96, far ahead of the pack at ~0.82, 🤔 could these scores be achieved "gaming the system" by systematically changing submissions to infer where the errors are, or is there a key insight that almost everyone missed?
Probably the latter
The numerical values are correlations, e.g. correlation between Age and Transported = -0.075, correlation ranges from 1 to -1 https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has ...
that doesn't answer my question
see numeric_only parameter of corr() bool, default False
Include only float, int or boolean data.
I guess I don't understand your question. You show a call to corr() which returns a table of numerical values and ask why Transported is converted to a number. The input value of Transported True/False is converted to a number because the numeric_only parameter of corr() defaults to False, the output associated with Transported is numbers because correlations are numbers.
ok then why would cryosleepand vip are not converted to numeric? because they have nans while transported doesnt?
i guess that answers my first question tho
now
I want to do the same thing in R
I have so far something like this but I don't know how to add the transported one
chatgpt is telling me this but I don't know why transported is saying it has size 0 when i have done str(train) and see obviously that the variable is still in the dataset
here the code if you want to help me:
# Select numeric columns
numeric_columns <- sapply(train, is.numeric)
# Calculate the correlation matrix for numeric columns
cor_matrix <- cor(train[, numeric_columns], use = "complete.obs")
cor_matrix
Anyways i solve it but 😓
I can't believe the fact that doing that transformation changes the data that much
Is that a problem?
I guess R does the calculations different than Python and so there are those differences but I don't know that when training that might be a problem
imma continue and i wait to your answer but still continue with the project
now that i see is not only transported is the whole df, seems like cor() in R treats it differently than cor in python :(
i hope that is not a problem, but the values are pretty different
https://www.kaggle.com/discussions/questions-and-answers/463429
Hey guys i posted a question about how to create my custom transfomer in R.
Custom Transfomer Question in R .
Any help could be appreciated
Can anyone help me? https://www.kaggle.com/discussions/questions-and-answers/463648
Feature engineering question in R
Transforming pipelines in R | Spaceship Titanic question.
is very basic
Can somebody help me with this code?
library(mlr3pipelines)
library(mlr3)
# Define the numerical pipeline
num_pipeline <- po("scale", id = "num_scale") %>>%
po("impute", id = "num_impute", param = list(strategy = "median"))
# Define the categorical pipeline
cat_pipeline <- po("encode", id = "cat_encode", param = list(method = "1hot"))
# Define the column transformer
full_pipeline <- po("branch", id = "branch") %>>%
po("pipe", id = "num_pipe", num_pipeline, col_roles = list(num = num_attribs)) %>>%
po("pipe", id = "cat_pipe", cat_pipeline, col_roles = list(cat = cat_attribs)) %>>%
po("ccombine", id = "ccombine")
As you can see i'm trying to create a pipelines in r
Hello, Looking for a partner or team to join for this project. I'm experienced software engineer (10 years) with couple of months experience in ML. Let me know please if anyone is interested.
Transformation pipelines in R question | Spaceship Titanic competition.
If someone helps me i would appreciate it
is a feature engineering question
caret and it's function, the fact that i cannot abstract the pipelines from the data is what bugs me
Hi Sanjeev -- I am also experienced s/w engineer + ML beginner. I have colab notebook which does complete processing from input data to prediction with pytorch. It gets 0.79 score which is around the middle of the leaderboard, where the top credible score is around 0.82. For me, 0.79 is good enough. I'm happy to share the notebook with some ideas how to improve the score by 1-2% if you're interested.
Hello and happy holidays!
I'm an Hobbyist here, so excuse my beginners questions as I am from the business sector. 🙂
In the Spaceship Titanic, I see that PassangerId is composed by Group and Number within the group. I would like to create a category labeled IsGroup indicating if that person is in a group or not (True or False). I was thinking about separate gggg and pp, than counting the gggg occurrences and if higher than 1 IsGroup == True, else IsGroup == False. Is this feasible? Any solution more elegant?
Suggest better might be category for group_size, this has more information for a classifier which includes the special case group_size==1 which would be False in your scheme. Then another category could be family_size using last name to identify families (how to handle missing names needs a bit of thought). In pandas you can do this with Series.str.split() with underscore as delimiter then value_counts() https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html. My philosophy was to look for a short-cut solution like value_counts(), but if I couldn't find it quickly then write a dumb loop over the row or column in python -- there is no compelling need to optimize speed or code size for engineering a small feature table like this.
Hello. I'm working on the spaceship titanic competition and I have a technical question. I'm using the suggested notebook to guide me through this exercise. I'm at the section which validates the Random Forest model using the validation set (not the OOB validation. I've already completed that part). The 'rf.evaluate' function returns two values: Loss = 0 and Accuracy = 0.792. Can someone explain the difference between these two?
Certainly! In the context of machine learning models, "Loss" and "Accuracy" are two commonly used metrics to evaluate the performance of a model. Let's break down what each of these metrics represents:
-
Loss:
- The loss is a measure of how well the model is performing on a specific task. It quantifies the difference between the predicted values and the actual values (ground truth).
- The goal during training is typically to minimize the loss. Different types of models and tasks use different loss functions. For example, in classification problems, cross-entropy loss is commonly used.
- In the context of the Titanic competition, the loss is likely calculated based on the predictions of whether a passenger survived or not compared to the actual survival status.
-
Accuracy:
- Accuracy is a measure of the overall correctness of the model's predictions. It is the ratio of correctly predicted instances to the total instances.
- It is one of the most straightforward metrics and is expressed as a percentage. An accuracy of 0.792 means that approximately 79.2% of the predictions made by the model on the validation set are correct.
- While accuracy is informative, it may not be the only metric to consider, especially in imbalanced datasets. For example, if only a small percentage of passengers survived, a model that predicts all passengers as not surviving might still achieve a high accuracy but may not be useful.
In summary, during the evaluation of a model:
- Loss provides a more detailed and task-specific measure of how well the model is performing.
- Accuracy provides a general measure of overall correctness but may not be sufficient in all cases, especially in imbalanced datasets.
For a more comprehensive evaluation, you may also consider exploring other metrics such as precision, recall, F1 score, or the area under the ROC curve (AUC-ROC), depending on the specific goals and characteristics of your problem.
ChatGPT apparently is only trained with the og titanic
but you get the idea
Remember another thing
In the context of the Titanic competition on platforms like Kaggle, you are correct. The actual survival status of passengers in the test set is typically not provided to participants. During the competition, participants train their models on the training set, where the ground truth (actual survival status) is known, and they validate the performance of their models using a validation set.
The loss reported during training and evaluation is calculated based on the predictions made by the model compared to the known outcomes in the validation set. This is done to get an estimate of how well the model is likely to perform on unseen data. The exact evaluation metric used for loss depends on the competition, but it's often related to the classification task at hand.
Once participants are satisfied with their models and want to make predictions on the test set, they use their trained models to predict the outcomes for the test set. **The actual outcomes for the test set are not provided during the competition. Participants then submit their predictions to the competition platform, which evaluates those predictions based on the true outcomes held by the platform. The platform uses these evaluations to calculate the final performance metrics, and participants are ranked on the public leaderboard accordingly.
In summary, during the competition, participants do not have access to the true outcomes for the test set. They use the training set and validation set to train and evaluate their models, and the final evaluation is based on the unseen test set when submitting predictions to the competition platform.**
Usually, loss is way to calculate error which used in training of neural networks. Random forests are a quite different type of classifier. A random forest does not use this kind of loss function in training, so I would guess that here loss is reported as zero as a placeholder to preserve the same function return as other classifiers in Keras. Instead, random forests use "purity", which measures how well a given node in a tree divides samples into different classes. In scikit-learn, the purity metric can be Gini impurity (default) or entropy; I don't know Keras, maybe it uses the term "loss" for purity in a random forest in which case the reported loss would be the average purity at the lowest node in the trees. It would be a bit surprising for the training loss to be exactly zero for a challenging classification problem like this, which again suggests the zero is just a placeholder.
Hello, i am getting a rather unusual problem with my submissions, somehow I am getting a score of 0 while sample submission is getting a score of 0.49. I know my model predictions are not that bad😅
It'll be really helpful if someone could guide me through this! Thanks
There could be because of many reasons. The most common one is that the submission only reads the bool as True or False
Check if you are predicting the values as numeric or even bool but with all uppercase and change it.
Hello, I'm working on the spaceship titanic and find unclear moment for me.
After constructing the age distribution, I noticed a peak at zero value. I'm trying to understand whether these are missing values or if there are just a lot of children in the dataset. Upon further inspection of the rows with zero age, I observed that they don't spend anything, but they may end up on different planets and either survive or not survive in the end.
There are two main hypotheses: either these are indeed children, or they could be spaceship personnel.
I'm not quite sure how to test this hypothesis, and I would be grateful if more experienced professionals could guide me in the right direction.
Thank you in advance!
Hey! I just made a question and I hope anyone from here could help me! I tried to complete this competition with Tensorflow and I want to improve my score https://www.kaggle.com/competitions/spaceship-titanic/discussion/470564
Predict which passengers are transported to an alternate dimension
Hi everyone, I just started out and want to ask about this competition. We have to predict the column 'transported' but there is no column as such in train.csv or test.csv. How will I train the model then?
Hello everyone.
I think someone has already noticed and even taken advantage of it. But I'm at a dead end.
These are scatterplots of Cabin_number and Group_number depending on some categorical features (and target feature). They form lines, and some even belong entirely to one class - for example in Cabin_deck feature. The situation is similar with Cabin_deck and Home_Planet. I want to use this to fill in the missing data, but have not had success with this yet.
If you have any ideas, please write.
Hi everyone, this is my first time entering a competition. My name is Hitesh an aspiring Data Scientist and I'm about to start learning ML. I want to learn through a more practical approach and then learn the theoretical aspect of ML simultaneously hence I'm here.
If anyone can help me getting started, it would be a huge help. Cheers!
can anyone please help as testing dataset of this spaceship-titanic is missing about 2% so how to predict for them ?
can someone explain why this is happening
Would love it if someone here could give their insight on my work!
https://www.kaggle.com/code/satvshr/spaceship-titanic-using-gridsearchcv-xgboost
there's s string in of the columns that the model isn't able to understand , use some encoding to change it to a number
does sample submission contain correct answers?
I want to evaluate my answer without submitting code
New notebook exploring Spaceship-Titanic transported status! Check it out and upvote if you find it helpful: https://www.kaggle.com/code/seifwael123/titanic-data-analysis-a-comprehensive-exploration
Thanks!
is my model tripping ?
I get around 80% accuracy but still , I have rly high doubts that Cabin_num is the most important feature
HI please check this notebooks and give me your feedback. If possible please upvote.
https://www.kaggle.com/code/bvvkarthik/spaceship-titanic-0-80-beginner-friendly
I am getting the score as zero even if I converted the predictions to integers.
hello, everyone. i just joined this competition and i get 0.78 public scores. i want to get more high accuracy and can anyone help me?
Can anyone explain why I'm getting 0000 public score?
Hi, Everyone. This was my first competition, and I got a public score of 0.72. I'm looking forward to working to get this higher. I have been going through the lessons, and for the first time, I am getting my head around machine learning. The other times I have tried it, it did not make sense or I did not know why I was doing something.
I just realise that there is still so much to learn
can anyone please tell https://www.kaggle.com/code/aadishsharma/spaceship-prediction/edit
where i am making error in this
got it thanks
anyone interested in doing this project together
Hi All, I am looking the solutions but I cannot understand why only the following features have been selected and the features "Age" , "RoomService" have been left out? I mean numerical columns shouldn't be they reprocessed?
https://colab.research.google.com/drive/1t1sdCvx9Gl3WPjCqWQMXDzbn0dX1x_Ej?usp=drive_link
Could anyone give me some tips to improve my code?
Hi, do you know why when I submit it gives me 0 score? Thank you! https://www.kaggle.com/code/davidg960/space-adventure
Hello, I think you need to convert your predictions to True/False instead of 0/1
sample submission
i ended up doing several transformations to passengerid however they seem legit in the end, but not legit enough. any suggestions?
ok so i did find a mistake, the id's were different but I replaced them w the original values directly from the test dataset and the error persists
Hello, i have done this spaceship prediction,i got 78% is their any way to get more percentage.
I mean accuracy score
Try with different algorithms
I have tried with randomtreeclasifier,svm's, adaboost . But all got 78% and below
Which algorithm you have used?
Good morning to evreybody. I'm new in Kaggle and i'm happy to be here
hey i did titanic model using multiple linear regression, got accuracy score of 78%
Did you do feature engineering?
i think it's a bad decision to use linear regression for that data. You should to use Classifiers for making categorical prediction(True/False, Red/Write/Green etc.) instead of using Regression! Regression is used for making continous predictions like house pricing, length or someone that you can measure or calculate, for example.
there is no "transported" feature in the training data
so how did you train the model?
I need more code or explanation what are you doing into analysis. Judging by your message you could have 3 possible mistaken ways:
- If you asking data['transported'] you have a syntax mistake - you should write 'Transported' instead of lowercased version. All is case sensitive in python.
- when you asking(for example) X_train['Transported'] - you will have an error because you should split your data for features - (for example) X_train = df.drop(columns=['Transported']) and y_train = df['Transported'] and after than use test_train_split. You will have your Transported column into y_train variable.
- You trying to ask a variable without your 'Transported' column
Got u
U did wit logistic regression?
You didn't read my first post, the next conv doesn't make sense
I already extracted the column “transported “ from the data to y
Chill oleh
I was able to achieve a 71% accuracy with minimal Feature engineering utilizing pretrained LLM embedding models to obtain feature vectors from the textualized data and fitting classifiers to those vectors, if anyone is interested. https://www.kaggle.com/code/liamdavies1/space-titanic-using-llm-enhanced-feature-embedding
Also, dont get me wrong, that is not a good accuracy for this problem. It is just demostrating a different approach
What was the reason for such low accuracy?
Hey guys, did any of you feature enginner the cabin to separate it into three columns? like a/b/c, each into a separated column and one-hot encoded. Did this showed better resultS?
Hey everyone,
I hope you are all doing well. I recently completed the "Titanic Spaceship" model.
I have shared the code in the Kaggle Notebook along with a brief explanation.
If you have any questions regarding the code or any suggestions, feel free to message me.
Kaggle Notebook Link: https://www.kaggle.com/code/mushei/spaceship-titanic-model-code
Hey guys, I tried to follow Samuel Cortinhas's notebook to complete this project, but I got stuck when dealing with missing values of surname. When I follow the code as shown above, I got the following error, could anyone help me with part of the code?
Thank you!
The error message means that you're trying to tell the sns.countplot() function to look for data at a specific position (or index) in your dataset, but that position doesn't exist. It's like trying to find a page in a book that isn't there.
hey guys i just started on this competition I was going to imputate the data since alot of it is missing but do any of you have any idea how to imputate the categorical data like destination, usually with numerical data I would just imputate it with the mean values. Thanks in advance!
I personally, use the OrdinalEncoder from sklearn.preprocessing , which converts all the catagorical data into a numeric one first ,
then I fill up missing values of the whole dataset , which is now fully numeric.
hello , can anyone help me improve my score , I don't know to feature engineer properly so didn't use that , but developed a few ways in which I could get an accuracy of 78% in Space titanic with data casting or feature dropping using correlation.
I get a lower accuracy if I drop the features with lower correlation and found data casting to be more useful , in which I convert all data types in a single data type i.e. float in this case.
I also get a lower score if I partition my train dataset in something other than 80-20.
I played with these parameters , even using Random Forest , but I'm getting different accuracy in Random forest Classifier if I run different times.
Is there a way to optimize more and can anyone help me , with how to vectorize my dataset for parallel computing (just asking).
https://www.kaggle.com/code/rishita00/space-titanic-indepth-classification/notebook
You should expect slightly different accuracies if you run without seeding the random state. How different are they?
I see, so I need to set a seed , for consistent result ?
Around 2% without changing anything. It varies between 78% to 80% , without any change.
With feature selection get an accuracy ranging between 75% 78%
With change in size of train and validation sets
70:30 , 77% to 79%
60:40 , 74% to 77%
90:10 , 76% to 77%
80:20 , 78% to 80% with Random Forest Classifier.
Yeah, that means you can just seed it. You get a different initial state each time you run the code.
hi, new to kaggle and new to competition, anyone who joined recently ?
can anyone suggest me if Roomservice, foodCourt, shopping mall and Spa can be ignored, or still need to be data processed for null values ?
null values should be imputed na ?
how can u skip data (pre) processing u shan't in any case
hi everyone, there most of people using blending in competition. Is it really valuable?
Sometimes. Blending is good for combining model strengths, e.g. random forest and linear regression. Blending (mostly) never leads to your model performing worse than before, so it's always worth giving it a go.
Understand, thank you so much.
thank you
anybody here uses imputer model like IterativeImputer, how do you guys go about processing or encoding features before feeding the data to an imputer model?
guys, is it normal that we only get the false ones ?
can i predict the true ones if i only get false one as a train sample
or maybe i misread the data
Hey did you figure this out
It’s just the sample submission with false ones
So they’re just placeholders to show you the format of submission
Hi! Have you figured this out? I have the same question
yea, i dont know why i didnt see the right doc
somehow i got 78% of acc which is meh
Hello everyone, i am new to kaggle competition. I am getting started with spaceship competition. Can anyone help, how to handle "nan" values in different columns ? shall that be removed or that also can be trained?
I ask around a bit since then, so basically you don't want to encode too much or else the features won't be as good in your ML models. You can use other imputing techniques and model that lets you encoding and change your features the least like so imputers like KNN or simple imputer needs the least preprocessing. Now with iterative imputer its a bit tricky since you want to maintain the features integrity so you can do stuff like normalize or scaling (StandardScaler or MinMaxScaler) and choose the correct encoding for each features.
TLDR; Don't over-preprocess features for an imputation model, it can affect integrity of the features when they are later use in a ML model
Hey all, this might have already been discussed, but I completed this competition a bit ago and I was wondering if it would be possible to take the model that I have trained in this competition and host it on a private hugging face repo for the experience of hosting the model.
I'm a bit new to hugging face and was hoping I could gain some experience in hosting models there and figured a dataset with completely fictional dataset would be a good place to start. I'm sure that hugging face has their own introductory models to host, but figure having an agent hosted from start to finish using independent sources may be very beneficial.
Hello all ,
I am new here in Kaggle, and this is my first competition (and I am totally lost!)
Please, could someone tell me how you deal with missing values?
These are the number, should we just drop them ?
HomePlanet 201
CryoSleep 217
Cabin 199
Destination 182
Age 179
VIP 203
RoomService 181
FoodCourt 183
ShoppingMall 208
Spa 183
VRDeck 188
Name 200
Thanks a lot!
PS: If someone has a place in a team! I will be greatful.
try to think about if there are any relationships between the features, here's an idea, maybe try to see if when 'Cryo'==True, how does all the expenditure features behavior and vice versa?
agree, in fact, I passed the completely unprocessed data into a HistGBC, (no param tuning whatsoever), got 0.79833, so any score lower than this means the pre-processing are actually hurting the score
please search and understand terms like MCAR, NMAR, and MAR, NaNs could have meaning
I fed the completely unprocessed data into a HistGBC, (no param tuning whatsoever), got 0.79833, with like 8 lines of code...
maybe this feature leaked?
thank you so much , i'll try it
let's study hard together!
Hello everyone! I am glad that I could join you. I am doing a academic project about this competition. However, it was only after we had been working on our project for a whole semester that my teacher realised that I am working on a fictional dataset, which might not be of great value to the company because the linear or non-linear relationships between the variables didn't hold up, and there was no need to study it. (I am sorry but I didn't know that I need to choose a real dataset). My teacher asked me to justify the choice of studying this dataset, but I can't think of one. 😦 Could you please tell me why you choose to take this challenge ? And the advantage of working on it ? Or if you know the origin of this dataset, I am truly grateful if you could answer me!
Thank you in advance!
Hello @sleek bone, I participated time ago in this competition, I think is an excellent case for the application of imputations techniques.
Thank you
I think so too! I mainly explored the package MICE en langage R. However, I thought that some other methods use the relations between the variables, and there don't exist any such relation in a fictional dataset... Could I ask you which techniques you used for this work ? Thank you in advance!
Hello everyone. I used a neural network model to join the competition. However, it seems that I am getting slightly different results when I run the code multiple times. May I ask how can I make neural network using tensorflow reproducible. Thank you!
I also posted a discussion in Kaggle where you can find the notebook that I used. Here is the link: https://www.kaggle.com/competitions/spaceship-titanic/discussion/567619
Hey everyone. I recently completed the Spaceship Titanic competition on Kaggle, and this is my first project where I worked solo. Since I’m still studying this area, I would really appreciate any feedback or suggestions to improve my model. If you see anything I might have done wrong or areas that could use some adjustments, I’d love to hear your thoughts! Looking forward to learning more from the community. Thanks in advance!
My notebook: https://www.kaggle.com/code/laurasaraiva/spaceship-titanic
Your teacher is short sighted
Could you please share with me why do you say that ? (Fortunately I have finished this projet with him now but I only achieved 0.803 as the score
It might not be a real world dataset, but there's so much value in it as a student playing around with dataset as newbie will help you build skills, there is no point in doing real world project if the students skills are still let's just say beginner, unless this academic project of yours is done on Masters Degree Level only then it would make sense
Thanks! Actually, my project is done on master's degree level...
🔥
Then he is right after all LMFAO
Started this challenge today, I have only worked on famous datasets like MNIST to learn (without any real practise) and jumping into this was disaster lol
Though I have achieved 0.799 score
how do I improve it ? Tried ensemble learning, Included names (thinking there just might be some relation setup), tried GridSearch etc
what else can one explore here to improve the model
Hello, I was wondering if anyone could tell me why everyone uses the random forest approach to resolving this competition?
Isn't XGBoost better than Random Forest?
You should try feature engineering; I didn't use ensemble learning or grid search and still scored almost the same.
Hey everyone! New year. This is my first project on kaggle. Any tips?
Hi everyone. My team and I are trying this competition and we used CatBoost along with basic feature engineering to get a pretty good score, but we noticed that our score gets worse when we do data imputation instead of leaving NaN's. Initially we filled NaN's with the mode of the column, then we added more sophisticated imputation by using patterns in the data (||Everyone from deck A/B/C is from Europa, everyone from cabin G is from Earth, everyone under 13 or in cryosleep spends 0 dollars||) but somehow this imputation makes our results worse consistently, and even the imputation by modes makes our results worse. This is very hard for me to understand - since the NaN's in this competition really look like the creators just blanked out random cells, it's hard to imagine that any pattern of where the NaN's are has any bearing on anything, and when reading others discussing this competition they say their results improve with imputation. Does anyone have any idea why this would happen? This is our notebook:
https://www.kaggle.com/code/samrohrer/spaceship-titanic-bad-imputation
Sorry I'm new, so grain of salt.
But the NaN values are likely being handled in SOME way, even if not explicitly from your direct instructions. Either rows are being dropped, using mean, or 0.
Whatever is happening is doing better than the mode calculation it seems.
it's been a while since we were having this issue, but we didn't ever figure it out. Catboost uses the NaN's to try to derive information, but in this particular case since it's synthetic data and I'm pretty sure they blindly blanked out the same random percentage of each feature, it seems weird for the existence of the NaN's to contain any information
but maybe they just leaned towards NaN'ing out people who vanished or something
Hey Kagglers and ML Enthusiasts! 👋
I’ve just published a new notebook where I built a model to predict Titanic survival using machine learning techniques like Logistic Regression, Random Forest.
It includes data cleaning, EDA, model comparison, and feature importance — beginner-friendly and easy to follow!
📘 Check it out here:
👉 Titanic Survival Prediction using Machine Learning
If you find it helpful, learned something new, or just want to support the work — a quick upvote would mean a lot! 💙
Let’s grow and learn together
Hi, created a notebook on space titanic : Basic DS Framework 80% accuracy using Random Forest and Boosting algorithms, check it out:
https://www.kaggle.com/code/salahuddinbayassi/beginner-ds-framework-for-80-accuracy
Hi @everyone
I created a wonderful Streamit app about the preaching of the Titanic survivors 🚢
You can check it out at: https://app-app-titanic-data-bdwwycbgdejsmtuv4ntkss.streamlit.app/
I'm really interested in your opinions.
Thank you.
Hii
Hi
hi
Hi
Hi
Hi, I have a question about spaceship titanic. I first submitted a logistic model with simply filling NAs by zeros and no new features, resulting in around 0.79. However, as I used Gradient Boosting and added some observations like total spent > 0 and Cabin -> Deck, and filling NAs with some ideas, it dropped to about 0.74. I thought my ideas are valuable and will increase accuracy. Anyone with this experience?
Here is the code of 0.74
https://www.kaggle.com/code/sunghakheo/notebookce8bdbe95e
You may have been lucky the first time. I obtained my best score last year while doing only some basic feature engineering. Now I've implemented many more but I cannot get near that score 🙂
hello all kagglers hope you all are great
would you guys appreciate a notebook baseline template which you can easily iterate on?
Go ahead mate
i am new kaggle , how can i access your notebook?
Hello, new here