#🚀┊spaceship-titanic

1 messages · Page 1 of 1 (latest)

keen ibex
#

Hi, how are the leaderboard scores calculated?

desert crescent
keen ibex
#

Well, it can’t be that my predictions have 0 score and using multiple models. There must be something wrong I am doing

#

I have read all the details but still not sure why any score would be zero (statistically it is impossible)

keen ibex
#

I found the error. The True and False are case sensitive and I was using all CAPs

young basalt
#

I have question why we dont drop VRDeck , roomService, FoodCourt , ShoopingMall along with PassengerId and Name

#

things like these should not matter in training the model or am i missing something

keen ibex
zinc grove
#

Anyone interested in teaming up?

coarse wave
white halo
untold storm
#

Yo guys! Lol, finally a place to connect to other people, geez. It's been a lonely 7 months, and 3 of them doing Data Analytics and ML. I'm from South Africa by the way, it's great to meet you all! And I just joined this competition

untold storm
tame cargo
#

I'm so happy with my solution for null values in the test dat in the expenditure columns

#

I just had to say lol

#

I haven't started my feature engineering yet, but I'm not on a team either, but anyways, it seems that the utility expenditure is quite important after all...

tame cargo
#

Hmm anybody got any ideas for further feature creation?

hoary monolith
#

hey completely new to machine learning anyone want to team up

#

what will be thought behind the fill nan values for cabin?

tame cargo
hoary monolith
#

what will be the approch for converting to integers?

tame cargo
#

For converting, a simple mapping did the trick!

map={'categorical':numerical, 'A':0}
df['CabinDeck'] = df['CabinDeck'].map(map)
untold storm
#

Hi everyone, I am struggling to really get the ball rolling for this project. Is there anyone willing to help me? I am able to do most of the basics and then a be able to do a mixture of things, which isn't good enough. I am struggling to do EDA and then categorical encoding. Please send a DM if anyone is willing to help. Much a appreciated, I'm based in South Africa and my timezone is GMT +2. I will put in extra hours to accommodate anyone willing to help me get through the spaceship Titanic, I am still a noob but I got 3 months worth of skill and self taught.

hoary monolith
#

my test score is way to low how can i imporve it am using Random Forest Classifier

zinc grove
#

Hey everyone,

I'm thrilled to share that my work for the Spaceship Titanic competition has achieved an impressive Top 5% rank on the leaderboard! 🥳🏆

In this Notebook: https://www.kaggle.com/code/ishanpurohit/top-5-solution-with-detailed-explanation, I've poured countless hours into conducting in-depth analysis, crafting insightful visualizations, and implementing advanced modeling techniques. From uncovering hidden patterns in missing data to optimizing feature engineering, my notebook showcases a comprehensive approach that has propelled me into the top echelon of participants.

If you're looking for a comprehensive guide to conquering the Spaceship Titanic challenge, complete with strategies to improve model performance and achieve remarkable insights, this is the notebook you've been waiting for.

I invite you all to check out my notebook and explore the journey that led to this achievement. Your support means the world to me, so if you find it valuable, please consider giving it an upvote! 🙌👍

tame cargo
zinc grove
cold bone
zinc grove
cold bone
# untold storm Hi everyone, I am struggling to really get the ball rolling for this project. Is...

Another suggestion is to start small; make small progress based on one learning module at a time. When I started with Kaggle, I take one Kaggle Learn course, pick a dataset that interests me and apply what I learned by anlayzing it. Below post talks about the approach more in detail. Hope you find it helpful. 🤗

https://www.kaggle.com/discussions/getting-started/393853

untold storm
#

Hey guys thanks for the help! Lol 2 weeks later and finally got somewhere in the spaceship Titanic challenge. Now my model is trained and is 78% accurate. But now im stuck, I don't know how to deal with the missing column that should be predicted in the test.csv 🤣😅😅 please any suggestions would help. It took me 2 weeks to get to this point and now I'm totally clueless

trail tangle
#

Could neural networks potentially be used to solve this problem

#

Or is there not enough data

trail tangle
#

Update: It was...

mighty lake
karmic pivot
#

Hello! After I converted some columns into numerical values (or I normalized some of them), so I still need to do the same on test.csv? Or should we not edit test.csv? THank you in advance for your answer!

karmic pivot
# limpid furnace Yeah bro

It's a question with two choices. It's a bit difficult to understand with what you meant by "yeah bro"

limpid furnace
#

You should edit test.csv too

karmic pivot
limpid furnace
#

No problem

limpid furnace
#

This is my notebook about this competition. I have used some basic technique to extract feature and fill NA but I have a pretty good accuracy. I didn't explore data and EDA much but I will share my notebook to all of you so you can find something interesting in EDA and improve it much better. If you have some advice for me, please comment and I will hear all of your comments. And if you interested my notebook. pleas support me by voting, I will very appreciate that. Thank you so much!!
https://www.kaggle.com/code/hoanglongroai/80-accuracy-spaceship-titanic#Make-prediction

lavish pulsar
#

Anyone interested in teaming up with me?

scenic furnace
limpid furnace
trail crypt
#

Going to look into this competition soon. Pretty similar to normal Titanic problem I assume?

junior cobalt
#

Need partner for this project , a team

mystic jungle
foggy dirge
lyric basin
#

hey can you help me with something

#

Like I face problem everytime how to decide which algorithm to use

sudden rivet
sudden rivet
carmine crescent
#

i did the normal EDA on the dataset, now i cant think of what to do next???

sudden rivet
#

Anything stick out to you? Did you try any transformations of the data?

vale heron
carmine crescent
#

Am I thinking right?

carmine crescent
#

Need to check that out

carmine crescent
#

Woah, I got 0.75 w/o it

thin hearth
#

anyone team-up?

teal shadow
#

I cant get over 0.75

teal shadow
thin hearth
#

No problem let's join

#

can you share me your notebook link

teal shadow
teal shadow
#

i tried to essemble 3 modles , lr , random fortest ,and gradient boosting but only manage to boost my result by like 0.77 to 0.78

opaque sedge
#

Anyone interested on teaming up on this spaceship titanic, can you pls DM me?

quasi pier
#

Got 79% at the first shot, then never got higher...

tiny mantle
#

My model keeps improving in validation scores and but keeps decreasing in kaggle score

#

If someone is willing to look at my code I would be really greatful. I am doing something very stupid and I don’t know what

quiet widget
#

Does catboosts performs better than xgboost on this dataset?

#

with xgboost i manage to get 0.803

onyx bridge
#

I got 0.71

autumn token
#

I got 78%

placid meadow
#

Has anyone tried neural network to solve this problem?

plain nimbus
#

I got 97.96 with decision tree 🙂

#

but my score is 0

#

anyone know why?

#

thanks

terse verge
#

Hello, I'm following this notebook: https://www.kaggle.com/code/oscardata963/spaceship-titanic-notebook

And I'm trying to do it in R.
So far I have something like this: [Photo 2]

Which works but I want the logarithmic scale and the tight_layout where the values are pretty much tight rather than what I have like this: [Photo 3]

Can someone help me? I can't find anything on the internet and I've tried everything. One of the problems I have in my professional journey is the fact that I can't solve things when I'm stuck and I have nobody to solve my problems. Please ping me.

#

Here the code for you to modify it:

par(mfrow=c(1,3), mar=c(4, 4, 2, 1))
for (col in colnames(train)) {
  if (class(train[[col]]) == "numeric") {
    hist(train[[col]], main=col, xlab="")
  }
}
#

I if I try to add the parameter of hist() log = 'y' it gives me these two errors:

#

Which means I have nulls and the 'is not a graphical parameter' which I don't understand it really. Anyways is impressive to me that python is able to detect and avoid the nulls when plotting with matplotlib (because the dtype is object not numeric or double) and tightens the data with such few code and I have to do all this, I just don't know what to do. Please someone help me.

tacit meadow
#

is it possible to get a perfect score in this competition?

terse verge
#

could somebody help me with my question?

#

that's the problem with my journey in coding and data science I can't solve my problems when I'm stuck

#

and nobody helps me

#

I've already also asked in the kaggle forum

#

if you can recommend me a place where somebody can help me? sad_panda

buoyant igloo
#

The top two scores on the leaderboard are 0.98 and 0.96, far ahead of the pack at ~0.82, 🤔 could these scores be achieved "gaming the system" by systematically changing submissions to infer where the errors are, or is there a key insight that almost everyone missed?

terse verge
#

Probably the latter

terse verge
#

Well the first one definitely used an iterative method 😂

terse verge
#

Why is python making transported a numerical value?

#

like automatically

buoyant igloo
# terse verge Why is python making transported a numerical value?

The numerical values are correlations, e.g. correlation between Age and Transported = -0.075, correlation ranges from 1 to -1 https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has ...

terse verge
#

that doesn't answer my question

buoyant igloo
terse verge
#

what

#

i don't understand you

buoyant igloo
#

I guess I don't understand your question. You show a call to corr() which returns a table of numerical values and ask why Transported is converted to a number. The input value of Transported True/False is converted to a number because the numeric_only parameter of corr() defaults to False, the output associated with Transported is numbers because correlations are numbers.

terse verge
#

ok then why would cryosleepand vip are not converted to numeric? because they have nans while transported doesnt?

#

i guess that answers my first question tho

#

now

#

I want to do the same thing in R

#

I have so far something like this but I don't know how to add the transported one

#

chatgpt is telling me this but I don't know why transported is saying it has size 0 when i have done str(train) and see obviously that the variable is still in the dataset

#

here the code if you want to help me:

# Select numeric columns
numeric_columns <- sapply(train, is.numeric)

# Calculate the correlation matrix for numeric columns
cor_matrix <- cor(train[, numeric_columns], use = "complete.obs")

cor_matrix
#

Anyways i solve it but 😓

#

I can't believe the fact that doing that transformation changes the data that much

#

Is that a problem?

#

I guess R does the calculations different than Python and so there are those differences but I don't know that when training that might be a problem

#

imma continue and i wait to your answer but still continue with the project

#

now that i see is not only transported is the whole df, seems like cor() in R treats it differently than cor in python :(

#

i hope that is not a problem, but the values are pretty different

terse verge
#

Any help could be appreciated

terse verge
#

is very basic

terse verge
#

Can somebody help me with this code?

library(mlr3pipelines)
library(mlr3)

# Define the numerical pipeline
num_pipeline <- po("scale", id = "num_scale") %>>%
  po("impute", id = "num_impute", param = list(strategy = "median"))

# Define the categorical pipeline
cat_pipeline <- po("encode", id = "cat_encode", param = list(method = "1hot"))

# Define the column transformer
full_pipeline <- po("branch", id = "branch") %>>%
  po("pipe", id = "num_pipe", num_pipeline, col_roles = list(num = num_attribs)) %>>%
  po("pipe", id = "cat_pipe", cat_pipeline, col_roles = list(cat = cat_attribs)) %>>%
  po("ccombine", id = "ccombine")

As you can see i'm trying to create a pipelines in r

flint stag
#

Hello, Looking for a partner or team to join for this project. I'm experienced software engineer (10 years) with couple of months experience in ML. Let me know please if anyone is interested.

terse verge
#

If someone helps me i would appreciate it

#

is a feature engineering question

#

caret and it's function, the fact that i cannot abstract the pipelines from the data is what bugs me

buoyant igloo
# flint stag Hello, Looking for a partner or team to join for this project. I'm experienced s...

Hi Sanjeev -- I am also experienced s/w engineer + ML beginner. I have colab notebook which does complete processing from input data to prediction with pytorch. It gets 0.79 score which is around the middle of the leaderboard, where the top credible score is around 0.82. For me, 0.79 is good enough. I'm happy to share the notebook with some ideas how to improve the score by 1-2% if you're interested.

mellow crag
#

Hello and happy holidays!
I'm an Hobbyist here, so excuse my beginners questions as I am from the business sector. 🙂
In the Spaceship Titanic, I see that PassangerId is composed by Group and Number within the group. I would like to create a category labeled IsGroup indicating if that person is in a group or not (True or False). I was thinking about separate gggg and pp, than counting the gggg occurrences and if higher than 1 IsGroup == True, else IsGroup == False. Is this feasible? Any solution more elegant?

buoyant igloo
# mellow crag Hello and happy holidays! I'm an Hobbyist here, so excuse my beginners questions...

Suggest better might be category for group_size, this has more information for a classifier which includes the special case group_size==1 which would be False in your scheme. Then another category could be family_size using last name to identify families (how to handle missing names needs a bit of thought). In pandas you can do this with Series.str.split() with underscore as delimiter then value_counts() https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html. My philosophy was to look for a short-cut solution like value_counts(), but if I couldn't find it quickly then write a dumb loop over the row or column in python -- there is no compelling need to optimize speed or code size for engineering a small feature table like this.

inland summit
#

Hello. I'm working on the spaceship titanic competition and I have a technical question. I'm using the suggested notebook to guide me through this exercise. I'm at the section which validates the Random Forest model using the validation set (not the OOB validation. I've already completed that part). The 'rf.evaluate' function returns two values: Loss = 0 and Accuracy = 0.792. Can someone explain the difference between these two?

terse verge
# inland summit Hello. I'm working on the spaceship titanic competition and I have a technical q...

Certainly! In the context of machine learning models, "Loss" and "Accuracy" are two commonly used metrics to evaluate the performance of a model. Let's break down what each of these metrics represents:

  1. Loss:

    • The loss is a measure of how well the model is performing on a specific task. It quantifies the difference between the predicted values and the actual values (ground truth).
    • The goal during training is typically to minimize the loss. Different types of models and tasks use different loss functions. For example, in classification problems, cross-entropy loss is commonly used.
    • In the context of the Titanic competition, the loss is likely calculated based on the predictions of whether a passenger survived or not compared to the actual survival status.
  2. Accuracy:

    • Accuracy is a measure of the overall correctness of the model's predictions. It is the ratio of correctly predicted instances to the total instances.
    • It is one of the most straightforward metrics and is expressed as a percentage. An accuracy of 0.792 means that approximately 79.2% of the predictions made by the model on the validation set are correct.
    • While accuracy is informative, it may not be the only metric to consider, especially in imbalanced datasets. For example, if only a small percentage of passengers survived, a model that predicts all passengers as not surviving might still achieve a high accuracy but may not be useful.

In summary, during the evaluation of a model:

  • Loss provides a more detailed and task-specific measure of how well the model is performing.
  • Accuracy provides a general measure of overall correctness but may not be sufficient in all cases, especially in imbalanced datasets.

For a more comprehensive evaluation, you may also consider exploring other metrics such as precision, recall, F1 score, or the area under the ROC curve (AUC-ROC), depending on the specific goals and characteristics of your problem.

#

ChatGPT apparently is only trained with the og titanic

#

but you get the idea

#

Remember another thing

#

In the context of the Titanic competition on platforms like Kaggle, you are correct. The actual survival status of passengers in the test set is typically not provided to participants. During the competition, participants train their models on the training set, where the ground truth (actual survival status) is known, and they validate the performance of their models using a validation set.

The loss reported during training and evaluation is calculated based on the predictions made by the model compared to the known outcomes in the validation set. This is done to get an estimate of how well the model is likely to perform on unseen data. The exact evaluation metric used for loss depends on the competition, but it's often related to the classification task at hand.

Once participants are satisfied with their models and want to make predictions on the test set, they use their trained models to predict the outcomes for the test set. **The actual outcomes for the test set are not provided during the competition. Participants then submit their predictions to the competition platform, which evaluates those predictions based on the true outcomes held by the platform. The platform uses these evaluations to calculate the final performance metrics, and participants are ranked on the public leaderboard accordingly.

In summary, during the competition, participants do not have access to the true outcomes for the test set. They use the training set and validation set to train and evaluate their models, and the final evaluation is based on the unseen test set when submitting predictions to the competition platform.**

buoyant igloo
# inland summit Hello. I'm working on the spaceship titanic competition and I have a technical q...

Usually, loss is way to calculate error which used in training of neural networks. Random forests are a quite different type of classifier. A random forest does not use this kind of loss function in training, so I would guess that here loss is reported as zero as a placeholder to preserve the same function return as other classifiers in Keras. Instead, random forests use "purity", which measures how well a given node in a tree divides samples into different classes. In scikit-learn, the purity metric can be Gini impurity (default) or entropy; I don't know Keras, maybe it uses the term "loss" for purity in a random forest in which case the reported loss would be the average purity at the lowest node in the trees. It would be a bit surprising for the training loss to be exactly zero for a challenging classification problem like this, which again suggests the zero is just a placeholder.

broken cape
#

Hello, i am getting a rather unusual problem with my submissions, somehow I am getting a score of 0 while sample submission is getting a score of 0.49. I know my model predictions are not that bad😅
It'll be really helpful if someone could guide me through this! Thanks

terse verge
#

There could be because of many reasons. The most common one is that the submission only reads the bool as True or False

Check if you are predicting the values as numeric or even bool but with all uppercase and change it.

rustic kernel
#

Hello, I'm working on the spaceship titanic and find unclear moment for me.

#

After constructing the age distribution, I noticed a peak at zero value. I'm trying to understand whether these are missing values or if there are just a lot of children in the dataset. Upon further inspection of the rows with zero age, I observed that they don't spend anything, but they may end up on different planets and either survive or not survive in the end.

There are two main hypotheses: either these are indeed children, or they could be spaceship personnel.

I'm not quite sure how to test this hypothesis, and I would be grateful if more experienced professionals could guide me in the right direction.

Thank you in advance!

vague sparrow
faint parcel
#

Hi everyone, I just started out and want to ask about this competition. We have to predict the column 'transported' but there is no column as such in train.csv or test.csv. How will I train the model then?

random idol
#

Hello everyone.
I think someone has already noticed and even taken advantage of it. But I'm at a dead end.
These are scatterplots of Cabin_number and Group_number depending on some categorical features (and target feature). They form lines, and some even belong entirely to one class - for example in Cabin_deck feature. The situation is similar with Cabin_deck and Home_Planet. I want to use this to fill in the missing data, but have not had success with this yet.
If you have any ideas, please write.

cursive jetty
#

Hi everyone, this is my first time entering a competition. My name is Hitesh an aspiring Data Scientist and I'm about to start learning ML. I want to learn through a more practical approach and then learn the theoretical aspect of ML simultaneously hence I'm here.

If anyone can help me getting started, it would be a huge help. Cheers!

vivid flicker
#

can anyone please help as testing dataset of this spaceship-titanic is missing about 2% so how to predict for them ?

turbid jasper
#

can someone explain why this is happening

dire cradle
inner pulsar
vague bay
#

does sample submission contain correct answers?

#

I want to evaluate my answer without submitting code

worthy moat
fair linden
#

is my model tripping ?

#

I get around 80% accuracy but still , I have rly high doubts that Cabin_num is the most important feature

gusty cipher
echo falcon
#

I am getting the score as zero even if I converted the predictions to integers.

worthy moat
coarse wave
#

hello, everyone. i just joined this competition and i get 0.78 public scores. i want to get more high accuracy and can anyone help me?

fickle stirrup
#

Can anyone explain why I'm getting 0000 public score?

knotty crescent
#

Hi, Everyone. This was my first competition, and I got a public score of 0.72. I'm looking forward to working to get this higher. I have been going through the lessons, and for the first time, I am getting my head around machine learning. The other times I have tried it, it did not make sense or I did not know why I was doing something.
I just realise that there is still so much to learn

solar relic
sullen scarab
#

anyone interested in doing this project together

iron ivy
#

Hi All, I am looking the solutions but I cannot understand why only the following features have been selected and the features "Age" , "RoomService" have been left out? I mean numerical columns shouldn't be they reprocessed?

compact spruce
vale viper
arctic estuary
#

sample submission

vital yew
#

i ended up doing several transformations to passengerid however they seem legit in the end, but not legit enough. any suggestions?

#

ok so i did find a mistake, the id's were different but I replaced them w the original values directly from the test dataset and the error persists

raven skiff
#

Hello, i have done this spaceship prediction,i got 78% is their any way to get more percentage.

#

I mean accuracy score

raven skiff
#

I have tried with randomtreeclasifier,svm's, adaboost . But all got 78% and below

#

Which algorithm you have used?

woven shell
#

Good morning to evreybody. I'm new in Kaggle and i'm happy to be here

olive ginkgo
#

hey i did titanic model using multiple linear regression, got accuracy score of 78%

hoary plover
ancient smelt
hoary plover
#

so how did you train the model?

ancient smelt
# hoary plover there is no "transported" feature in the training data

I need more code or explanation what are you doing into analysis. Judging by your message you could have 3 possible mistaken ways:

  1. If you asking data['transported'] you have a syntax mistake - you should write 'Transported' instead of lowercased version. All is case sensitive in python.
  2. when you asking(for example) X_train['Transported'] - you will have an error because you should split your data for features - (for example) X_train = df.drop(columns=['Transported']) and y_train = df['Transported'] and after than use test_train_split. You will have your Transported column into y_train variable.
  3. You trying to ask a variable without your 'Transported' column
hoary plover
#

U did wit logistic regression?

ancient smelt
hoary plover
#

Chill oleh

winter lava
#

Also, dont get me wrong, that is not a good accuracy for this problem. It is just demostrating a different approach

hoary plover
pale plank
#

Hey guys, did any of you feature enginner the cabin to separate it into three columns? like a/b/c, each into a separated column and one-hot encoded. Did this showed better resultS?

hoary plover
#

Hey everyone,
I hope you are all doing well. I recently completed the "Titanic Spaceship" model.
I have shared the code in the Kaggle Notebook along with a brief explanation.

If you have any questions regarding the code or any suggestions, feel free to message me.

Kaggle Notebook Link: https://www.kaggle.com/code/mushei/spaceship-titanic-model-code

raw egret
#

Hey guys, I tried to follow Samuel Cortinhas's notebook to complete this project, but I got stuck when dealing with missing values of surname. When I follow the code as shown above, I got the following error, could anyone help me with part of the code?

#

Thank you!

hoary plover
prisma ridge
#

hey guys i just started on this competition I was going to imputate the data since alot of it is missing but do any of you have any idea how to imputate the categorical data like destination, usually with numerical data I would just imputate it with the mean values. Thanks in advance!

upper sail
upper sail
#

hello , can anyone help me improve my score , I don't know to feature engineer properly so didn't use that , but developed a few ways in which I could get an accuracy of 78% in Space titanic with data casting or feature dropping using correlation.
I get a lower accuracy if I drop the features with lower correlation and found data casting to be more useful , in which I convert all data types in a single data type i.e. float in this case.
I also get a lower score if I partition my train dataset in something other than 80-20.
I played with these parameters , even using Random Forest , but I'm getting different accuracy in Random forest Classifier if I run different times.
Is there a way to optimize more and can anyone help me , with how to vectorize my dataset for parallel computing (just asking).
https://www.kaggle.com/code/rishita00/space-titanic-indepth-classification/notebook

austere bay
upper sail
austere bay
#

Yeah, that means you can just seed it. You get a different initial state each time you run the code.

native zinc
#

hi, new to kaggle and new to competition, anyone who joined recently ?

#

can anyone suggest me if Roomservice, foodCourt, shopping mall and Spa can be ignored, or still need to be data processed for null values ?

tawdry ridge
#

how can u skip data (pre) processing u shan't in any case

compact grove
#

hi everyone, there most of people using blending in competition. Is it really valuable?

dusk belfry
compact grove
prisma ridge
#

anybody here uses imputer model like IterativeImputer, how do you guys go about processing or encoding features before feeding the data to an imputer model?

fallen pebble
#

guys, is it normal that we only get the false ones ?

#

can i predict the true ones if i only get false one as a train sample

#

or maybe i misread the data

lapis portal
#

It’s just the sample submission with false ones

#

So they’re just placeholders to show you the format of submission

final creek
fallen pebble
#

yea, i dont know why i didnt see the right doc

#

somehow i got 78% of acc which is meh

ocean arch
#

Hello everyone, i am new to kaggle competition. I am getting started with spaceship competition. Can anyone help, how to handle "nan" values in different columns ? shall that be removed or that also can be trained?

prisma ridge
# final creek Hi! Have you figured this out? I have the same question

I ask around a bit since then, so basically you don't want to encode too much or else the features won't be as good in your ML models. You can use other imputing techniques and model that lets you encoding and change your features the least like so imputers like KNN or simple imputer needs the least preprocessing. Now with iterative imputer its a bit tricky since you want to maintain the features integrity so you can do stuff like normalize or scaling (StandardScaler or MinMaxScaler) and choose the correct encoding for each features.

TLDR; Don't over-preprocess features for an imputation model, it can affect integrity of the features when they are later use in a ML model

neon crystal
#

Hey all, this might have already been discussed, but I completed this competition a bit ago and I was wondering if it would be possible to take the model that I have trained in this competition and host it on a private hugging face repo for the experience of hosting the model.

I'm a bit new to hugging face and was hoping I could gain some experience in hosting models there and figured a dataset with completely fictional dataset would be a good place to start. I'm sure that hugging face has their own introductory models to host, but figure having an agent hosted from start to finish using independent sources may be very beneficial.

eager sand
#

Hello all ,
I am new here in Kaggle, and this is my first competition (and I am totally lost!)
Please, could someone tell me how you deal with missing values?
These are the number, should we just drop them ?
HomePlanet 201
CryoSleep 217
Cabin 199
Destination 182
Age 179
VIP 203
RoomService 181
FoodCourt 183
ShoppingMall 208
Spa 183
VRDeck 188
Name 200
Thanks a lot!
PS: If someone has a place in a team! I will be greatful.

astral violet
astral violet
astral violet
astral violet
astral violet
astral violet
sleek bone
#

Hello everyone! I am glad that I could join you. I am doing a academic project about this competition. However, it was only after we had been working on our project for a whole semester that my teacher realised that I am working on a fictional dataset, which might not be of great value to the company because the linear or non-linear relationships between the variables didn't hold up, and there was no need to study it. (I am sorry but I didn't know that I need to choose a real dataset). My teacher asked me to justify the choice of studying this dataset, but I can't think of one. 😦 Could you please tell me why you choose to take this challenge ? And the advantage of working on it ? Or if you know the origin of this dataset, I am truly grateful if you could answer me! virtual_hug Thank you in advance!

inner socket
sleek bone
#

Thank you virtual_hug I think so too! I mainly explored the package MICE en langage R. However, I thought that some other methods use the relations between the variables, and there don't exist any such relation in a fictional dataset... Could I ask you which techniques you used for this work ? Thank you in advance!

jovial herald
#

Hello everyone. I used a neural network model to join the competition. However, it seems that I am getting slightly different results when I run the code multiple times. May I ask how can I make neural network using tensorflow reproducible. Thank you!

I also posted a discussion in Kaggle where you can find the notebook that I used. Here is the link: https://www.kaggle.com/competitions/spaceship-titanic/discussion/567619

humble pulsar
#

Hey everyone. I recently completed the Spaceship Titanic competition on Kaggle, and this is my first project where I worked solo. Since I’m still studying this area, I would really appreciate any feedback or suggestions to improve my model. If you see anything I might have done wrong or areas that could use some adjustments, I’d love to hear your thoughts! Looking forward to learning more from the community. Thanks in advance!
My notebook: https://www.kaggle.com/code/laurasaraiva/spaceship-titanic

twilit rapids
sleek bone
twilit rapids
#

It might not be a real world dataset, but there's so much value in it as a student playing around with dataset as newbie will help you build skills, there is no point in doing real world project if the students skills are still let's just say beginner, unless this academic project of yours is done on Masters Degree Level only then it would make sense

sleek bone
#

Thanks! Actually, my project is done on master's degree level...

inner tulip
#

🔥

twilit rapids
delicate robin
#

Started this challenge today, I have only worked on famous datasets like MNIST to learn (without any real practise) and jumping into this was disaster lol
Though I have achieved 0.799 score

how do I improve it ? Tried ensemble learning, Included names (thinking there just might be some relation setup), tried GridSearch etc

#

what else can one explore here to improve the model

dire merlin
#

Hello, I was wondering if anyone could tell me why everyone uses the random forest approach to resolving this competition?

carmine wedge
carmine wedge
coarse wave
#

Hey everyone! New year. This is my first project on kaggle. Any tips?

glacial fiber
#

Hi everyone. My team and I are trying this competition and we used CatBoost along with basic feature engineering to get a pretty good score, but we noticed that our score gets worse when we do data imputation instead of leaving NaN's. Initially we filled NaN's with the mode of the column, then we added more sophisticated imputation by using patterns in the data (||Everyone from deck A/B/C is from Europa, everyone from cabin G is from Earth, everyone under 13 or in cryosleep spends 0 dollars||) but somehow this imputation makes our results worse consistently, and even the imputation by modes makes our results worse. This is very hard for me to understand - since the NaN's in this competition really look like the creators just blanked out random cells, it's hard to imagine that any pattern of where the NaN's are has any bearing on anything, and when reading others discussing this competition they say their results improve with imputation. Does anyone have any idea why this would happen? This is our notebook:

https://www.kaggle.com/code/samrohrer/spaceship-titanic-bad-imputation

grave helm
glacial fiber
#

but maybe they just leaned towards NaN'ing out people who vanished or something

pearl basin
#

Hey Kagglers and ML Enthusiasts! 👋
I’ve just published a new notebook where I built a model to predict Titanic survival using machine learning techniques like Logistic Regression, Random Forest.
It includes data cleaning, EDA, model comparison, and feature importance — beginner-friendly and easy to follow!

📘 Check it out here:
👉 Titanic Survival Prediction using Machine Learning

If you find it helpful, learned something new, or just want to support the work — a quick upvote would mean a lot! 💙

Let’s grow and learn together

timber harbor
stone forge
tawny roost
#

Hii

weak cave
#

Hi

supple surge
#

hi

tranquil sluice
#

Hi

pastel depot
#

Hi

rocky hazel
#

Hi, I have a question about spaceship titanic. I first submitted a logistic model with simply filling NAs by zeros and no new features, resulting in around 0.79. However, as I used Gradient Boosting and added some observations like total spent > 0 and Cabin -> Deck, and filling NAs with some ideas, it dropped to about 0.74. I thought my ideas are valuable and will increase accuracy. Anyone with this experience?

rocky hazel
pine mauve
keen tree
#

hello all kagglers hope you all are great

untold yarrow
#

would you guys appreciate a notebook baseline template which you can easily iterate on?

floral kite
frozen cobalt
#

Hello, new here