#🏠┊house-prices-advanced-regression-techniques | Kaggle | Page 1

real cloak Aug 9, 2023, 5:20 AM

#

So how to get started? Sorry I m lost in the overwhelming messages

calm saffron Aug 9, 2023, 6:25 AM

#

real cloak So how to get started? Sorry I m lost in the overwhelming messages

For the house price competition we recommends beginning by checking out the starter notebook (https://www.kaggle.com/code/gusthema/house-prices-prediction-using-tfdf/notebook). If you copy your own version, spend some time reading it and modifying it you can get some familiarity with the problem.

rough zephyr Aug 12, 2023, 7:50 AM

#

anyone wants to team up for this project?

thin dust Aug 13, 2023, 9:46 AM

#

rough zephyr anyone wants to team up for this project?

yes

real cloak Aug 13, 2023, 3:15 PM

#

Anyone can lead?

tough cosmos Aug 15, 2023, 5:11 PM

#

Glad to see you all teaming up! Please post in our https://discord.com/channels/1101210829807956100/1130572338182762657 channel so others have the opportunity to team up as well!

fringe scroll Aug 17, 2023, 1:22 PM

#

@rough zephyr @thin dust can I be part of your team?

tough cosmos Aug 23, 2023, 10:40 PM

#

Hey there. I'm going to delete this message as we prefer to keep these channels competition-specific. Any questions for general help should be asked in https://discord.com/channels/1101210829807956100/1129507880379367525.

ripe thunder Aug 24, 2023, 7:45 AM

#

Hi everyone, I have a rookie question. As my cross validation score decreases, the RMSE score on kaggle increasing. Couldn't explain why such a contrary occurs. Anybody?

tidal hawk Aug 31, 2023, 11:55 AM

#

Hey, I am no expert but I would assume you have overfitting. A lower cross-validation score suggests your model is performing as needed on your local dataset. However, a high (or increasing) RMSE on Kaggle shows that its not generalizing well to new, unseen data. @ripe thunder

#

Make sure you r properly preprocessing and validating ur data to avoid the issue.

ripe thunder Aug 31, 2023, 12:35 PM

#

Thanks, maybe I can use a validation set approach rather than cv. Will keep in mind the tips.

ivory tapir Sep 5, 2023, 5:21 AM

#

Large dataset hurts my computer lol

#

Thoughts on doing PCA --> training on components, instead of directly on features?

viral path Sep 5, 2023, 2:20 PM

#

ivory tapir Thoughts on doing PCA --> training on components, instead of directly on feature...

I did that. Got good scores. Give it a shot!

crystal plover Sep 6, 2023, 2:52 AM

#

Hello - When you analyze data to identify the needs for preprocessing, do you look at both Training and Test datasets and apply the technique (e.g. dropping columns with missing data) that will work for both Training and Test datasets to save time and effort later? Or do you look at the Training dataset only and apply the same steps to the Test dataset? And if there is an issue, then deal with it at that time? The first approach is proactive and the later approach is reactive.

I am asking because this has happened:
Step 1: Analyzed the Training dataset and identified 3 columns that have missing values
Step 2: Split the Training dataset to Training and Validation sub-datasets
Step 3: Dropped those 3 columns from both sub-sets
Step 4: Defined the model, fit the model with the Training sub-dataset
Step 5: Made predictions using the Validation sub-dataset
Step 6: Determined the MAE for the model using the dropping the columns approach performed better than the model using imputation
Step 7: I was ready to make predictions using the Test dataset; then found out that the Test dataset had additional columns with missing data. SURPRISE! What am I supposed to do in this situation?

Below is the link to my notebook.
https://www.kaggle.com/code/juliasuzuki/house-price-prediction-advanced-regression/notebook

Thank you so much for your response in advance!

House Price Prediction | Advanced Regression

Explore and run machine learning code with Kaggle Notebooks | Using data from House Prices - Advanced Regression Techniques

crystal plover Sep 6, 2023, 2:02 PM

#

crystal plover Hello - When you analyze data to identify the needs for preprocessing, do you lo...

UPDATE: I received enough responses in Kaggle Discussions (Yes, I posted the same question) and my question is resolved. It is generally a good practice to work only with Training dataset during the model development (preprocessing included). After all, the purpose of Test data is to make an inference from new data (i.e. Test data) using the trained model. But to avoid a situation like the one I mentioned above (find a surprise when working with Test data), I could implement a better preprocessing strategy (e.g. imputation instead of dropping columns). Hope you find this helpful!

tough cosmos Sep 6, 2023, 5:54 PM

#

crystal plover UPDATE: I received enough responses in Kaggle Discussions (Yes, I posted the sam...

Great question and glad it was resolved. For posterity, could you post the link to your question in Discussions?

crystal plover Sep 7, 2023, 12:03 AM

#

tough cosmos Great question and glad it was resolved. For posterity, could you post the link ...

Can you please elaborate "post the link to your question in Discussions" so I can better understand your ask? Thank you. 😀

tough cosmos Sep 7, 2023, 10:32 PM

#

crystal plover Can you please elaborate "post the link to your question in Discussions" so I ca...

Yeah no problem. Could you direct us to where the other post is with the responses?

crystal plover Sep 7, 2023, 11:37 PM

#

Got it! Below is the link to the Discussions forum in Kaggle where I posted my question.

https://www.kaggle.com/discussions/questions-and-answers/437246

Best Practice on Preprocessing Data | Dropping Columns | Kaggle

Best Practice on Preprocessing Data | Dropping Columns.

tough cosmos Sep 8, 2023, 6:52 PM

#

@crystal plover Thanks! I agree with most of the advice. You want to split your data into train and test datasets before applying any preprocessing of filling in the missing values. You apply the preprocessing only to the training data. You want to treat your test data as completely separate data during the entire model building process and only use it for scoring the model (and iterating through the tuning process to improve it).

I like this explanation posted on StackExchange as well that summarizes these concepts: https://stats.stackexchange.com/a/95088

Incorporating the preprocessing techniques into a pipeline is a convenient way to modulize the preprocessing techniques used to ensure those same techniques are applied to future, unseen data. This will ensure that new data receives exactly the same treatment that your model was trained on, which also helps avoid errors if your new data has more/less features than your old data (eg - shape errors: ValueError: Error when checking input: expected input_main_input to have shape (7,) but got array with shape (1,)) .

#

Here's a good example of how to incorporate a pipeline, which preprocesses and splits the data into train/test: https://www.kaggle.com/code/alexisbcook/pipelines

crystal plover Sep 8, 2023, 7:59 PM

#

tough cosmos Here's a good example of how to incorporate a pipeline, which preprocesses and s...

Thank you! Kaggle Learn is great; I like that each module is small and practical so I can apply techniques to my notebook right away. I've looked at this module on Pipeline and applied it about 6 months ago. I managed to make my code work, but don't think did it corretly. I will try again in this particular notebook using the house price dataset. Hopefully, I will understand it better. 🤓

crystal plover Sep 8, 2023, 8:02 PM

#

tough cosmos <@1000738878971445291> Thanks! I agree with most of the advice. You want to spli...

Thank you for sharing the resource. Got it; split data before preprocessing. 🙏

dawn dagger Sep 13, 2023, 11:08 AM

#

Hello. In the case of predicting the price of real estate, is there a need to do EDA? If so, I have a small problem: I cannot extract information from the dataset and I get a little lost in the analysis of the latter. I would just like to know if I can find a more suitable method to do EDA on datasets for a regression task. Thank you in advance for your answers

#

In fact, for my basic EDA, I started by familiarizing myself with the dataset (which has more than 80 columns) by trying to see the percentage of missing values, the proportion of each type of variable, the number of qualitative variables and quantitative. Then, I started with the visualization of target/variable relationships, variable/variable relationships and this is where things got complicated for me because I couldn't extract any relevant information from this dataset and I was only doing turn in circles. Hence my question as to whether there is a need to do EDA on this dataset? Thank you in advance for your responses, once again.

thick knot Sep 13, 2023, 1:56 PM

#

You should always do EDA on datasets

#

They can help reveal patterns that you can use, or give inspiration to features you can engineer for modelling

#

Wdym by “couldn’t extract any relevant information from the dataset” @dawn dagger

dawn dagger Sep 14, 2023, 10:01 AM

#

thick knot You should always do EDA on datasets

Thank you very much dear @thick knot . I then think that I will have the plan of my EDA redone. For this dataset, can you give me some ideas on how to get started? Sincerely

thick knot Sep 14, 2023, 12:54 PM

#

dawn dagger Thank you very much dear <@522588708931764243> . I then think that I will have t...

I typically like to do count plots for categorical variates and histograms for numerical ones

#

As well as making a correlation matrix

#

Seeing the distribution of the explanatory variates after taking into account the response is also useful

#

it’s also important you confirm the distributions of data between the training and test sets are similar, as otherwise training may not work

dawn dagger Sep 14, 2023, 2:31 PM

#

thick knot I typically like to do count plots for categorical variates and histograms for n...

Thank you so much @thick knot . Personnally, I would do the opposite with graphs, i.e. variable count graphs for numerical variables and Histograms for categorical variables. But thinking twice, I find that it makes more sense to use the histograms or more precisely the barplot function of seaborn to examine a little the distribution of each variable. Regarding categorical variables, we could try count plots or even catplots to see the types of categories and the distribution of each category present in these variables. But, I owe you a big thank you for clarifying my ideas and also for giving me other ideas. I hope to get back to you soon with further questions. Sincerely.

steel flower Sep 19, 2023, 7:20 PM

#

hello everyone,
I got a 0.14712 score on my first submission of house price prediction.

link :- https://www.kaggle.com/dinanksoni/house-prices-score-0-14712

house prices - Score 0.14712

Explore and run machine learning code with Kaggle Notebooks | Using data from House Prices - Advanced Regression Techniques

lament monolith Oct 4, 2023, 4:35 AM

#

Hi everyone! I have a question about the commercial usage of this dataset: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

I work as an editor for a publishing company and we are currently working on a new book around predictive modeling and time series analysis. The author of the book wants to use this dataset to discuss techniques for data exploration and feature engineering. He thinks this dataset would be a good fit to explain these concepts.

Our policy is to use datasets that are open-source and available for commercial use. Or else, we seek permission from the source before we can use it in our books. We didn't find a license for this dataset and the dataset link provided in the Acknowledgements section is broken. So I'm just wondering if we can use this dataset in our book? I would love to get a response from one of the Kaggle staff.

I apologize if this is the wrong place to post this. It would be great if someone could point me in the right direction. I'd be happy to get in touch with someone from Kaggle (over email) to keep it official. I couldn't find a contact on the website hence the post here.

Cheers!

House Prices - Advanced Regression Techniques

Predict sales prices and practice feature engineering, RFs, and gradient boosting

proven field Oct 9, 2023, 4:38 AM

#

I have been having a problem with implementing a pipeline for 2 functions namely, an imputer and a dataset splitter. I get that I could bypass this step , but since I am learning I really want to know how to create and use pipelines.

any help would be much appreciated

fresh sequoia Oct 9, 2023, 3:40 PM

#

So close to top 500

fresh sequoia Oct 11, 2023, 1:43 PM

#

And finally hit top 10%

#

Time to record a video

fading willow Oct 11, 2023, 5:02 PM

#

gg

lament monolith Oct 13, 2023, 4:49 AM

#

lament monolith Hi everyone! I have a question about the commercial usage of this dataset: https...

Just following up on this one. Any pointers would be really appreciated!

gloomy rain Oct 24, 2023, 9:03 AM

#

Its so tough

#

My models cants even hit 10% accuracy

#

If I put too much load it just crashed colab notebook

past tide Oct 25, 2023, 3:07 AM

#

I have a doubt, assume that we find correlation between all the features(Pearson) and the target and find that some features are there whose correlation with the target is around 0(or between 0.05 to -0.05). Should we include these features in the model??

#

By using a quick and dirty random forest , we find the feature importance and features with least importance should be included or not in the final model??

past tide Oct 25, 2023, 3:09 AM

#

gloomy rain Its so tough

U should follow the recent Kaggle kernels from Kaggle

#

The ones which are highly voted

past tide Oct 25, 2023, 2:34 PM

#

I have a doubt , the image shows the correlation(pearson) between the target feature and all other predictors. There are some predictors with which the target is very very weakly correlated like (correlation between 0.05 and -0.05) . Should we include these features in the model? In my opinion these features should not be included in the model since very very weak correlation mean any change in the predictor will not reflect the change in the target and hence these 2 are independent of each other . Am I correct and what should be done?

low steppe Oct 28, 2023, 3:06 AM

#

past tide I have a doubt , the image shows the correlation(pearson) between the target fea...

Pearson correlation coefficient only accounts for linear relationship, try method = spearman and also yesz RF for Features selection can be good choice

fresh sequoia Oct 28, 2023, 8:32 PM

#

My YouTube vid on this will be up next week. Been recording all day long

cyan turtle Oct 31, 2023, 4:00 AM

#

I have a question regarding the cross validation using GridSearch, are there any point of train_validation split?, since the gridsearch doesn't use validation dataset to get the best score of the algorithm.

fresh sequoia Nov 6, 2023, 3:56 PM

#

For anyone who wants to watch a youtube video on the series: https://www.youtube.com/watch?v=UqmulHG4IvY&t=1s&ab_channel=RyanNolanData

YouTube

Ryan Nolan Data

Data Science Beginner Project: Kaggle House Prices Regression Analy...

Welcome to our latest data science project! In this exciting YouTube tutorial, we'll dive into the world of advanced regression analysis using Kaggle's House Prices dataset. When working on the project, the code was able to achieve a top 10% score!

Kaggle Notebook: https://www.kaggle.com/code/ryannolan1/kaggle-housing-youtube-video

Email: ryan...

▶ Play video

still void Dec 7, 2023, 12:53 AM

#

cyan turtle I have a question regarding the cross validation using GridSearch, are there any...

Are you using the GridSearch of sklearn? The one of sklearn is called GridSearchCV, the CV means it is using cross validation

cyan turtle Dec 7, 2023, 1:32 AM

#

still void Are you using the GridSearch of sklearn? The one of sklearn is called GridSearch...

Yes, it is. So basically, they split the data in the bracklet further for cross validation and make the best out of it. Then I will use the valiation dataset that I split earlier to test the score. After I satisfy with the result, I can apply it to the test dataset of the competition for submittion, right?

rare mauve Dec 14, 2023, 9:37 PM

#

hey everyone i got score of rmse0.4 by using my this notebook quick question is that how can i improve it
https://www.kaggle.com/code/ayeshairshadcoder/house-price-prediction-competition/

house price prediction competition

Explore and run machine learning code with Kaggle Notebooks | Using data from House Prices - Advanced Regression Techniques

rare mauve Dec 14, 2023, 9:41 PM

#

rare mauve hey everyone i got score of rmse0.4 by using my this notebook quick question is...

@fresh sequoia

rare mauve Dec 14, 2023, 9:42 PM

#

rare mauve hey everyone i got score of rmse0.4 by using my this notebook quick question is...

@steel flower

past tide Dec 26, 2023, 12:22 PM

#

Hello everyone, Just completed notebook link here https://www.kaggle.com/code/nishchay331/n6-house-prices-advanced-regression-techniques. I'll be glad if you take some time to go through this and point my mistakes as I am a beginner. Suggestions are highly appreciated. Thank you.

worthy lava Jan 5, 2024, 7:46 AM

#

Hi

#

Is there any way I can use deep learning/ neural nets on this data set

#

Am new to deep learning

fresh sequoia Jan 6, 2024, 1:08 AM

#

yes but probably better to use a prebuilt regression

worthy lava Jan 8, 2024, 5:11 AM

#

fresh sequoia yes but probably better to use a prebuilt regression

I mostly just want to learn deep leaning/ neural nets

#

Do u have any idea how I could apply it?

#

I have done a decent bit of regression before in another project

toxic tide Feb 5, 2024, 2:03 PM

#

why is this happening?

daring dragon Feb 5, 2024, 7:34 PM

#

@past tide Great Work!…

vapid jay Feb 10, 2024, 5:51 PM

#

Looking for a team

#

please dm me

fierce epoch Feb 11, 2024, 3:21 PM

#

i got a score of 0.14

#

@rare mauve

rare mauve Feb 11, 2024, 3:26 PM

#

fierce epoch i got a score of 0.14

wow ..

fierce epoch Feb 11, 2024, 3:27 PM

#

rare mauve wow ..

https://tenor.com/view/mean-girls-ikr-gif-4219221

Tenor

Mean Girls

▶ Play video

mild crystal Feb 13, 2024, 4:53 PM

#

looking for a team anyone interested?

modest sapphire Feb 13, 2024, 5:35 PM

#

mild crystal looking for a team anyone interested?

ok

#

im noob though

mild crystal Feb 13, 2024, 5:37 PM

#

modest sapphire ok

me too

modest sapphire Feb 13, 2024, 5:38 PM

#

oki

final cairn Feb 20, 2024, 3:50 AM

#

@toxic tide Did you figure out why?

hoary prism Mar 11, 2024, 5:01 PM

#

Hey, I am new to ML and learning ML by watching YT videos and MOOC. I am looking for a Mentor/Guide/Buddy with whom can share his/her experience with me and help me learn become a better ML practitioner.

jovial jewel Mar 14, 2024, 3:59 AM

#

Hey is 0.14 a good score to start with? Like it's my first submission but I would change a lot of things actually. Is it that good!?

brazen hare Mar 22, 2024, 5:21 PM

#

maybe im the epoch?

fresh sequoia Apr 3, 2024, 7:18 PM

#

if anyone wants a vid to follow, I made a 3 hour one on this project: https://www.youtube.com/watch?v=UqmulHG4IvY

YouTube

Ryan Nolan Data

Data Science Beginner Project: Kaggle House Prices Regression Analy...

Welcome to our latest data science project! In this exciting YouTube tutorial, we'll dive into the world of advanced regression analysis using Kaggle's House Prices dataset. When working on the project, the code was able to achieve a top 10% score!

Kaggle Notebook: https://www.kaggle.com/code/ryannolan1/kaggle-housing-youtube-video

Interested ...

▶ Play video

onyx spoke Apr 7, 2024, 6:47 PM

#

Hello i am student in ML just starting and i am struggling with this competion
looking for team mates,buddy,senior or any one else to share experiences

desert stirrup Apr 13, 2024, 10:28 AM

#

Notebook links attached to sub headings of the competition description are not working.

verbal wren Apr 23, 2024, 1:48 PM

#

in the house price dataset, how to deal with the columns which are having very high number of counts of a single value, say more than 90%

sinful token Apr 28, 2024, 2:43 AM

#

past tide I have a doubt , the image shows the correlation(pearson) between the target fea...

I used random forest to look at feature importance and this is what i have got

cols    imp

47 OverallQual 0.765166
59 GrLivArea 0.146761
56 1stFlrSF 0.023267
55 TotalBsmtSF 0.023234
34 GarageFinish 0.015694
70 GarageArea 0.011260
67 Fireplaces 0.008944
52 BsmtFinSF1 0.005675
6 LotConfig 0.000000
8 Neighborhood 0.000000
1 Street 0.000000
2 Alley 0.000000
3 LotShape 0.000000
4 LandContour 0.000000
5 Utilities 0.000000

regal heart May 1, 2024, 7:30 PM

#

Hello, I'm trying to start the home price prediction project. But noticed I keep getting issues or errors with my tensorflow. I'm trying to print or describe the csv that I currently loaded. I'm running the code on vs code on windows 11. Any help will be great

high jasper May 3, 2024, 5:04 PM

#

anyone interested in doing this project together

dense edge May 7, 2024, 2:36 PM

#

high jasper anyone interested in doing this project together

yeah why not

frosty halo May 25, 2024, 2:57 PM

#

high jasper anyone interested in doing this project together

I am down! Hope it is not too late

#

I am just starting this competition. Looking for teammates. Please let me know if you're interested in working together

steady laurel Jul 6, 2024, 1:13 PM

#

Hi guys, the result after doing the submission is the MSE? Meaning the one that we are ranked upon

earnest moon Jul 12, 2024, 8:48 PM

#

steady laurel Hi guys, the result after doing the submission is the MSE? Meaning the one that ...

Yes, mse of log(Saleprice)

distant zodiac Aug 28, 2024, 12:07 PM

#

sinful token I used random forest to look at feature importance and this is what i have got ...

you can see correlation between feature and the target feature (by corr() )

past patrol Sep 8, 2024, 12:41 AM

#

Is anyone interested in collaborating as a team? Feel free to DM.

pearl oak Sep 8, 2024, 3:57 AM

#

past patrol Is anyone interested in collaborating as a team? Feel free to DM.

Hi --- I'm a graduate student studying Data Science, and I'd love to join this challenge. I won't mess up the feature engineering step by trying to write the pipeline the OOP way, I promise lol. Also, if you could help with my question under the Titanic channel, that would be great 🙂

past patrol Sep 8, 2024, 10:22 AM

#

pearl oak Hi --- I'm a graduate student studying Data Science, and I'd love to join this c...

No worries, I'm no expert in feature engineering either, so we'll learn through our mistakes lol. Sure, I'll take a look at your question and see if I can help. 🙂

hushed moon Sep 14, 2024, 7:10 AM

#

past patrol Is anyone interested in collaborating as a team? Feel free to DM.

I am interested too unless you are already too many!

past patrol Sep 14, 2024, 8:14 AM

#

hushed moon I am interested too unless you are already too many!

We’re nearly finished with the project and are currently performing hyperparameter tuning on the models. If you’re interested in working on a different dataset/project related to ML, feel free to DM me.

mint prism Sep 21, 2024, 1:52 PM

#

past patrol We’re nearly finished with the project and are currently performing hyperparamet...

I'm planning on improving my code for this project. My previous code was very messy and I feel like I overdid a lot of stuffs on my pipelining. We can collaborate on improving our code (especially mine lol) if you want to.

past patrol Sep 21, 2024, 5:02 PM

#

mint prism I'm planning on improving my code for this project. My previous code was very me...

Sure, we can discuss this in detail on my server. Also, feel free to check out my updated Notebook on House Prices.
https://www.kaggle.com/code/shahriarrahman10/house-price-prediction-using-gbr

House Price Prediction using GBR

Explore and run machine learning code with Kaggle Notebooks | Using data from House Prices - Advanced Regression Techniques

uncut pond Sep 24, 2024, 2:10 PM

#

I got score in this competition, is ok or is really bad?

junior forge Sep 24, 2024, 3:15 PM

#

anyone interested in doing this project together?

fair trail Sep 27, 2024, 8:36 AM

#

how are you guys encoding the categorical variable, one hot encoding would balloon the the dataframe to 100s of columns is there a more efficient method?

junior forge Sep 27, 2024, 8:38 AM

#

fair trail how are you guys encoding the categorical variable, one hot encoding would ballo...

I've done like you and I also have 100 columns

#

but I delete the columns that are behind 0.05 correlation with SalePrice

pallid thistle Sep 27, 2024, 2:45 PM

#

fair trail how are you guys encoding the categorical variable, one hot encoding would ballo...

Yeah man, this is a common issue with cat variables and one hot encoding. In addition to @junior forge answer and at the best of my knowledge you can just ignore cat variables with high cardinality. However, these days I am searching for some NLP solutions. I am not sure if techniques like word embedding may be a good alternative (if someone already have the answer do not hesitate to answer me).

junior forge Sep 27, 2024, 6:44 PM

#

pallid thistle Yeah man, this is a common issue with cat variables and one hot encoding. In add...

How works Word embedding?

rustic widget Sep 28, 2024, 1:30 AM

#

Hello, I'm a bit confused , can anyone explain the submission format , I'm supposed to find np.log(y_predict) or rmse(np.log(y_true),np.log(y_predict) )

pallid thistle Sep 28, 2024, 9:43 AM

#

junior forge How works Word embedding?

Basically they transform a word (say the word "house") to a vector of length n

The word embeddings in addition to representing words with vectors, they try to make synonyms and similar words close to each other in the vector space.

Therefore, I am wondering if using word embedings with cat variables may help. I don't know I am still searching....

hardy wedge Oct 8, 2024, 9:19 PM

#

hello i was working on the housing regression competition and i am fairly new. I was wondering for a column such as Street, then it has a range of string values such as gravel, pavement... how should i encode this

ornate spindle Oct 20, 2024, 10:48 AM

#

hello, i am an absolute beginner in machine learning. i just learned about linear regression and error metrics and wanted to get my hands on a small project using the techniques i learnt. So i started with the famous boston-housing-prices dataset on kaggle and would appreciate if you could take a look at my code: https://www.kaggle.com/code/khalidhelmy55/boston-housing-prices and guide me on what is missing or what could be better done..
according to the metrics i calculated the model is not performing good.

tranquil quail Oct 25, 2024, 6:34 AM

#

hello everyone beginner here, so I was doing the intermediate ML course and in the categorical variables exercise when I was performing one hot encoding it went smoothly but now the DataFrame had missing values. what is the most appropriate course of action? should I just fill in the missing values or something else?
as you can see in the screenshot, the prevoius step was correct

latent horizon Oct 28, 2024, 9:05 PM

#

tranquil quail hello everyone beginner here, so I was doing the intermediate ML course and in t...

Did you get it figured out? I would suggest running all the cells over as you might have ran them out of order and introduced missing values.

tranquil quail Oct 29, 2024, 2:30 PM

#

Thank you Tom for your response. Actually later on in the tutorial the values were imputed, so the problem was solved.

tranquil quail Nov 25, 2024, 6:28 AM

#

so, was thinking of applying what I learned to the competition to see if the accuracy can be improved, now I have a problem which I am not able to figure out no matter what I do. This is what I want to do:-

select categorical columns and apply simpleimputer to them with a constant value.
select numerical columns and apply simpleimputer with strategy = mean
apply ordinal encoding to certian columns and one-hot to others
finally I want to train an XGBRegressor model.
but everytime a new error is being thrown.

#

this is the pipeline I am using

#

#

i am imputing the values but still this error is showing up

#

I am thinking that the imputation is not being applied to the testing data.

rustic widget Nov 25, 2024, 7:38 AM

#

tranquil quail i am imputing the values but still this error is showing up

Well this is a common error.
It occurs because you need to first convert the categorical data into numerical data first before imputing it , because SimpleImputer works only with numerical data.
Since you are first trying to impute the categorical data , simple imputer doesn't understand 'nan' (which is a string).

tranquil quail Nov 25, 2024, 8:12 AM

#

rustic widget Well this is a common error. It occurs because you need to first convert the ca...

but in another approach I tried, and I Imputed the values with strategy = "most_frequent" and it worked.
screenshot attachend below:-

#

the only difference is the strategy.
the dataset is the same

rustic widget Nov 25, 2024, 8:26 AM

#

tranquil quail but in another approach I tried, and I Imputed the values with strategy = "most_...

Idk, never tried imputing with "most frequent" strategy , I mostly use "median" and at times "mean" ,
I think it's because you used "most frequent" , like in string or categorical data the most frequent category will fill up the data.
But for "mean" , you can't really take "mean" , "median" or such startegy since they are for numbers statistically.

tranquil quail Nov 25, 2024, 8:28 AM

#

rustic widget Idk, never tried imputing with "most frequent" strategy , I mostly use "median" ...

but I used the strategy ="constant" which fills the missing values with the fill_value parameter's value

#

and the error doesn't occur while fitting the transformer. The error happens when I use the transform method on other data

#

what I suspect is that when I am calling the transform method the imputation is not happening for some reason

rustic widget Nov 25, 2024, 8:40 AM

#

tranquil quail and the error doesn't occur while fitting the transformer. The error happens whe...

Ahh.. I understand now. I faced similar problem when I made a column transformer using sklearn , I made one using columns in train set which also happen to be in test set , but the transformer didn't work well with both only training set , then I created another column transformer for test set seperately. Alas , I'm not familier with the solution.

tranquil quail Nov 25, 2024, 8:41 AM

#

but wont there be data leakage if we do that?

rustic widget Nov 25, 2024, 8:41 AM

#

tranquil quail but wont there be data leakage if we do that?

do what exactly transformer ?

tranquil quail Nov 25, 2024, 8:42 AM

#

fitting another transformer on the testing data?

#

and the features may be inconsistent also?

rustic widget Nov 25, 2024, 8:46 AM

#

tranquil quail fitting another transformer on the testing data?

I'm not quite sure , Well I used the same code as the first transformer only difference was the columns.
I extracted the column names from test data and it works.
I always use skimpy.skim(data) or data.info() or data.describe() after any step in perprocessing.

When I used the same transformer I found from skim(data) that for testing data the conversion was incomplete, that's why I made the 2nd transformer only changing it to test.columns instead of train.columns and it worked since I got a similar description using skim(test).

tranquil quail Nov 25, 2024, 8:52 AM

#

rustic widget I'm not quite sure , Well I used the same code as the first transformer only di...

turns out the error was being thrown by OrdinalEncoder, I made a few changes and the error was resolved.

rustic widget Nov 25, 2024, 8:53 AM

#

tranquil quail turns out the error was being thrown by OrdinalEncoder, I made a few changes and...

Nice! What was the error though ?

tranquil quail Nov 25, 2024, 8:58 AM

#

rustic widget Nice! What was the error though ?

there were two errors:-

turns out that the ColumnTransformer performs the transformations in parallel and due to this behavior imputations were being skipped because there were intersecting columns for imputations and encoding.
The OrdinalEncoder was not able to handle unknown values because (my mistake) I did not specify how to handle them.

#

Thank you for the help @rustic widget !

rustic widget Nov 25, 2024, 9:04 AM

#

tranquil quail there were two errors:- 1. turns out that the ColumnTransformer performs the tra...

I doubt I was of any help. Great Job fixing the bug! Good luck with the competition.

valid plume Dec 17, 2024, 10:15 AM

#

Hi guys, i really cannot understand is it possible that though having cross_validated my models with 10 kfolded sets, I still have a huge gap between my local score and the leaderboard score. That happened with Titanic competition too, is that normal or are you experiencing a better alignment amongst local and kaggle scores ?

idle jay Dec 29, 2024, 7:05 PM

#

hi, I'm running various models on the london house prices dataset, but I've ran into an error with matplotlib i cant seem to fix.

#

im trying to run residual analysis

#

and the first time it works

#

but running the same code teh second time doesnt

#

#

idle jay Dec 29, 2024, 7:23 PM

#

nevermind figured it out

merry mirage Jan 12, 2025, 9:33 PM

#

I made a (fairly basic) entry to the december playground and placed 1577/2390

then i literally just copied and ran the same code for the house prices competition and placed 355 out of 4910.

are the competitors on the rolling competitions worse than the competitors on the playgrounds or?

#

like i'm not doing feature engineering, hyperparameter tuning, anything, i just one-hot encode the categorics and do one run of xgboost, that's it, it's not a good model

merry mirage Jan 13, 2025, 12:16 PM

#

wait am i doing the wrong one, why are there two

#

i submitted an entry to https://www.kaggle.com/competitions/home-data-for-ml-course/ but are other people submitting to https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/

Housing Prices Competition for Kaggle Learn Users

Apply what you learned in the Machine Learning course on Kaggle Learn alongside others in the course.

House Prices - Advanced Regression Techniques

Predict sales prices and practice feature engineering, RFs, and gradient boosting

#

i might submit to that one too

#

what's the difference between them? the scores on the leaderboards are all different

merry mirage Jan 13, 2025, 12:36 PM

#

I submitted the same code to the other one and placed 2837. so I'm guessing that one is more competitive?

lavish hearth Jan 19, 2025, 12:39 AM

#

https://www.kaggle.com/code/amitbarkama/random-forest-regressor-xgbregressor-xgboost

Random Forest Regressor, XGBRegressor, XGBoost

Explore and run machine learning code with Kaggle Notebooks | Using data from House Prices - Advanced Regression Techniques

lavish hearth Jan 24, 2025, 11:38 AM

#

https://www.kaggle.com/code/amitbarkama/ml-part4

Improved score

ML_part4

Explore and run machine learning code with Kaggle Notebooks | Using data from House Prices - Advanced Regression Techniques

merry mirage Jan 26, 2025, 8:27 AM

#

so I've been trying to adapt Ryan Holbrook's notebook (https://www.kaggle.com/code/ryanholbrook/feature-engineering-for-house-prices) but ran into an odd problem. I don't know if anyone can help with this.

in Ryan Holbrook's notebook, if you print out the MI scores, it shows that OverallQual and Neighborhood are the two most important features.

I changed the imputer to impute the median for the numerical features instead of 0 and the MI score for several features, including OverallQual and Neighborhood, dropped to 0.

Does anyone know why this is? Have I done something wrong?

Feature Engineering for House Prices

Explore and run machine learning code with Kaggle Notebooks | Using data from House Prices - Advanced Regression Techniques

late tusk Feb 20, 2025, 7:27 PM

#

Can someone help me find out what wrong with my code?
I implemented EDA, Feature Engineering, and other preprocessing techniques. However, the score lowered compared to my other codes. What is wrong with my code? Here is the link: https://www.kaggle.com/code/eidenspark/notebook5a2c152607

Thanks!

desert hollow Mar 5, 2025, 9:32 AM

#

Hi in the housing prices beginners competition, what is the score/predictions that participants should be aiming for? I've just done my first submission and got 21488.82469. Thanks

remote torrent Mar 19, 2025, 4:51 PM

#

Hey, I'm having a issue in this competition. Can someone help me?

#

📎 main.py

#

The first features list got 15509.73 in MAE metric, while features2 list got 21857.15.

#

I should receive more points right!? But I got less....

#

Idk what is happening.

remote torrent Mar 19, 2025, 5:09 PM

#

desert hollow Hi in the housing prices beginners competition, what is the score/predictions th...

As much higher, better. I think they should limit the score until 100. This makes easy to measure your position on leaderboard.

spring heath Apr 6, 2025, 10:11 AM

#

Hello all, subhasish this side

dusky dust Apr 15, 2025, 5:33 PM

#

Hi everyone
What is the best project that you found open source?
I need one to use on my data in my website

jovial coral May 6, 2025, 10:37 PM

#

So, you can just get the house price index data between 2006 and 2010 for Ames, Iowa via FRED, that totally feels like cheating even though it's not 😅

arctic anchor May 7, 2025, 1:56 AM

#

jovial coral So, you can just get the house price index data between 2006 and 2010 for Ames, ...

How is that not cheating?

#

Isn't that target leakage?

jovial coral May 7, 2025, 2:26 AM

#

arctic anchor How is that not cheating?

Rule 10 might be broad enough to make it cheating ig.

I'm just using state-wide data instead for that reason

cloud acorn May 8, 2025, 1:14 PM

#

hi i just started my competition journey with the beginner house price prediction i never practiced on such dataset there are so many attributes that i m confused on what columns should i create data visualization ..need help

arctic anchor May 8, 2025, 1:50 PM

#

cloud acorn hi i just started my competition journey with the beginner house price predictio...

Group columns on the nature of data. Like categorical(nominal/ordinal), numerical(continuous/discrete), time attributes and other features. Then group wise maybe start the analysis

cloud acorn May 8, 2025, 1:51 PM

#

arctic anchor Group columns on the nature of data. Like categorical(nominal/ordinal), numerica...

thanks bro

fossil coyote May 8, 2025, 6:41 PM

#

how the hell did the leaders get a score of 0.00044

cloud acorn May 9, 2025, 2:06 AM

#

where

cloud acorn May 9, 2025, 2:07 AM

#

fossil coyote how the hell did the leaders get a score of 0.00044

is the score mentioned

arctic anchor May 9, 2025, 3:19 AM

#

fossil coyote how the hell did the leaders get a score of 0.00044

They cheat

jovial coral May 9, 2025, 8:15 AM

#

Tip for anyone else doing this dataset btw: There's a 45% correlation between LotArea and LotFrontage, but a 60% correlation between sqrt(LotArea) and LotFrontage.

I ended up grouping LotFrontage/sqrt(LotArea) ratios based on moderately correlated groups like LotShape, with a minimum number of samples per group, got the median ratios, and used these to impute the 250 or so (however many there were) missing LotFrontage values.

Couldn't find any reason as to why there were so many LotFrontage values missing either even after trying correlations after encoding categories

jovial coral May 9, 2025, 8:18 AM

#

remote torrent

data leakage I believe, seems like you're fitting your model based on both your training data, and your testing data?

(Oh that was ages ago I need another coffee ._.)

cloud acorn May 9, 2025, 9:50 AM

#

how i m gonna find out which columns to drop or which to keep cause after dropping some columns i still got 75 of them i have tried correlation but i dont think on that basis only i can judge what to drop .....

jovial coral May 9, 2025, 9:53 AM

#

cloud acorn how i m gonna find out which columns to drop or which to keep cause after droppi...

Best off to use ordinal and OHE first then drop

#

I've still gotta look for outliers myself but I'll probably drop most via PCA

#

But I wanna get good at sensitivity analysis so non linear considerations are accounted for

jovial coral May 9, 2025, 10:14 AM

#

jovial coral Tip for anyone else doing this dataset btw: There's a 45% correlation between Lo...

Also 35% correlation to LotFrontage and SalePrice iirc. Don't be lazy and impute medians for this 😛

jovial coral May 10, 2025, 3:22 AM

#

Well, feel the need to revisit multivariable calc then perform sensitivity analysis before PCA so at least I have an idea of the non-linear correlations going on.

Kind of thought how I wanna go about this with partial derivatives and the Jacobian Matrix

WML 😆 this is gonna be fun

cloud acorn May 10, 2025, 4:51 PM

#

what could be an good avg value for this

#

rmse

fossil wasp May 15, 2025, 4:35 AM

#

Hi, community! This is my first submission on Kaggle. I hope to keep moving forward and will try to improve my score. 😄
This is my current code—I'll document my steps more thoroughly soon, but if you notice anything I can improve, I’d really appreciate your feedback!
https://www.kaggle.com/code/luiz2002/house-pricing-0-13481

house_pricing_0.13481

Explore and run machine learning code with Kaggle Notebooks | Using data from House Prices - Advanced Regression Techniques

steady flume May 18, 2025, 3:27 PM

#

Hello everyone! I am learning on Kaggle by doing the House Prices competition. My current rank is 583. But right now, I am kinda stuck on how to further improve my model. I'd really appreciate any tips, suggestions, or feedback on what I could try next. Thanks so much in advance!

Screenshot_2025-05-18_at_11.25.25_AM.png

naive slate May 19, 2025, 4:12 PM

#

Is there any difference between the Housing Prices Competition for Kaggle Learn Users & the Housing Prices Advanced Regression Techniques or are they just different places to submit predictions with different scoring methods to try and split up people who have only just started and people who have recently started but are digging into more complex things?

steady flume May 20, 2025, 4:15 AM

#

@everyoneHello, is anyone interested in chatting about strategies for this competition? I think it could be helpful for us to share ideas and learn from each other to improve our models. If you're interested, feel free to DM me. I would love to connect!

jovial coral May 20, 2025, 2:38 PM

#

fossil wasp Hi, community! This is my first submission on Kaggle. I hope to keep moving forw...

Nice, my testing has be at a out 0.12 presently, I could get it a bit lower maybe.

I'm just using a PyTorch linear regression model, I've transformed the data a lot though

jovial coral May 20, 2025, 5:35 PM

#

fossil wasp Hi, community! This is my first submission on Kaggle. I hope to keep moving forw...

some of the missing values are deliberately missing. Make sure you carefully read the data description text file

jovial coral May 20, 2025, 5:36 PM

#

jovial coral Also 35% correlation to LotFrontage and SalePrice iirc. Don't be lazy and impute...

also sensitivity analysis showed I think a 0.0019 rating, I'd need to double check. I only drop if they're below 0.001

jovial coral May 21, 2025, 8:12 PM

#

https://github.com/HotProtato/Ames-House-Prices-Regression-Model/tree/master

Due to university assignments I wont be able to continue that for a while, there's a lot of mess I need to clean up, including for some reason some variables showing up more than once. There's multiple commits to see my EDA methodology.

However, if I were to use a deep learning model for tabular data, I would now be morer systematic. For instance, I would observe the medians and IQR by way of boxplots for ordinal categories against the target value to look for relationships. For instance, I would automatically attempt to look for concave relationships, if identified, introduce the ordinal values after encoding as x^2, letting the model find the coefficient.

I would then integrate that approach as an imputation strategy. I believe it was MICE? That does something similar, it already imputes via groups with high correlations, but for an imputation strategy I'd first use the first method, treating the variable to be imputed as the temporary target variable, attempting to find relationships automatically, then performing transformations (cloning to keep the originals), to try and find groups that have higher correlations.

An example being, I used sqrt(LotArea) which had a 60% correlation to LotFrontage, compared to 45% normally, allowing me to make a LotFrontageRatio value for imputation, grouped by 2 other ~30 - 35% correlating groups.

I also performed sensitivity analysis to remove many columns. The best score I can seem to get so far, with 80% 20% split is around 0.12 RMSE (the target variable is np.log1p'd). However, I would look to use a 95% 5% split, using stratified subsampling once I'm 100% done.

While I know there's a lot of cleaning to do, let me know if you see any areas of improvement, whether it's convention-wise, or functional.

My next project, I will look for a larger than memory dataset, to deliberately require my navigating the challenges that come with it, using a deliberate local database structure

GitHub

GitHub - HotProtato/Ames-House-Prices-Regression-Model at master

Contribute to HotProtato/Ames-House-Prices-Regression-Model development by creating an account on GitHub.

#

I mathematically represented the model but need to update the details due to hyper parameter optimization (used it to determine number of layers and neurons per layer, as well as activation functions) 🙂

Also, I've had each value set between 0 and 1 (except sine and cosine of MoSold), and deliberately made it so the higher the value, the greater the expected SalePrice, hence weird values like "BuildingNewnessScore" lol. Building age also makes sense only with respect to MoSold and YrSold after all 😛 the aligned scoring in this way allows the model to train more efficiently

Prior to working on my next model, I plan on making a general utility class. Might be interested in working on some projects with others, so hmu. My time is limited for the next 3 weeks, however

cinder cliff May 28, 2025, 9:48 PM

#

I got a score of 0.12561 using Lasso linear regression. I posted my Jupyter notebook on github at the following link:
https://github.com/gjpelletier/stepAIC/blob/main/Example_kaggle_house_prices.ipynb

GitHub

stepAIC/Example_kaggle_house_prices.ipynb at main · gjpelletier/st...

Stepwise, Lasso, and Ridge linear regression to optimize AIC, BIC, or VIF in Python and Jupyter notebook - gjpelletier/stepAIC

cinder cliff May 31, 2025, 5:13 PM

#

I updated my Jupyter notebook using StackingRegressor with a machine learning ensemble of models. This improved the score to 0.12489 (rank 693). Here is a link to the updated notebook using stacking regression: https://github.com/gjpelletier/stepAIC/blob/main/Example_kaggle_house_prices.ipynb

GitHub

stepAIC/Example_kaggle_house_prices.ipynb at main · gjpelletier/st...

Stepwise, Lasso, Ridge, Elastic Net, and Stacking linear regression to minimize MSE, AIC, BIC, or VIF in Python and Jupyter notebook - gjpelletier/stepAIC

nimble cedar Jun 2, 2025, 11:45 AM

#

cinder cliff I updated my Jupyter notebook using StackingRegressor with a machine learning en...

Nice

jovial coral Jun 4, 2025, 7:56 AM

#

Well, thankfully I've been able to work on this some more, I've gotten it down to 0.12036 for my training/test split, I plan on training on 95% of the data, with a 5% stratified subsampling split to ensure ideal weights when submitting.

Trimming the last of the low contributing features via sensitivity analysis, then moving onto PCA 🤩

nimble cedar Jun 4, 2025, 5:36 PM

#

jovial coral Well, thankfully I've been able to work on this some more, I've gotten it down t...

How are you going to use PcA?

jovial coral Jun 4, 2025, 8:39 PM

#

nimble cedar How are you going to use PcA?

Just in my eda notebook to see if it's worth adding to my pipeline and for what parameters.

Needing to learn how SVD works first, as I know the math behind PCA, just not through this method

jovial coral Jun 5, 2025, 1:29 AM

#

Okay, so I've learnt a lot.

Due to my 80/20 split showing my scores being in the top 1% already, not going to bother with PCA, but when I use it, I will need to learn how it's calculated via SVD so I have an understanding of error margins to expect based on my data.

Beyond this, I've learnt that in research and actual model deployment, the averages or deviations from training matter more than the lowest validation loss value.

Going to make my submission within the next few hours then update my repo, then I'm just gonna clone my project, and focus more on learning how to best extract insights, as well as practicing charting the HPO values by way of performance vs cost, so if I were to hypothetically scale the model, how to make it more efficient.

Thanks to my strong feature transformations that I've done, my neural network isn't large at all 🙂

Learnt a lot in terms of Optuna trials in HPO and when to best cull / early stop trials.

If anyone is curious or knowledgeable when it comes to extracting insights from data (effectively getting as close to casual inference as possible, since it's not feasible with this data for example), reach out 😁

My lowest score is in the 0.119's now, hopefully it'll get lower with my 95 / 5 (stratified) split

cinder cliff Jun 5, 2025, 10:33 PM

#

Notebook with score=0.11988 https://github.com/gjpelletier/EasyMLR/blob/main/Example_kaggle_house_prices.ipynb

GitHub

EasyMLR/Example_kaggle_house_prices.ipynb at main · gjpelletier/Ea...

Tools for Machine Learning Regression in Python. Contribute to gjpelletier/EasyMLR development by creating an account on GitHub.

jovial coral Jun 8, 2025, 3:20 PM

#

Well, I learnt for my model PCA was kinda pointless, and I also learnt that for HPO and my permutation sensitivity analysis (despite n = 30), for such little data, I should have had 3 weights from validation to determine what features should be kept and removed via iterative HPO and sensitivity analysis.

Decided to start working on my own framework instead for now that uses feature engine and scikit-learn within an auditing kind of system, where everything is traceable, and gradually collecting data over the years, maybe to make my own data scientist advisor model in the future, who knows? 🤔

nimble cedar Jun 10, 2025, 2:35 PM

#

jovial coral Well, I learnt for my model PCA was kinda pointless, and I also learnt that for ...

Have you considered a Facor Analysis?

jovial coral Jun 10, 2025, 2:43 PM

#

nimble cedar Have you considered a Facor Analysis?

Not yet!

I've just finished learning the logic and math behind most tree-based models, just working on two frameworks at present, one aimed at explainability, the other aimed at performance.

I want to generalize my imputation strategy by learning C++, writing it as a module in C++ to then use in my framework, and get a score via this strategy, as well as a score for tree-based models to see which, if any, are suitable for imputing missing values.

If there's 30% or more missing data in one column, I'll remove it. Otherwise, I'll perform a test/train split on missing data columns, and use my method, and tree-based models, see which one gets the least amount of loss, then use whichever model/method works best for imputing those missing values.

I'm also making a wrapper function for just about everything that integrates into an audit system, so I can track absolutely everything, maybe visually represent it as well. Who knows? If I capture enough of my interactions, I could even make my own Data Scientist Advisor model, based on my own audits collected over the years xD

I still have other methods of sensitivity analysis to learn like SHAP values/graphs. For this dataset, I realized a little too late that when I was performing HPO, as well as permutation sensitivity analysis, I should have used 3 different training weights, and measuring the averages instead

#

Hopefully by using this approach with these two separate frameworks, they can serve as useful inputs for actuary functions, to make a suitable risk framework in the future

balmy sonnet Jun 10, 2025, 5:28 PM

#

I tried everything I could to debug it. Is there anything else I can do? Any help would be greatly appreciated

nimble cedar Jun 10, 2025, 5:51 PM

#

jovial coral Not yet! I've just finished learning the logic and math behind most tree-based ...

Idk man it feels like over-overkill

jovial coral Jun 10, 2025, 5:53 PM

#

nimble cedar Idk man it feels like over-overkill

I'm not making the frameworks just for this one project lol, they're for my continued use. Why repeat myself every project?

nimble cedar Jun 10, 2025, 7:42 PM

#

jovial coral I'm not making the frameworks just for this one project lol, they're for my cont...

The libraries are alteady so high level but youbmight be right I’m just a newbie

jovial coral Jun 10, 2025, 8:25 PM

#

nimble cedar The libraries are alteady so high level but youbmight be right I’m just a newbie

I'll just share a msg file (since it's too large) of what I'm doing in case you're curious

#

📎 message.txt

#

Yeah there's more powerful imputation strategies, like XGBoost and its variants, framework #1 is focused on making sure everything is traceable and explainable

#

I did draw inspiration from how I managed the LotFrontage in this project, but I didn't test the adequacy in the way I should have noted above. My goal is to streamline these kind of processes so I'm not repeating myself

(I should say my variation of the random forest model in the text file, it's literally the same, just the bagging happens per-node instead of per-tree)

jovial coral Jun 11, 2025, 7:24 AM

#

jovial coral I did draw inspiration from how I managed the LotFrontage in this project, but I...

Quite annoying that this model doesn't exist already =.= it's so simple, a decision tree with the typical histogram bins, and using bootstrapping per node, getting the mode per-node, to replicate the benefits of random forests but have this ridiculously more explainable.

Before I decide to make a C++ module for this for efficiency, I suppose I should trial this and benchmark some results to see if it performs as well as expected

Edit: So this wasn't feasible, but there's a potential MoE approach, not for the typical performance optimization, but to literally increase accuracy while remaining interpretable. Performance was basically the same as a regular tree-based model, gonna finish my frameworks then gradually improve adding stuff like this idea. Still going to use a tree to assist in forming MoE candidates

swift moth Jul 5, 2025, 9:56 AM

#

oak peak Jul 22, 2025, 3:24 PM

#

Hi,

I didn't uderstand the difference between my RMSE score and the score in Kaggle. Do you have an idea about it ?
https://github.com/Jeremy-Duval-PhD/Kaggle_housing_price

GitHub

GitHub - Jeremy-Duval-PhD/Kaggle_housing_price: https://www.kaggle....

https://www.kaggle.com/competitions/home-data-for-ml-course/overview - Jeremy-Duval-PhD/Kaggle_housing_price

jovial coral Jul 24, 2025, 8:18 PM

#

oak peak Hi, I didn't uderstand the difference between my RMSE score and the score in Ka...

Since you're managing outliers by removing them even with IsolationForest, your scores may appear overly optimistic, particularly given MSE is used (as part of RMSE), in that outliers will be particularly valuable unless they're some kind of error. That's the only thing that comes to mind here

oak peak Aug 20, 2025, 4:46 PM

#

jovial coral Since you're managing outliers by removing them even with IsolationForest, your ...

Ok, thanks for your answer. And what do you do with outliers on your side ?

jovial coral Aug 20, 2025, 4:51 PM

#

oak peak Ok, thanks for your answer. And what do you do with outliers on your side ?

I'd keep them unless there's any indication as to the values being a mistake, or too extreme. They're valuable here

oak peak Aug 21, 2025, 5:26 PM

#

jovial coral I'd keep them unless there's any indication as to the values being a mistake, or...

Ok, thank you !

willow crow Sep 18, 2025, 3:55 PM

#

Hello, I'm recently submitted my model's predictions to this competition and i got: Score: 0.13854

Is this score is a RMSLE, RMSE, MSE, MAE or R2 score?

And is this score is a good score or a bad score?

Or you need more information about my notebook?

timid hill Oct 3, 2025, 6:27 AM

#

hey, I'm new on ML, so my question is probably a newbie question : I tried to use random forest on the datas's challenge, but it did not work so well. My guess is that you can't compare the price of residential housing and commercial locals. So I splitted the data's with MSZoning feature and I applied random forest algorithm on splitted data. For example only on data's which are flagged Commercial with MSZoning feature you have only 10 samples, so I thing that 81 features for 10 sample is far to much and, actually, on these 81 features something like 20 features have the same values for all my 10 samples. So I tried to remove these 20 features. But, and these is my question, the result of my random forest arlgorithm is worst when I remove these 20 unusefull features than with the 81. Is this oberfitting ? Do you have any idea of the reason of this odd result ?

dark bronze Oct 4, 2025, 8:37 AM

#

willow crow Hello, I'm recently submitted my model's predictions to this competition and i g...

your model is predicted only 13% on the test dataset i suggest you improve your models accuracy by trying different algorithms

dark bronze Oct 4, 2025, 8:39 AM

#

willow crow Hello, I'm recently submitted my model's predictions to this competition and i g...

i think its RMSLE score

dark bronze Oct 4, 2025, 8:43 AM

#

timid hill hey, I'm new on ML, so my question is probably a newbie question : I tried to us...

i think your model is not overfitted but again its the randomness of the random forest try to increase the sample size because if there are more features the random forest has more scope to split

jovial coral Oct 9, 2025, 12:39 PM

#

timid hill hey, I'm new on ML, so my question is probably a newbie question : I tried to us...

If you use a sufficiently complex model, it would implicitly learn how reliable residential and commercial locations are with respect to each other.

In a more professional setting, you would also test such assumptions.

ember portal Oct 20, 2025, 4:10 PM

#

Hello, I have a question — is it necessary for me to understand every column in this project? Because there are so many columns.

fallen sandal Oct 21, 2025, 5:19 AM

#

dark bronze your model is predicted only 13% on the test dataset i suggest you improve your ...

To get 0.13 in RMSLE is not bad, I think is moderate but you can improve to 0.11-0.10

hollow jolt Oct 24, 2025, 2:59 PM

#

steady flume Hello everyone! I am learning on Kaggle by doing the House Prices competition. M...

hello can you help me with a project

north epoch Nov 9, 2025, 7:25 AM

#

Hii

covert stag Nov 9, 2025, 7:34 AM

#

Hi
I am new I just want to learn and do this

brittle trellis Jan 13, 2026, 5:00 PM

#

Hi

naive cedar Jan 17, 2026, 3:30 PM

#

Hi everyone.

sturdy cobalt Jan 24, 2026, 9:18 AM

#

hi guys

azure robin Jan 28, 2026, 6:55 AM

#

Heyy, tbh im kinda lost trying to do this basic competition rn.. my best entry is 0.14774 as of rn. How do you guys go about feature engineering?

#

I feel like that's what's been messing up and not sure how I can add relevant features without just trail and errors

random mural Feb 14, 2026, 4:44 PM

#

would you guys appreciate a notebook baseline template which you can easily iterate on?

hexed pivot Mar 5, 2026, 12:58 PM

#

Here is my pipeline overview and notebook:
https://www.kaggle.com/code/rommelsharma/adv-house-price-predictions-with-optuna-tuning

Pipeline Overview
Step Description
1 Load data & remove outliers
2 Domain-aware ordinal encoding (quality scales)
3 Missing value imputation (train stats only - no leakage)
4 Feature engineering: area, age, quality*size crosses, pool features
5 Drop near-constant columns (with protected set for rare signals)
6 OrdinalEncoder on vocabulary union of train+test
7 Neighbourhood target encoding (5-fold out-of-fold)
8 Pairwise EDA plots vs SalePrice
9 Multi-model 5-fold CV comparison
10 Optuna hyperparameter tuning (9 params, 50 trials)
11 Final XGBoost training with early stopping
12-13 RMSE progression + feature importance charts
14 Save model + preprocessor for reuse
15 Generate submission.csv

I have provided extensive documentation so that it is easy to understand. I hope you can fork the code and get better results. All the best.

humble rune Apr 6, 2026, 8:42 PM

#

random mural would you guys appreciate a notebook baseline template which you can easily iter...

sure, it's competition's kernel

humble rune Apr 6, 2026, 8:43 PM

#

azure robin I feel like that's what's been messing up and not sure how I can add relevant fe...

on kernel, you can see inspector part, it could help you with selecting features.