#🏠┊house-prices-advanced-regression-techniques
1 messages · Page 1 of 1 (latest)
For the house price competition we recommends beginning by checking out the starter notebook (https://www.kaggle.com/code/gusthema/house-prices-prediction-using-tfdf/notebook). If you copy your own version, spend some time reading it and modifying it you can get some familiarity with the problem.
anyone wants to team up for this project?
yes
Anyone can lead?
Glad to see you all teaming up! Please post in our https://discord.com/channels/1101210829807956100/1130572338182762657 channel so others have the opportunity to team up as well!
@rough zephyr @thin dust can I be part of your team?
Hey there. I'm going to delete this message as we prefer to keep these channels competition-specific. Any questions for general help should be asked in https://discord.com/channels/1101210829807956100/1129507880379367525.
Hi everyone, I have a rookie question. As my cross validation score decreases, the RMSE score on kaggle increasing. Couldn't explain why such a contrary occurs. Anybody?
Hey, I am no expert but I would assume you have overfitting. A lower cross-validation score suggests your model is performing as needed on your local dataset. However, a high (or increasing) RMSE on Kaggle shows that its not generalizing well to new, unseen data. @ripe thunder
Make sure you r properly preprocessing and validating ur data to avoid the issue.
Thanks, maybe I can use a validation set approach rather than cv. Will keep in mind the tips.
Large dataset hurts my computer lol
Thoughts on doing PCA --> training on components, instead of directly on features?
I did that. Got good scores. Give it a shot!
Hello - When you analyze data to identify the needs for preprocessing, do you look at both Training and Test datasets and apply the technique (e.g. dropping columns with missing data) that will work for both Training and Test datasets to save time and effort later? Or do you look at the Training dataset only and apply the same steps to the Test dataset? And if there is an issue, then deal with it at that time? The first approach is proactive and the later approach is reactive.
I am asking because this has happened:
Step 1: Analyzed the Training dataset and identified 3 columns that have missing values
Step 2: Split the Training dataset to Training and Validation sub-datasets
Step 3: Dropped those 3 columns from both sub-sets
Step 4: Defined the model, fit the model with the Training sub-dataset
Step 5: Made predictions using the Validation sub-dataset
Step 6: Determined the MAE for the model using the dropping the columns approach performed better than the model using imputation
Step 7: I was ready to make predictions using the Test dataset; then found out that the Test dataset had additional columns with missing data. SURPRISE! What am I supposed to do in this situation?
Below is the link to my notebook.
https://www.kaggle.com/code/juliasuzuki/house-price-prediction-advanced-regression/notebook
Thank you so much for your response in advance!
UPDATE: I received enough responses in Kaggle Discussions (Yes, I posted the same question) and my question is resolved. It is generally a good practice to work only with Training dataset during the model development (preprocessing included). After all, the purpose of Test data is to make an inference from new data (i.e. Test data) using the trained model. But to avoid a situation like the one I mentioned above (find a surprise when working with Test data), I could implement a better preprocessing strategy (e.g. imputation instead of dropping columns). Hope you find this helpful!
Great question and glad it was resolved. For posterity, could you post the link to your question in Discussions?
Can you please elaborate "post the link to your question in Discussions" so I can better understand your ask? Thank you. 😀
Yeah no problem. Could you direct us to where the other post is with the responses?
Got it! Below is the link to the Discussions forum in Kaggle where I posted my question.
https://www.kaggle.com/discussions/questions-and-answers/437246
Best Practice on Preprocessing Data | Dropping Columns.
@crystal plover Thanks! I agree with most of the advice. You want to split your data into train and test datasets before applying any preprocessing of filling in the missing values. You apply the preprocessing only to the training data. You want to treat your test data as completely separate data during the entire model building process and only use it for scoring the model (and iterating through the tuning process to improve it).
I like this explanation posted on StackExchange as well that summarizes these concepts: https://stats.stackexchange.com/a/95088
Incorporating the preprocessing techniques into a pipeline is a convenient way to modulize the preprocessing techniques used to ensure those same techniques are applied to future, unseen data. This will ensure that new data receives exactly the same treatment that your model was trained on, which also helps avoid errors if your new data has more/less features than your old data (eg - shape errors: ValueError: Error when checking input: expected input_main_input to have shape (7,) but got array with shape (1,)) .
Here's a good example of how to incorporate a pipeline, which preprocesses and splits the data into train/test: https://www.kaggle.com/code/alexisbcook/pipelines
Thank you! Kaggle Learn is great; I like that each module is small and practical so I can apply techniques to my notebook right away. I've looked at this module on Pipeline and applied it about 6 months ago. I managed to make my code work, but don't think did it corretly. I will try again in this particular notebook using the house price dataset. Hopefully, I will understand it better. 🤓
Thank you for sharing the resource. Got it; split data before preprocessing. 🙏
Hello. In the case of predicting the price of real estate, is there a need to do EDA? If so, I have a small problem: I cannot extract information from the dataset and I get a little lost in the analysis of the latter. I would just like to know if I can find a more suitable method to do EDA on datasets for a regression task. Thank you in advance for your answers
In fact, for my basic EDA, I started by familiarizing myself with the dataset (which has more than 80 columns) by trying to see the percentage of missing values, the proportion of each type of variable, the number of qualitative variables and quantitative. Then, I started with the visualization of target/variable relationships, variable/variable relationships and this is where things got complicated for me because I couldn't extract any relevant information from this dataset and I was only doing turn in circles. Hence my question as to whether there is a need to do EDA on this dataset? Thank you in advance for your responses, once again.
You should always do EDA on datasets
They can help reveal patterns that you can use, or give inspiration to features you can engineer for modelling
Wdym by “couldn’t extract any relevant information from the dataset” @dawn dagger
Thank you very much dear @thick knot . I then think that I will have the plan of my EDA redone. For this dataset, can you give me some ideas on how to get started? Sincerely
I typically like to do count plots for categorical variates and histograms for numerical ones
As well as making a correlation matrix
Seeing the distribution of the explanatory variates after taking into account the response is also useful
it’s also important you confirm the distributions of data between the training and test sets are similar, as otherwise training may not work
Thank you so much @thick knot . Personnally, I would do the opposite with graphs, i.e. variable count graphs for numerical variables and Histograms for categorical variables. But thinking twice, I find that it makes more sense to use the histograms or more precisely the barplot function of seaborn to examine a little the distribution of each variable. Regarding categorical variables, we could try count plots or even catplots to see the types of categories and the distribution of each category present in these variables. But, I owe you a big thank you for clarifying my ideas and also for giving me other ideas. I hope to get back to you soon with further questions. Sincerely.
hello everyone,
I got a 0.14712 score on my first submission of house price prediction.
link :- https://www.kaggle.com/dinanksoni/house-prices-score-0-14712
Hi everyone! I have a question about the commercial usage of this dataset: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data
I work as an editor for a publishing company and we are currently working on a new book around predictive modeling and time series analysis. The author of the book wants to use this dataset to discuss techniques for data exploration and feature engineering. He thinks this dataset would be a good fit to explain these concepts.
Our policy is to use datasets that are open-source and available for commercial use. Or else, we seek permission from the source before we can use it in our books. We didn't find a license for this dataset and the dataset link provided in the Acknowledgements section is broken. So I'm just wondering if we can use this dataset in our book? I would love to get a response from one of the Kaggle staff.
I apologize if this is the wrong place to post this. It would be great if someone could point me in the right direction. I'd be happy to get in touch with someone from Kaggle (over email) to keep it official. I couldn't find a contact on the website hence the post here.
Cheers!
Predict sales prices and practice feature engineering, RFs, and gradient boosting
I have been having a problem with implementing a pipeline for 2 functions namely, an imputer and a dataset splitter. I get that I could bypass this step , but since I am learning I really want to know how to create and use pipelines.
any help would be much appreciated
So close to top 500
gg
Just following up on this one. Any pointers would be really appreciated!
Its so tough
My models cants even hit 10% accuracy
If I put too much load it just crashed colab notebook
I have a doubt, assume that we find correlation between all the features(Pearson) and the target and find that some features are there whose correlation with the target is around 0(or between 0.05 to -0.05). Should we include these features in the model??
By using a quick and dirty random forest , we find the feature importance and features with least importance should be included or not in the final model??
U should follow the recent Kaggle kernels from Kaggle
The ones which are highly voted
I have a doubt , the image shows the correlation(pearson) between the target feature and all other predictors. There are some predictors with which the target is very very weakly correlated like (correlation between 0.05 and -0.05) . Should we include these features in the model? In my opinion these features should not be included in the model since very very weak correlation mean any change in the predictor will not reflect the change in the target and hence these 2 are independent of each other . Am I correct and what should be done?
Pearson correlation coefficient only accounts for linear relationship, try method = spearman and also yesz RF for Features selection can be good choice
My YouTube vid on this will be up next week. Been recording all day long
I have a question regarding the cross validation using GridSearch, are there any point of train_validation split?, since the gridsearch doesn't use validation dataset to get the best score of the algorithm.
For anyone who wants to watch a youtube video on the series: https://www.youtube.com/watch?v=UqmulHG4IvY&t=1s&ab_channel=RyanNolanData
Welcome to our latest data science project! In this exciting YouTube tutorial, we'll dive into the world of advanced regression analysis using Kaggle's House Prices dataset. When working on the project, the code was able to achieve a top 10% score!
Kaggle Notebook: https://www.kaggle.com/code/ryannolan1/kaggle-housing-youtube-video
Email: ryan...
Are you using the GridSearch of sklearn? The one of sklearn is called GridSearchCV, the CV means it is using cross validation
Yes, it is. So basically, they split the data in the bracklet further for cross validation and make the best out of it. Then I will use the valiation dataset that I split earlier to test the score. After I satisfy with the result, I can apply it to the test dataset of the competition for submittion, right?
hey everyone i got score of rmse0.4 by using my this notebook quick question is that how can i improve it
https://www.kaggle.com/code/ayeshairshadcoder/house-price-prediction-competition/
@fresh sequoia
@steel flower
Hello everyone, Just completed notebook link here https://www.kaggle.com/code/nishchay331/n6-house-prices-advanced-regression-techniques. I'll be glad if you take some time to go through this and point my mistakes as I am a beginner. Suggestions are highly appreciated. Thank you.
Hi
Is there any way I can use deep learning/ neural nets on this data set
Am new to deep learning
yes but probably better to use a prebuilt regression
I mostly just want to learn deep leaning/ neural nets
Do u have any idea how I could apply it?
I have done a decent bit of regression before in another project
why is this happening?
@past tide Great Work!…
wow ..
looking for a team anyone interested?
me too
oki
@toxic tide Did you figure out why?
Hey, I am new to ML and learning ML by watching YT videos and MOOC. I am looking for a Mentor/Guide/Buddy with whom can share his/her experience with me and help me learn become a better ML practitioner.
Hey is 0.14 a good score to start with? Like it's my first submission but I would change a lot of things actually. Is it that good!?
maybe im the epoch?
if anyone wants a vid to follow, I made a 3 hour one on this project: https://www.youtube.com/watch?v=UqmulHG4IvY
Welcome to our latest data science project! In this exciting YouTube tutorial, we'll dive into the world of advanced regression analysis using Kaggle's House Prices dataset. When working on the project, the code was able to achieve a top 10% score!
Kaggle Notebook: https://www.kaggle.com/code/ryannolan1/kaggle-housing-youtube-video
Interested ...
Hello i am student in ML just starting and i am struggling with this competion
looking for team mates,buddy,senior or any one else to share experiences
Notebook links attached to sub headings of the competition description are not working.
in the house price dataset, how to deal with the columns which are having very high number of counts of a single value, say more than 90%
I used random forest to look at feature importance and this is what i have got
cols imp
47 OverallQual 0.765166
59 GrLivArea 0.146761
56 1stFlrSF 0.023267
55 TotalBsmtSF 0.023234
34 GarageFinish 0.015694
70 GarageArea 0.011260
67 Fireplaces 0.008944
52 BsmtFinSF1 0.005675
6 LotConfig 0.000000
8 Neighborhood 0.000000
1 Street 0.000000
2 Alley 0.000000
3 LotShape 0.000000
4 LandContour 0.000000
5 Utilities 0.000000
Hello, I'm trying to start the home price prediction project. But noticed I keep getting issues or errors with my tensorflow. I'm trying to print or describe the csv that I currently loaded. I'm running the code on vs code on windows 11. Any help will be great
anyone interested in doing this project together
yeah why not
I am down! Hope it is not too late
I am just starting this competition. Looking for teammates. Please let me know if you're interested in working together
Hi guys, the result after doing the submission is the MSE? Meaning the one that we are ranked upon
Yes, mse of log(Saleprice)
you can see correlation between feature and the target feature (by corr() )
Is anyone interested in collaborating as a team? Feel free to DM.
Hi --- I'm a graduate student studying Data Science, and I'd love to join this challenge. I won't mess up the feature engineering step by trying to write the pipeline the OOP way, I promise lol. Also, if you could help with my question under the Titanic channel, that would be great 🙂
No worries, I'm no expert in feature engineering either, so we'll learn through our mistakes lol. Sure, I'll take a look at your question and see if I can help. 🙂
I am interested too unless you are already too many!
We’re nearly finished with the project and are currently performing hyperparameter tuning on the models. If you’re interested in working on a different dataset/project related to ML, feel free to DM me.
I'm planning on improving my code for this project. My previous code was very messy and I feel like I overdid a lot of stuffs on my pipelining. We can collaborate on improving our code (especially mine lol) if you want to.
Sure, we can discuss this in detail on my server. Also, feel free to check out my updated Notebook on House Prices.
https://www.kaggle.com/code/shahriarrahman10/house-price-prediction-using-gbr
I got score in this competition, is ok or is really bad?
anyone interested in doing this project together?
how are you guys encoding the categorical variable, one hot encoding would balloon the the dataframe to 100s of columns is there a more efficient method?
I've done like you and I also have 100 columns
but I delete the columns that are behind 0.05 correlation with SalePrice
Yeah man, this is a common issue with cat variables and one hot encoding. In addition to @junior forge answer and at the best of my knowledge you can just ignore cat variables with high cardinality. However, these days I am searching for some NLP solutions. I am not sure if techniques like word embedding may be a good alternative (if someone already have the answer do not hesitate to answer me).
How works Word embedding?
Hello, I'm a bit confused , can anyone explain the submission format , I'm supposed to find np.log(y_predict) or rmse(np.log(y_true),np.log(y_predict) )
Basically they transform a word (say the word "house") to a vector of length n
The word embeddings in addition to representing words with vectors, they try to make synonyms and similar words close to each other in the vector space.
Therefore, I am wondering if using word embedings with cat variables may help. I don't know I am still searching....
hello i was working on the housing regression competition and i am fairly new. I was wondering for a column such as Street, then it has a range of string values such as gravel, pavement... how should i encode this
hello, i am an absolute beginner in machine learning. i just learned about linear regression and error metrics and wanted to get my hands on a small project using the techniques i learnt. So i started with the famous boston-housing-prices dataset on kaggle and would appreciate if you could take a look at my code: https://www.kaggle.com/code/khalidhelmy55/boston-housing-prices and guide me on what is missing or what could be better done..
according to the metrics i calculated the model is not performing good.
hello everyone beginner here, so I was doing the intermediate ML course and in the categorical variables exercise when I was performing one hot encoding it went smoothly but now the DataFrame had missing values. what is the most appropriate course of action? should I just fill in the missing values or something else?
as you can see in the screenshot, the prevoius step was correct
Did you get it figured out? I would suggest running all the cells over as you might have ran them out of order and introduced missing values.
Thank you Tom for your response. Actually later on in the tutorial the values were imputed, so the problem was solved.
so, was thinking of applying what I learned to the competition to see if the accuracy can be improved, now I have a problem which I am not able to figure out no matter what I do. This is what I want to do:-
- select categorical columns and apply simpleimputer to them with a constant value.
- select numerical columns and apply simpleimputer with strategy = mean
- apply ordinal encoding to certian columns and one-hot to others
- finally I want to train an XGBRegressor model.
but everytime a new error is being thrown.
this is the pipeline I am using
i am imputing the values but still this error is showing up
I am thinking that the imputation is not being applied to the testing data.
Well this is a common error.
It occurs because you need to first convert the categorical data into numerical data first before imputing it , because SimpleImputer works only with numerical data.
Since you are first trying to impute the categorical data , simple imputer doesn't understand 'nan' (which is a string).
but in another approach I tried, and I Imputed the values with strategy = "most_frequent" and it worked.
screenshot attachend below:-
the only difference is the strategy.
the dataset is the same
Idk, never tried imputing with "most frequent" strategy , I mostly use "median" and at times "mean" ,
I think it's because you used "most frequent" , like in string or categorical data the most frequent category will fill up the data.
But for "mean" , you can't really take "mean" , "median" or such startegy since they are for numbers statistically.
but I used the strategy ="constant" which fills the missing values with the fill_value parameter's value
and the error doesn't occur while fitting the transformer. The error happens when I use the transform method on other data
what I suspect is that when I am calling the transform method the imputation is not happening for some reason
Ahh.. I understand now. I faced similar problem when I made a column transformer using sklearn , I made one using columns in train set which also happen to be in test set , but the transformer didn't work well with both only training set , then I created another column transformer for test set seperately. Alas , I'm not familier with the solution.
but wont there be data leakage if we do that?
do what exactly transformer ?
fitting another transformer on the testing data?
and the features may be inconsistent also?
I'm not quite sure , Well I used the same code as the first transformer only difference was the columns.
I extracted the column names from test data and it works.
I always use skimpy.skim(data) or data.info() or data.describe() after any step in perprocessing.
When I used the same transformer I found from skim(data) that for testing data the conversion was incomplete, that's why I made the 2nd transformer only changing it to test.columns instead of train.columns and it worked since I got a similar description using skim(test).
turns out the error was being thrown by OrdinalEncoder, I made a few changes and the error was resolved.
Nice! What was the error though ?
there were two errors:-
- turns out that the ColumnTransformer performs the transformations in parallel and due to this behavior imputations were being skipped because there were intersecting columns for imputations and encoding.
- The OrdinalEncoder was not able to handle unknown values because (my mistake) I did not specify how to handle them.
Thank you for the help @rustic widget !
I doubt I was of any help. Great Job fixing the bug! Good luck with the competition.
Hi guys, i really cannot understand is it possible that though having cross_validated my models with 10 kfolded sets, I still have a huge gap between my local score and the leaderboard score. That happened with Titanic competition too, is that normal or are you experiencing a better alignment amongst local and kaggle scores ?
hi, I'm running various models on the london house prices dataset, but I've ran into an error with matplotlib i cant seem to fix.
im trying to run residual analysis
and the first time it works
but running the same code teh second time doesnt
nevermind figured it out
I made a (fairly basic) entry to the december playground and placed 1577/2390
then i literally just copied and ran the same code for the house prices competition and placed 355 out of 4910.
are the competitors on the rolling competitions worse than the competitors on the playgrounds or?
like i'm not doing feature engineering, hyperparameter tuning, anything, i just one-hot encode the categorics and do one run of xgboost, that's it, it's not a good model
wait am i doing the wrong one, why are there two
i submitted an entry to https://www.kaggle.com/competitions/home-data-for-ml-course/ but are other people submitting to https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/
i might submit to that one too
what's the difference between them? the scores on the leaderboards are all different
I submitted the same code to the other one and placed 2837. so I'm guessing that one is more competitive?
https://www.kaggle.com/code/amitbarkama/ml-part4
Improved score
so I've been trying to adapt Ryan Holbrook's notebook (https://www.kaggle.com/code/ryanholbrook/feature-engineering-for-house-prices) but ran into an odd problem. I don't know if anyone can help with this.
in Ryan Holbrook's notebook, if you print out the MI scores, it shows that OverallQual and Neighborhood are the two most important features.
I changed the imputer to impute the median for the numerical features instead of 0 and the MI score for several features, including OverallQual and Neighborhood, dropped to 0.
Does anyone know why this is? Have I done something wrong?
Can someone help me find out what wrong with my code?
I implemented EDA, Feature Engineering, and other preprocessing techniques. However, the score lowered compared to my other codes. What is wrong with my code? Here is the link: https://www.kaggle.com/code/eidenspark/notebook5a2c152607
Thanks!
Hi in the housing prices beginners competition, what is the score/predictions that participants should be aiming for? I've just done my first submission and got 21488.82469. Thanks
Hey, I'm having a issue in this competition. Can someone help me?
The first features list got 15509.73 in MAE metric, while features2 list got 21857.15.
I should receive more points right!? But I got less....
Idk what is happening.
As much higher, better. I think they should limit the score until 100. This makes easy to measure your position on leaderboard.
Hello all, subhasish this side
Hi everyone
What is the best project that you found open source?
I need one to use on my data in my website
So, you can just get the house price index data between 2006 and 2010 for Ames, Iowa via FRED, that totally feels like cheating even though it's not 😅
How is that not cheating?
Isn't that target leakage?
Rule 10 might be broad enough to make it cheating ig.
I'm just using state-wide data instead for that reason
hi i just started my competition journey with the beginner house price prediction i never practiced on such dataset there are so many attributes that i m confused on what columns should i create data visualization ..need help
Group columns on the nature of data. Like categorical(nominal/ordinal), numerical(continuous/discrete), time attributes and other features. Then group wise maybe start the analysis
thanks bro
how the hell did the leaders get a score of 0.00044
where
is the score mentioned
They cheat
Tip for anyone else doing this dataset btw: There's a 45% correlation between LotArea and LotFrontage, but a 60% correlation between sqrt(LotArea) and LotFrontage.
I ended up grouping LotFrontage/sqrt(LotArea) ratios based on moderately correlated groups like LotShape, with a minimum number of samples per group, got the median ratios, and used these to impute the 250 or so (however many there were) missing LotFrontage values.
Couldn't find any reason as to why there were so many LotFrontage values missing either even after trying correlations after encoding categories
data leakage I believe, seems like you're fitting your model based on both your training data, and your testing data?
(Oh that was ages ago I need another coffee ._.)
how i m gonna find out which columns to drop or which to keep cause after dropping some columns i still got 75 of them i have tried correlation but i dont think on that basis only i can judge what to drop .....
Best off to use ordinal and OHE first then drop
I've still gotta look for outliers myself but I'll probably drop most via PCA
But I wanna get good at sensitivity analysis so non linear considerations are accounted for
Also 35% correlation to LotFrontage and SalePrice iirc. Don't be lazy and impute medians for this 😛
Well, feel the need to revisit multivariable calc then perform sensitivity analysis before PCA so at least I have an idea of the non-linear correlations going on.
Kind of thought how I wanna go about this with partial derivatives and the Jacobian Matrix
WML 😆 this is gonna be fun
Hi, community! This is my first submission on Kaggle. I hope to keep moving forward and will try to improve my score. 😄
This is my current code—I'll document my steps more thoroughly soon, but if you notice anything I can improve, I’d really appreciate your feedback!
https://www.kaggle.com/code/luiz2002/house-pricing-0-13481
Hello everyone! I am learning on Kaggle by doing the House Prices competition. My current rank is 583. But right now, I am kinda stuck on how to further improve my model. I'd really appreciate any tips, suggestions, or feedback on what I could try next. Thanks so much in advance!
Is there any difference between the Housing Prices Competition for Kaggle Learn Users & the Housing Prices Advanced Regression Techniques or are they just different places to submit predictions with different scoring methods to try and split up people who have only just started and people who have recently started but are digging into more complex things?
@everyoneHello, is anyone interested in chatting about strategies for this competition? I think it could be helpful for us to share ideas and learn from each other to improve our models. If you're interested, feel free to DM me. I would love to connect!
Nice, my testing has be at a out 0.12 presently, I could get it a bit lower maybe.
I'm just using a PyTorch linear regression model, I've transformed the data a lot though
some of the missing values are deliberately missing. Make sure you carefully read the data description text file
also sensitivity analysis showed I think a 0.0019 rating, I'd need to double check. I only drop if they're below 0.001
https://github.com/HotProtato/Ames-House-Prices-Regression-Model/tree/master
Due to university assignments I wont be able to continue that for a while, there's a lot of mess I need to clean up, including for some reason some variables showing up more than once. There's multiple commits to see my EDA methodology.
However, if I were to use a deep learning model for tabular data, I would now be morer systematic. For instance, I would observe the medians and IQR by way of boxplots for ordinal categories against the target value to look for relationships. For instance, I would automatically attempt to look for concave relationships, if identified, introduce the ordinal values after encoding as x^2, letting the model find the coefficient.
I would then integrate that approach as an imputation strategy. I believe it was MICE? That does something similar, it already imputes via groups with high correlations, but for an imputation strategy I'd first use the first method, treating the variable to be imputed as the temporary target variable, attempting to find relationships automatically, then performing transformations (cloning to keep the originals), to try and find groups that have higher correlations.
An example being, I used sqrt(LotArea) which had a 60% correlation to LotFrontage, compared to 45% normally, allowing me to make a LotFrontageRatio value for imputation, grouped by 2 other ~30 - 35% correlating groups.
I also performed sensitivity analysis to remove many columns. The best score I can seem to get so far, with 80% 20% split is around 0.12 RMSE (the target variable is np.log1p'd). However, I would look to use a 95% 5% split, using stratified subsampling once I'm 100% done.
While I know there's a lot of cleaning to do, let me know if you see any areas of improvement, whether it's convention-wise, or functional.
My next project, I will look for a larger than memory dataset, to deliberately require my navigating the challenges that come with it, using a deliberate local database structure
I mathematically represented the model but need to update the details due to hyper parameter optimization (used it to determine number of layers and neurons per layer, as well as activation functions) 🙂
Also, I've had each value set between 0 and 1 (except sine and cosine of MoSold), and deliberately made it so the higher the value, the greater the expected SalePrice, hence weird values like "BuildingNewnessScore" lol. Building age also makes sense only with respect to MoSold and YrSold after all 😛 the aligned scoring in this way allows the model to train more efficiently
Prior to working on my next model, I plan on making a general utility class. Might be interested in working on some projects with others, so hmu. My time is limited for the next 3 weeks, however
I got a score of 0.12561 using Lasso linear regression. I posted my Jupyter notebook on github at the following link:
https://github.com/gjpelletier/stepAIC/blob/main/Example_kaggle_house_prices.ipynb
Stepwise, Lasso, and Ridge linear regression to optimize AIC, BIC, or VIF in Python and Jupyter notebook - gjpelletier/stepAIC
I updated my Jupyter notebook using StackingRegressor with a machine learning ensemble of models. This improved the score to 0.12489 (rank 693). Here is a link to the updated notebook using stacking regression: https://github.com/gjpelletier/stepAIC/blob/main/Example_kaggle_house_prices.ipynb
Nice
Well, thankfully I've been able to work on this some more, I've gotten it down to 0.12036 for my training/test split, I plan on training on 95% of the data, with a 5% stratified subsampling split to ensure ideal weights when submitting.
Trimming the last of the low contributing features via sensitivity analysis, then moving onto PCA 🤩
How are you going to use PcA?
Just in my eda notebook to see if it's worth adding to my pipeline and for what parameters.
Needing to learn how SVD works first, as I know the math behind PCA, just not through this method
Okay, so I've learnt a lot.
Due to my 80/20 split showing my scores being in the top 1% already, not going to bother with PCA, but when I use it, I will need to learn how it's calculated via SVD so I have an understanding of error margins to expect based on my data.
Beyond this, I've learnt that in research and actual model deployment, the averages or deviations from training matter more than the lowest validation loss value.
Going to make my submission within the next few hours then update my repo, then I'm just gonna clone my project, and focus more on learning how to best extract insights, as well as practicing charting the HPO values by way of performance vs cost, so if I were to hypothetically scale the model, how to make it more efficient.
Thanks to my strong feature transformations that I've done, my neural network isn't large at all 🙂
Learnt a lot in terms of Optuna trials in HPO and when to best cull / early stop trials.
If anyone is curious or knowledgeable when it comes to extracting insights from data (effectively getting as close to casual inference as possible, since it's not feasible with this data for example), reach out 😁
My lowest score is in the 0.119's now, hopefully it'll get lower with my 95 / 5 (stratified) split
Notebook with score=0.11988 https://github.com/gjpelletier/EasyMLR/blob/main/Example_kaggle_house_prices.ipynb
Well, I learnt for my model PCA was kinda pointless, and I also learnt that for HPO and my permutation sensitivity analysis (despite n = 30), for such little data, I should have had 3 weights from validation to determine what features should be kept and removed via iterative HPO and sensitivity analysis.
Decided to start working on my own framework instead for now that uses feature engine and scikit-learn within an auditing kind of system, where everything is traceable, and gradually collecting data over the years, maybe to make my own data scientist advisor model in the future, who knows? 🤔
Have you considered a Facor Analysis?
Not yet!
I've just finished learning the logic and math behind most tree-based models, just working on two frameworks at present, one aimed at explainability, the other aimed at performance.
I want to generalize my imputation strategy by learning C++, writing it as a module in C++ to then use in my framework, and get a score via this strategy, as well as a score for tree-based models to see which, if any, are suitable for imputing missing values.
If there's 30% or more missing data in one column, I'll remove it. Otherwise, I'll perform a test/train split on missing data columns, and use my method, and tree-based models, see which one gets the least amount of loss, then use whichever model/method works best for imputing those missing values.
I'm also making a wrapper function for just about everything that integrates into an audit system, so I can track absolutely everything, maybe visually represent it as well. Who knows? If I capture enough of my interactions, I could even make my own Data Scientist Advisor model, based on my own audits collected over the years xD
I still have other methods of sensitivity analysis to learn like SHAP values/graphs. For this dataset, I realized a little too late that when I was performing HPO, as well as permutation sensitivity analysis, I should have used 3 different training weights, and measuring the averages instead
Hopefully by using this approach with these two separate frameworks, they can serve as useful inputs for actuary functions, to make a suitable risk framework in the future
I tried everything I could to debug it. Is there anything else I can do? Any help would be greatly appreciated
Idk man it feels like over-overkill
I'm not making the frameworks just for this one project lol, they're for my continued use. Why repeat myself every project?
The libraries are alteady so high level but youbmight be right I’m just a newbie
I'll just share a msg file (since it's too large) of what I'm doing in case you're curious
Yeah there's more powerful imputation strategies, like XGBoost and its variants, framework #1 is focused on making sure everything is traceable and explainable
I did draw inspiration from how I managed the LotFrontage in this project, but I didn't test the adequacy in the way I should have noted above. My goal is to streamline these kind of processes so I'm not repeating myself
(I should say my variation of the random forest model in the text file, it's literally the same, just the bagging happens per-node instead of per-tree)
Quite annoying that this model doesn't exist already =.= it's so simple, a decision tree with the typical histogram bins, and using bootstrapping per node, getting the mode per-node, to replicate the benefits of random forests but have this ridiculously more explainable.
Before I decide to make a C++ module for this for efficiency, I suppose I should trial this and benchmark some results to see if it performs as well as expected
Edit: So this wasn't feasible, but there's a potential MoE approach, not for the typical performance optimization, but to literally increase accuracy while remaining interpretable. Performance was basically the same as a regular tree-based model, gonna finish my frameworks then gradually improve adding stuff like this idea. Still going to use a tree to assist in forming MoE candidates
Hi,
I didn't uderstand the difference between my RMSE score and the score in Kaggle. Do you have an idea about it ?
https://github.com/Jeremy-Duval-PhD/Kaggle_housing_price
https://www.kaggle.com/competitions/home-data-for-ml-course/overview - Jeremy-Duval-PhD/Kaggle_housing_price
Since you're managing outliers by removing them even with IsolationForest, your scores may appear overly optimistic, particularly given MSE is used (as part of RMSE), in that outliers will be particularly valuable unless they're some kind of error. That's the only thing that comes to mind here
Ok, thanks for your answer. And what do you do with outliers on your side ?
I'd keep them unless there's any indication as to the values being a mistake, or too extreme. They're valuable here
Ok, thank you !
Hello, I'm recently submitted my model's predictions to this competition and i got: Score: 0.13854
Is this score is a RMSLE, RMSE, MSE, MAE or R2 score?
And is this score is a good score or a bad score?
Or you need more information about my notebook?
hey, I'm new on ML, so my question is probably a newbie question : I tried to use random forest on the datas's challenge, but it did not work so well. My guess is that you can't compare the price of residential housing and commercial locals. So I splitted the data's with MSZoning feature and I applied random forest algorithm on splitted data. For example only on data's which are flagged Commercial with MSZoning feature you have only 10 samples, so I thing that 81 features for 10 sample is far to much and, actually, on these 81 features something like 20 features have the same values for all my 10 samples. So I tried to remove these 20 features. But, and these is my question, the result of my random forest arlgorithm is worst when I remove these 20 unusefull features than with the 81. Is this oberfitting ? Do you have any idea of the reason of this odd result ?
your model is predicted only 13% on the test dataset i suggest you improve your models accuracy by trying different algorithms
i think its RMSLE score
i think your model is not overfitted but again its the randomness of the random forest try to increase the sample size because if there are more features the random forest has more scope to split
If you use a sufficiently complex model, it would implicitly learn how reliable residential and commercial locations are with respect to each other.
In a more professional setting, you would also test such assumptions.
Hello, I have a question — is it necessary for me to understand every column in this project? Because there are so many columns.
To get 0.13 in RMSLE is not bad, I think is moderate but you can improve to 0.11-0.10
hello can you help me with a project
Hii
Hi
I am new I just want to learn and do this
Hi
Hi everyone.
hi guys
Heyy, tbh im kinda lost trying to do this basic competition rn.. my best entry is 0.14774 as of rn. How do you guys go about feature engineering?
I feel like that's what's been messing up and not sure how I can add relevant features without just trail and errors
would you guys appreciate a notebook baseline template which you can easily iterate on?
Here is my pipeline overview and notebook:
https://www.kaggle.com/code/rommelsharma/adv-house-price-predictions-with-optuna-tuning
Pipeline Overview
Step Description
1 Load data & remove outliers
2 Domain-aware ordinal encoding (quality scales)
3 Missing value imputation (train stats only - no leakage)
4 Feature engineering: area, age, quality*size crosses, pool features
5 Drop near-constant columns (with protected set for rare signals)
6 OrdinalEncoder on vocabulary union of train+test
7 Neighbourhood target encoding (5-fold out-of-fold)
8 Pairwise EDA plots vs SalePrice
9 Multi-model 5-fold CV comparison
10 Optuna hyperparameter tuning (9 params, 50 trials)
11 Final XGBoost training with early stopping
12-13 RMSE progression + feature importance charts
14 Save model + preprocessor for reuse
15 Generate submission.csv
I have provided extensive documentation so that it is easy to understand. I hope you can fork the code and get better results. All the best.
sure, it's competition's kernel
on kernel, you can see inspector part, it could help you with selecting features.