#🏠┊house-prices-advanced-regression-techniques

1 messages · Page 1 of 1 (latest)

real cloak
#

So how to get started? Sorry I m lost in the overwhelming messages

calm saffron
rough zephyr
#

anyone wants to team up for this project?

real cloak
#

Anyone can lead?

tough cosmos
fringe scroll
#

@rough zephyr @thin dust can I be part of your team?

tough cosmos
ripe thunder
#

Hi everyone, I have a rookie question. As my cross validation score decreases, the RMSE score on kaggle increasing. Couldn't explain why such a contrary occurs. Anybody?

tidal hawk
#

Hey, I am no expert but I would assume you have overfitting. A lower cross-validation score suggests your model is performing as needed on your local dataset. However, a high (or increasing) RMSE on Kaggle shows that its not generalizing well to new, unseen data. @ripe thunder

#

Make sure you r properly preprocessing and validating ur data to avoid the issue.

ripe thunder
#

Thanks, maybe I can use a validation set approach rather than cv. Will keep in mind the tips.

ivory tapir
#

Large dataset hurts my computer lol

#

Thoughts on doing PCA --> training on components, instead of directly on features?

viral path
crystal plover
#

Hello - When you analyze data to identify the needs for preprocessing, do you look at both Training and Test datasets and apply the technique (e.g. dropping columns with missing data) that will work for both Training and Test datasets to save time and effort later? Or do you look at the Training dataset only and apply the same steps to the Test dataset? And if there is an issue, then deal with it at that time? The first approach is proactive and the later approach is reactive.

I am asking because this has happened:
Step 1: Analyzed the Training dataset and identified 3 columns that have missing values
Step 2: Split the Training dataset to Training and Validation sub-datasets
Step 3: Dropped those 3 columns from both sub-sets
Step 4: Defined the model, fit the model with the Training sub-dataset
Step 5: Made predictions using the Validation sub-dataset
Step 6: Determined the MAE for the model using the dropping the columns approach performed better than the model using imputation
Step 7: I was ready to make predictions using the Test dataset; then found out that the Test dataset had additional columns with missing data. SURPRISE! What am I supposed to do in this situation?

Below is the link to my notebook.
https://www.kaggle.com/code/juliasuzuki/house-price-prediction-advanced-regression/notebook

Thank you so much for your response in advance!

crystal plover
# crystal plover Hello - When you analyze data to identify the needs for preprocessing, do you lo...

UPDATE: I received enough responses in Kaggle Discussions (Yes, I posted the same question) and my question is resolved. It is generally a good practice to work only with Training dataset during the model development (preprocessing included). After all, the purpose of Test data is to make an inference from new data (i.e. Test data) using the trained model. But to avoid a situation like the one I mentioned above (find a surprise when working with Test data), I could implement a better preprocessing strategy (e.g. imputation instead of dropping columns). Hope you find this helpful!

tough cosmos
crystal plover
tough cosmos
crystal plover
tough cosmos
#

@crystal plover Thanks! I agree with most of the advice. You want to split your data into train and test datasets before applying any preprocessing of filling in the missing values. You apply the preprocessing only to the training data. You want to treat your test data as completely separate data during the entire model building process and only use it for scoring the model (and iterating through the tuning process to improve it).

I like this explanation posted on StackExchange as well that summarizes these concepts: https://stats.stackexchange.com/a/95088

Incorporating the preprocessing techniques into a pipeline is a convenient way to modulize the preprocessing techniques used to ensure those same techniques are applied to future, unseen data. This will ensure that new data receives exactly the same treatment that your model was trained on, which also helps avoid errors if your new data has more/less features than your old data (eg - shape errors: ValueError: Error when checking input: expected input_main_input to have shape (7,) but got array with shape (1,)) .

crystal plover
crystal plover
dawn dagger
#

Hello. In the case of predicting the price of real estate, is there a need to do EDA? If so, I have a small problem: I cannot extract information from the dataset and I get a little lost in the analysis of the latter. I would just like to know if I can find a more suitable method to do EDA on datasets for a regression task. Thank you in advance for your answers

#

In fact, for my basic EDA, I started by familiarizing myself with the dataset (which has more than 80 columns) by trying to see the percentage of missing values, the proportion of each type of variable, the number of qualitative variables and quantitative. Then, I started with the visualization of target/variable relationships, variable/variable relationships and this is where things got complicated for me because I couldn't extract any relevant information from this dataset and I was only doing turn in circles. Hence my question as to whether there is a need to do EDA on this dataset? Thank you in advance for your responses, once again.

thick knot
#

You should always do EDA on datasets

#

They can help reveal patterns that you can use, or give inspiration to features you can engineer for modelling

#

Wdym by “couldn’t extract any relevant information from the dataset” @dawn dagger

dawn dagger
thick knot
#

As well as making a correlation matrix

#

Seeing the distribution of the explanatory variates after taking into account the response is also useful

#

it’s also important you confirm the distributions of data between the training and test sets are similar, as otherwise training may not work

dawn dagger
# thick knot I typically like to do count plots for categorical variates and histograms for n...

Thank you so much @thick knot . Personnally, I would do the opposite with graphs, i.e. variable count graphs for numerical variables and Histograms for categorical variables. But thinking twice, I find that it makes more sense to use the histograms or more precisely the barplot function of seaborn to examine a little the distribution of each variable. Regarding categorical variables, we could try count plots or even catplots to see the types of categories and the distribution of each category present in these variables. But, I owe you a big thank you for clarifying my ideas and also for giving me other ideas. I hope to get back to you soon with further questions. Sincerely.

steel flower
lament monolith
#

Hi everyone! I have a question about the commercial usage of this dataset: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

I work as an editor for a publishing company and we are currently working on a new book around predictive modeling and time series analysis. The author of the book wants to use this dataset to discuss techniques for data exploration and feature engineering. He thinks this dataset would be a good fit to explain these concepts.

Our policy is to use datasets that are open-source and available for commercial use. Or else, we seek permission from the source before we can use it in our books. We didn't find a license for this dataset and the dataset link provided in the Acknowledgements section is broken. So I'm just wondering if we can use this dataset in our book? I would love to get a response from one of the Kaggle staff.

I apologize if this is the wrong place to post this. It would be great if someone could point me in the right direction. I'd be happy to get in touch with someone from Kaggle (over email) to keep it official. I couldn't find a contact on the website hence the post here.

Cheers!

proven field
#

I have been having a problem with implementing a pipeline for 2 functions namely, an imputer and a dataset splitter. I get that I could bypass this step , but since I am learning I really want to know how to create and use pipelines.

any help would be much appreciated

fresh sequoia
#

So close to top 500

fresh sequoia
#

And finally hit top 10%

#

Time to record a video

fading willow
#

gg

lament monolith
gloomy rain
#

Its so tough

#

My models cants even hit 10% accuracy

#

If I put too much load it just crashed colab notebook

past tide
#

I have a doubt, assume that we find correlation between all the features(Pearson) and the target and find that some features are there whose correlation with the target is around 0(or between 0.05 to -0.05). Should we include these features in the model??

#

By using a quick and dirty random forest , we find the feature importance and features with least importance should be included or not in the final model??

past tide
#

The ones which are highly voted

past tide
#

I have a doubt , the image shows the correlation(pearson) between the target feature and all other predictors. There are some predictors with which the target is very very weakly correlated like (correlation between 0.05 and -0.05) . Should we include these features in the model? In my opinion these features should not be included in the model since very very weak correlation mean any change in the predictor will not reflect the change in the target and hence these 2 are independent of each other . Am I correct and what should be done?

low steppe
fresh sequoia
#

My YouTube vid on this will be up next week. Been recording all day long

cyan turtle
#

I have a question regarding the cross validation using GridSearch, are there any point of train_validation split?, since the gridsearch doesn't use validation dataset to get the best score of the algorithm.

fresh sequoia
#

For anyone who wants to watch a youtube video on the series: https://www.youtube.com/watch?v=UqmulHG4IvY&t=1s&ab_channel=RyanNolanData

Welcome to our latest data science project! In this exciting YouTube tutorial, we'll dive into the world of advanced regression analysis using Kaggle's House Prices dataset. When working on the project, the code was able to achieve a top 10% score!

Kaggle Notebook: https://www.kaggle.com/code/ryannolan1/kaggle-housing-youtube-video

Email: ryan...

▶ Play video
still void
cyan turtle
rare mauve
past tide
worthy lava
#

Hi

#

Is there any way I can use deep learning/ neural nets on this data set

#

Am new to deep learning

fresh sequoia
#

yes but probably better to use a prebuilt regression

worthy lava
#

Do u have any idea how I could apply it?

#

I have done a decent bit of regression before in another project

toxic tide
#

why is this happening?

daring dragon
#

@past tide Great Work!…

vapid jay
#

Looking for a team

#

please dm me

fierce epoch
#

i got a score of 0.14

#

@rare mauve

rare mauve
mild crystal
#

looking for a team anyone interested?

modest sapphire
#

im noob though

mild crystal
modest sapphire
#

oki

final cairn
#

@toxic tide Did you figure out why?

hoary prism
#

Hey, I am new to ML and learning ML by watching YT videos and MOOC. I am looking for a Mentor/Guide/Buddy with whom can share his/her experience with me and help me learn become a better ML practitioner.

jovial jewel
#

Hey is 0.14 a good score to start with? Like it's my first submission but I would change a lot of things actually. Is it that good!?

brazen hare
#

maybe im the epoch?

fresh sequoia
#

if anyone wants a vid to follow, I made a 3 hour one on this project: https://www.youtube.com/watch?v=UqmulHG4IvY

Welcome to our latest data science project! In this exciting YouTube tutorial, we'll dive into the world of advanced regression analysis using Kaggle's House Prices dataset. When working on the project, the code was able to achieve a top 10% score!

Kaggle Notebook: https://www.kaggle.com/code/ryannolan1/kaggle-housing-youtube-video

Interested ...

▶ Play video
onyx spoke
#

Hello i am student in ML just starting and i am struggling with this competion
looking for team mates,buddy,senior or any one else to share experiences

desert stirrup
#

Notebook links attached to sub headings of the competition description are not working.

verbal wren
#

in the house price dataset, how to deal with the columns which are having very high number of counts of a single value, say more than 90%

sinful token
# past tide I have a doubt , the image shows the correlation(pearson) between the target fea...

I used random forest to look at feature importance and this is what i have got

cols    imp

47 OverallQual 0.765166
59 GrLivArea 0.146761
56 1stFlrSF 0.023267
55 TotalBsmtSF 0.023234
34 GarageFinish 0.015694
70 GarageArea 0.011260
67 Fireplaces 0.008944
52 BsmtFinSF1 0.005675
6 LotConfig 0.000000
8 Neighborhood 0.000000
1 Street 0.000000
2 Alley 0.000000
3 LotShape 0.000000
4 LandContour 0.000000
5 Utilities 0.000000

regal heart
#

Hello, I'm trying to start the home price prediction project. But noticed I keep getting issues or errors with my tensorflow. I'm trying to print or describe the csv that I currently loaded. I'm running the code on vs code on windows 11. Any help will be great

high jasper
#

anyone interested in doing this project together

dense edge
frosty halo
#

I am just starting this competition. Looking for teammates. Please let me know if you're interested in working together

steady laurel
#

Hi guys, the result after doing the submission is the MSE? Meaning the one that we are ranked upon

distant zodiac
past patrol
#

Is anyone interested in collaborating as a team? Feel free to DM.

pearl oak
past patrol
hushed moon
past patrol
mint prism
past patrol
uncut pond
#

I got score in this competition, is ok or is really bad?

junior forge
#

anyone interested in doing this project together?

fair trail
#

how are you guys encoding the categorical variable, one hot encoding would balloon the the dataframe to 100s of columns is there a more efficient method?

junior forge
#

but I delete the columns that are behind 0.05 correlation with SalePrice

pallid thistle
# fair trail how are you guys encoding the categorical variable, one hot encoding would ballo...

Yeah man, this is a common issue with cat variables and one hot encoding. In addition to @junior forge answer and at the best of my knowledge you can just ignore cat variables with high cardinality. However, these days I am searching for some NLP solutions. I am not sure if techniques like word embedding may be a good alternative (if someone already have the answer do not hesitate to answer me).

rustic widget
#

Hello, I'm a bit confused , can anyone explain the submission format , I'm supposed to find np.log(y_predict) or rmse(np.log(y_true),np.log(y_predict) )

pallid thistle
# junior forge How works Word embedding?

Basically they transform a word (say the word "house") to a vector of length n

The word embeddings in addition to representing words with vectors, they try to make synonyms and similar words close to each other in the vector space.

Therefore, I am wondering if using word embedings with cat variables may help. I don't know I am still searching....

hardy wedge
#

hello i was working on the housing regression competition and i am fairly new. I was wondering for a column such as Street, then it has a range of string values such as gravel, pavement... how should i encode this

ornate spindle
#

hello, i am an absolute beginner in machine learning. i just learned about linear regression and error metrics and wanted to get my hands on a small project using the techniques i learnt. So i started with the famous boston-housing-prices dataset on kaggle and would appreciate if you could take a look at my code: https://www.kaggle.com/code/khalidhelmy55/boston-housing-prices and guide me on what is missing or what could be better done..
according to the metrics i calculated the model is not performing good.

tranquil quail
#

hello everyone beginner here, so I was doing the intermediate ML course and in the categorical variables exercise when I was performing one hot encoding it went smoothly but now the DataFrame had missing values. what is the most appropriate course of action? should I just fill in the missing values or something else?
as you can see in the screenshot, the prevoius step was correct

latent horizon
tranquil quail
#

Thank you Tom for your response. Actually later on in the tutorial the values were imputed, so the problem was solved.

tranquil quail
#

so, was thinking of applying what I learned to the competition to see if the accuracy can be improved, now I have a problem which I am not able to figure out no matter what I do. This is what I want to do:-

  1. select categorical columns and apply simpleimputer to them with a constant value.
  2. select numerical columns and apply simpleimputer with strategy = mean
  3. apply ordinal encoding to certian columns and one-hot to others
  4. finally I want to train an XGBRegressor model.
    but everytime a new error is being thrown.
#

this is the pipeline I am using

#

i am imputing the values but still this error is showing up

#

I am thinking that the imputation is not being applied to the testing data.

rustic widget
# tranquil quail i am imputing the values but still this error is showing up

Well this is a common error.
It occurs because you need to first convert the categorical data into numerical data first before imputing it , because SimpleImputer works only with numerical data.
Since you are first trying to impute the categorical data , simple imputer doesn't understand 'nan' (which is a string).

tranquil quail
#

the only difference is the strategy.
the dataset is the same

rustic widget
tranquil quail
#

and the error doesn't occur while fitting the transformer. The error happens when I use the transform method on other data

#

what I suspect is that when I am calling the transform method the imputation is not happening for some reason

rustic widget
tranquil quail
#

but wont there be data leakage if we do that?

rustic widget
tranquil quail
#

fitting another transformer on the testing data?

#

and the features may be inconsistent also?

rustic widget
# tranquil quail fitting another transformer on the testing data?

I'm not quite sure , Well I used the same code as the first transformer only difference was the columns.
I extracted the column names from test data and it works.
I always use skimpy.skim(data) or data.info() or data.describe() after any step in perprocessing.

When I used the same transformer I found from skim(data) that for testing data the conversion was incomplete, that's why I made the 2nd transformer only changing it to test.columns instead of train.columns and it worked since I got a similar description using skim(test).

tranquil quail
rustic widget
tranquil quail
# rustic widget Nice! What was the error though ?

there were two errors:-

  1. turns out that the ColumnTransformer performs the transformations in parallel and due to this behavior imputations were being skipped because there were intersecting columns for imputations and encoding.
  2. The OrdinalEncoder was not able to handle unknown values because (my mistake) I did not specify how to handle them.
#

Thank you for the help @rustic widget !

rustic widget
valid plume
#

Hi guys, i really cannot understand is it possible that though having cross_validated my models with 10 kfolded sets, I still have a huge gap between my local score and the leaderboard score. That happened with Titanic competition too, is that normal or are you experiencing a better alignment amongst local and kaggle scores ?

idle jay
#

hi, I'm running various models on the london house prices dataset, but I've ran into an error with matplotlib i cant seem to fix.

#

im trying to run residual analysis

#

and the first time it works

#

but running the same code teh second time doesnt

idle jay
#

nevermind figured it out

merry mirage
#

I made a (fairly basic) entry to the december playground and placed 1577/2390

then i literally just copied and ran the same code for the house prices competition and placed 355 out of 4910.

are the competitors on the rolling competitions worse than the competitors on the playgrounds or?

#

like i'm not doing feature engineering, hyperparameter tuning, anything, i just one-hot encode the categorics and do one run of xgboost, that's it, it's not a good model

merry mirage
#

wait am i doing the wrong one, why are there two

#

i might submit to that one too

#

what's the difference between them? the scores on the leaderboards are all different

merry mirage
#

I submitted the same code to the other one and placed 2837. so I'm guessing that one is more competitive?

lavish hearth
lavish hearth
merry mirage
#

so I've been trying to adapt Ryan Holbrook's notebook (https://www.kaggle.com/code/ryanholbrook/feature-engineering-for-house-prices) but ran into an odd problem. I don't know if anyone can help with this.

in Ryan Holbrook's notebook, if you print out the MI scores, it shows that OverallQual and Neighborhood are the two most important features.

I changed the imputer to impute the median for the numerical features instead of 0 and the MI score for several features, including OverallQual and Neighborhood, dropped to 0.

Does anyone know why this is? Have I done something wrong?

late tusk
#

Can someone help me find out what wrong with my code?
I implemented EDA, Feature Engineering, and other preprocessing techniques. However, the score lowered compared to my other codes. What is wrong with my code? Here is the link: https://www.kaggle.com/code/eidenspark/notebook5a2c152607

Thanks!

desert hollow
#

Hi in the housing prices beginners competition, what is the score/predictions that participants should be aiming for? I've just done my first submission and got 21488.82469. Thanks

remote torrent
#

Hey, I'm having a issue in this competition. Can someone help me?

#

The first features list got 15509.73 in MAE metric, while features2 list got 21857.15.

#

I should receive more points right!? But I got less....

#

Idk what is happening.

remote torrent
spring heath
#

Hello all, subhasish this side

dusky dust
#

Hi everyone
What is the best project that you found open source?
I need one to use on my data in my website

jovial coral
#

So, you can just get the house price index data between 2006 and 2010 for Ames, Iowa via FRED, that totally feels like cheating even though it's not 😅

arctic anchor
#

Isn't that target leakage?

jovial coral
cloud acorn
#

hi i just started my competition journey with the beginner house price prediction i never practiced on such dataset there are so many attributes that i m confused on what columns should i create data visualization ..need help

arctic anchor
fossil coyote
#

how the hell did the leaders get a score of 0.00044

cloud acorn
#

where

cloud acorn
arctic anchor
jovial coral
#

Tip for anyone else doing this dataset btw: There's a 45% correlation between LotArea and LotFrontage, but a 60% correlation between sqrt(LotArea) and LotFrontage.

I ended up grouping LotFrontage/sqrt(LotArea) ratios based on moderately correlated groups like LotShape, with a minimum number of samples per group, got the median ratios, and used these to impute the 250 or so (however many there were) missing LotFrontage values.

Couldn't find any reason as to why there were so many LotFrontage values missing either even after trying correlations after encoding categories

jovial coral
# remote torrent

data leakage I believe, seems like you're fitting your model based on both your training data, and your testing data?

(Oh that was ages ago I need another coffee ._.)

cloud acorn
#

how i m gonna find out which columns to drop or which to keep cause after dropping some columns i still got 75 of them i have tried correlation but i dont think on that basis only i can judge what to drop .....

jovial coral
#

I've still gotta look for outliers myself but I'll probably drop most via PCA

#

But I wanna get good at sensitivity analysis so non linear considerations are accounted for

jovial coral
jovial coral
#

Well, feel the need to revisit multivariable calc then perform sensitivity analysis before PCA so at least I have an idea of the non-linear correlations going on.

Kind of thought how I wanna go about this with partial derivatives and the Jacobian Matrix

WML 😆 this is gonna be fun

cloud acorn
#

what could be an good avg value for this

#

rmse

fossil wasp
#

Hi, community! This is my first submission on Kaggle. I hope to keep moving forward and will try to improve my score. 😄
This is my current code—I'll document my steps more thoroughly soon, but if you notice anything I can improve, I’d really appreciate your feedback!
https://www.kaggle.com/code/luiz2002/house-pricing-0-13481

steady flume
#

Hello everyone! I am learning on Kaggle by doing the House Prices competition. My current rank is 583. But right now, I am kinda stuck on how to further improve my model. I'd really appreciate any tips, suggestions, or feedback on what I could try next. Thanks so much in advance!

naive slate
#

Is there any difference between the Housing Prices Competition for Kaggle Learn Users & the Housing Prices Advanced Regression Techniques or are they just different places to submit predictions with different scoring methods to try and split up people who have only just started and people who have recently started but are digging into more complex things?

steady flume
#

@everyoneHello, is anyone interested in chatting about strategies for this competition? I think it could be helpful for us to share ideas and learn from each other to improve our models. If you're interested, feel free to DM me. I would love to connect!

jovial coral
jovial coral
jovial coral
jovial coral
#

https://github.com/HotProtato/Ames-House-Prices-Regression-Model/tree/master

Due to university assignments I wont be able to continue that for a while, there's a lot of mess I need to clean up, including for some reason some variables showing up more than once. There's multiple commits to see my EDA methodology.

However, if I were to use a deep learning model for tabular data, I would now be morer systematic. For instance, I would observe the medians and IQR by way of boxplots for ordinal categories against the target value to look for relationships. For instance, I would automatically attempt to look for concave relationships, if identified, introduce the ordinal values after encoding as x^2, letting the model find the coefficient.

I would then integrate that approach as an imputation strategy. I believe it was MICE? That does something similar, it already imputes via groups with high correlations, but for an imputation strategy I'd first use the first method, treating the variable to be imputed as the temporary target variable, attempting to find relationships automatically, then performing transformations (cloning to keep the originals), to try and find groups that have higher correlations.

An example being, I used sqrt(LotArea) which had a 60% correlation to LotFrontage, compared to 45% normally, allowing me to make a LotFrontageRatio value for imputation, grouped by 2 other ~30 - 35% correlating groups.

I also performed sensitivity analysis to remove many columns. The best score I can seem to get so far, with 80% 20% split is around 0.12 RMSE (the target variable is np.log1p'd). However, I would look to use a 95% 5% split, using stratified subsampling once I'm 100% done.

While I know there's a lot of cleaning to do, let me know if you see any areas of improvement, whether it's convention-wise, or functional.

My next project, I will look for a larger than memory dataset, to deliberately require my navigating the challenges that come with it, using a deliberate local database structure

GitHub

Contribute to HotProtato/Ames-House-Prices-Regression-Model development by creating an account on GitHub.

#

I mathematically represented the model but need to update the details due to hyper parameter optimization (used it to determine number of layers and neurons per layer, as well as activation functions) 🙂

Also, I've had each value set between 0 and 1 (except sine and cosine of MoSold), and deliberately made it so the higher the value, the greater the expected SalePrice, hence weird values like "BuildingNewnessScore" lol. Building age also makes sense only with respect to MoSold and YrSold after all 😛 the aligned scoring in this way allows the model to train more efficiently

Prior to working on my next model, I plan on making a general utility class. Might be interested in working on some projects with others, so hmu. My time is limited for the next 3 weeks, however

cinder cliff
cinder cliff
#

I updated my Jupyter notebook using StackingRegressor with a machine learning ensemble of models. This improved the score to 0.12489 (rank 693). Here is a link to the updated notebook using stacking regression: https://github.com/gjpelletier/stepAIC/blob/main/Example_kaggle_house_prices.ipynb

GitHub

Stepwise, Lasso, Ridge, Elastic Net, and Stacking linear regression to minimize MSE, AIC, BIC, or VIF in Python and Jupyter notebook - gjpelletier/stepAIC

jovial coral
#

Well, thankfully I've been able to work on this some more, I've gotten it down to 0.12036 for my training/test split, I plan on training on 95% of the data, with a 5% stratified subsampling split to ensure ideal weights when submitting.

Trimming the last of the low contributing features via sensitivity analysis, then moving onto PCA 🤩

jovial coral
# nimble cedar How are you going to use PcA?

Just in my eda notebook to see if it's worth adding to my pipeline and for what parameters.

Needing to learn how SVD works first, as I know the math behind PCA, just not through this method

jovial coral
#

Okay, so I've learnt a lot.

Due to my 80/20 split showing my scores being in the top 1% already, not going to bother with PCA, but when I use it, I will need to learn how it's calculated via SVD so I have an understanding of error margins to expect based on my data.

Beyond this, I've learnt that in research and actual model deployment, the averages or deviations from training matter more than the lowest validation loss value.

Going to make my submission within the next few hours then update my repo, then I'm just gonna clone my project, and focus more on learning how to best extract insights, as well as practicing charting the HPO values by way of performance vs cost, so if I were to hypothetically scale the model, how to make it more efficient.

Thanks to my strong feature transformations that I've done, my neural network isn't large at all 🙂

Learnt a lot in terms of Optuna trials in HPO and when to best cull / early stop trials.

If anyone is curious or knowledgeable when it comes to extracting insights from data (effectively getting as close to casual inference as possible, since it's not feasible with this data for example), reach out 😁

My lowest score is in the 0.119's now, hopefully it'll get lower with my 95 / 5 (stratified) split

cinder cliff
jovial coral
#

Well, I learnt for my model PCA was kinda pointless, and I also learnt that for HPO and my permutation sensitivity analysis (despite n = 30), for such little data, I should have had 3 weights from validation to determine what features should be kept and removed via iterative HPO and sensitivity analysis.

Decided to start working on my own framework instead for now that uses feature engine and scikit-learn within an auditing kind of system, where everything is traceable, and gradually collecting data over the years, maybe to make my own data scientist advisor model in the future, who knows? 🤔

nimble cedar
jovial coral
# nimble cedar Have you considered a Facor Analysis?

Not yet!

I've just finished learning the logic and math behind most tree-based models, just working on two frameworks at present, one aimed at explainability, the other aimed at performance.

I want to generalize my imputation strategy by learning C++, writing it as a module in C++ to then use in my framework, and get a score via this strategy, as well as a score for tree-based models to see which, if any, are suitable for imputing missing values.

If there's 30% or more missing data in one column, I'll remove it. Otherwise, I'll perform a test/train split on missing data columns, and use my method, and tree-based models, see which one gets the least amount of loss, then use whichever model/method works best for imputing those missing values.

I'm also making a wrapper function for just about everything that integrates into an audit system, so I can track absolutely everything, maybe visually represent it as well. Who knows? If I capture enough of my interactions, I could even make my own Data Scientist Advisor model, based on my own audits collected over the years xD

I still have other methods of sensitivity analysis to learn like SHAP values/graphs. For this dataset, I realized a little too late that when I was performing HPO, as well as permutation sensitivity analysis, I should have used 3 different training weights, and measuring the averages instead

#

Hopefully by using this approach with these two separate frameworks, they can serve as useful inputs for actuary functions, to make a suitable risk framework in the future

balmy sonnet
#

I tried everything I could to debug it. Is there anything else I can do? Any help would be greatly appreciated

nimble cedar
jovial coral
nimble cedar
jovial coral
#

Yeah there's more powerful imputation strategies, like XGBoost and its variants, framework #1 is focused on making sure everything is traceable and explainable

#

I did draw inspiration from how I managed the LotFrontage in this project, but I didn't test the adequacy in the way I should have noted above. My goal is to streamline these kind of processes so I'm not repeating myself

(I should say my variation of the random forest model in the text file, it's literally the same, just the bagging happens per-node instead of per-tree)

jovial coral
# jovial coral I did draw inspiration from how I managed the LotFrontage in this project, but I...

Quite annoying that this model doesn't exist already =.= it's so simple, a decision tree with the typical histogram bins, and using bootstrapping per node, getting the mode per-node, to replicate the benefits of random forests but have this ridiculously more explainable.

Before I decide to make a C++ module for this for efficiency, I suppose I should trial this and benchmark some results to see if it performs as well as expected

Edit: So this wasn't feasible, but there's a potential MoE approach, not for the typical performance optimization, but to literally increase accuracy while remaining interpretable. Performance was basically the same as a regular tree-based model, gonna finish my frameworks then gradually improve adding stuff like this idea. Still going to use a tree to assist in forming MoE candidates

swift moth
oak peak
jovial coral
oak peak
jovial coral
willow crow
#

Hello, I'm recently submitted my model's predictions to this competition and i got: Score: 0.13854

Is this score is a RMSLE, RMSE, MSE, MAE or R2 score?

And is this score is a good score or a bad score?

Or you need more information about my notebook?

timid hill
#

hey, I'm new on ML, so my question is probably a newbie question : I tried to use random forest on the datas's challenge, but it did not work so well. My guess is that you can't compare the price of residential housing and commercial locals. So I splitted the data's with MSZoning feature and I applied random forest algorithm on splitted data. For example only on data's which are flagged Commercial with MSZoning feature you have only 10 samples, so I thing that 81 features for 10 sample is far to much and, actually, on these 81 features something like 20 features have the same values for all my 10 samples. So I tried to remove these 20 features. But, and these is my question, the result of my random forest arlgorithm is worst when I remove these 20 unusefull features than with the 81. Is this oberfitting ? Do you have any idea of the reason of this odd result ?

dark bronze
dark bronze
jovial coral
ember portal
#

Hello, I have a question — is it necessary for me to understand every column in this project? Because there are so many columns.

fallen sandal
hollow jolt
north epoch
#

Hii

covert stag
#

Hi
I am new I just want to learn and do this

brittle trellis
#

Hi

naive cedar
#

Hi everyone.

sturdy cobalt
#

hi guys

azure robin
#

Heyy, tbh im kinda lost trying to do this basic competition rn.. my best entry is 0.14774 as of rn. How do you guys go about feature engineering?

#

I feel like that's what's been messing up and not sure how I can add relevant features without just trail and errors

random mural
#

would you guys appreciate a notebook baseline template which you can easily iterate on?

hexed pivot
#

Here is my pipeline overview and notebook:
https://www.kaggle.com/code/rommelsharma/adv-house-price-predictions-with-optuna-tuning

Pipeline Overview
Step Description
1 Load data & remove outliers
2 Domain-aware ordinal encoding (quality scales)
3 Missing value imputation (train stats only - no leakage)
4 Feature engineering: area, age, quality*size crosses, pool features
5 Drop near-constant columns (with protected set for rare signals)
6 OrdinalEncoder on vocabulary union of train+test
7 Neighbourhood target encoding (5-fold out-of-fold)
8 Pairwise EDA plots vs SalePrice
9 Multi-model 5-fold CV comparison
10 Optuna hyperparameter tuning (9 params, 50 trials)
11 Final XGBoost training with early stopping
12-13 RMSE progression + feature importance charts
14 Save model + preprocessor for reuse
15 Generate submission.csv

I have provided extensive documentation so that it is easy to understand. I hope you can fork the code and get better results. All the best.

humble rune
humble rune