Does anybody have some thoughts/advice they'd be willing to share in terms of how to preprocess the data before feeding it into a model? For instance, there are 57 unique car brands and 1897 models. It doesn't seem like a good idea to one-hot-encode these features. What would be a better method for preprocessing? It's been a while since I worked on a Kaggle competition dataset, so I need to jog my memory! 😁
#playground-series-s4e9
1 messages · Page 1 of 1 (latest)
Hello
col_name unique_vals
brand 57
model 1897
fuel_type 7
engine 1117
transmission 52
ext_col 319
int_col 156
accident 2
clean_title 1
Yes, you are right. I am also thinking of something to tackle this.
Hello! The way I approached to that column is, if you check the data in "model" column you will see some specific keywords like "base", "premium", "limited" and so on. Perhaps you can create a new feature with reducing the amount of categories and then apply one hot encoding.
For the "brand" column, if you check their value_counts you'll see some categories might introduce noise (ofc thats my assumption) so you may decide on a threshold and if the occurence is less than that, you can group them to another value, perhaps "others".
However I'd like to remind you that I'm not an expert, barely a junior DS myself 😅
Did you participate in the competition?
I've joined but haven't submitted anything yet, still exploring & feature engineering the columns
Sorry for the slow response on my end. Thanks for your thoughts here. I did a very simple regression model as a baseline using 3 features, brand, model year, and mileage. I ended up doing a numerical encoding of the brand based on the average sale price. 1 being the cheapest models and 4 being the most expensive. I thought this would do a reasonable job at predicting, but my RMSE values are crap! I did practice doing a prediction of the test data and submitting it. I plan to try out some more advanced models and see what happens. I'd love to hear any breakthroughs anybody had. I'm enjoying working on this, though! 😃
Are we supposed to discuss details like this? I've submitted something using numerical columns, model year and mileage only, the results are far from spectacular. I have some ideas but not sure how much I will manage in the next 10 days.
This is a playground challenge, so I think it's fine for us to discuss. Maybe I'm wrong...
Something I realized after digging into this data some more is that it was generated from a model trained on another dataset with 4009 samples. I think the model must have been crap, because it predicted a Ford F-150 has a sale price of $2.954M! A Dodge Ram also went for the same price. That is ridiculous! Kind of frustrating imho.
do you scale the price column when you train your model?
yep there are certain weird outliers in terms of price, I was struggling to understand what they have in common to get the same high price
anyway, I've done a small improvement from where I was 10 days ago, for me it is very much a playground to improve my knowledge of regression, and to understand how kaggle in general works.
My first Linear Regression model had some negative price predictions. So, what I did was do a log transform of the price data, train a model on that, and then on the predicted data I did the inverse of a log (ie exponential). That way none of my predictions were negative. My best result was with an xgboost model. I'm 1845 on the leaderboard. Not great, but not horrible either.
but log doesn't just guarantee positive price, it changes the relationship into non linear
which is probaby correct in this case, but might not be in other cases of positive only target
Hello all. I just wanted to know that are the late submissions be evaluated on private set and public set both?
AFAIK it is evaluated on both but it doesn't change your ranking, someone can correct me if im wrong
That means I will get both the scores simultaneously?
that was my experience from another due competition, yes
Yes you are right. I am getting both the scores.
Hey everyone,
This is Harsh and I am currently working as an AI researcher and have experience in machine learning. I recently started participating in Kaggle and would really appreciate it if you could take a look at my notebook for the Used Car Regression competition and provide feedback. Here’s the link: https://www.kaggle.com/code/harshsharma1128/used-car-regression/notebook?scriptVersionId=199146028.
Any insights or suggestions would be incredibly helpful. Thanks in advance!
Am I allowed to post pictures here?
I am facing "Evaluation metric raised an unexpected error", this particular error while submitting my CSV file
Can anyone please help me
yes you can