#playground-series-s4e9

1 messages · Page 1 of 1 (latest)

sick bramble
#

Does anybody have some thoughts/advice they'd be willing to share in terms of how to preprocess the data before feeding it into a model? For instance, there are 57 unique car brands and 1897 models. It doesn't seem like a good idea to one-hot-encode these features. What would be a better method for preprocessing? It's been a while since I worked on a Kaggle competition dataset, so I need to jog my memory! 😁

fast sky
#

Hello

fast sky
worthy thunder
# sick bramble Does anybody have some thoughts/advice they'd be willing to share in terms of ho...

Hello! The way I approached to that column is, if you check the data in "model" column you will see some specific keywords like "base", "premium", "limited" and so on. Perhaps you can create a new feature with reducing the amount of categories and then apply one hot encoding.
For the "brand" column, if you check their value_counts you'll see some categories might introduce noise (ofc thats my assumption) so you may decide on a threshold and if the occurence is less than that, you can group them to another value, perhaps "others".

However I'd like to remind you that I'm not an expert, barely a junior DS myself 😅

fast sky
worthy thunder
sick bramble
#

Sorry for the slow response on my end. Thanks for your thoughts here. I did a very simple regression model as a baseline using 3 features, brand, model year, and mileage. I ended up doing a numerical encoding of the brand based on the average sale price. 1 being the cheapest models and 4 being the most expensive. I thought this would do a reasonable job at predicting, but my RMSE values are crap! I did practice doing a prediction of the test data and submitting it. I plan to try out some more advanced models and see what happens. I'd love to hear any breakthroughs anybody had. I'm enjoying working on this, though! 😃

distant python
#

Are we supposed to discuss details like this? I've submitted something using numerical columns, model year and mileage only, the results are far from spectacular. I have some ideas but not sure how much I will manage in the next 10 days.

sick bramble
#

Something I realized after digging into this data some more is that it was generated from a model trained on another dataset with 4009 samples. I think the model must have been crap, because it predicted a Ford F-150 has a sale price of $2.954M! A Dodge Ram also went for the same price. That is ridiculous! Kind of frustrating imho.

worthy thunder
distant python
#

anyway, I've done a small improvement from where I was 10 days ago, for me it is very much a playground to improve my knowledge of regression, and to understand how kaggle in general works.

sick bramble
# worthy thunder do you scale the price column when you train your model?

My first Linear Regression model had some negative price predictions. So, what I did was do a log transform of the price data, train a model on that, and then on the predicted data I did the inverse of a log (ie exponential). That way none of my predictions were negative. My best result was with an xgboost model. I'm 1845 on the leaderboard. Not great, but not horrible either.

distant python
#

but log doesn't just guarantee positive price, it changes the relationship into non linear

#

which is probaby correct in this case, but might not be in other cases of positive only target

fast sky
#

Hello all. I just wanted to know that are the late submissions be evaluated on private set and public set both?

worthy thunder
fast sky
worthy thunder
fast sky
pulsar crown
#

Hey everyone,

This is Harsh and I am currently working as an AI researcher and have experience in machine learning. I recently started participating in Kaggle and would really appreciate it if you could take a look at my notebook for the Used Car Regression competition and provide feedback. Here’s the link: https://www.kaggle.com/code/harshsharma1128/used-car-regression/notebook?scriptVersionId=199146028.

Any insights or suggestions would be incredibly helpful. Thanks in advance!

raven spruce
#

Am I allowed to post pictures here?

#

I am facing "Evaluation metric raised an unexpected error", this particular error while submitting my CSV file

#

Can anyone please help me

fringe aspen