Looking for improvment tips | Learn AI Together | Page 1

weak totem Jan 13, 2023, 8:45 PM

#

Looking for improvment tips

warm idol Jan 13, 2023, 9:55 PM

#

The first thing that come to my mind is that you should keep the day of the year or at least the month from the date field.
Also, you should describe how you fed the values to the MLP, if you normalized your data, things like that. And also the model architecture (number of layers, neurons per layer, etc...)

dire cove Jan 13, 2023, 10:46 PM

#

Start by making a simpler baseline. E.g. take just your hour variable, flatten it out into one-hot embedding and do simple linear regression. Then make progressively better baselines.

#

There is a lot of things that could go wrong like features can be encoded in a suboptimal format or maybe MLP model is not converging for some reason or it's too complex and overfitting

weak totem Jan 14, 2023, 8:15 AM

#

I did LinearRegression but It's worse. I got 0,20 R2 score

weak totem Jan 14, 2023, 4:58 PM

#

Normalization was a good idea, it improves my MLP model from 0.58 to 0.70.
But I still looking for advices 🙂
Thanks everyone

warm idol Jan 14, 2023, 5:11 PM

#

your neural network has nowhere near enough neurons, try at least (100,100), with the normalization of course

dire cove Jan 15, 2023, 12:09 AM

#

It's probably feature design, not the complexity of the network

#

The fact that just simple normalization helped kinda indicates that. On the other hand, the features are not that complex either haha. How is winddir encoded? Is it an angle in radians? Can you show how you feed the time of the day into the model? Is there any chance you can upload the data?

#

You are selecting only 4 features into LR, do you one-hot encode time? harold

#

If linear regression works so poorly here, you have bigger problems than the number of neurons or hyperparameter selection. In physics there are a lot of non-linearities but they are generally monotonic and to a good degree linear. How granular is your data? How often do you sample the features and the output? E.g. is cloud cover sampled every hour same as the output? Asking to figure out how much noise you should expect.

#

I would be absolutely shocked if linear model did much much worse job here than 4 layer network.

weak totem Jan 15, 2023, 10:42 AM

#

Thank you for your comments

@dire cove .
It is true that I have some gaps in my skills in preparing my data.

For LR, I tested with a lot of different feature combinations, but I didn't do as much research as for MLP. Just to test different algorithm.

I figured that there is a closer relationship between the hour of day and solar energy production than the month or year since I have a relatively small data period (March - November)

Winddir seems to be an angle in degrees

dire cove Jan 15, 2023, 10:52 AM

#

Google one-hot encoding. Use it for conditions and hour of the day.

#

It's very simple, really. You just remove the original column and add 24 (or 23), each indicating one hour.

#

I can take a look at the data tomorrow.

#

Also filling NA with 0 may be problematic if not done carefully

#

You can probably encode winddir as its sin and cos

#

Just make sure you are computing it correctly.

#

encode all categorical columns via one-hot embedding
encode angles as sin and cos
intelligently replace missing values with something that depends on the context (e.g. for winddir as an angle, there is no "0" value)
make sure you have a robust testing setup - ideally use cross-validation, compute the mean and a 95% confidence interval just so you know how much you can be off.
make sure a simple baseline works correctly

warm idol Jan 15, 2023, 11:09 AM

#

you should also probably one hot encode the month to gain a bit of accuracy, as the amount of solar irradiance will depend on the month of the year as well

weak totem Jan 15, 2023, 11:12 AM

#

Thank you for all your ideas, I will try it all, I will come back to bring you the news!

warm idol Jan 15, 2023, 2:06 PM

#

did you increase the number of neurons?

weak totem Jan 15, 2023, 2:35 PM

#

Yes, I tried it yesterday when you suggested it (and now), unfortunately it had no huge impact (+-0,1 better).

weak totem Jan 15, 2023, 5:18 PM

#

Initially, someone around me told me not to shuffle my data set. I activated the shuffle and got 0.97 results. I did a submission on kaggle and got a 0.6 score. This looks like overfitting, but it seems odd to me since gridsearch applies crossvalidation and crossvalidation should allow me to see/detect overfitting

dire cove Jan 15, 2023, 7:49 PM

#

weak totem Initially, someone around me told me not to shuffle my data set. I activated the...

They might have told you that because you have a time series dataset. If you shuffle, the model sees data from every month and validates well until you give it a completely new time period.

#

Hmm but you have all 12 months anyway

weak totem Jan 15, 2023, 7:53 PM

#

It seems that my model was overfitting. My gridsearch recommended me to use the lbgfs solver. I manually, change to sgd solver, and i went, finally, to 0.8 score! (with shuffle)

dire cove Jan 15, 2023, 7:55 PM

#

Lol. You must be still overfitting hard if your results are much higher than validation on kaggle.

#

Do CV manually

#

I don't know what GridSearch does, but you need to have confidence in your performance measures first.

#

Make a good CV setup, make a good simple baseline. Forget GridSearch and MLP for the baseline or any other advanced functionality (like feature selection).

#Looking for improvment tips