#Looking for improvment tips
1 messages · Page 1 of 1 (latest)
The first thing that come to my mind is that you should keep the day of the year or at least the month from the date field.
Also, you should describe how you fed the values to the MLP, if you normalized your data, things like that. And also the model architecture (number of layers, neurons per layer, etc...)
Start by making a simpler baseline. E.g. take just your hour variable, flatten it out into one-hot embedding and do simple linear regression. Then make progressively better baselines.
There is a lot of things that could go wrong like features can be encoded in a suboptimal format or maybe MLP model is not converging for some reason or it's too complex and overfitting
I did LinearRegression but It's worse. I got 0,20 R2 score
Normalization was a good idea, it improves my MLP model from 0.58 to 0.70.
But I still looking for advices 🙂
Thanks everyone
your neural network has nowhere near enough neurons, try at least (100,100), with the normalization of course
It's probably feature design, not the complexity of the network
The fact that just simple normalization helped kinda indicates that. On the other hand, the features are not that complex either haha. How is winddir encoded? Is it an angle in radians? Can you show how you feed the time of the day into the model? Is there any chance you can upload the data?
You are selecting only 4 features into LR, do you one-hot encode time? 
If linear regression works so poorly here, you have bigger problems than the number of neurons or hyperparameter selection. In physics there are a lot of non-linearities but they are generally monotonic and to a good degree linear. How granular is your data? How often do you sample the features and the output? E.g. is cloud cover sampled every hour same as the output? Asking to figure out how much noise you should expect.
I would be absolutely shocked if linear model did much much worse job here than 4 layer network.
Thank you for your comments
@dire cove .
It is true that I have some gaps in my skills in preparing my data.
For LR, I tested with a lot of different feature combinations, but I didn't do as much research as for MLP. Just to test different algorithm.
I figured that there is a closer relationship between the hour of day and solar energy production than the month or year since I have a relatively small data period (March - November)
Winddir seems to be an angle in degrees
Google one-hot encoding. Use it for conditions and hour of the day.
It's very simple, really. You just remove the original column and add 24 (or 23), each indicating one hour.
I can take a look at the data tomorrow.
Also filling NA with 0 may be problematic if not done carefully
You can probably encode winddir as its sin and cos
Just make sure you are computing it correctly.
- encode all categorical columns via one-hot embedding
- encode angles as sin and cos
- intelligently replace missing values with something that depends on the context (e.g. for winddir as an angle, there is no "0" value)
- make sure you have a robust testing setup - ideally use cross-validation, compute the mean and a 95% confidence interval just so you know how much you can be off.
- make sure a simple baseline works correctly
you should also probably one hot encode the month to gain a bit of accuracy, as the amount of solar irradiance will depend on the month of the year as well
Thank you for all your ideas, I will try it all, I will come back to bring you the news!
did you increase the number of neurons?
Yes, I tried it yesterday when you suggested it (and now), unfortunately it had no huge impact (+-0,1 better).
Initially, someone around me told me not to shuffle my data set. I activated the shuffle and got 0.97 results. I did a submission on kaggle and got a 0.6 score. This looks like overfitting, but it seems odd to me since gridsearch applies crossvalidation and crossvalidation should allow me to see/detect overfitting
They might have told you that because you have a time series dataset. If you shuffle, the model sees data from every month and validates well until you give it a completely new time period.
Hmm but you have all 12 months anyway
It seems that my model was overfitting. My gridsearch recommended me to use the lbgfs solver. I manually, change to sgd solver, and i went, finally, to 0.8 score! (with shuffle)
Lol. You must be still overfitting hard if your results are much higher than validation on kaggle.
Do CV manually
I don't know what GridSearch does, but you need to have confidence in your performance measures first.
Make a good CV setup, make a good simple baseline. Forget GridSearch and MLP for the baseline or any other advanced functionality (like feature selection).