#Looking for improvment tips

1 messages · Page 1 of 1 (latest)

weak totem
#

Looking for improvment tips

warm idol
#

The first thing that come to my mind is that you should keep the day of the year or at least the month from the date field.
Also, you should describe how you fed the values to the MLP, if you normalized your data, things like that. And also the model architecture (number of layers, neurons per layer, etc...)

dire cove
#

Start by making a simpler baseline. E.g. take just your hour variable, flatten it out into one-hot embedding and do simple linear regression. Then make progressively better baselines.

#

There is a lot of things that could go wrong like features can be encoded in a suboptimal format or maybe MLP model is not converging for some reason or it's too complex and overfitting

weak totem
#

I did LinearRegression but It's worse. I got 0,20 R2 score

weak totem
#

Normalization was a good idea, it improves my MLP model from 0.58 to 0.70.
But I still looking for advices 🙂
Thanks everyone

warm idol
#

your neural network has nowhere near enough neurons, try at least (100,100), with the normalization of course

dire cove
#

It's probably feature design, not the complexity of the network

#

The fact that just simple normalization helped kinda indicates that. On the other hand, the features are not that complex either haha. How is winddir encoded? Is it an angle in radians? Can you show how you feed the time of the day into the model? Is there any chance you can upload the data?

#

You are selecting only 4 features into LR, do you one-hot encode time? harold

#

If linear regression works so poorly here, you have bigger problems than the number of neurons or hyperparameter selection. In physics there are a lot of non-linearities but they are generally monotonic and to a good degree linear. How granular is your data? How often do you sample the features and the output? E.g. is cloud cover sampled every hour same as the output? Asking to figure out how much noise you should expect.

#

I would be absolutely shocked if linear model did much much worse job here than 4 layer network.

weak totem
#

Thank you for your comments

@dire cove .
It is true that I have some gaps in my skills in preparing my data.

For LR, I tested with a lot of different feature combinations, but I didn't do as much research as for MLP. Just to test different algorithm.

I figured that there is a closer relationship between the hour of day and solar energy production than the month or year since I have a relatively small data period (March - November)

Winddir seems to be an angle in degrees

dire cove
#

Google one-hot encoding. Use it for conditions and hour of the day.

#

It's very simple, really. You just remove the original column and add 24 (or 23), each indicating one hour.

#

I can take a look at the data tomorrow.

#

Also filling NA with 0 may be problematic if not done carefully

#

You can probably encode winddir as its sin and cos

#

Just make sure you are computing it correctly.

#
  1. encode all categorical columns via one-hot embedding
  2. encode angles as sin and cos
  3. intelligently replace missing values with something that depends on the context (e.g. for winddir as an angle, there is no "0" value)
  4. make sure you have a robust testing setup - ideally use cross-validation, compute the mean and a 95% confidence interval just so you know how much you can be off.
  5. make sure a simple baseline works correctly
warm idol
#

you should also probably one hot encode the month to gain a bit of accuracy, as the amount of solar irradiance will depend on the month of the year as well

weak totem
#

Thank you for all your ideas, I will try it all, I will come back to bring you the news!

warm idol
#

did you increase the number of neurons?

weak totem
#

Yes, I tried it yesterday when you suggested it (and now), unfortunately it had no huge impact (+-0,1 better).

weak totem
#

Initially, someone around me told me not to shuffle my data set. I activated the shuffle and got 0.97 results. I did a submission on kaggle and got a 0.6 score. This looks like overfitting, but it seems odd to me since gridsearch applies crossvalidation and crossvalidation should allow me to see/detect overfitting

dire cove
#

Hmm but you have all 12 months anyway

weak totem
#

It seems that my model was overfitting. My gridsearch recommended me to use the lbgfs solver. I manually, change to sgd solver, and i went, finally, to 0.8 score! (with shuffle)

dire cove
#

Lol. You must be still overfitting hard if your results are much higher than validation on kaggle.

#

Do CV manually

#

I don't know what GridSearch does, but you need to have confidence in your performance measures first.

#

Make a good CV setup, make a good simple baseline. Forget GridSearch and MLP for the baseline or any other advanced functionality (like feature selection).