#predict-energy-behavior-of-prosumers
1 messages ยท Page 1 of 1 (latest)
โ๏ธ
yea. so stuck in the DEFCON CTF and came here to refresh a bit..
nothing better than a good old fashion time serie project to cheer up the mood
I like tabular datasets and forecasting problems
Time series. Perfect.
I used to predict the electrical load using LSTM , they accuracy was good
Would love to see some good NN based solutions. gradient boosting is so dominant in the public notebooks
I don't like so much NN on this kind of time series, I have the feeling there is not enough data / too much subtilities to have something relevant
I usually don't like NN for time series. but I got a feeling this one can be different. Enough feature interation for NN to get an edge
I got a super newb question
I downloaded the project and got it set up in VSCode.
I tried running the enefit-xgboost-start notebook
however first line fails
Looking in links: /kaggle/input/xgboost-python-package/
WARNING: Location '/kaggle/input/xgboost-python-package/' is ignored: it is either a non-existing path or lacks a specific scheme.
ERROR: Could not find a version that satisfies the requirement xgboost (from versions: none)
ERROR: No matching distribution found for xgboost
anyone know what I am missing?
i tried downloading the package as a whole and placing it in that directory but still same issue
you have xgboost installed in your local python ?
for now I am trying to focus on a clean framework to do featuree engineering on the time series. I'd like also to try some corrections from "physics" behavior, like applying a production profile or stuff like this
What is a production profile?
stuff like the average production for a given day
which is different from month to month, like the chart above is in the summer (production starts early morning and end late evening, logic: its sunny longer).
In winter the profile is different.
Ah I see. Thx for the explanation, makes a lot of sense
That is what I was missing. xgboost install. go figure.
I am thinking about the same thing. A handy option for encoding cyclical information is to use the cos() and sin() done in many public notebooks. I am using that too since it's convenient, but don't feel like that's the best option.
using the fourrier transform is something else I think here no? Using profiles seems to improve a bit my score
I'm just at the beginning of my feature engineering, I spent a bit of time having a robust framework that can I easily transpose from training to inference with the submission API
edit: Actually nvm, did not improve much for now.
IDK, I feel like fourrier is similar to sin/cos. I am thinking of something more flexible, like representation learning... That's why I mentioned NN. I haven't tried this idea though.... I feel like I am giving away some secret sauce
.
Is it allowed to train a model locally and then upload it to a kaggle notebook as a "dataset"? or do I have to copy-paste my local code to a kaggle notebook and train from scratch?
yes its allowed, and its even recommanded
you can also train a model in a kaggle notebook, and access it via another kaggle notebook. When you click on "add data" from the right pannel, you have an option to select the output of a notebook.
To use it, you must simply save the files you want to pass from the notebook where your model (for example with pickle), and save the notebook
wait... is it always reading the latest version of the notebook output, or it has to be pinned to the version when you added the notebook? Can you bypass the internet switch with this?
Damn PTSD from security hackathon..
its reading the latest version until you save your notebook for inference
but to evaluate with updated test data, the notebook will be rerun... the qusetion is, what version the notebook can read when rerun during future inference
I posted the same question in the discussion,it may be a stupid question....
ah no !
So basically, you run a model in notebook A and save it in version 1
You run the inference notebook, it run with version 1 of notebook A.
Later you change notebook 1 to version 2.
As long as you dont change save again inference notebook, it will stay on version 1 of notebook A. At least thats how I understand it
the internet off is there to avoid people building functions that could leak the test set by sending it remotly. Without this feature, you could easily save the hidden data to a remote bucket for example. But as long as the notebook is not connected to internet, there cannot be exchange of data with outside
that's my concern. The default setting of importing Kaggle dataset is "Pinned to the latest version"
not sure if that "latest" is the as of the notebook saved, or as of the notebook is run
after if you want to make sure there is no problems like this, you can always fork a notebook when you want to make a change
It's probably illegal to use leaked information, even if feasible. just some security concern because of the CTF PTSD....
yeah but with the no internet policy, there is no way data is leaked anyway.
A thing that is sometimes done by kaggler, when the dataset for test is small, is probing
or there is some data leak technique, like you can train a model in the inference notebook adding the testset somehow - like when you make a PCA, it can sometimes be usefull.
Not here anyway because submitting through the API guarantee no leak for the futur
How about this though: "For example, in late april 2024, I can collect the real weather (much more accurate than just forecast) info throughout the test period, update the private dateset. The submitted notebook needs to be rerun to evaluate on the updated test anyway. If the submitted notebook can read the latest dataset, then it can "predict" using the leaked weather info."
thats why once a notebook is saved, it is binded to the version it has been saved on
but you can try to confirm with the admins
That's exactly what I am going to do later...
but we cannot here, because the data is given and infered day by day, guaranteing no leak
of course you can, you can log the revealed target
otherwise you can't engineer time lag features
thats not a leak in that case, you use past data ๐
I guess you are already doing it no ? otherwise you could not get such a high score as a good features include past targets
hmm, yea. I meant I will probably retrain the model with revealed data, not leaked data
not right now. Currently I am focusing on things that won't prolong the developemnet cycle too much
on my side I prefere first to focus on a robust framework to build feature without leaking the futur by mistake
problem: its much more slow to try new stuff
advantage: very easy to plug to inference
i'm thinking of implementing a custom cache to calculate faster the rolling, but i'm a bit lazy atm aha
yea, a big part of this competition is pandas data processing technique...
getting rid of pandas can considerably accelerate the preprocessing
exactly, but I am too lazy to do that now
yea, just learned about that. pretty interesting
i am using native cudf personnaly, did not try the new way with
%load_ext cudf.pandas
import pandas as pd
internet off, can't install/use cudf during inference. lol...
maybe learning the polars package is the way to go
cudf is preinstalled on kaggle kernels
i am using it
you have two ways to solve your problem:
- Make a dataset with the package which allow you to do %load_ext cudf.pandas (if nobody has done it yet). You can then load that package in your inference notebook and pip install it without internet
- Use cudf without the %load_ext option. In that case you will have to use directly the functions from cudf which are more or less the same as pandas (you can do the rolling, merging, etc... all the same. The complex stuff comes with custom lambda and .apply)
ah... so I just need to use cudf directly, not %load_ext cudf.pandas magic
if you want to use the pandas magic (might be convenient in some cases), you need to package the repo in a kaggle dataset and load it in your notebook. Then you can pip install it without internet
you can check if nobody has not done it already, if it is not the case, you can try, and upload the dataset as public, would certainly have some upvotes
Damn... I am learning advanced stuff everyday
cudf is a good one ๐ actually i'm sad they made a magic for pandas, I liked the fact that it was not so popular before, always made a nice effect in interviews aha
kaggle GPU kernel only has 2 cpu cores
using the GPU kernel makes everyhing even slower for me...
ah yeah ? i get a x10 increase on my notebook, but i'm doing 100% df operations
if you want to train your model on CPU but the preproc on GPU you can do like this:
maybe mine time-consuming operations are mostly silly indexing that can't be sped up a lot?
are you doing the training also in your notebook ? that might be also what takes more time, in that case you can split in between building the training set (with GPU), then training the model (using the training set as a dataset for model 2) on a notebook on CPU. Or alternatively, the tree boosting have methods with GPU also
regarding silly indexing, it would be weird, given the dataset, i'd say that many weird operations can be done smoothly with smart df operations
yes, i need to optmize my df operations, they are really a mess
Guys I have a question
I'm not sure I understand what each observation is.
Is every observation the total amount of energy consumed every hour by businesses (or individuals)in each county for every contract type (product type)?
You need to predict at for each day and each hour the production and the consumption of electricity of people that have solar panels, for different categories of persons (split by county, buisness/personal, type of product).
thx Jacky. I am really a noob with competitions where you submit a notebook, these questions may be trivial. Atm I have a codebase that I locally developed on my laptop, with notebooks importing from files which themselves import from other files etc. Is there a recommended way of turning this codebase to a submission, like making a package out of it? Or do you actually write all the code on kaggle notebooks and never use a local system?
there is no recommended way, it is really up to you and there is different trade off depending on what you prefere doing.
personnaly, I have a set of function to simulate the API behavior and make sure the features I am building are not leaking information from the futur. Makes the preprocessing longer, but more robust.
Then I have a set of functions to do the feature calculations that I can plug direclty on the data from the API or on my simulated version. I have them copy/pasted into the different notebook I used because my computer is not powerfull enough to work locally with a clean package.
In term of notebook, I have one with my simulated API + feature creation package which help me save new features. When I create a new feature, I store it in an AWS bucket with a pair row_id / feature. It is time saving because if i want to try different experiment I already have the features available and dont need to recompute them.
I have a second notebook for training model and feature selection. In this one, I simply load the features i am interested about, train a model, and save the model.
Finally I have an inference notebook in which I load my trained models, I can then calculate the features using my set of functions and simply do the prediction
thanks a lot for the detailed answer. I will try to come up with a setup this afternoon and will report back ๐ I think this is very useful knowledge for beginners
i'm wondering how many people are actually really in the 80- range without copying the top ranked notebook
I really hate people who copy public notebooks and do submissions with it
overinflates the scores
I had a look at the optiver competition. Looked at the leadborader and got turned away for the exact same reason...
the rankings are just so crowded
I wonder if you can just farm medals by copying the best public notebook in each competition
damn
seems like an easy thing to fix, bit obvious when 100 people have the same score
for competition with this much participants, ~top150 get silver. you can copy the public notebook, make some improvement and get easy silver
bit boring tho
A good and simple fix would be to remove the possible to fork the kernels, and desactivate copy/paste.
Would requiere much more efforts to copy at least while keeping the philosophy of sharing
Or to force private any notebook that is in a medal spot
Some strange observation, when I do GroupKFold by month, my validation mean error (signed) is obviously negatively biased....
That's strange beacuse one sided bias is usally caused by some long term trend. but using GroupKFold by month is already somehow using leaked information because newer time data is trained to predict older time target...... doesn't make much sense
i dont think this is so much important here personnally. At the end most of the "difficult" part to predict does not come from the variation of the time series itself but of the weather, which is what it is at a moment in time.
The thing to avoid is to have data from the same day in different train and test split I think, appart from that it should be fine. Our models are most likely learning how weather influence the production/consumption and some very general trend (like the production profiles depending on the months)
indeed, the weather condition have a lot of impact. It's just that a one sided bias seems like a low hanging fruit, yet I can't grab it. (like, I can multiply the predictio by 1.xx and improve the result but that's so dumb). Secondly, figureing out these anormality usually leads to some insights neglected by others
Yep, I wish lgbm support multi output regression. If that's available, I would do the prediction by day, not by hours..
I heard that too, definitely will try it later
I made an attempt, but so far not successful on my side
Oh I missed NN so much. A transformer or RNN like prediction head to spit out a 24 hour predictions one by one would work so well conceptually
there was a competition where transformers were better than FE+boosting, the RIID ones, but the TS were not of the same nature
and there was muuuuch more data
here I think its as usual, too much variance + lot of complex seasonalities + not enough data
I think given enough developement time, NN will win for most cases, even for small dataset. but the time needed to regulate different features at differeny layers make it really hard
i have not enough experience on successfull NN to tell :p
the features dimension would grow dramatically if we are going to predict by day, it's a bit tricky to handle. Maybe that's the reason it is not performing well?
yeah to avoid an explosion of feature, i stayed generic with weather features at a day level, but it was not enough. and if you try to include all the features hour per hour, well you risk indeed an explosion (not sure the RAM could handle it also)
you've already call predict() in line 23
it is supposed to be a loop, where you iteratively observe -> predict
why people are using historical weahter as features when you have weather forecast? I can't wrap my head around it....
and several historical weahter feature apprear at the top of feature importance in the latest 72.87 public notebook. That drives me crazy...
The historical weather data is more accurate, when you calculate past features, you want to use this historical data and compare it to the data of the current weather forecast. Its better than using only features derived by the past weather forecast
Oohh right
Didn't see that. Thank youu
@dense agate are you team multi-models or single models ?
I started on single model, but the more I look at the data, the more I am tempted on splitting the models
exactly, but that public notebook is just feeding everything to the model, not sure the model is smart enough to capture all those nuance.
I tried to compare forecast and historical features, but most of them are not even comparable, measured in different methods
multi-models will win. The issue is how to split and how many
yes but it does not matter really no ? I guess a comparaison is made between rolling features on target and rolling feature on historical weather to capture the "trend" which is then compared to the value from the forecast to "balance" the "trends" obtain with the historical values to get a deviation
So far, I am considering 2 very distincts models (there is clearly 1 particular cluster which is really different from all the others), but I am also redoing all my feature engineernig and calculate some particular coefficients for some clusters.
I don't like so much the idea of using a boosting trees with unbounded trends, so I'm trying to normalise everything properly
tree-based algo are very efficient if the data stays within the boundaries of the training set, but they are bad for extrapolating outside of those boundaries unlike LR (if I'm correct). Or here, we are exactly in the case where we will go outside of the boundaries as the installed capacity increase over time
hmm... you are right. maybe there is some interesting feature based on historical weather that can proxy some "state"/"trend"...
normalizing the target would make sense then.
I want to check something, holdon a sec :p
normalizing targets that would mess up the loss though, needs to be careful.
aaah
perfect
look at this
N=100
X = np.random.random(N)10
y = X2+6+np.random.random(N)
new_X_1 = np.random.random(N)x10
news_y_1 = new_X_1*2+6+np.random.random(N)
new_X_2 = (np.random.random(N)x2+10)
news_y_2 = new_X_2*2+6+np.random.random(N)
With this code I create a simple linear relation between X and y, and my new_X_2 is outside of the train set (X)
lr = LinearRegression()
lr.fit(X.reshape(-1,1),y)
dt = LGBMRegressor()
dt.fit(X.reshape(-1,1),y)
print(lr.score(new_X_1.reshape(-1,1),news_y_1))
print(dt.score(new_X_1.reshape(-1,1),news_y_1))
print(lr.score(new_X_2.reshape(-1,1),news_y_2))
print(dt.score(new_X_2.reshape(-1,1),news_y_2))
What is going to happen when I score/predict with a LR and a lgbregressor in your opinion? :p
this could make a nice interview question for a junior ds actually aha
wont the lgbm flatline?
if by flatline you mean "predicting always the same value outside of the training range", then yes
yeah i mean if you would plot the preds they would stop increasing at around 25ish
I actually have no idea how gbdt works at low level, this competition actually introduced me to tree-based methods. My understading is that it somehow formulate the regression problem as a classification with a lot of bins. But I also saw somewhere that the prediction can go beyond the target range seen in the train set. That's something I don't understand.
the kaggle time series course showed a neat trick: first fit a linear model, then fit a lgbm on the residuals. ols extrapolates, lgbm fits the nuances
a decision tree actually simply split the training space to optimize the entropy.
In the case of the LR, it average the target within each subspace
a boosting tree leverage multiple dt, so it is limited by the limits of the decision trees
oh, your comment actually reminded me of the answer to my own question.
yeah i thought that could have been the answer :p
why the target increase over time ? Because of the number of clients increasing over time, and the proxy for that is eic_count and installed_capacity
we shouldnt have to extrapolate sooo much though, given rolling training
it all depends how you build your features I think
what you want is having a bounded space, it can be done in multiple ways
I think there are many ways of normalizing based on different proxy. The issue is probably how to adjust the cost once you do that.
and you need to be able to reverse the normalisation
but if you divide by a coef, normally you can always multiply your prediction by the same coef ๐
you will probably need a customized loss right? otherwise, you are minimizing a proxy of the MAE, not the real MAE
I think it should be roughly equivalent no ?
I image they would be quite different. If the normalization has an meaningful impact.
These things are pretty easy to do with NN. don't know about boosting packagss.
you can build a custom loss in your lgb
that includes the coefficient vector you use to normalize
basically what you want to do is something like:
loss(y, pred, c) = np.abs(y x c - pred x c)
that's neat
I settled on the following development workflow (for now): write code locally in some folder, then zip and upload that folder to a Kaggle notebook. Then within the notebook, add the path to the uploaded folder to PATH, now all imports within the folder work, and I can also import the code to the notebook. If a pickled trained model is in that folder, can just unpickle it and it is ready for prediction even. Still have to look at the sklearn version mismatch between kaggle and local, but this seems good enough.
the more I look at the data, the more I am confused
some time-series don't behave at all as they should
Anyone interested in teaming up?
on my side I prefere trying to go for a solo gold, but happy to discuss ideas and thoughts here
good luck!
Linear Boosting is a two stage learning process. Firstly, a linear model is trained on the initial dataset to obtain predictions. Secondly, the residuals of the previous step are modeled with a decision tree using all the available features. The tree identifies the path leading to highest error (i.e. the worst leaf). The leaf contributing to the error the most is used to generate a new binary feature to be used in the first stage. The iterations continue until a certain stopping criterion is met.
why the heck binary feature.... I am lost. Is it going to overfit after like 10 seconds?
Just started looking into this comp. From what I gathered there are predictions to be made for every county, product type and is_business combination for production and consumption. This would make up for a maximum (not all have to be present) 1624*2 time series.
The API is necessary to make prediction but confuses me. It provides for every iteration a new row of information of the test set. Per row_id a target should be provided to make a submission.
- does the test dataframe provide the indication what to predict? which county, product type, date and hour etc
- does the test dataframe follow 'after' the last row of the training set? and continues to pour new information with every iteration?
- is there a way (besides using the leader board or the training data) to use the values in the test set for validation? (MAE scores)
- if you would fit a time series model for a individual timeseries (county / product type / is_business etc) is there a direct identifier to connect this timeseries (and thus the prediction) to the row_id in the sample prediction the iteration?
- how would you create lag values of a single time series
I'll try to answer these myself but if someone would have some pointers, that would great.
basically the API is here to make sure you make the predictions before accessing the next values, which would be a problem in term of data leaks.
check this notebook: https://www.kaggle.com/code/sohier/enefit-basic-submission-demo it gives a simple routine to gather the test data (iteration by iteration) and submit (iteration by iteration).
At each iteration you will get multiple datasets providing same data as in the training set (but one day at the time). You need to predict all the rows in "test". Note that the notebook crashes if you miss submitting one or several rows, if you have duplicates etc... So among the datasets send by the api, one is called "sample_submission" and indicate you the format to respect (the rows_id with the associated target). Then its up to you to make your model fit the model (I am personnally using a left join and filling na with 0, this way I guarantee no duplicates/no missing data).
regarding your other questions:
I think the test set will follow the train set but will be provided step by step by the API, it will not possible otherwise to calculate rolling features. For now there is a 2 days overlap, so you must make sure you have a continuity between train and test set if you compute rolling features etc...
To create the lag features, everybody has its own method, I personnaly have a class Enefit which has in attributes each dataset from the train set. Then at each iteration from the API, I add the rows/format the columns/etc.. of each df to its corresponding df in my class. Then, and only then, I calculate the feature vector X using all the data I have compiled with joins/rolling/etc..., and use them to make my predictions.
for the identifiers, I think the orgs provided one in the main df, I personally just use multi index/multi columns depending on what I need
no idea, I never used this kind of linear methods with boosting... Don't think I will for this comp either :p
For now, I am looking at each ts, one by one, and try to separate the ones with strange behaviors from the ones which are very similar
those are the "clean" ones
but there is also some weird specimens that I am considering treating separatly
(those are ts smoothed over week + a little normalisation btw for those interested)
I think I discovered some method that may be worth a paper...
basically making tree-based method works better for TS regression, after thinking of the issue you raised yesterday @wanton herald
interesting approach. I recall that in the last energy theme regression comp, removing outliers was the key to gold
cool i see your improved your score !
I'm curious to see where we will be landing by the end of the comp, there is so much to do
I was running a simple experiment this morning, that all of us should be doing, but a simple forward strategy give me already an incredibly high score
well.. "incredibly good" = 66 in MAE, but in 3 lines of code and without models
I am going to try some good old TS forecasting methods at some point
that's indeed very good for 3 lines of code
i might put the notebook public, i want to see how much it scores first. I'm just using pivot/melt but in a smart way
After thinking about the tree-based methods limiation. I managed to improve the LB MAE by ~2.6, at the cost of 7x the runtime though. Now it takes about 2 hours just to have the trained models. Could be a curse for me since it is so early now, now everything takes a lot time to see the full effect...
well you are far from the 2nd for now, and many people will not even read these stuff that we are discussing (which are super important!)
I like EDA and alternative solution ๐ much better than all the generic LGBM baselines
yeah the problem is that many people just rush into boosting without even trying to understand the underlying problem, and for time series its very important
if you want a bit of food for thoughts, check this time series:
df[(df.is_business==1) & (df.product_type==3) & (df.is_consumption==1) & (df.county==13)]
I find it very interesting
exactly. as much as I hated the defcon CTF. The PTSD forced me to read the competiton details again and again..... that helped
hmm, county 13 is not on may radar... county 0 is a big trouble for me
its not particularly county 13 the problem but the combo of county 13 and the other features
(df.product_type==3) & (df.is_consumption==1) is supposed to be the trouble anyway
yes, thats true that most of my outlliers are coming from that
in this particular one, there is a very interesting drop, and I don't think it can be explained by the added/removed capacity. I think there is some information we don't have that would be somehow missing
I was wondering if a client is a co-generation power plant (that can produce with gas and solar for example) or a plant that produce hydrogen based on electricity. That could generate pretty big outliers in comparaison to smaller players
maybe lost a big customer...
product_type==3 is the spot contract. I thought having the electricy cost will help explain a lot the variance. But the elec price only helped marginally in my model.
yeah but you would expect to see this in the capacity installed also no ?
ah no yeah because here the drop is on the time serie of consumption... So it might be a eater of energy alone that leave. Makes sense
production is easier, because it doesn't make sense to turn it off. that's free money
consumption is another story, there are so many alternative even if the facility is installed
yeah I saw that too
and same observation on elec/gas price not very usefull and no obvious correlation on filters like elec > gas or things like this
do we have an idea btw of the benchmark from ENEFIT ? Whats their own baseline ?
I am curious too
it's one of the few industries where forecasting accuracy directly translate into $$, and the host seems to have decent skills
damn, if this competition goes well. after it is done, I may as well concact my local electricity provider and ask for a project&funding
ahaha
And there is a guy on the forum asking to giving up the chance to receive prize in returns for participation without handing in codes. Maybe this is the reason. If a propreitary algo can improve the foreast accuracy considerably, that means a lot of $ for these big corp.
yeah its a huge subject
the compe prize is probably nothing compared to what the big corp can save
and god knows the amount of data scientist not doing things properly
and thats one of the few fields where a little saving in accuracy can save a lot
ok I think I managed to put up the submission pipeline with my naive method
time to see how it scores!
how much does it score ?
219
thats what i am doing, but in more evolved :p
and yes, that should be how to do a baseline
but there is better than ffill simply last
you need to ffill the right value, which is not the last
not even the last day of a given hour
oh! interesting
there is also a weekly periodicity
lot of stuff are working slower during weekend
not too bad including the weekly periodicity :p
๐
I guess only combining the previous hour + last weekly hour you can get already very high scores
I might be overthinking this sentence. I was thinking a K-mean clustering with many bins, and then do a classification based on feature distance
I guess it will end up similar but inferior to tree-based methods
I was just saying that there is multiple way of doing a baseline. The easier is to propagate for a given ts the last hour.
another method is, for each hour, to propagate the day before
a more advance method is to actually propagate for each hour and each day of the week the last iteration
actually we cannot do the first one because we don't have access to the last hour when we do the forecast 
Kmeans are usually much worsts than boosting trees
Thanks for the answer. I'll probably have to play around with the data a bit more to fully grasp how to compute the lagged values. But are you modelling individual time series (as there are at least 2 for the production and consumption, but possibly much more distinct time series as you are discussing as well), then making a forecast for each time series, and the mapping the right prediction value to the county/product_type/business/cons_prod?
I'm struggling to see how to come from different models (which are fed row by row on training data) to the row_id target value. Wouldn't it get really messy if prediction_units drop out etc?
you can do the way you want, in many notebook, people simply predict the target given all the other information without manipulating the time series themselves (its somehow handled by the model if you give information like hour, dayofweek etc...)
If you want to build a forecast for each time series, you need indeed a good layer of pre-processing and post-processing. Its not straight forward and it might requiere different things depending on your approach. There is no absolute solution for this
No I understand that's up to every to design a method to approach this. It kinda makes the entry to prediction a bit more difficult as you need to handle the output in a very narrow and specified way.
The method used with trees algorithm can fit to all data, but if you want to try ARIMA based models or variants that do not generalize well to exogenous features that you already need to split the models and match the output to API row-based setup. Probably doable but more prone to errors.
By the way, when you were taking that it doesnt make sense to shut down production. In some countries they have mechanism in place to reduce the production by incentive scheme or penalties when there is to much production. But that's probably edge case and not applicable here though
my model are so bad at May, and luckily the private test ends in April
wondering what's special about May...
in France we have a lot of days off in may
add 2-3 days off you can add already a 10% error
lol... are there some statistics on % people taking days off during the year. that would be an funny feature
no but like nationnal day off
in france for example we have 1st May, 8th May, pentcost, ascension thursday...
I even studied the school schdule in estonia, and made feature for when schools are open and students are in class. Thought that will make a hughe difference, but nothing...
maybe schoos are not Enefit's customers, IDK...
Maybe I can do a public notebook of uesless features...
it can also be that may is a hard month both for consumption and production
mmh
I had a mistake in my training, I was shifting by two rows instead of 1
when I shift of 1 row only I get a MAE of 35
where is the fluck ? :p
ah I know i forgot a param...
@dense agate what is your current score on training data ? out of curiousity ?
i'm getting MAE 58 now with my ffill method
GroupKFold by month, MAE around 35.5
I don't konw why GroupKFold by month works a bit better on LB... by year-month or by day should make more sense, and much better local MAE. but doesn't improve LB
dont know... The advantage with the ffill method is that I dont need groupKfold :p
now that I fixed a bug, i'm expecting better results, maybe I can sub 100
Good old seasonal forecasting methods could be interesintg too https://otexts.com/fpp2/holt-winters.html
I was thinking of doing the H-W algo
@dense agate do you use functions pd.melt/pd.pivot_table ?
not for this competition
its pretty handy to do quickly feature engineer, you might want to have a look to accelerate your code
(worth for all participant readying this message)
just to expand here on my message in the overfit thread (kaggle website discussions) -- I tried out to predict not the direct target, but e.g. the lagged (log) difference. It does seem to slightly improve the perf in short term (as in if I train until end of March, performance in April is better). But performance is worse long term with those transformations (e.g. May). Really wondering how to transform this problem to either make tree methods work / try out linear trees / get rid of trees completely...
why using log diff and not just diff ?
because with normal diff a few time series are still non stationary (checked using Dickey Fuller test)
with log they are all stationary
I am personnally trying to regress the capacity_installed with target and use the prediction as a coefficient to normalize the time serie
this way I get rid of the drift (the main drift coming from new clients)
Oh okay. I did try to predict target per installed capacity. However, this didn't really work out (> 100 MAE) for me. And did not want to waste more time in that direction with these initial results^^
yeah its not the best, but it help bounding the targets.
This is for example a time serie for a given hour for a given dayofweek without / with applying this factor:
so we see that the upward trend is corrected, and its easy to reverse back.
But for now, it did not improve my analysis so much because the lagged error is not really affected by the long trend of installed capa...
is data augmentation a thing for tree-based methods? I am throwing random thoughts to solve the "out of boundry issue". It is reasonable to synthesize a new sample by combining two samples with similar features two days with simiar features: sum the client features, and average the others
that would solv the out of boundary issue, but could a toll on training time
I never saw data augmentation working for tabular data
I think there is a much better shot into removing outliers
On my side I managed to get to MAE 55 on the train set with just smoothing + ffill, but I dont manage to infere properly for now
This score is important because it can help creating stationnary ts like Lasse was saying
isn't Dickey Fuller test a bit too harsh though. after all, we have many features other than just time
if there exist a test that examamine stationaryness wrt to features, i think the data will pass easily
Im not using a test, but i want stationnary ts to be able to predict the delta from one date to the other using the delta of the other metrics (like temperature, clouds etc..)
I have similar thoughts.
For now i would like my sub with lagged target to work correctly that would be a nice first step ๐
the last line of my notebooks is printing the the number of na counts in the features...lol
the problem is that we cannot see the public test set, so its tough to debug :p
There is one new column in the test file: currently_scored. This is intended to allow you to reliably avoid spending time performing inference on unscored delivered by the API, such as the initial rows of train data.
the new column is interesting
If it is what I understand it is, then we can probably delay our training untill we hit the first row with currently_scored=True
Well yes of course it's quite hard. The point is that without stationary targets, your tree will often have to predict at its bounds, where by definition it will be less well trained. Linear trees might be a huge step up with proper feature normalization. Also, if you look at the Kaggle M5 competition paper, there are some interesting augmentation strategies for tabular time series data.
For me, I currently do get my best scores by just predicting the actual target. But I do feel that having a non stationary target should help. At least for long term quality forecasts. However, this competition seems more and more about who is re-training the most often during Inference ๐
i dont want to believe that retraining is the key personnally... :p
We can get very good score just by playing on the periodicity of the target (55 MAE on the training set by simply using a 2/7 days lag with a 3hrs smoothing is not bad!) and I assume that weather as a large part to play for explaining the remaining error (at least on the production side).
Of course there are many things at play ! You will be able to get a great score without retraining at all ! But the very top submissions will surely do that. In the end we will have to predict 10 months of future data. No way around at least adjusting your model (e.g. by fitting past errors or just retraining the whole thing).
The smoothing is a nice idea. People tend to forget about post-processing ๐
when you say that you use the profiling, do you mean that you created a new variable with the average production per month? Moreover, when using these profiles, did you create the avg production/consumption at month and hourly level?
Hi @wanton herald , thanks for sharing, when you say smoothed over week, do you mean that you have taken the weekly avg to plot the series? normalization you mean that for all the points you have substarcted the mean and divided by the std?
For now i am not using the profiles but to visualise them, you can normalise each time series each day by the mean value, and average for each hour and every month.
You can build other features with this ideas or use it as preprocessing when you calculate other features. For example: if you want to smooth target by applying a rolling window, you might want to consider this type of profile before smoothing
For the graph i did weekly average of daily max, the weekly average is used to smooth weekly seasonal effects and daily max to smooth the daily effect
Coming back to his point, I also came across some weird cases by plotiing the charts. I didn't smooth the series, I just calculated the avg daily of the target. There are more cases where either the consumption/ production increases or decreases breaking all the sesonality and trend patters. In all of these cases, how would you guys treat each TS?
Hi guys, i don't understand the logic between the target and is_consumption in the train dataset, we have 2 observation per prediction_unit_id, his relationsip is (target when is_consumption = 1) - (target when is_consumption = 0) or (target when is_consumption = 1) + (target when is_consumption = 0) ยฟ?, this is my first competition, i'll really appreciate your help
Actually sebas, we have two different time series per prediction_unit. One for is consumption = 0, which means that energy is been produced and stored and that usually happens in summer months. The other TS associated with the prediction unit is when consumption = 1, which means that energy is being consumed. Ideally that happens during the winter months
hello there
in the weather forecast file there is two Predictions for almost every hour in all counties. my problem is that the eraly prediction got dtat_block_id that indicates that it will not be available until next day, in other words after a lag for one day, but in the other hand we have the column hours_ahead which suggest that the data will be available exactly when the day starts so which one should I trust
I tend to agree with you. The gain on retraining as a function of retraining times N is likely on the order of logN. Anyway, I would be curious to see what is the initial gain of retraining 1 time compared to no retraining.
At the end, anyone wants a gold would probably need to retrain their model a few times, but unlikely an advantage if you retrain many more times than others.
Hi guys, I kinda stuck with problem of submitting via env. Make some adjustments and got Submission Scoring Error. Checked everythin, no NaN, filling NaN with 0.0, float64 column type, but spent 5 submissions with no results. Read the topic on forum, but nothing changed. Maybe someone found a way to guarantee submission scoring?
- Generate your results, join on row_id with test_submission, fill nan with a method of your choice
- Encapsulate the loop in a try/except submit test_submission
hm, it possibly will work, so bad that we can't get more traces of error cause of possibly dataleak
thanks!
Yeah its very frustrating... The reason behind that is that having access to the error would allow participant to leak the test dataset
hello guys, is there any cases where data_block_id is misleading and do not represent the true time of the availabilty of data?
my problem is that according to data_block_id some forecasts for the weather will be available after one day lag like gas prices but that does not make sense, since the forecast are made at the start of the day and should be available before we predict the target
I am really struggling with this ๐
this post by the host may be helpful: https://www.kaggle.com/competitions/predict-energy-behavior-of-prosumers/discussion/455833
Predict Prosumer Energy Patterns and Minimize Imbalance Costs.
If I understand your question correctly. The issue is the actual business process involved. Read this description by the host: "Letโs say we are on day D at 11am. We want to predict next day D+1 net consumption from 00 to 23 for every hours."
This is how it works, you don't get to collect all information at 00:01am of D+1 and make prediction after that. The prediction for day D+1 is needed at 11am of day D.
If I understand you correctly you meant that even though the forecast of day D is available on day D we can't use it since we will predict day D at day D-1 where these data will not be available yet
yes. We make prediciton for the 24 hours of day D, at 11am of day D-1
Thank you so much for the explanation ๐
Guys im currently looking for a team for this competition can any one pull me in? Im a undergraduate student that studies on computer science in data analyst and me myself did some predictive model and descriptive model as well.
i took a week of break and I'm back in the comp, hard to put my head back in the notebooks!
I havnt started yet, is it too late to get into it? ๐
there is still plenty of time!
I took a two-week break from the other competition. It's closing soon and my LB ranking droped from like 2nd to 200th because of some public notebook. lol....
It's not too late.
and yet, help me please.
which competition was it ?
there is really someone that published a public notebook with a top 5 score ? ๐คฏ
Kaggle really need to start doing something about this
Open problem 2
It was ensembles of multiple good public solutions. People published the ensembles with scores in the Gold zone, thinking it is just overfitting the public test. And other people keep blending... But now there is speculation that there could a slight chance some of those ensembles are not just overfitting
Now I have to work my a*s off to make sure my previous efforts are not flushed into toiletts
is the test set big for that ones ? I saw in many competitions very funny shakeup for people blending high public scores
its actually very easy to overfit a testset
the dataset is quite small
and the distribution of train/public test/private test are very different from each other. Personally I think those blends are overfitting. but still... quite scary to see the shakeup just before the closing
blending can be efficient when used correctly but most of people don't, and it often result in a big overfitting of the test set.
I remember on the MOA, which is the first competition i took seriously: at the end of the comp, a lot of people where publishing blending of blend and it shifted a lot the silver/bronze area.
At the end of the day, all the people that used those public notebook got catapulted to the end of the ladder ahah
Is it ok, that score can variate with the same submissions? I suppose that reason - I got different test cases each time
Or it is just noise?
If you train your model in the same notebook without fixing the random state, yes it is normal, although it should not change much. What is sent by the api is deterministic, so any variation is coming from your side
Hi, I made a notebook so one can ise the train data through the API. So, if someone is interested:
https://www.kaggle.com/code/ginkobalboa/run-train-data-with-the-api
I kept encountering this error: ModuleNotFoundError: No module named 'enefit.competition' when running the API, is there any modules that is missing from my part? thank you
the distribution of train/public test/private test are very different from each other <-- Sounds bad.
^--
Why would Kaggle organizers not take more care, to ensure the training and test set are from the same distribution?
hi, I imported from the same folder the folder enefit as
import eneft
but recieved error msg:
ModuleNotFoundError Traceback (most recent call last)
Cell In[15], line 1
----> 1 import enefit
File ~/Dropbox/KaggleChallenges/predict-energy-behavior-of-prosumers/enefit/init.py:2
----> 2 from .competition import make_env
4 all = ['make_env']
ModuleNotFoundError: No module named 'enefit.competition'
i wonder what could cause this kind of problem? thank you ๐
You should have competition.cpython-310-x86_64-linux-gnu.so and __init__.py files in your enefit folder.
my folder looks like this, so the file is there..
Becasue distribution shift is a real problem and needs to be solved as part of the solution
Alright, makes sense. Winter is different from summer for solar output. Duh.
Any other causes of drift?
There are many causes, just look data description in competition
Pardon me the file is in the folder but I still get the ModuleNotFoundError. Any ideas how to fix it? Thank you ๐
you have problems of relative paths. you need to run your notebook (or your scripts) in the same folder in which you have the folder enefit
|- working_path/
|- enefit/
|- notebook.ipynb
If your notebook is not located in the right place, you'll have some errors.
Another way would be to add the folder enefit/ in your python PATH.
Ask GPT3.5 for assistance, for this type of problems it is very efficient.
guys what do we know about the format of the testing data that will be used to give a score for the model
let's say that i used Autoregressive Integrated Moving Average model but when they test the model they provided day by day data this will make the model useless
so what do we know about the mechanism of how the data will be provided during testing phase
at each iteration you get a sample of all the dataset for a particular given date.
For example for 2023-11-28 you would get:
- the rows to predict (for 2023-11-28)
- the lagged target (from 2023-11-26)
- the lagged client
- etc...
After its your role to keep in memory those data and do whatever model you want to do with it
my concern is that we are working with a time series problem where historical data is crucial "e.g. the weather of one day is not informative as the weather of the whole month" but we do not know if the data format will allow us to utilize histrical data or not
you can do everything you are doing during training as long as you organise yourself your time series.
You must keep in memory the historical values, and when a new dataset is released, have a processing that will add to the historical values the new values you received
that will be a bit of a challenge since we know so little about the data nature especially that when you submit a notebook you get only a score or a very general error message
before submitting you can check what data is served. Check the basic submission notebook and try to run the cells: https://www.kaggle.com/code/sohier/enefit-basic-submission-demo
Then you can check one by one the data served by the API
for (test, revealed_targets, client, historical_weather,
forecast_weather, electricity_prices, gas_prices, sample_prediction) in iter_test:
Hi! I have a question regarding the API! I am using Windows so the .so file is not working for me. Is there anyone who could use the .so file on Windows with a workaround?
why don't you use kaggle kernels directly ?
I am quite new to kaggle so I didn't know I could to be honest!
on the top of the competition in the code section you can click on "new notebook"
Thank you so much!:)
good luck!
I did in fact asked GPT4 but it did not work better than your support :). Thank you for your help. Iโll try it!
the issue might be due to this .so file, apparently does not work on windows according to natalia
I'd like to clarify public test data period and private test data period. 2024-01-31 is final submission deadline and 2023-04-30 is competition end date. Does it mean that private leaderboard will be scored by MAE by actual data generated between 2024-02-01 and 2024-04-30? How public leaderboard is scored? Competition dataset includes target between 2021-09-01 and 2023-05-31, and env.iter_test seems to start from 2023-05-28. What is the last datetime the API will provide when I submit my code, and how will the public leaderboard be scored?
Hello dear competition participants!
I would like to understand the data a little better and what we should achieve as an output because I am new to the competition so in short everything you need to know about the competition for more clarification. THANKS
Hello dear competition participants!
i have questions about train and other datasets concerning data_block_id
why train.csv begins with data_block_id = 0 in 2021-09-01
and the others don't begin with 0
for example in the electricity dataset we have forecast_date 2021-09-01 same as train but data_block_id = 1
Also if data_block_id = 637 in both train and electricity
we don't have the same date
in train date = 2023-05-31 and in electricity date = 2023-05-30
i need to understand to merge the data correctly
Hello. I'm new to kaggle and don't know what this time-series API is. I can't install the enefit package to experiment with the provided notebook. Can someone tell me what this api does?
The main folder change ?
another data patch
. Good news for most people, not so good news for LB top teams
I just checked the data and I did not find this new main folder
also they mentioned some edits that should be done on the data but I did not find anything new so now I do not know should keep working on the data I have or wait for some update
were you shifting your forecast datasets ?
I'm circling in this competition, trying to do stuff more fancy that just "training vulgar boosting tree + feature engineering" but without much success so far
I had a processing pipeline that shift the data correctly before the patch (through reading and some trial and error). Now the patch is out, which nullify my pipeline. And I honestly don't know what changed and what not. The description for the patch is so brief and not clear, the timestamp in the updated data doesn't look like it is in the right format. And the description in the data tab is not updated... In short, I think the patch is a mess in the current form. I would rather wait for clarification before doing anything.
The data was correct before the patch anyway, just required people to pay attention to the description. Now after the patch, I am not sure the data is even correct.
Thankyou @lyric bough
yeah i had a look, did not find any changes either
I didn't download the data before the patch, so I can not compare at all...lol. All I can tell is that rerunning my notebooks give me worse local CV..
i just rerun my current wip notebook and did not see many changes in the score... But my work is chaotic
I remember that the first value in the column datetime in historical weather was 2021-09-01 01:00:00
now it is 2021-09-01 00:00:00, however i am not sure about that
Based on what I saw, I am expecting a patch on the recent patch shortly... not worth working on the current data. Time to take a break.
I see in the sample submission there is three columns: row_id, data_block_id, and target. But when I look at the discussion: Enefit Basic Submission Demo There is no data_block_id column. Is there any reason for this? Just trying to understand what my output should look like and how to leverage the API. I am working locally so not sure how to use the API
hello, @everyone
I think you should omit data_block_id
guys, the lag of historical target is two days i.e. its data_block_id should starts at value 2 on timedate 2021-09-01 00:00:00
am I right ?
I recommend you to read through the competition page and then fork the submisssion demo. And start from there.
Then when you have a more specific question, the chance of getting an answer would be greater.
I'm really surprise by the difficulty of LGBMRegressor to infer ratios of target.
For example:
using only the lagged_target_2days as a feature gives better result than using lagged_target smoothed over 24hrs period (including hours and ts metadata as a feature to have different ratio for each hour/ each ts).
using lagged_target_7days gives better results than using lagged_target_2days + dayofweek which means the model struggle to infer the weekends ratio from dayofweek
I think your results are in line with the results I got.
its anoying ๐
i was trying to do some renormalizing operations to be able to use the target value at D-2 instead of the one at D-7 for the week-periodicity cases, and while I managed to improve significantly the MAE for those cases (going from MAE 168 to 120 with D-2), I cannot beat the D-7 benchmark (MAE 95)
BTW, I think the current patch is problematic and mishandled the data....
you can't trust any result after the patch.
It's interesting that nobody has flagged it 4 days after the patch, maybe someone is capitalizing it by not revealing the issue....But that's just my opinion, what do I know.
no idea, to be honnest I did not look to much in the details yet, there is a lot of other stuff I am trying to figure out before this and with which I struggle
did you manage to retrieve a similar sub by shifting the hours ?
I actually managed to get similar scores for D-2 / D-7 for weekly-periodic time series using a custom correction for each TS
On the way there, still figuring out what exactly have been done wrong for this patch... I have a big picture, but didn't have the time to check some small details.
Hello @everyone ! I need some help with the submission. I dont understand why its giving me an error.... i looked at the output and has the same rows and cols of the naive submission file... has anyone had the same problem?
We have all been there...
try to isolate the problem, e.g. is the submission in the wrong format or the data processing pipeline throws error
okay, maybe not your problem. Many other people on the forum and myself are seeing the same issue now
ah... Another confusing patch. I want to rant so much... If it were not for these patches, we may be seeing sub 60 scores already.
Ah a new patch ? ๐
A few advices:
To make it works you need two parameters:
- Make sure you make a forecast for each row proposed by the API. This can be achieved by doing a left join on the given row_id with your prediction and fillna with something.
- Make sure your code don't generate an error at execution time because of an edge case forgotten (for example, if you have are using a dictionnary to lookup for some historical data, a new key at inference time might generate an error). As a starting point, you can wrap your inference cell in a try: YOUR_CODE except: SUBMIT_DUMMY_FORECAST (for example all 0).
To go further:
The best way to anticipate later issue is to use the training data to replicate the API behavior on the training dataset. This way you can make sure your inference code works on a large amount of new data and you should spot early all kind of bugs.
And it seems that many people reported (including myself) the submission is broken since the "new patch". Previously successful submissions now throw error. I will stop working on it until an official answer or until someone reported a successful submission
understandable
i might submit something on my side by the end of the week. I am still far from your crossval MAE of 35, but i'm getting closer (i have MAE 40), with a very simple yet difficult to implement methodology
i'm curious to see how much it would score on the LB
does anyone know how to make the online enefit api return "data_block_id" for test data?
you would have to add to column by yourself. Each iteration would be a new data_block_id.
Does anyone know the extent of historical weather information the test will be running with?
Does anyone have a succesful submission lately? Kaggle stuff is not responding regarding the submission errors yet. Trying to understand what's happening
with the new new update i got my new local CV going from 40 to 37
i was not using any offset before, so it looks like this new update fixed indeed the times of the weather
yep, that's expected
did you try to try/except on your sub ?
I think the patch (when done right) will boost LB score for most people
no.. I am still assuming I did something wrong (or didn't consider some edge cases), and taking the opportunity to make the script more robust.
but no, it didn't work. Submission fails for me
maybe a problem of parsing in a datetime ?
or depending how you preprocess the forecast you might be missing a particular hour ?
yeah making script robust has been my biggest challenge... I spent actually most of my time on that also ๐ญ
OMG... the host says the historical weather is in EET/EEST time...
I have a feeling that he is mistaken...
try offsets and see which one gives the best score, ultimatly thats what i'll be doing i think
Okay, now I am waiting for a new new new patch to fix the submission error. and a new new new new patch to correct the time in historical weather...
they should just make a rollback to the original dataset
on that I agree...
many people there at the beginning of the comp but that stop looking at it later on might be penalized
on the other hand, thats a very good demonstration of a real world usecase where the data specs change every two weeks aha
btw, if you find a minor issue (like, inaccurate description of the data), are you allowed to keep it as a secrete and capitalize it by not flagging the issue?
no idea, i guess you could keep it a secret even if its not exactly the philosophy
i personnaly like giving this kinds of hint as i'm looking for the discussion master rank and its better than sharing a notebook that everybody will fork without even looking :p
hello guys
Which option do you believe is better: implementing a model by yourself, or utilizing one of the implementations available in various frameworks?
I understand that there is no definitive answer to such a question, but I am curious about your opinions.
As someone whos a student and new to Kaggle comps I'd love to read about what y'all have found on this so far. Still doing EDA on the data and trying to understand the model logic, lag variables, etc.
I understand that in the forecast weather df the cloud cover is in percentage at a certain altitude AND at the end of the hour not beginning, whereas historical weather is the total volume of the area (from a disc. post). Is this correct?
Utilizing frameworks
Its very rare you would reinvent the wheel when working on these types of problema for real (except in research maybe). But thats just from my experience
Hey! I recently joined the competition and was wondering if anyone tried out encoder decoder architecture. Specifically, I want to understand will we able to test using a fixed context window of encoded information before starting prediction?
This is so frustrating....
I am doing this at the end of my iter loop:
sample_prediction['target'] = 0.0
env.predict(sample_prediction)
But I am still getting the submission scroing error (submission file with incorrect format in the wrong format), but not an notebook exception error..
btw, does anyone know if the submission run out of memory would return an OOM error or just an submission scoring error?
Just submission scoring error
I made a sub, it works well on my side
Did you try to check the inputs type from the date column? In my case i add issues while joining before of a datetime casting
If you suspect oom, you should trim your historical data to -X months to limit the size
Also use a notebook for training and one for inference
I tried these two things :
- I explicit convert sample_prediction['target'] to float64, this once fixed my submission scoring error a long time ago, but not working since the latest patch.
- I explicitly set
sample_prediction['target'] = 0.0right beforeenv.predict(sample_prediction)
with the 2) I did, there should not be any reason for a "submissino scoring error"
The datetimes have a different casting also
I would quietly debug or accept failure if the message is "Notebook threw exception"..
For each piece of data i do:
df["datetime"] = pd.to_datetime(df["datetime"]).astype(str) to be sure they are in a uniform format
And for the date ones i just add .dt.date.astype(str)
Add also a fillna(0) at the end just to make sure you are not accidentally sending nan values
I am doing sample_prediction['target'] = 0.0, so I am just sending 0s
still a submissino scoring error
You dont have any preproc in the inference loop?
I do have. But if those code cause any issue, I expect a notbook threw exception error, because those code has no effect since I have the sample_prediction['target'] = 0.0 right before calling env.predict(sample_prediction)
so it might be an OOM ?
did you try to cut your datasets to limitate ram use ?
the oom can also happen if you are performing merging and forgot to filter some duplicates
i had the issue earlier in the comp, while concatenating the historical data with the new ones, i was badly filtering the duplicates, resulting in merging with double times the amount of rows
According to Kaggle's guide https://www.kaggle.com/code-competition-debugging, OOM should produce an "Notebook Exceeded Allowed Compute" error.
ah i didnt know. If you want I can try to kill one of my sub and generate an OOM to see how looks the error
thank you for offering, but that's not neccessary.
still trying to isolate the problem. but with the quota of 5 submission/day, that's a lot of opportunity cost.
did you try to simulate the API behavior ?
def simulate_api(block_id): sub_df = df[df.data_block_id == block_id] sub_df["prediction_datetime"] = sub_df["datetime"] sub_df = sub_df.drop(["datetime","target"],axis=1) prev_targets = df[df.data_block_id==block_id-2] subclient = clients[clients.data_block_id==block_id] subhist = historical_weather[historical_weather.data_block_id==block_id] subforecast = forecast_weather[forecast_weather.data_block_id==block_id] subgas = gas_price[gas_price.data_block_id==block_id] subelec = elec_price[elec_price.data_block_id==block_id] sample_pred = sub_df[["row_id"]] sample_pred["target"]=0 return (sub_df, prev_targets, subclient, subhist, subforecast, subelec, subgas, sample_pred)
I use something like this
Alright, I just used 1 submission and isolated the problem downto the infererence code...I left the training code intact and comment out the inference code, and the error is gone.
If I am lucky, I should be able to locate the issue within less than 10 submission. lol...
Theoratically, I should be able to identify one line of code out of 1000 lines with 10 submission. lol...
good luck, its an art to debug :p
1k line is a lot, there is probably refactoring to do no ?
Luckily, I don't have 1000 lines to debug, just referring to the theoratical limit.
on my side I have a model that perform very well on the paper but give a shitty submission and I dont find obvious leaks ๐ข
I get the error "ModuleNotFoundError: No module named 'enefit.competition'" working with a M1 macbook, I suspect because it can't use the file "competition.cpython-310-x86_64-linux-gnu.so". is there any workaround for this?
gosh after 3 days and almost giving up, I finally found the bug that explain why my score during inference was so bad (90 of MAE while my crossval is around 37 MAE)
i was merging on latitudes/longitudes on the weather dataset, but because of a rounding error on lat/lon, i was not merging new data.
And as data in the api overlap with the train data, I didnt notice the new data was not correctly added to my historical data
that was not introducing any breaking point, just nan values in features I was not able to see, leading to incorrect predictions.
My personal API simulator, based on the training dataset, was not seeing this, because the rounding is correct there.
@wanton herald I should have made that a seperate post....
exactly what you experienced
ah yeah i did not see that one :p
the worst is that i saw that there was this discrepency and i thought i had handle it already
anyway, i'm at 71 LB with my original method without any fine tunning or feature engineering
any progress in debugging on your side ?
yep... figured out what's wrong with the data
I am pretty sure there's some minor issue with the forecast or historical weather data (in the public test)
Depending on the pipeline, it may has no effect or break the code (in my case)
Will delay flagging that to the host a bit later, after I got the time to fix my pipeline.
So that I don't put myself in disavantage by wasting so much of my time on what is not supposed to be my job...
can't you make your pipeline robust to it ?
another casting stuff ?
to avoid issues, I am now casting all columns in the format i want (string or float correctly rounded) as a preprocessing step
yes, once the issue is located it is easy to deal with.
It's not casting. It's something that's been flagged for the train data and fixed. Apparently not fixed for the test data.
If you don't have submission error, you are probably not affected.
probably something I handled also, I used to have a few sub errors as well.
anyway now I can finally try to compete with my approach of the problem
this competition is extremly hard from a data processing point of view. Definitly not one I would recommand for beginners
approximately my time spent on: 10% modelling; 20% preparing data; 70% fixing bugs in preparing data
Hey guys is this a multi step forecasting or single step ?
Hi ๐! I ve got an format file errot when submitting. The funny thing is that the submission.csv file doesnt have any inssue... Im doing a left join just to not miss any data point and fillna for target.... I cant work out the error...
Any guess?
Hello I'm new to this competition and this area and I want to participate in this competition mostly to learn and understand these type of projects . can someone explain the competition in simple terms ? what should we do and what type of task is this and what is the data
and what is the input and the expected output ?
Thanks
I am new to this, and when doing my feature engineering it goes over the memory limit (ram) over 30gb. Is it possible to do this in jupyter notebooks outside of kaggle and then import it for submission or would that not work?
Welcome to the world of data science! It's not uncommon to run into memory issues when working with large datasets. You can increase the memory limit of your Jupyter notebook by following these steps:
- Generate a config file using the command
jupyter notebook --generate-config. - Open the
jupyter_notebook_config.pyfile located inside thejupyterfolder and edit the following property:NotebookApp.max_buffer_size = your desired value. - Remember to remove the
#before the property value. - Save and run the Jupyter notebook. It should now be able to utilize the set memory value.
Alternatively, you can run the notebook using the following command: jupyter notebook --NotebookApp.max_buffer_size=your_value.
Regarding your question about importing the feature-engineered data for submission, it is possible to do so. You can save the data as a .csv or .pkl file and then load it into your submission notebook using pandas.read_csv() or pandas.read_pickle(), respectively. This way, you can avoid having to re-run the feature engineering code every time you want to make a submission.
I hope this helps! Let me know if you have any other questions. ๐
Source: Conversation with Bing, 25/12/2023
(1) How to increase Jupyter notebook Memory limit? - Stack Overflow. https://stackoverflow.com/questions/57948003/how-to-increase-jupyter-notebook-memory-limit.
(2) Is there any way to increase memory assigned to jupyter notebook. https://stackoverflow.com/questions/51202801/is-there-any-way-to-increase-memory-assigned-to-jupyter-notebook.
(3) [FIXED] How to increase Jupyter notebook Memory limit?. https://www.pythonfixing.com/2022/03/fixed-how-to-increase-jupyter-notebook.html.
I am using jupyter notebook with Python3 on windows 10. My computer has 8GB RAM and at least 4GB of my RAM is free.
But when I want to make a numpy ndArray with size 6000*6000 with this command:
np.
I'm curious if it's possible and even advice to re-train your model during the public tes set and eventually the prive test set. I can imagine there is some drift and you want to re-train your model based on the data. This means the data provided need to be stored/collected along the way. Is this possible?
Hey guys is this a multi step forecasting or single step ?
@dense agate anyone noticed that the county lat lon data is missing some lat/long pairs in the weather DFs?
the prediction interval is daily for every hour of the day. so daily batch forecasting
thanks
so like this :
The lagged target(and other features) as X1 and the target as Y1
X1p ==> ModelP ==> Y1p(pred) (for production )
X1c ==> ModelC ==> Y1c(pred) (for consumption)
?
Why I Coding like this for predict in enefit is error whereas im upload sample_prediction ?
Hi there, I am participating in this competition. Can anyone here help me out in merging the train and clients mapping datasets? Is there a significance of using data_block_id for merging? Currently I figured out to use date, county, is_business and product_type as keys for merging these two dataframes.
hi I am having problems submitting to the API. I have broken down that is fails in this line of code y_predict = gbm_upload.predict(test_sub_final.values).clip(0). the model has been instantiated before in the following format: gbm_upload = load('xxxxxxgbm_model.pkl).My notebook runs fine in Kaggle with no errors, the predictions are generated fine. Any ideas why the submission fails when calling the model? Thanks
same for me fede, I have raised this to the organizers too but no reply. My notebook runs fine , no errors, no timeouts, and my submission format checked too. I think if the organizers have other versions of packages installed or other dependencies they should tell us otherwise it will continue failing and it does not give same chance to everyone to participate and it is not a fair competition :/ @sinful vine
Out of curiosity, why is consumption even part of this challenge? The description only mentions the inaccuracy of energy production, not consumption.
Feels like having consumption targets adds alot of hours to this competition even though it is a "solved" challenge for Enefit.
Not sure which part of the descirption you are referring to, but here's my take. The energy that Enefit need to produce (P_enefit) need to align with the net consumption of their prosumers(C_prosumer - P_prosumer)
neither of C_prosumer or P_prosumer is "solved". THe inaccuracy of energy production probably refer to P_enefit, not P_prosumer
If (big if) anything is considered "sovled", it's probably the consumption pattern of pure consumer, not the consumption pattern of prosumer, these two are different
but if they are a business they produce more than they consume
but we have to predict the amt produced as well as consumed for each prosumer(or active prosumer),right?
guys anyone have a clue about the period of the testing data in the API (at whihc day it starts and ends)
Hmm yeah that makes sense tbf,that it's not necessarily production we care about but the prosumer activity as a whole
I have recently joined the competition and can't figure out why do people in kaggle notebooks use this piece of code when joining the clients set to the training set
df_client.with_columns( (pl.col("date") + pl.duration(days=2)).cast(pl.Date) )
Why do they shift it on 2 days?
because during inference, the client data is available with a 2 day delay
Hello, I just found the discord channel. I am currently in position 27. Now, I am trying different features (like target_diff), the results locally seem to be really good but not in LB.
I really wonder how representative the hidden test set is,
like do we know its size and if it cover different seasons and so on?
There was a discussion post saying it was around 90 days
Last training date is 2023-05-31, so the company could have up to 6 more months
are they from 6/2023 to 9/2023 ?
We don't know, this is the post: https://www.kaggle.com/competitions/predict-energy-behavior-of-prosumers/discussion/463640
Predict Prosumer Energy Patterns and Minimize Imbalance Costs.
the issue is that the target distribution vary in different seasons very much so if you test your model in certain season you will get quite different results compare to if you test it in other season. i found some cases where i get lower validation loss than the training loss in the first iteration ๐
the reason was that i set the test set to be the last 20% of my data
Yes! In winter, production is easy to guess. Consumption is more difficult to model
i think the reason is that the average of the target significantly decrease which lead directly to a decrease in the mae loss despite of your model and data quality
the api and the whole submitting story is to me harder than the problem itself. anyways, i have a small question if my code succeeded to preprocess the example data that is revealed by the api should i expect that the code will make it when i submit my notebook or there are some problems such as some discrepancy in columns names or in data type and so on compare to the exaple they provided
API also got me several problems
I don't think you shouldn't find any column names discrepancies
to predict a point i use the previous 256 points so if i want to predict first point in the test set i will need the last 256 from the training set. my problem is that after testing the model on the first test batch there will be a gap between the training data and the second patch of testing data, this gap is the first patch of testing data so how can i use the first patch of testing data as a context to predict the second patch of testing data?
what i mean is that for example the testing data that cover period 6/2023 to 9/2023 should be giving as input when we predict the period 10/2023 to 1/2024 so is that the case?
yess, that's why you need to concat the new data (revealed targets, forecast weather, etc.) to the existing dataframes, so no null appear. So every time you read new data, you store it for the following test days. I don't know if you mean this
i am using the code someone made with polars and this problem is handled.
i guess that the case is as follows for the api
version1: patch1
version2: patch1 + patch2
version3: patch1 + patch2+patch3
and so on.
Then, the column currently_scored guide you to know which rows are already predicted and can be used as input and which rows are new and should be predicted in the submitted file
can someone please confirm if this is the case or not
Hello there, I'm new with the challenge procedure. I don't know which environment is better to ask my questions so I link here the interrogation that I have, this is about the dates in the example_test_files. https://www.kaggle.com/competitions/predict-energy-behavior-of-prosumers/discussion/467460
Predict Prosumer Energy Patterns and Minimize Imbalance Costs.
if i got you right then the answer for your question lays in the column data_block_id which exist in every data frame they provided
by checking the data_block_id for every file you will be able to figure out the delay of each feature
Yes, thank you, it helps me to find the following discussion that explains clearly the time availability of the data. https://www.kaggle.com/competitions/predict-energy-behavior-of-prosumers/discussion/455833
Predict Prosumer Energy Patterns and Minimize Imbalance Costs.
More on that, in the discussion he specifies that he is looking at the availability of the data at 11 am, is it the time at which the prediction will be made ? In fact, if we predict after 2 pm, we then have more info on the next day
i think the model's purpose is to predict energy consumption/production for the next day at 11 am each day, so it is not up to you to decide when to make the predictions
anyone can help me with that guys?
"the column currently_scored guide you to know which rows are already predicted and can be used as input"
I think you get the idea, but I think your statement is not quite precise.
currently_scored=False doesn't neccessarily means it's already "predicted". It just let you know which rows are being scored. For what we know now, June, July, August of 2023 is being used for public test LB now, and those period will give you currently_scored=True during the public test runs. But later during the **private **test, after the submission deadline, rows in those period will give you currently_scored=False.
Thanks for the clarification, maybe if i said it is the rows that you should make predictions for it, because the loss will be calculated based on these rows, for now, that will be more accurate.
but my main concern is about if the api is going to provide the data for the period between training data 'ends at 31/5/2023' and the rows that you should predict now
i.e. is the case something like this
version1: patch1
version2: patch1 + patch2
version3: patch1 + patch2+patch3
yes, the data will be provided in continuity
Hello, when trying to submit my notebook in kaggle, I'm struggling to overwrite the submission.csv file by doing output.to_csv('/kaggle/working/submission.csv', index=False), An error is raised saying : PermissionError: [Errno 1] Operation not permitted: '/kaggle/working/submission.csv'. Is there a trick to know how to submit ?
It seems that the for loop over iter_test does the job, but why does it iterate 4 times and the sample_prediction is splitted in 4 ?
has anyone tried nn's till now?
wanna know how they are working xD
trying some nn's rn
for sure someone has tried lstm
Are people still struggling with the submission scoring error ? I have no NaN, all the entries are float and the scoring error created in local give me a score of 80, what happen during the scoring process that could fail ? counter = 0 for (test, revealed_targets, client, historical_weather, forecast_weather, electricity_prices, gas_prices, sample_prediction) in iter_test: value = prediction(test, revealed_targets, client, historical_weather, forecast_weather, electricity_prices, gas_prices, sample_prediction) sample_prediction['target'] = np.array(value["target"]) env.predict(sample_prediction) counter += 1 the function prediction return exactly the same dataframe that is found in the submission.csv (column and line wise)
Hi! It seems to me that the forecast date of the electricity price data we receive from the API is always behind a day compared to the other dataframes. Does anyone know why?
so for example the prediction_datetime column of the test dataframe is 2023.05.28 but in the same iteration the forecast date of the electricity price is 2023.05.27
Look at this, it explains how the data in test is given : https://www.kaggle.com/competitions/predict-energy-behavior-of-prosumers/discussion/455833
Predict Prosumer Energy Patterns and Minimize Imbalance Costs.
Thank you! I have already read this and it still seems to me that the electricity price and the gas price returned by the API is wrong.๐ If the prediction date is 2023.05.28. then the forecast date for the electricity and gas should be 2023.05.28 as well and not one day earlier
the data will not be available at the time where you make the predictions this is why the forecast date do not match the date where you will make the predictions
you can look at column data_block_id which you can find in training files to understand the delay amount and when each feature will be available and after how much delay
ah okay. I got it! Thank you!:)
you're welcome ๐ฅฐ
did you figure out what was the problem? i think i am having the same problem now
The solution seems to be : np.clip(...,0,np.inf)
Do not ask me why, some magics from the authors
ModuleNotFoundError: No module named 'holidays'
Yester day every module is fine but today when I run my script this show up, can anyone help? I cannot use polars module,too
the loss is now calculated on new data which is not the same data as week ago am i right?
I read somewhere that the hidden test grows over the three month but they didn't really specify. I guess there are more points in the hidden test set
By the way, does the scoring takes a lot of time sometimes? It runs scoring for the past hour for me, which is weird because it was faster in the morning
ah okay the greater the MAE the longer it takes fo the scoring to run haha
no, actually the hidden dataset has been updated
they added new test data and i think they also set the old hidden test data to have currently_scored == false
so you don't need to score on those rows which have the currently_scored == false? because it is not relevant anymore?
I can't really tell, each time I read what they said I understand something different
However I think that they added extra data for the period after the original data and the column currently scored for this new data is set == false so you have to consider this to avoid scoring error
And yet people submitted without accounting for this new data and got successful submtions
https://www.kaggle.com/code/gitfox/working-dummy-submission-after-2024-01-19-update/notebook
https://www.kaggle.com/competitions/predict-energy-behavior-of-prosumers/discussion/469293
I have found these two links which are helpful I think, even though I am still not 100% sure I get it
where did u put this code
did u clip sample prediction's target?
Yes
didnt work for me tho
i think it is related to the hidden dataset update
Predict Prosumer Energy Patterns and Minimize Imbalance Costs.
i guess i should sacrifice my daily submission to debug
guys am i the only one how is facing that problem or is column prediction_datetime column has become like this in the new data generated by the api
in kaggle notebook options, change enviornment to pin to original environment
I think it is a problem caused by the most recent env in kaggle notebook along with other problems I noticed and i think it is also caused by the new env
i.e. if you make a new notebook and change enviornment to pin to original environment this will not solve the problem
Hi, I joined the competition a week ago. It is my first competition. Yesterday, when ready to complete my first submission, I realized that it will take some debugging (and submissions) to successfully submit. Being new to Kaggle and joining late in a competition with a complicated to debug "Submission Scoring Error" issue is not optimal. Something seems to happen when the hidden stuff is executed when running env.predict(sample_prediction) in the iter_test loop. I think it is unfortunate, that there is a 5-a-day limit for first time successful submissions late in the competition, something for you to consider. Thx.
Hello all! I've just made the following question in the discussion section in the competition. I know it's a bit late, but any help from you would be great ๐
https://www.kaggle.com/competitions/predict-energy-behavior-of-prosumers/discussion/472267
Predict Prosumer Energy Patterns and Minimize Imbalance Costs.
Is the competition still going? I thought it was supposed to end in februari
submission is closed and ongoing private lb
when can we see the scores of the private leaderboard?
Also, can we still modify our models or is it only for testing the existing one on a new dataset?
For those who is been in kaggle for long time, can we expect to see a new time series competition any time soon?
I made a summary video of this competition. Let me know if you find it at all useful... https://www.youtube.com/watch?v=rR9i9tO4BIQ
Kaggle's Enfit competition review!
The goal of this competition was to predict energy produced and consumed by customers of a power company who have installed solar panels. Here you'll learn both fundamental and state of the art techniques for building a machine learning model for a time series problem.
Chapters:
0:00 Welcome to my channel!
...