#predict-energy-behavior-of-prosumers

1 messages ยท Page 1 of 1 (latest)

mighty rock
#

Hey , anybody active guy here working on this project

wanton herald
#

โ˜๏ธ

dense agate
#

yea. so stuck in the DEFCON CTF and came here to refresh a bit..

wanton herald
#

nothing better than a good old fashion time serie project to cheer up the mood

thin ibex
#

I like tabular datasets and forecasting problems

late monolith
#

Time series. Perfect.

autumn seal
#

I used to predict the electrical load using LSTM , they accuracy was good

dense agate
wanton herald
#

I don't like so much NN on this kind of time series, I have the feeling there is not enough data / too much subtilities to have something relevant

dense agate
#

I usually don't like NN for time series. but I got a feeling this one can be different. Enough feature interation for NN to get an edge

quaint cedar
#

I got a super newb question

#

I downloaded the project and got it set up in VSCode.

#

I tried running the enefit-xgboost-start notebook

#

however first line fails

#

Looking in links: /kaggle/input/xgboost-python-package/
WARNING: Location '/kaggle/input/xgboost-python-package/' is ignored: it is either a non-existing path or lacks a specific scheme.
ERROR: Could not find a version that satisfies the requirement xgboost (from versions: none)
ERROR: No matching distribution found for xgboost

#

anyone know what I am missing?

#

i tried downloading the package as a whole and placing it in that directory but still same issue

thin ibex
#

you have xgboost installed in your local python ?

wanton herald
sacred steeple
wanton herald
#

stuff like the average production for a given day

#

which is different from month to month, like the chart above is in the summer (production starts early morning and end late evening, logic: its sunny longer).
In winter the profile is different.

sacred steeple
#

Ah I see. Thx for the explanation, makes a lot of sense

quaint cedar
dense agate
wanton herald
#

using the fourrier transform is something else I think here no? Using profiles seems to improve a bit my score
I'm just at the beginning of my feature engineering, I spent a bit of time having a robust framework that can I easily transpose from training to inference with the submission API

edit: Actually nvm, did not improve much for now.

dense agate
#

IDK, I feel like fourrier is similar to sin/cos. I am thinking of something more flexible, like representation learning... That's why I mentioned NN. I haven't tried this idea though.... I feel like I am giving away some secret sauce harold .

sacred steeple
#

Is it allowed to train a model locally and then upload it to a kaggle notebook as a "dataset"? or do I have to copy-paste my local code to a kaggle notebook and train from scratch?

wanton herald
#

yes its allowed, and its even recommanded

#

you can also train a model in a kaggle notebook, and access it via another kaggle notebook. When you click on "add data" from the right pannel, you have an option to select the output of a notebook.
To use it, you must simply save the files you want to pass from the notebook where your model (for example with pickle), and save the notebook

dense agate
#

wait... is it always reading the latest version of the notebook output, or it has to be pinned to the version when you added the notebook? Can you bypass the internet switch with this?

#

Damn PTSD from security hackathon..

wanton herald
#

its reading the latest version until you save your notebook for inference

dense agate
#

but to evaluate with updated test data, the notebook will be rerun... the qusetion is, what version the notebook can read when rerun during future inference

#

I posted the same question in the discussion,it may be a stupid question....

wanton herald
#

ah no !
So basically, you run a model in notebook A and save it in version 1
You run the inference notebook, it run with version 1 of notebook A.
Later you change notebook 1 to version 2.
As long as you dont change save again inference notebook, it will stay on version 1 of notebook A. At least thats how I understand it

#

the internet off is there to avoid people building functions that could leak the test set by sending it remotly. Without this feature, you could easily save the hidden data to a remote bucket for example. But as long as the notebook is not connected to internet, there cannot be exchange of data with outside

dense agate
#

that's my concern. The default setting of importing Kaggle dataset is "Pinned to the latest version"

#

not sure if that "latest" is the as of the notebook saved, or as of the notebook is run

wanton herald
#

after if you want to make sure there is no problems like this, you can always fork a notebook when you want to make a change

dense agate
#

It's probably illegal to use leaked information, even if feasible. just some security concern because of the CTF PTSD....

wanton herald
#

yeah but with the no internet policy, there is no way data is leaked anyway.
A thing that is sometimes done by kaggler, when the dataset for test is small, is probing

#

or there is some data leak technique, like you can train a model in the inference notebook adding the testset somehow - like when you make a PCA, it can sometimes be usefull.
Not here anyway because submitting through the API guarantee no leak for the futur

dense agate
#

How about this though: "For example, in late april 2024, I can collect the real weather (much more accurate than just forecast) info throughout the test period, update the private dateset. The submitted notebook needs to be rerun to evaluate on the updated test anyway. If the submitted notebook can read the latest dataset, then it can "predict" using the leaked weather info."

wanton herald
#

thats why once a notebook is saved, it is binded to the version it has been saved on

#

but you can try to confirm with the admins

dense agate
wanton herald
#

but we cannot here, because the data is given and infered day by day, guaranteing no leak

dense agate
#

of course you can, you can log the revealed target

#

otherwise you can't engineer time lag features

wanton herald
#

thats not a leak in that case, you use past data ๐Ÿ™‚

#

I guess you are already doing it no ? otherwise you could not get such a high score as a good features include past targets

dense agate
#

hmm, yea. I meant I will probably retrain the model with revealed data, not leaked data

wanton herald
#

ah

#

yeah you can do a rolling training

dense agate
#

not right now. Currently I am focusing on things that won't prolong the developemnet cycle too much

wanton herald
#

on my side I prefere first to focus on a robust framework to build feature without leaking the futur by mistake

#

problem: its much more slow to try new stuff
advantage: very easy to plug to inference

#

i'm thinking of implementing a custom cache to calculate faster the rolling, but i'm a bit lazy atm aha

dense agate
#

yea, a big part of this competition is pandas data processing technique...

wanton herald
#

getting rid of pandas can considerably accelerate the preprocessing

dense agate
#

exactly, but I am too lazy to do that now

wanton herald
#

or using cudf from nvidia also

#

cudf+gpu is also a good way to preprocess faster

dense agate
#

yea, just learned about that. pretty interesting

wanton herald
#

i am using native cudf personnaly, did not try the new way with
%load_ext cudf.pandas
import pandas as pd

dense agate
#

internet off, can't install/use cudf during inference. lol...

#

maybe learning the polars package is the way to go

wanton herald
#

cudf is preinstalled on kaggle kernels

#

i am using it

#

you have two ways to solve your problem:

  1. Make a dataset with the package which allow you to do %load_ext cudf.pandas (if nobody has done it yet). You can then load that package in your inference notebook and pip install it without internet
  2. Use cudf without the %load_ext option. In that case you will have to use directly the functions from cudf which are more or less the same as pandas (you can do the rolling, merging, etc... all the same. The complex stuff comes with custom lambda and .apply)
dense agate
#

ah... so I just need to use cudf directly, not %load_ext cudf.pandas magic

wanton herald
#

if you want to use the pandas magic (might be convenient in some cases), you need to package the repo in a kaggle dataset and load it in your notebook. Then you can pip install it without internet

#

you can check if nobody has not done it already, if it is not the case, you can try, and upload the dataset as public, would certainly have some upvotes

dense agate
#

Damn... I am learning advanced stuff everyday

wanton herald
#

cudf is a good one ๐Ÿ™‚ actually i'm sad they made a magic for pandas, I liked the fact that it was not so popular before, always made a nice effect in interviews aha

dense agate
#

kaggle GPU kernel only has 2 cpu coresharold

#

using the GPU kernel makes everyhing even slower for me...

wanton herald
#

ah yeah ? i get a x10 increase on my notebook, but i'm doing 100% df operations

#

if you want to train your model on CPU but the preproc on GPU you can do like this:

dense agate
#

maybe mine time-consuming operations are mostly silly indexing that can't be sped up a lot?

wanton herald
#

are you doing the training also in your notebook ? that might be also what takes more time, in that case you can split in between building the training set (with GPU), then training the model (using the training set as a dataset for model 2) on a notebook on CPU. Or alternatively, the tree boosting have methods with GPU also

#

regarding silly indexing, it would be weird, given the dataset, i'd say that many weird operations can be done smoothly with smart df operations

dense agate
#

yes, i need to optmize my df operations, they are really a mess

late herald
#

Guys I have a question
I'm not sure I understand what each observation is.
Is every observation the total amount of energy consumed every hour by businesses (or individuals)in each county for every contract type (product type)?

wanton herald
#

You need to predict at for each day and each hour the production and the consumption of electricity of people that have solar panels, for different categories of persons (split by county, buisness/personal, type of product).

sacred steeple
# wanton herald you can also train a model in a kaggle notebook, and access it via another kaggl...

thx Jacky. I am really a noob with competitions where you submit a notebook, these questions may be trivial. Atm I have a codebase that I locally developed on my laptop, with notebooks importing from files which themselves import from other files etc. Is there a recommended way of turning this codebase to a submission, like making a package out of it? Or do you actually write all the code on kaggle notebooks and never use a local system?

wanton herald
#

there is no recommended way, it is really up to you and there is different trade off depending on what you prefere doing.
personnaly, I have a set of function to simulate the API behavior and make sure the features I am building are not leaking information from the futur. Makes the preprocessing longer, but more robust.
Then I have a set of functions to do the feature calculations that I can plug direclty on the data from the API or on my simulated version. I have them copy/pasted into the different notebook I used because my computer is not powerfull enough to work locally with a clean package.
In term of notebook, I have one with my simulated API + feature creation package which help me save new features. When I create a new feature, I store it in an AWS bucket with a pair row_id / feature. It is time saving because if i want to try different experiment I already have the features available and dont need to recompute them.
I have a second notebook for training model and feature selection. In this one, I simply load the features i am interested about, train a model, and save the model.
Finally I have an inference notebook in which I load my trained models, I can then calculate the features using my set of functions and simply do the prediction

sacred steeple
wanton herald
#

i'm wondering how many people are actually really in the 80- range without copying the top ranked notebook

dapper edge
#

I really hate people who copy public notebooks and do submissions with it

#

overinflates the scores

dense agate
#

the rankings are just so crowded

dapper edge
#

I wonder if you can just farm medals by copying the best public notebook in each competition

dense agate
#

silver and bronze yes

#

so easy

dapper edge
#

damn

#

seems like an easy thing to fix, bit obvious when 100 people have the same score

dense agate
#

for competition with this much participants, ~top150 get silver. you can copy the public notebook, make some improvement and get easy silver

dapper edge
#

bit boring tho

wanton herald
#

A good and simple fix would be to remove the possible to fork the kernels, and desactivate copy/paste.
Would requiere much more efforts to copy at least while keeping the philosophy of sharing

dapper edge
#

Or to force private any notebook that is in a medal spot

dense agate
#

Some strange observation, when I do GroupKFold by month, my validation mean error (signed) is obviously negatively biased....

#

That's strange beacuse one sided bias is usally caused by some long term trend. but using GroupKFold by month is already somehow using leaked information because newer time data is trained to predict older time target...... doesn't make much sense

wanton herald
#

i dont think this is so much important here personnally. At the end most of the "difficult" part to predict does not come from the variation of the time series itself but of the weather, which is what it is at a moment in time.
The thing to avoid is to have data from the same day in different train and test split I think, appart from that it should be fine. Our models are most likely learning how weather influence the production/consumption and some very general trend (like the production profiles depending on the months)

dense agate
#

indeed, the weather condition have a lot of impact. It's just that a one sided bias seems like a low hanging fruit, yet I can't grab it. (like, I can multiply the predictio by 1.xx and improve the result but that's so dumb). Secondly, figureing out these anormality usually leads to some insights neglected by others

dense agate
wanton herald
#

xgboost does

#

its a new feature they introduced recently

dense agate
#

I heard that too, definitely will try it later

wanton herald
#

I made an attempt, but so far not successful on my side

dense agate
#

Oh I missed NN so much. A transformer or RNN like prediction head to spit out a 24 hour predictions one by one would work so well conceptually

wanton herald
#

there was a competition where transformers were better than FE+boosting, the RIID ones, but the TS were not of the same nature

#

and there was muuuuch more data

#

here I think its as usual, too much variance + lot of complex seasonalities + not enough data

dense agate
#

I think given enough developement time, NN will win for most cases, even for small dataset. but the time needed to regulate different features at differeny layers make it really hard

wanton herald
#

i have not enough experience on successfull NN to tell :p

dense agate
wanton herald
#

yeah to avoid an explosion of feature, i stayed generic with weather features at a day level, but it was not enough. and if you try to include all the features hour per hour, well you risk indeed an explosion (not sure the RAM could handle it also)

late herald
#

can someone explain this error to me please?

#

i can't seem to predict anymore?

dense agate
#

you've already call predict() in line 23

#

it is supposed to be a loop, where you iteratively observe -> predict

dense agate
#

why people are using historical weahter as features when you have weather forecast? I can't wrap my head around it....

#

and several historical weahter feature apprear at the top of feature importance in the latest 72.87 public notebook. That drives me crazy...

wanton herald
late herald
wanton herald
#

@dense agate are you team multi-models or single models ?
I started on single model, but the more I look at the data, the more I am tempted on splitting the models

dense agate
#

I tried to compare forecast and historical features, but most of them are not even comparable, measured in different methods

dense agate
wanton herald
#

yes but it does not matter really no ? I guess a comparaison is made between rolling features on target and rolling feature on historical weather to capture the "trend" which is then compared to the value from the forecast to "balance" the "trends" obtain with the historical values to get a deviation

#

So far, I am considering 2 very distincts models (there is clearly 1 particular cluster which is really different from all the others), but I am also redoing all my feature engineernig and calculate some particular coefficients for some clusters.
I don't like so much the idea of using a boosting trees with unbounded trends, so I'm trying to normalise everything properly

#

tree-based algo are very efficient if the data stays within the boundaries of the training set, but they are bad for extrapolating outside of those boundaries unlike LR (if I'm correct). Or here, we are exactly in the case where we will go outside of the boundaries as the installed capacity increase over time

dense agate
#

hmm... you are right. maybe there is some interesting feature based on historical weather that can proxy some "state"/"trend"...

dense agate
wanton herald
#

I want to check something, holdon a sec :p

dense agate
#

normalizing targets that would mess up the loss though, needs to be careful.

wanton herald
#

aaah

#

perfect

#

look at this

#

N=100
X = np.random.random(N)10
y = X
2+6+np.random.random(N)

new_X_1 = np.random.random(N)x10
news_y_1 = new_X_1*2+6+np.random.random(N)

new_X_2 = (np.random.random(N)x2+10)
news_y_2 = new_X_2*2+6+np.random.random(N)

With this code I create a simple linear relation between X and y, and my new_X_2 is outside of the train set (X)

#

lr = LinearRegression()
lr.fit(X.reshape(-1,1),y)
dt = LGBMRegressor()
dt.fit(X.reshape(-1,1),y)

print(lr.score(new_X_1.reshape(-1,1),news_y_1))
print(dt.score(new_X_1.reshape(-1,1),news_y_1))

print(lr.score(new_X_2.reshape(-1,1),news_y_2))
print(dt.score(new_X_2.reshape(-1,1),news_y_2))

What is going to happen when I score/predict with a LR and a lgbregressor in your opinion? :p

#

this could make a nice interview question for a junior ds actually aha

sacred steeple
#

wont the lgbm flatline?

wanton herald
#

if by flatline you mean "predicting always the same value outside of the training range", then yes

sacred steeple
#

yeah i mean if you would plot the preds they would stop increasing at around 25ish

dense agate
#

I actually have no idea how gbdt works at low level, this competition actually introduced me to tree-based methods. My understading is that it somehow formulate the regression problem as a classification with a lot of bins. But I also saw somewhere that the prediction can go beyond the target range seen in the train set. That's something I don't understand.

sacred steeple
#

the kaggle time series course showed a neat trick: first fit a linear model, then fit a lgbm on the residuals. ols extrapolates, lgbm fits the nuances

wanton herald
#

a decision tree actually simply split the training space to optimize the entropy.
In the case of the LR, it average the target within each subspace
a boosting tree leverage multiple dt, so it is limited by the limits of the decision trees

dense agate
sacred steeple
wanton herald
#

why the target increase over time ? Because of the number of clients increasing over time, and the proxy for that is eic_count and installed_capacity

sacred steeple
#

we shouldnt have to extrapolate sooo much though, given rolling training

wanton herald
#

it all depends how you build your features I think

#

what you want is having a bounded space, it can be done in multiple ways

dense agate
#

I think there are many ways of normalizing based on different proxy. The issue is probably how to adjust the cost once you do that.

wanton herald
#

and you need to be able to reverse the normalisation

#

but if you divide by a coef, normally you can always multiply your prediction by the same coef ๐Ÿ™‚

dense agate
#

you will probably need a customized loss right? otherwise, you are minimizing a proxy of the MAE, not the real MAE

wanton herald
#

I think it should be roughly equivalent no ?

dense agate
#

I image they would be quite different. If the normalization has an meaningful impact.

#

These things are pretty easy to do with NN. don't know about boosting packagss.

wanton herald
#

you can build a custom loss in your lgb

#

that includes the coefficient vector you use to normalize

#

basically what you want to do is something like:
loss(y, pred, c) = np.abs(y x c - pred x c)

dense agate
#

that's neat

sacred steeple
# sacred steeple thanks a lot for the detailed answer. I will try to come up with a setup this af...

I settled on the following development workflow (for now): write code locally in some folder, then zip and upload that folder to a Kaggle notebook. Then within the notebook, add the path to the uploaded folder to PATH, now all imports within the folder work, and I can also import the code to the notebook. If a pickled trained model is in that folder, can just unpickle it and it is ready for prediction even. Still have to look at the sklearn version mismatch between kaggle and local, but this seems good enough.

wanton herald
#

the more I look at the data, the more I am confused

#

some time-series don't behave at all as they should

marble trail
#

Anyone interested in teaming up?

wanton herald
#

on my side I prefere trying to go for a solo gold, but happy to discuss ideas and thoughts here

dense agate
#

Linear Boosting is a two stage learning process. Firstly, a linear model is trained on the initial dataset to obtain predictions. Secondly, the residuals of the previous step are modeled with a decision tree using all the available features. The tree identifies the path leading to highest error (i.e. the worst leaf). The leaf contributing to the error the most is used to generate a new binary feature to be used in the first stage. The iterations continue until a certain stopping criterion is met.

#

why the heck binary feature.... I am lost. Is it going to overfit after like 10 seconds?

indigo cedar
#

Just started looking into this comp. From what I gathered there are predictions to be made for every county, product type and is_business combination for production and consumption. This would make up for a maximum (not all have to be present) 1624*2 time series.

The API is necessary to make prediction but confuses me. It provides for every iteration a new row of information of the test set. Per row_id a target should be provided to make a submission.

  • does the test dataframe provide the indication what to predict? which county, product type, date and hour etc
  • does the test dataframe follow 'after' the last row of the training set? and continues to pour new information with every iteration?
  • is there a way (besides using the leader board or the training data) to use the values in the test set for validation? (MAE scores)
  • if you would fit a time series model for a individual timeseries (county / product type / is_business etc) is there a direct identifier to connect this timeseries (and thus the prediction) to the row_id in the sample prediction the iteration?
  • how would you create lag values of a single time series

I'll try to answer these myself but if someone would have some pointers, that would great.

wanton herald
# indigo cedar Just started looking into this comp. From what I gathered there are predictions ...

basically the API is here to make sure you make the predictions before accessing the next values, which would be a problem in term of data leaks.
check this notebook: https://www.kaggle.com/code/sohier/enefit-basic-submission-demo it gives a simple routine to gather the test data (iteration by iteration) and submit (iteration by iteration).
At each iteration you will get multiple datasets providing same data as in the training set (but one day at the time). You need to predict all the rows in "test". Note that the notebook crashes if you miss submitting one or several rows, if you have duplicates etc... So among the datasets send by the api, one is called "sample_submission" and indicate you the format to respect (the rows_id with the associated target). Then its up to you to make your model fit the model (I am personnally using a left join and filling na with 0, this way I guarantee no duplicates/no missing data).

#

regarding your other questions:
I think the test set will follow the train set but will be provided step by step by the API, it will not possible otherwise to calculate rolling features. For now there is a 2 days overlap, so you must make sure you have a continuity between train and test set if you compute rolling features etc...
To create the lag features, everybody has its own method, I personnaly have a class Enefit which has in attributes each dataset from the train set. Then at each iteration from the API, I add the rows/format the columns/etc.. of each df to its corresponding df in my class. Then, and only then, I calculate the feature vector X using all the data I have compiled with joins/rolling/etc..., and use them to make my predictions.
for the identifiers, I think the orgs provided one in the main df, I personally just use multi index/multi columns depending on what I need

wanton herald
#

For now, I am looking at each ts, one by one, and try to separate the ones with strange behaviors from the ones which are very similar

#

those are the "clean" ones

#

but there is also some weird specimens that I am considering treating separatly

#

(those are ts smoothed over week + a little normalisation btw for those interested)

dense agate
#

I think I discovered some method that may be worth a paper...

#

basically making tree-based method works better for TS regression, after thinking of the issue you raised yesterday @wanton herald

dense agate
# wanton herald

interesting approach. I recall that in the last energy theme regression comp, removing outliers was the key to gold

wanton herald
#

cool i see your improved your score !
I'm curious to see where we will be landing by the end of the comp, there is so much to do

#

I was running a simple experiment this morning, that all of us should be doing, but a simple forward strategy give me already an incredibly high score

#

well.. "incredibly good" = 66 in MAE, but in 3 lines of code and without models

dense agate
#

I am going to try some good old TS forecasting methods at some point

#

that's indeed very good for 3 lines of code

wanton herald
#

i might put the notebook public, i want to see how much it scores first. I'm just using pivot/melt but in a smart way

dense agate
#

After thinking about the tree-based methods limiation. I managed to improve the LB MAE by ~2.6, at the cost of 7x the runtime though. Now it takes about 2 hours just to have the trained models. Could be a curse for me since it is so early now, now everything takes a lot time to see the full effect...

wanton herald
#

well you are far from the 2nd for now, and many people will not even read these stuff that we are discussing (which are super important!)

dense agate
wanton herald
#

yeah the problem is that many people just rush into boosting without even trying to understand the underlying problem, and for time series its very important

#

if you want a bit of food for thoughts, check this time series:
df[(df.is_business==1) & (df.product_type==3) & (df.is_consumption==1) & (df.county==13)]

#

I find it very interesting

dense agate
#

exactly. as much as I hated the defcon CTF. The PTSD forced me to read the competiton details again and again..... that helped

#

hmm, county 13 is not on may radar... county 0 is a big trouble for me

wanton herald
#

its not particularly county 13 the problem but the combo of county 13 and the other features

dense agate
#

(df.product_type==3) & (df.is_consumption==1) is supposed to be the trouble anyway

wanton herald
#

yes, thats true that most of my outlliers are coming from that

#

in this particular one, there is a very interesting drop, and I don't think it can be explained by the added/removed capacity. I think there is some information we don't have that would be somehow missing

#

I was wondering if a client is a co-generation power plant (that can produce with gas and solar for example) or a plant that produce hydrogen based on electricity. That could generate pretty big outliers in comparaison to smaller players

dense agate
#

maybe lost a big customer...

#

product_type==3 is the spot contract. I thought having the electricy cost will help explain a lot the variance. But the elec price only helped marginally in my model.

wanton herald
#

yeah but you would expect to see this in the capacity installed also no ?

#

ah no yeah because here the drop is on the time serie of consumption... So it might be a eater of energy alone that leave. Makes sense

dense agate
#

production is easier, because it doesn't make sense to turn it off. that's free money

#

consumption is another story, there are so many alternative even if the facility is installed

wanton herald
#

yeah I saw that too

#

and same observation on elec/gas price not very usefull and no obvious correlation on filters like elec > gas or things like this

#

do we have an idea btw of the benchmark from ENEFIT ? Whats their own baseline ?

dense agate
#

I am curious too

#

it's one of the few industries where forecasting accuracy directly translate into $$, and the host seems to have decent skills

#

damn, if this competition goes well. after it is done, I may as well concact my local electricity provider and ask for a project&funding

wanton herald
#

ahaha

dense agate
#

And there is a guy on the forum asking to giving up the chance to receive prize in returns for participation without handing in codes. Maybe this is the reason. If a propreitary algo can improve the foreast accuracy considerably, that means a lot of $ for these big corp.

wanton herald
#

yeah its a huge subject

dense agate
#

the compe prize is probably nothing compared to what the big corp can save

wanton herald
#

and god knows the amount of data scientist not doing things properly

#

and thats one of the few fields where a little saving in accuracy can save a lot

#

ok I think I managed to put up the submission pipeline with my naive method

#

time to see how it scores!

dense agate
#

btw this is my favorite baseline

#

you got to do better than this.

wanton herald
#

how much does it score ?

dense agate
#

219

wanton herald
#

thats what i am doing, but in more evolved :p

#

and yes, that should be how to do a baseline

#

but there is better than ffill simply last

#

you need to ffill the right value, which is not the last

#

not even the last day of a given hour

dense agate
#

oh! interesting

wanton herald
#

there is also a weekly periodicity

#

lot of stuff are working slower during weekend

#

not too bad including the weekly periodicity :p

dense agate
#

๐Ÿ‘

wanton herald
#

I guess only combining the previous hour + last weekly hour you can get already very high scores

dense agate
#

I guess it will end up similar but inferior to tree-based methods

wanton herald
#

I was just saying that there is multiple way of doing a baseline. The easier is to propagate for a given ts the last hour.
another method is, for each hour, to propagate the day before
a more advance method is to actually propagate for each hour and each day of the week the last iteration

#

actually we cannot do the first one because we don't have access to the last hour when we do the forecast harold

wanton herald
indigo cedar
# wanton herald regarding your other questions: I think the test set will follow the train set b...

Thanks for the answer. I'll probably have to play around with the data a bit more to fully grasp how to compute the lagged values. But are you modelling individual time series (as there are at least 2 for the production and consumption, but possibly much more distinct time series as you are discussing as well), then making a forecast for each time series, and the mapping the right prediction value to the county/product_type/business/cons_prod?
I'm struggling to see how to come from different models (which are fed row by row on training data) to the row_id target value. Wouldn't it get really messy if prediction_units drop out etc?

wanton herald
# indigo cedar Thanks for the answer. I'll probably have to play around with the data a bit mor...

you can do the way you want, in many notebook, people simply predict the target given all the other information without manipulating the time series themselves (its somehow handled by the model if you give information like hour, dayofweek etc...)
If you want to build a forecast for each time series, you need indeed a good layer of pre-processing and post-processing. Its not straight forward and it might requiere different things depending on your approach. There is no absolute solution for this

indigo cedar
#

No I understand that's up to every to design a method to approach this. It kinda makes the entry to prediction a bit more difficult as you need to handle the output in a very narrow and specified way.
The method used with trees algorithm can fit to all data, but if you want to try ARIMA based models or variants that do not generalize well to exogenous features that you already need to split the models and match the output to API row-based setup. Probably doable but more prone to errors.

#

By the way, when you were taking that it doesnt make sense to shut down production. In some countries they have mechanism in place to reduce the production by incentive scheme or penalties when there is to much production. But that's probably edge case and not applicable here though

dense agate
#

my model are so bad at May, and luckily the private test ends in Aprilharold

#

wondering what's special about May...

wanton herald
#

in France we have a lot of days off in may

#

add 2-3 days off you can add already a 10% error

dense agate
#

lol... are there some statistics on % people taking days off during the year. that would be an funny feature

wanton herald
#

no but like nationnal day off

#

in france for example we have 1st May, 8th May, pentcost, ascension thursday...

dense agate
#

I even studied the school schdule in estonia, and made feature for when schools are open and students are in class. Thought that will make a hughe difference, but nothing...

#

maybe schoos are not Enefit's customers, IDK...

#

Maybe I can do a public notebook of uesless features...

wanton herald
#

it can also be that may is a hard month both for consumption and production

#

mmh

#

I had a mistake in my training, I was shifting by two rows instead of 1

#

when I shift of 1 row only I get a MAE of 35

#

where is the fluck ? :p

#

ah I know i forgot a param...

#

@dense agate what is your current score on training data ? out of curiousity ?

#

i'm getting MAE 58 now with my ffill method

dense agate
#

GroupKFold by month, MAE around 35.5

#

I don't konw why GroupKFold by month works a bit better on LB... by year-month or by day should make more sense, and much better local MAE. but doesn't improve LB

wanton herald
#

dont know... The advantage with the ffill method is that I dont need groupKfold :p

#

now that I fixed a bug, i'm expecting better results, maybe I can sub 100

dense agate
#

I was thinking of doing the H-W algo

wanton herald
#

@dense agate do you use functions pd.melt/pd.pivot_table ?

dense agate
#

not for this competition

wanton herald
#

its pretty handy to do quickly feature engineer, you might want to have a look to accelerate your code

#

(worth for all participant readying this message)

timid swan
#

just to expand here on my message in the overfit thread (kaggle website discussions) -- I tried out to predict not the direct target, but e.g. the lagged (log) difference. It does seem to slightly improve the perf in short term (as in if I train until end of March, performance in April is better). But performance is worse long term with those transformations (e.g. May). Really wondering how to transform this problem to either make tree methods work / try out linear trees / get rid of trees completely...

wanton herald
#

why using log diff and not just diff ?

timid swan
#

because with normal diff a few time series are still non stationary (checked using Dickey Fuller test)

#

with log they are all stationary

wanton herald
#

I am personnally trying to regress the capacity_installed with target and use the prediction as a coefficient to normalize the time serie

#

this way I get rid of the drift (the main drift coming from new clients)

timid swan
#

Oh okay. I did try to predict target per installed capacity. However, this didn't really work out (> 100 MAE) for me. And did not want to waste more time in that direction with these initial results^^

wanton herald
#

yeah its not the best, but it help bounding the targets.
This is for example a time serie for a given hour for a given dayofweek without / with applying this factor:
so we see that the upward trend is corrected, and its easy to reverse back.

#

But for now, it did not improve my analysis so much because the lagged error is not really affected by the long trend of installed capa...

dense agate
#

is data augmentation a thing for tree-based methods? I am throwing random thoughts to solve the "out of boundry issue". It is reasonable to synthesize a new sample by combining two samples with similar features two days with simiar features: sum the client features, and average the others

#

that would solv the out of boundary issue, but could a toll on training time

wanton herald
#

I never saw data augmentation working for tabular data

#

I think there is a much better shot into removing outliers

#

On my side I managed to get to MAE 55 on the train set with just smoothing + ffill, but I dont manage to infere properly for now

#

This score is important because it can help creating stationnary ts like Lasse was saying

dense agate
#

isn't Dickey Fuller test a bit too harsh though. after all, we have many features other than just time

#

if there exist a test that examamine stationaryness wrt to features, i think the data will pass easily

wanton herald
#

Im not using a test, but i want stationnary ts to be able to predict the delta from one date to the other using the delta of the other metrics (like temperature, clouds etc..)

wanton herald
#

For now i would like my sub with lagged target to work correctly that would be a nice first step ๐Ÿ˜…

dense agate
#

the last line of my notebooks is printing the the number of na counts in the features...lol

wanton herald
#

the problem is that we cannot see the public test set, so its tough to debug :p

dense agate
#

There is one new column in the test file: currently_scored. This is intended to allow you to reliably avoid spending time performing inference on unscored delivered by the API, such as the initial rows of train data.
the new column is interesting

#

If it is what I understand it is, then we can probably delay our training untill we hit the first row with currently_scored=True

timid swan
# dense agate isn't Dickey Fuller test a bit too harsh though. after all, we have many feature...

Well yes of course it's quite hard. The point is that without stationary targets, your tree will often have to predict at its bounds, where by definition it will be less well trained. Linear trees might be a huge step up with proper feature normalization. Also, if you look at the Kaggle M5 competition paper, there are some interesting augmentation strategies for tabular time series data.

#

For me, I currently do get my best scores by just predicting the actual target. But I do feel that having a non stationary target should help. At least for long term quality forecasts. However, this competition seems more and more about who is re-training the most often during Inference ๐Ÿ™ƒ

wanton herald
#

i dont want to believe that retraining is the key personnally... :p
We can get very good score just by playing on the periodicity of the target (55 MAE on the training set by simply using a 2/7 days lag with a 3hrs smoothing is not bad!) and I assume that weather as a large part to play for explaining the remaining error (at least on the production side).

timid swan
#

Of course there are many things at play ! You will be able to get a great score without retraining at all ! But the very top submissions will surely do that. In the end we will have to predict 10 months of future data. No way around at least adjusting your model (e.g. by fitting past errors or just retraining the whole thing).

#

The smoothing is a nice idea. People tend to forget about post-processing ๐Ÿ˜„

marble trail
marble trail
wanton herald
# marble trail when you say that you use the profiling, do you mean that you created a new vari...

For now i am not using the profiles but to visualise them, you can normalise each time series each day by the mean value, and average for each hour and every month.
You can build other features with this ideas or use it as preprocessing when you calculate other features. For example: if you want to smooth target by applying a rolling window, you might want to consider this type of profile before smoothing

wanton herald
marble trail
charred spoke
#

Hi guys, i don't understand the logic between the target and is_consumption in the train dataset, we have 2 observation per prediction_unit_id, his relationsip is (target when is_consumption = 1) - (target when is_consumption = 0) or (target when is_consumption = 1) + (target when is_consumption = 0) ยฟ?, this is my first competition, i'll really appreciate your help

marble trail
#

Actually sebas, we have two different time series per prediction_unit. One for is consumption = 0, which means that energy is been produced and stored and that usually happens in summer months. The other TS associated with the prediction unit is when consumption = 1, which means that energy is being consumed. Ideally that happens during the winter months

ornate elbow
#

hello there
in the weather forecast file there is two Predictions for almost every hour in all counties. my problem is that the eraly prediction got dtat_block_id that indicates that it will not be available until next day, in other words after a lag for one day, but in the other hand we have the column hours_ahead which suggest that the data will be available exactly when the day starts so which one should I trust

dense agate
#

At the end, anyone wants a gold would probably need to retrain their model a few times, but unlikely an advantage if you retrain many more times than others.

signal hollow
#

Hi guys, I kinda stuck with problem of submitting via env. Make some adjustments and got Submission Scoring Error. Checked everythin, no NaN, filling NaN with 0.0, float64 column type, but spent 5 submissions with no results. Read the topic on forum, but nothing changed. Maybe someone found a way to guarantee submission scoring?

wanton herald
#
  1. Generate your results, join on row_id with test_submission, fill nan with a method of your choice
  2. Encapsulate the loop in a try/except submit test_submission
signal hollow
#

hm, it possibly will work, so bad that we can't get more traces of error cause of possibly dataleak

#

thanks!

wanton herald
#

Yeah its very frustrating... The reason behind that is that having access to the error would allow participant to leak the test dataset

ornate elbow
#

hello guys, is there any cases where data_block_id is misleading and do not represent the true time of the availabilty of data?

#

my problem is that according to data_block_id some forecasts for the weather will be available after one day lag like gas prices but that does not make sense, since the forecast are made at the start of the day and should be available before we predict the target

#

I am really struggling with this ๐Ÿ˜”

dense agate
dense agate
#

This is how it works, you don't get to collect all information at 00:01am of D+1 and make prediction after that. The prediction for day D+1 is needed at 11am of day D.

ornate elbow
#

If I understand you correctly you meant that even though the forecast of day D is available on day D we can't use it since we will predict day D at day D-1 where these data will not be available yet

dense agate
#

yes. We make prediciton for the 24 hours of day D, at 11am of day D-1

ornate elbow
#

Thank you so much for the explanation ๐Ÿ‘

upbeat sparrow
#

Guys im currently looking for a team for this competition can any one pull me in? Im a undergraduate student that studies on computer science in data analyst and me myself did some predictive model and descriptive model as well.

wanton herald
#

i took a week of break and I'm back in the comp, hard to put my head back in the notebooks!

dapper edge
#

I havnt started yet, is it too late to get into it? ๐Ÿ˜

wanton herald
#

there is still plenty of time!

dense agate
#

I took a two-week break from the other competition. It's closing soon and my LB ranking droped from like 2nd to 200th because of some public notebook. lol....

late monolith
#

and yet, help me please.

wanton herald
#

there is really someone that published a public notebook with a top 5 score ? ๐Ÿคฏ

dapper edge
dense agate
#

Open problem 2

#

It was ensembles of multiple good public solutions. People published the ensembles with scores in the Gold zone, thinking it is just overfitting the public test. And other people keep blending... But now there is speculation that there could a slight chance some of those ensembles are not just overfitting

#

Now I have to work my a*s off to make sure my previous efforts are not flushed into toiletts

wanton herald
#

is the test set big for that ones ? I saw in many competitions very funny shakeup for people blending high public scores

#

its actually very easy to overfit a testset

dense agate
#

the dataset is quite small

#

and the distribution of train/public test/private test are very different from each other. Personally I think those blends are overfitting. but still... quite scary to see the shakeup just before the closing

wanton herald
#

blending can be efficient when used correctly but most of people don't, and it often result in a big overfitting of the test set.
I remember on the MOA, which is the first competition i took seriously: at the end of the comp, a lot of people where publishing blending of blend and it shifted a lot the silver/bronze area.
At the end of the day, all the people that used those public notebook got catapulted to the end of the ladder ahah

signal hollow
#

Is it ok, that score can variate with the same submissions? I suppose that reason - I got different test cases each time

#

Or it is just noise?

wanton herald
#

If you train your model in the same notebook without fixing the random state, yes it is normal, although it should not change much. What is sent by the api is deterministic, so any variation is coming from your side

dull chasm
marble trail
#

I kept encountering this error: ModuleNotFoundError: No module named 'enefit.competition' when running the API, is there any modules that is missing from my part? thank you

wanton herald
#

you need to import the data from the competition, the api is part of them

amber oasis
#

the distribution of train/public test/private test are very different from each other <-- Sounds bad.

amber oasis
#

Why would Kaggle organizers not take more care, to ensure the training and test set are from the same distribution?

marble trail
# wanton herald

hi, I imported from the same folder the folder enefit as
import eneft

but recieved error msg:
ModuleNotFoundError Traceback (most recent call last)
Cell In[15], line 1
----> 1 import enefit

File ~/Dropbox/KaggleChallenges/predict-energy-behavior-of-prosumers/enefit/init.py:2
----> 2 from .competition import make_env
4 all = ['make_env']

ModuleNotFoundError: No module named 'enefit.competition'
i wonder what could cause this kind of problem? thank you ๐Ÿ™‚

dull chasm
#

You should have competition.cpython-310-x86_64-linux-gnu.so and __init__.py files in your enefit folder.

marble trail
#

my folder looks like this, so the file is there..

dense agate
amber oasis
#

Alright, makes sense. Winter is different from summer for solar output. Duh.

#

Any other causes of drift?

signal hollow
marble trail
wanton herald
#

you have problems of relative paths. you need to run your notebook (or your scripts) in the same folder in which you have the folder enefit
|- working_path/
|- enefit/
|- notebook.ipynb
If your notebook is not located in the right place, you'll have some errors.
Another way would be to add the folder enefit/ in your python PATH.

Ask GPT3.5 for assistance, for this type of problems it is very efficient.

ornate elbow
#

guys what do we know about the format of the testing data that will be used to give a score for the model
let's say that i used Autoregressive Integrated Moving Average model but when they test the model they provided day by day data this will make the model useless

#

so what do we know about the mechanism of how the data will be provided during testing phase

wanton herald
#

at each iteration you get a sample of all the dataset for a particular given date.
For example for 2023-11-28 you would get:

  • the rows to predict (for 2023-11-28)
  • the lagged target (from 2023-11-26)
  • the lagged client
  • etc...

After its your role to keep in memory those data and do whatever model you want to do with it

ornate elbow
#

my concern is that we are working with a time series problem where historical data is crucial "e.g. the weather of one day is not informative as the weather of the whole month" but we do not know if the data format will allow us to utilize histrical data or not

wanton herald
#

you can do everything you are doing during training as long as you organise yourself your time series.
You must keep in memory the historical values, and when a new dataset is released, have a processing that will add to the historical values the new values you received

ornate elbow
#

that will be a bit of a challenge since we know so little about the data nature especially that when you submit a notebook you get only a score or a very general error message

wanton herald
#

before submitting you can check what data is served. Check the basic submission notebook and try to run the cells: https://www.kaggle.com/code/sohier/enefit-basic-submission-demo

Then you can check one by one the data served by the API
for (test, revealed_targets, client, historical_weather,
forecast_weather, electricity_prices, gas_prices, sample_prediction) in iter_test:

marble trail
#

Hi! I have a question regarding the API! I am using Windows so the .so file is not working for me. Is there anyone who could use the .so file on Windows with a workaround?

wanton herald
#

why don't you use kaggle kernels directly ?

marble trail
#

I am quite new to kaggle so I didn't know I could to be honest!

wanton herald
#

on the top of the competition in the code section you can click on "new notebook"

marble trail
#

Thank you so much!:)

wanton herald
#

good luck!

marble trail
wanton herald
#

the issue might be due to this .so file, apparently does not work on windows according to natalia

plain plume
#

I'd like to clarify public test data period and private test data period. 2024-01-31 is final submission deadline and 2023-04-30 is competition end date. Does it mean that private leaderboard will be scored by MAE by actual data generated between 2024-02-01 and 2024-04-30? How public leaderboard is scored? Competition dataset includes target between 2021-09-01 and 2023-05-31, and env.iter_test seems to start from 2023-05-28. What is the last datetime the API will provide when I submit my code, and how will the public leaderboard be scored?

thick pecan
#

Hello dear competition participants!
I would like to understand the data a little better and what we should achieve as an output because I am new to the competition so in short everything you need to know about the competition for more clarification. THANKS

willow swift
#

Hello dear competition participants!
i have questions about train and other datasets concerning data_block_id
why train.csv begins with data_block_id = 0 in 2021-09-01
and the others don't begin with 0
for example in the electricity dataset we have forecast_date 2021-09-01 same as train but data_block_id = 1
Also if data_block_id = 637 in both train and electricity
we don't have the same date
in train date = 2023-05-31 and in electricity date = 2023-05-30
i need to understand to merge the data correctly

gaunt ruin
#

Hello. I'm new to kaggle and don't know what this time-series API is. I can't install the enefit package to experiment with the provided notebook. Can someone tell me what this api does?

jade lark
#

The main folder change ?

dense agate
#

another data patchharold . Good news for most people, not so good news for LB top teams

ornate elbow
#

also they mentioned some edits that should be done on the data but I did not find anything new so now I do not know should keep working on the data I have or wait for some update

wanton herald
#

I'm circling in this competition, trying to do stuff more fancy that just "training vulgar boosting tree + feature engineering" but without much success so far

dense agate
# wanton herald were you shifting your forecast datasets ?

I had a processing pipeline that shift the data correctly before the patch (through reading and some trial and error). Now the patch is out, which nullify my pipeline. And I honestly don't know what changed and what not. The description for the patch is so brief and not clear, the timestamp in the updated data doesn't look like it is in the right format. And the description in the data tab is not updated... In short, I think the patch is a mess in the current form. I would rather wait for clarification before doing anything.

#

The data was correct before the patch anyway, just required people to pay attention to the description. Now after the patch, I am not sure the data is even correct.

soft frost
#

Thankyou @lyric bough

wanton herald
#

yeah i had a look, did not find any changes either

dense agate
#

I didn't download the data before the patch, so I can not compare at all...lol. All I can tell is that rerunning my notebooks give me worse local CV..

wanton herald
#

i just rerun my current wip notebook and did not see many changes in the score... But my work is chaotic

ornate elbow
#

I remember that the first value in the column datetime in historical weather was 2021-09-01 01:00:00
now it is 2021-09-01 00:00:00, however i am not sure about that

dense agate
#

Based on what I saw, I am expecting a patch on the recent patch shortly... not worth working on the current data. Time to take a break.

glossy bison
#

I see in the sample submission there is three columns: row_id, data_block_id, and target. But when I look at the discussion: Enefit Basic Submission Demo There is no data_block_id column. Is there any reason for this? Just trying to understand what my output should look like and how to leverage the API. I am working locally so not sure how to use the API

ornate elbow
#

guys, the lag of historical target is two days i.e. its data_block_id should starts at value 2 on timedate 2021-09-01 00:00:00
am I right ?

dense agate
#

Then when you have a more specific question, the chance of getting an answer would be greater.

wanton herald
#

I'm really surprise by the difficulty of LGBMRegressor to infer ratios of target.
For example:
using only the lagged_target_2days as a feature gives better result than using lagged_target smoothed over 24hrs period (including hours and ts metadata as a feature to have different ratio for each hour/ each ts).
using lagged_target_7days gives better results than using lagged_target_2days + dayofweek which means the model struggle to infer the weekends ratio from dayofweek

dense agate
wanton herald
#

its anoying ๐Ÿ˜…

wanton herald
#

i was trying to do some renormalizing operations to be able to use the target value at D-2 instead of the one at D-7 for the week-periodicity cases, and while I managed to improve significantly the MAE for those cases (going from MAE 168 to 120 with D-2), I cannot beat the D-7 benchmark (MAE 95)

dense agate
#

BTW, I think the current patch is problematic and mishandled the data....

#

you can't trust any result after the patch.

#

It's interesting that nobody has flagged it 4 days after the patch, maybe someone is capitalizing it by not revealing the issue....But that's just my opinion, what do I know.

wanton herald
#

no idea, to be honnest I did not look to much in the details yet, there is a lot of other stuff I am trying to figure out before this and with which I struggle

#

did you manage to retrieve a similar sub by shifting the hours ?

wanton herald
dense agate
errant pagoda
#

Hello @everyone ! I need some help with the submission. I dont understand why its giving me an error.... i looked at the output and has the same rows and cols of the naive submission file... has anyone had the same problem?

dense agate
#

try to isolate the problem, e.g. is the submission in the wrong format or the data processing pipeline throws error

dense agate
#

okay, maybe not your problem. Many other people on the forum and myself are seeing the same issue now

#

ah... Another confusing patch. I want to rant so much... If it were not for these patches, we may be seeing sub 60 scores already.

wanton herald
#

Ah a new patch ? ๐Ÿ˜…

wanton herald
# errant pagoda Hello @everyone ! I need some help with the submission. I dont understand why it...

A few advices:
To make it works you need two parameters:

  1. Make sure you make a forecast for each row proposed by the API. This can be achieved by doing a left join on the given row_id with your prediction and fillna with something.
  2. Make sure your code don't generate an error at execution time because of an edge case forgotten (for example, if you have are using a dictionnary to lookup for some historical data, a new key at inference time might generate an error). As a starting point, you can wrap your inference cell in a try: YOUR_CODE except: SUBMIT_DUMMY_FORECAST (for example all 0).

To go further:
The best way to anticipate later issue is to use the training data to replicate the API behavior on the training dataset. This way you can make sure your inference code works on a large amount of new data and you should spot early all kind of bugs.

dense agate
#

And it seems that many people reported (including myself) the submission is broken since the "new patch". Previously successful submissions now throw error. I will stop working on it until an official answer or until someone reported a successful submission

wanton herald
#

understandable

#

i might submit something on my side by the end of the week. I am still far from your crossval MAE of 35, but i'm getting closer (i have MAE 40), with a very simple yet difficult to implement methodology

#

i'm curious to see how much it would score on the LB

dapper wigeon
#

does anyone know how to make the online enefit api return "data_block_id" for test data?

wanton herald
#

you would have to add to column by yourself. Each iteration would be a new data_block_id.

eager wren
#

Does anyone know the extent of historical weather information the test will be running with?

dense agate
#

Does anyone have a succesful submission lately? Kaggle stuff is not responding regarding the submission errors yet. Trying to understand what's happening

wanton herald
#

with the new new update i got my new local CV going from 40 to 37

#

i was not using any offset before, so it looks like this new update fixed indeed the times of the weather

dense agate
#

yep, that's expected

wanton herald
#

did you try to try/except on your sub ?

dense agate
#

I think the patch (when done right) will boost LB score for most people

dense agate
#

but no, it didn't work. Submission fails for me

wanton herald
#

maybe a problem of parsing in a datetime ?

#

or depending how you preprocess the forecast you might be missing a particular hour ?

wanton herald
dense agate
#

OMG... the host says the historical weather is in EET/EEST time...

#

I have a feeling that he is mistaken...

wanton herald
#

try offsets and see which one gives the best score, ultimatly thats what i'll be doing i think

dense agate
#

Okay, now I am waiting for a new new new patch to fix the submission error. and a new new new new patch to correct the time in historical weather...

wanton herald
#

they should just make a rollback to the original dataset

dense agate
wanton herald
#

many people there at the beginning of the comp but that stop looking at it later on might be penalized

#

on the other hand, thats a very good demonstration of a real world usecase where the data specs change every two weeks aha

dense agate
#

btw, if you find a minor issue (like, inaccurate description of the data), are you allowed to keep it as a secrete and capitalize it by not flagging the issue?

wanton herald
#

no idea, i guess you could keep it a secret even if its not exactly the philosophy

#

i personnaly like giving this kinds of hint as i'm looking for the discussion master rank and its better than sharing a notebook that everybody will fork without even looking :p

ornate elbow
#

hello guys
Which option do you believe is better: implementing a model by yourself, or utilizing one of the implementations available in various frameworks?
I understand that there is no definitive answer to such a question, but I am curious about your opinions.

eager wren
#

I understand that in the forecast weather df the cloud cover is in percentage at a certain altitude AND at the end of the hour not beginning, whereas historical weather is the total volume of the area (from a disc. post). Is this correct?

dapper edge
#

Its very rare you would reinvent the wheel when working on these types of problema for real (except in research maybe). But thats just from my experience

torn musk
#

Hey! I recently joined the competition and was wondering if anyone tried out encoder decoder architecture. Specifically, I want to understand will we able to test using a fixed context window of encoded information before starting prediction?

dense agate
#

This is so frustrating....

I am doing this at the end of my iter loop:

sample_prediction['target'] = 0.0
env.predict(sample_prediction)

But I am still getting the submission scroing error (submission file with incorrect format in the wrong format), but not an notebook exception error..

#

btw, does anyone know if the submission run out of memory would return an OOM error or just an submission scoring error?

wanton herald
#

Just submission scoring error

#

I made a sub, it works well on my side

#

Did you try to check the inputs type from the date column? In my case i add issues while joining before of a datetime casting

#

If you suspect oom, you should trim your historical data to -X months to limit the size

#

Also use a notebook for training and one for inference

dense agate
#

I tried these two things :

  1. I explicit convert sample_prediction['target'] to float64, this once fixed my submission scoring error a long time ago, but not working since the latest patch.
  2. I explicitly set sample_prediction['target'] = 0.0 right before env.predict(sample_prediction)
#

with the 2) I did, there should not be any reason for a "submissino scoring error"

wanton herald
#

The datetimes have a different casting also

dense agate
#

I would quietly debug or accept failure if the message is "Notebook threw exception"..

wanton herald
#

For each piece of data i do:
df["datetime"] = pd.to_datetime(df["datetime"]).astype(str) to be sure they are in a uniform format

#

And for the date ones i just add .dt.date.astype(str)

#

Add also a fillna(0) at the end just to make sure you are not accidentally sending nan values

dense agate
#

I am doing sample_prediction['target'] = 0.0, so I am just sending 0s

#

still a submissino scoring error

wanton herald
#

You dont have any preproc in the inference loop?

dense agate
#

I do have. But if those code cause any issue, I expect a notbook threw exception error, because those code has no effect since I have the sample_prediction['target'] = 0.0 right before calling env.predict(sample_prediction)

wanton herald
#

so it might be an OOM ?

#

did you try to cut your datasets to limitate ram use ?

#

the oom can also happen if you are performing merging and forgot to filter some duplicates

#

i had the issue earlier in the comp, while concatenating the historical data with the new ones, i was badly filtering the duplicates, resulting in merging with double times the amount of rows

dense agate
wanton herald
#

ah i didnt know. If you want I can try to kill one of my sub and generate an OOM to see how looks the error

dense agate
#

thank you for offering, but that's not neccessary.

#

still trying to isolate the problem. but with the quota of 5 submission/day, that's a lot of opportunity cost.

wanton herald
#

did you try to simulate the API behavior ?

#

def simulate_api(block_id): sub_df = df[df.data_block_id == block_id] sub_df["prediction_datetime"] = sub_df["datetime"] sub_df = sub_df.drop(["datetime","target"],axis=1) prev_targets = df[df.data_block_id==block_id-2] subclient = clients[clients.data_block_id==block_id] subhist = historical_weather[historical_weather.data_block_id==block_id] subforecast = forecast_weather[forecast_weather.data_block_id==block_id] subgas = gas_price[gas_price.data_block_id==block_id] subelec = elec_price[elec_price.data_block_id==block_id] sample_pred = sub_df[["row_id"]] sample_pred["target"]=0 return (sub_df, prev_targets, subclient, subhist, subforecast, subelec, subgas, sample_pred)
I use something like this

dense agate
#

Alright, I just used 1 submission and isolated the problem downto the infererence code...I left the training code intact and comment out the inference code, and the error is gone.

#

If I am lucky, I should be able to locate the issue within less than 10 submission. lol...

#

Theoratically, I should be able to identify one line of code out of 1000 lines with 10 submission. lol...

wanton herald
#

good luck, its an art to debug :p

#

1k line is a lot, there is probably refactoring to do no ?

dense agate
#

Luckily, I don't have 1000 lines to debug, just referring to the theoratical limit.

wanton herald
#

on my side I have a model that perform very well on the paper but give a shitty submission and I dont find obvious leaks ๐Ÿ˜ข

noble pagoda
#

I get the error "ModuleNotFoundError: No module named 'enefit.competition'" working with a M1 macbook, I suspect because it can't use the file "competition.cpython-310-x86_64-linux-gnu.so". is there any workaround for this?

wanton herald
#

gosh after 3 days and almost giving up, I finally found the bug that explain why my score during inference was so bad (90 of MAE while my crossval is around 37 MAE)

#

i was merging on latitudes/longitudes on the weather dataset, but because of a rounding error on lat/lon, i was not merging new data.
And as data in the api overlap with the train data, I didnt notice the new data was not correctly added to my historical data

#

that was not introducing any breaking point, just nan values in features I was not able to see, leading to incorrect predictions.
My personal API simulator, based on the training dataset, was not seeing this, because the rounding is correct there.

dense agate
#

@wanton herald I should have made that a seperate post....

#

exactly what you experienced

wanton herald
#

ah yeah i did not see that one :p

#

the worst is that i saw that there was this discrepency and i thought i had handle it already

#

anyway, i'm at 71 LB with my original method without any fine tunning or feature engineering

#

any progress in debugging on your side ?

dense agate
#

yep... figured out what's wrong with the data

#

I am pretty sure there's some minor issue with the forecast or historical weather data (in the public test)

#

Depending on the pipeline, it may has no effect or break the code (in my case)

#

Will delay flagging that to the host a bit later, after I got the time to fix my pipeline.

#

So that I don't put myself in disavantage by wasting so much of my time on what is not supposed to be my job...

wanton herald
#

can't you make your pipeline robust to it ?

#

another casting stuff ?

#

to avoid issues, I am now casting all columns in the format i want (string or float correctly rounded) as a preprocessing step

dense agate
#

yes, once the issue is located it is easy to deal with.

#

It's not casting. It's something that's been flagged for the train data and fixed. Apparently not fixed for the test data.

If you don't have submission error, you are probably not affected.

wanton herald
#

probably something I handled also, I used to have a few sub errors as well.

#

anyway now I can finally try to compete with my approach of the problem

#

this competition is extremly hard from a data processing point of view. Definitly not one I would recommand for beginners

dense agate
#

approximately my time spent on: 10% modelling; 20% preparing data; 70% fixing bugs in preparing dataharold

jovial kernel
#

Hey guys is this a multi step forecasting or single step ?

errant pagoda
#

Any guess?

jovial kernel
#

Hello I'm new to this competition and this area and I want to participate in this competition mostly to learn and understand these type of projects . can someone explain the competition in simple terms ? what should we do and what type of task is this and what is the data
and what is the input and the expected output ?

Thanks

flint magnet
#

I am new to this, and when doing my feature engineering it goes over the memory limit (ram) over 30gb. Is it possible to do this in jupyter notebooks outside of kaggle and then import it for submission or would that not work?

formal surge
# flint magnet I am new to this, and when doing my feature engineering it goes over the memory ...

Welcome to the world of data science! It's not uncommon to run into memory issues when working with large datasets. You can increase the memory limit of your Jupyter notebook by following these steps:

  1. Generate a config file using the command jupyter notebook --generate-config.
  2. Open the jupyter_notebook_config.py file located inside the jupyter folder and edit the following property: NotebookApp.max_buffer_size = your desired value.
  3. Remember to remove the # before the property value.
  4. Save and run the Jupyter notebook. It should now be able to utilize the set memory value.

Alternatively, you can run the notebook using the following command: jupyter notebook --NotebookApp.max_buffer_size=your_value.

Regarding your question about importing the feature-engineered data for submission, it is possible to do so. You can save the data as a .csv or .pkl file and then load it into your submission notebook using pandas.read_csv() or pandas.read_pickle(), respectively. This way, you can avoid having to re-run the feature engineering code every time you want to make a submission.

I hope this helps! Let me know if you have any other questions. ๐Ÿ˜Š

Source: Conversation with Bing, 25/12/2023
(1) How to increase Jupyter notebook Memory limit? - Stack Overflow. https://stackoverflow.com/questions/57948003/how-to-increase-jupyter-notebook-memory-limit.
(2) Is there any way to increase memory assigned to jupyter notebook. https://stackoverflow.com/questions/51202801/is-there-any-way-to-increase-memory-assigned-to-jupyter-notebook.
(3) [FIXED] How to increase Jupyter notebook Memory limit?. https://www.pythonfixing.com/2022/03/fixed-how-to-increase-jupyter-notebook.html.

indigo cedar
#

I'm curious if it's possible and even advice to re-train your model during the public tes set and eventually the prive test set. I can imagine there is some drift and you want to re-train your model based on the data. This means the data provided need to be stored/collected along the way. Is this possible?

jovial kernel
#

Hey guys is this a multi step forecasting or single step ?

eager wren
#

@dense agate anyone noticed that the county lat lon data is missing some lat/long pairs in the weather DFs?

indigo cedar
jovial kernel
#

so like this :
The lagged target(and other features) as X1 and the target as Y1
X1p ==> ModelP ==> Y1p(pred) (for production )

#

X1c ==> ModelC ==> Y1c(pred) (for consumption)

#

?

sharp needle
#

Why I Coding like this for predict in enefit is error whereas im upload sample_prediction ?

soft ingot
#

Hi there, I am participating in this competition. Can anyone here help me out in merging the train and clients mapping datasets? Is there a significance of using data_block_id for merging? Currently I figured out to use date, county, is_business and product_type as keys for merging these two dataframes.

serene silo
#

hi I am having problems submitting to the API. I have broken down that is fails in this line of code y_predict = gbm_upload.predict(test_sub_final.values).clip(0). the model has been instantiated before in the following format: gbm_upload = load('xxxxxxgbm_model.pkl).My notebook runs fine in Kaggle with no errors, the predictions are generated fine. Any ideas why the submission fails when calling the model? Thanks

serene silo
# errant pagoda Hi ๐Ÿ‘‹! I ve got an format file errot when submitting. The funny thing is that th...

same for me fede, I have raised this to the organizers too but no reply. My notebook runs fine , no errors, no timeouts, and my submission format checked too. I think if the organizers have other versions of packages installed or other dependencies they should tell us otherwise it will continue failing and it does not give same chance to everyone to participate and it is not a fair competition :/ @sinful vine

dapper edge
#

Out of curiosity, why is consumption even part of this challenge? The description only mentions the inaccuracy of energy production, not consumption.

Feels like having consumption targets adds alot of hours to this competition even though it is a "solved" challenge for Enefit.

dense agate
#

Not sure which part of the descirption you are referring to, but here's my take. The energy that Enefit need to produce (P_enefit) need to align with the net consumption of their prosumers(C_prosumer - P_prosumer)

#

neither of C_prosumer or P_prosumer is "solved". THe inaccuracy of energy production probably refer to P_enefit, not P_prosumer

#

If (big if) anything is considered "sovled", it's probably the consumption pattern of pure consumer, not the consumption pattern of prosumer, these two are different

dull blaze
#

but we have to predict the amt produced as well as consumed for each prosumer(or active prosumer),right?

ornate elbow
#

guys anyone have a clue about the period of the testing data in the API (at whihc day it starts and ends)

dapper edge
#

Hmm yeah that makes sense tbf,that it's not necessarily production we care about but the prosumer activity as a whole

timid lodge
#

I have recently joined the competition and can't figure out why do people in kaggle notebooks use this piece of code when joining the clients set to the training set
df_client.with_columns( (pl.col("date") + pl.duration(days=2)).cast(pl.Date) )
Why do they shift it on 2 days?

dense agate
#

because during inference, the client data is available with a 2 day delay

cold moat
#

Hello, I just found the discord channel. I am currently in position 27. Now, I am trying different features (like target_diff), the results locally seem to be really good but not in LB.

ornate elbow
cold moat
#

There was a discussion post saying it was around 90 days

#

Last training date is 2023-05-31, so the company could have up to 6 more months

ornate elbow
#

are they from 6/2023 to 9/2023 ?

cold moat
ornate elbow
#

the issue is that the target distribution vary in different seasons very much so if you test your model in certain season you will get quite different results compare to if you test it in other season. i found some cases where i get lower validation loss than the training loss in the first iteration ๐Ÿ˜‚

#

the reason was that i set the test set to be the last 20% of my data

cold moat
#

Yes! In winter, production is easy to guess. Consumption is more difficult to model

ornate elbow
#

i think the reason is that the average of the target significantly decrease which lead directly to a decrease in the mae loss despite of your model and data quality

ornate elbow
#

the api and the whole submitting story is to me harder than the problem itself. anyways, i have a small question if my code succeeded to preprocess the example data that is revealed by the api should i expect that the code will make it when i submit my notebook or there are some problems such as some discrepancy in columns names or in data type and so on compare to the exaple they provided

cold moat
#

API also got me several problems

#

I don't think you shouldn't find any column names discrepancies

ornate elbow
#

to predict a point i use the previous 256 points so if i want to predict first point in the test set i will need the last 256 from the training set. my problem is that after testing the model on the first test batch there will be a gap between the training data and the second patch of testing data, this gap is the first patch of testing data so how can i use the first patch of testing data as a context to predict the second patch of testing data?

#

what i mean is that for example the testing data that cover period 6/2023 to 9/2023 should be giving as input when we predict the period 10/2023 to 1/2024 so is that the case?

cold moat
#

yess, that's why you need to concat the new data (revealed targets, forecast weather, etc.) to the existing dataframes, so no null appear. So every time you read new data, you store it for the following test days. I don't know if you mean this

ornate elbow
#

yes this what i meant

#

but my question is how to store it.

cold moat
#

i am using the code someone made with polars and this problem is handled.

ornate elbow
#

i guess that the case is as follows for the api
version1: patch1
version2: patch1 + patch2
version3: patch1 + patch2+patch3
and so on.
Then, the column currently_scored guide you to know which rows are already predicted and can be used as input and which rows are new and should be predicted in the submitted file

#

can someone please confirm if this is the case or not

mighty zenith
ornate elbow
#

if i got you right then the answer for your question lays in the column data_block_id which exist in every data frame they provided

#

by checking the data_block_id for every file you will be able to figure out the delay of each feature

mighty zenith
#

More on that, in the discussion he specifies that he is looking at the availability of the data at 11 am, is it the time at which the prediction will be made ? In fact, if we predict after 2 pm, we then have more info on the next day

ornate elbow
#

i think the model's purpose is to predict energy consumption/production for the next day at 11 am each day, so it is not up to you to decide when to make the predictions

ornate elbow
dense agate
# ornate elbow i guess that the case is as follows for the api version1: patch1 version2: patc...

"the column currently_scored guide you to know which rows are already predicted and can be used as input"
I think you get the idea, but I think your statement is not quite precise.

currently_scored=False doesn't neccessarily means it's already "predicted". It just let you know which rows are being scored. For what we know now, June, July, August of 2023 is being used for public test LB now, and those period will give you currently_scored=True during the public test runs. But later during the **private **test, after the submission deadline, rows in those period will give you currently_scored=False.

ornate elbow
#

Thanks for the clarification, maybe if i said it is the rows that you should make predictions for it, because the loss will be calculated based on these rows, for now, that will be more accurate.

#

but my main concern is about if the api is going to provide the data for the period between training data 'ends at 31/5/2023' and the rows that you should predict now

#

i.e. is the case something like this
version1: patch1
version2: patch1 + patch2
version3: patch1 + patch2+patch3

dense agate
#

yes, the data will be provided in continuity

mighty zenith
#

Hello, when trying to submit my notebook in kaggle, I'm struggling to overwrite the submission.csv file by doing output.to_csv('/kaggle/working/submission.csv', index=False), An error is raised saying : PermissionError: [Errno 1] Operation not permitted: '/kaggle/working/submission.csv'. Is there a trick to know how to submit ?

mighty zenith
#

It seems that the for loop over iter_test does the job, but why does it iterate 4 times and the sample_prediction is splitted in 4 ?

tiny light
#

has anyone tried nn's till now?
wanna know how they are working xD
trying some nn's rn

dapper edge
#

for sure someone has tried lstm

mighty zenith
#

Are people still struggling with the submission scoring error ? I have no NaN, all the entries are float and the scoring error created in local give me a score of 80, what happen during the scoring process that could fail ? counter = 0 for (test, revealed_targets, client, historical_weather, forecast_weather, electricity_prices, gas_prices, sample_prediction) in iter_test: value = prediction(test, revealed_targets, client, historical_weather, forecast_weather, electricity_prices, gas_prices, sample_prediction) sample_prediction['target'] = np.array(value["target"]) env.predict(sample_prediction) counter += 1 the function prediction return exactly the same dataframe that is found in the submission.csv (column and line wise)

marble trail
#

Hi! It seems to me that the forecast date of the electricity price data we receive from the API is always behind a day compared to the other dataframes. Does anyone know why?

#

so for example the prediction_datetime column of the test dataframe is 2023.05.28 but in the same iteration the forecast date of the electricity price is 2023.05.27

mighty zenith
marble trail
#

Thank you! I have already read this and it still seems to me that the electricity price and the gas price returned by the API is wrong.๐Ÿ˜… If the prediction date is 2023.05.28. then the forecast date for the electricity and gas should be 2023.05.28 as well and not one day earlier

ornate elbow
#

the data will not be available at the time where you make the predictions this is why the forecast date do not match the date where you will make the predictions

#

you can look at column data_block_id which you can find in training files to understand the delay amount and when each feature will be available and after how much delay

marble trail
#

ah okay. I got it! Thank you!:)

ornate elbow
#

you're welcome ๐Ÿฅฐ

ornate elbow
mighty zenith
#

The solution seems to be : np.clip(...,0,np.inf)

#

Do not ask me why, some magics from the authors

indigo summit
#

ModuleNotFoundError: No module named 'holidays'
Yester day every module is fine but today when I run my script this show up, can anyone help? I cannot use polars module,too

ornate elbow
#

the loss is now calculated on new data which is not the same data as week ago am i right?

marble trail
#

I read somewhere that the hidden test grows over the three month but they didn't really specify. I guess there are more points in the hidden test set

#

By the way, does the scoring takes a lot of time sometimes? It runs scoring for the past hour for me, which is weird because it was faster in the morning

#

ah okay the greater the MAE the longer it takes fo the scoring to run haha

ornate elbow
#

no, actually the hidden dataset has been updated

#

they added new test data and i think they also set the old hidden test data to have currently_scored == false

marble trail
#

so you don't need to score on those rows which have the currently_scored == false? because it is not relevant anymore?

ornate elbow
#

I can't really tell, each time I read what they said I understand something different

#

However I think that they added extra data for the period after the original data and the column currently scored for this new data is set == false so you have to consider this to avoid scoring error

#

And yet people submitted without accounting for this new data and got successful submtions

marble trail
#
rustic osprey
#

did u clip sample prediction's target?

mighty zenith
#

Yes

rustic osprey
#

didnt work for me tho

#

i think it is related to the hidden dataset update

#

i guess i should sacrifice my daily submission to debug

ornate elbow
#

guys am i the only one how is facing that problem or is column prediction_datetime column has become like this in the new data generated by the api

rustic osprey
# ornate elbow

in kaggle notebook options, change enviornment to pin to original environment

ornate elbow
#

I think it is a problem caused by the most recent env in kaggle notebook along with other problems I noticed and i think it is also caused by the new env

#

i.e. if you make a new notebook and change enviornment to pin to original environment this will not solve the problem

dusk gorge
#

Hi, I joined the competition a week ago. It is my first competition. Yesterday, when ready to complete my first submission, I realized that it will take some debugging (and submissions) to successfully submit. Being new to Kaggle and joining late in a competition with a complicated to debug "Submission Scoring Error" issue is not optimal. Something seems to happen when the hidden stuff is executed when running env.predict(sample_prediction) in the iter_test loop. I think it is unfortunate, that there is a 5-a-day limit for first time successful submissions late in the competition, something for you to consider. Thx.

turbid grove
dapper edge
#

Is the competition still going? I thought it was supposed to end in februari

rustic osprey
#

submission is closed and ongoing private lb

marble trail
#

when can we see the scores of the private leaderboard?

marble trail
#

Also, can we still modify our models or is it only for testing the existing one on a new dataset?

ornate elbow
#

For those who is been in kaggle for long time, can we expect to see a new time series competition any time soon?

slow wigeon
#

I made a summary video of this competition. Let me know if you find it at all useful... https://www.youtube.com/watch?v=rR9i9tO4BIQ

Kaggle's Enfit competition review!

The goal of this competition was to predict energy produced and consumed by customers of a power company who have installed solar panels. Here you'll learn both fundamental and state of the art techniques for building a machine learning model for a time series problem.

Chapters:
0:00 Welcome to my channel!
...

โ–ถ Play video