#jane-street-real-time-market-data-forecasting
1 messages Ā· Page 1 of 1 (latest)
What is the format for the submission? Does my notebook need to have an output file?
how to submit and what exactly to submit
Per the competition rules, "You must submit to this competition using the provided evaluation API, which ensures that models do not peek forward in time." See: https://www.kaggle.com/code/ryanholbrook/jane-street-rmf-demo-submission
@steel trench
Hello
hi
Hi Everyone. I am new to market data forecasting. I was just reading the problem statement and have a question on terminology. What do we mean by responders? I couldn't find the exact definition anywhere.
@ember meteor Hi! "responder" or more specifically "responder_6" is meant to be your target or the outcome. They don't want to expose any sensitive info so they use the term responder. Lets say you take a buy trade at 7:35 AM. "responder" would be the value at 7:36 AM. That's just an example. Your goal is to predict that value at 7:36 AM.
Thanks a lot for the explanation. So if I got this right each record in the training dataset is one trading opportunity. The 79 features are time series values of a particular scrip. The 9 responders are the future vales of that same time series when extended. Among them the responder 6 is what we are predicting.
I have not dug into the data well enough yet but that sounds right. the features are probably values of different indicators. ex: RSI, MOVING AVERAGE, OTHER SECURITIES, ETC..
Hello everyone this is my first Kaggle challenge, Is there an environment file we can use to set up everything that's needed for the APIs and inference servers ?
This lead me to a question, if we don't know the nature of those features how can we do features engineering e.g. if I'm not sure if the feature is already a moving average then how can I know if should made a moving average out of it or not
Our task is just to find and use useful relationships between the features that they provided us if there is any. I also prefer to have actual meaningful categories names so we can play with but if we look at this purely from a time series analysis point of view, then we can think more clearly.
Lets say that we know one of them is a moving average, we might be tempted to try every moving average under the sun. We can do our own feature engineering since we know its market. The way they presented it leaves things more focused in my opinion. Puts everyone on the same playing field.
The kaggle page provides a notebook to work from. We should stick to using a Kaggle notebook for all of our work.
Oh sorry, I meant an environment file for dependencies, versions of packages etc... that are needed for said notebook
ahhh
So what you are saying is that I should anyway try the features transformations I believe would add useful signals to the data
I'm a beginner in machine learning and this my first time to deal with anonymous features so I might fail to make the most intuitive conclusion about the problem we have in this competition
The way I like to approach is use what is given, then manipulate/add/remove as needed to get better scores.
I'm pretty new too so no worries. We are all here to learn and gain experience. I only feel slightly more confident because of my experience with a few recent projects and my love for trading the market.
What about missing values NaNs I'm quite sure you didn't drop these rows but only want to confirm
I'm surprised that no one in the discussion talked about it in detail although there's significant number of NaNs in this dataset
I have the impression that the fact that a value is missing in this dataset holds a useful signal and that the testing data will include many NaNs as well so I think imputing or dropping them isn't smart
Handling missing or NaN values for time series can be a bit of an art.
Is that row important?
Can I use the row before it in my model to predict the next row?
All valid questions. In the case of returns, I personally would most likely impute in some way. That may or may not be the best approach in this case but it's up to the programmer šŖ š¤ .
Hey, just starting in this comp so apologies if this is a silly question.
In the test set are we provided lags for each date_id, time_id pair? Or only for the first one?
So in the current example, test and lags have the same number of rows. But if only the first time_id for each date is given for the lags, test.csv might have more rows than lags - right?
There are no stupid question here. I haven't had the chance to dig into the data well enough yet. I'm sorry I'm not sure.
what are tags?
Sorry, new to these type of competitions. I see that there are many submission submitted already. Are teams meant to be making new submissions throughout the project until the deadline ?
During the forecasting phase, is our model forced to be fixed? Can we train dynamically when new information flows in, or feed this new information into our prediction function in anyway via not a serious "training"? In other words, can we use the information of sequentially earlier part of the test set when predicting the sequentially later part of the test set?
Hello there! "weight - The weighting used for calculating the scoring function" what's the scoring function here?
I think it's in the overview -> evaluation
thanks!
im a little confused about what some of the files are for. are the responders.csv and features.csv supposed to be used? or just to look at?
I believe they're just to look at
guys im new i want to try a constant submission just to test but i cant import the module how can i do
Has anyone tried PySpark for submission. Is it even possible? I am new to Pyspark so was thinking of applying it in this contest.
same here.
any updates how to install kaggle_evaluation
FYI I posted this to kaggle as well
https://www.kaggle.com/competitions/jane-street-real-time-market-data-forecasting/discussion/541541
My notebook is failing scoring, but nothing appears in the logs. My guess is that this is due the inference server being in a different thread. Is there an example of how to capture those logs, so I can figure out the source of the failure?
Not sure, but I donāt think thatās the correct way to import ākaggle evaluationā module.
Hello, what is the best practice regarding installing packages in kaggle notebooks? I see that there's a default environment with some reasonable package versions (i.e. lightgbm 4.2.0). I tried updating it to 4.5.0 by connecting the notebook to the internet, but it turns out we cannot submit notebooks with connection to the internet.
Curious to hear if there are Kaggle ways to deal with this?
So, in order to generate/create features based on the responders as example, will they be available at each date_id/time_id ? all of them not just responder_6 ? Same goes for features, for each date_id/time_id will they be available as well ?
what is the reason for the Nulls in the data? a decent chunk of some of the features are nulls
Hello I am trying to make submission through API, I tried setting os.environ['KAGGLE_IS_COMPETITION_RERUN'] = "True" and then run the Serve function and the server kept running with no results so far, Can anyone please help me?
yes. you can upload the package as a Kaggle Dataset and then import the dataset and pip install it from the file. this is an example https://www.kaggle.com/datasets/marketneutral/cvxpy-python-package
features.csv may be used to aid in feature selection (tags may indicate features belonging to the same style or group). multiple responders may be used as additional targets to help regularize the model
this is not correct. the inference server gives you the t-1 responder values. you therefore can train an online model that includes test data
Yeah, I am still confused about that point. It seems that the current "submit a single prediction function" lacks the ability to do any inferences a posteriori or in general, to use any information from the sequentially last or previous datapoints.
Can Anyone please tell me what should I do
Hello EveryOne, Can anyone verify if we need to set KAGGLE_IS_COMPETITION_RERUN env variable before running the server and what how much time does it usually take to complete a submission?
as of my understanding of the data, we have timeseries data for training set, but only the responders (0,[...],8) from T-1 and the features (0, [...],78) from T for the prediction right? so for the prediction we do not have historical data available?
I havenāt worked on this part of the problem and wonāt for awhile. But you can cache the test data up to t-1 and on t at time 0 you are given all the historical responders. hence you could, say, retrain every 100 test days or something like that.
since the data are anonymized, we donāt know.
Yeah I actually feel the same. Let me know if it's otherwise
i am facing memory issue while running notebook in kaggle. what's the solution? should i use less partition from train data rather than using all 10, will that impact model's accuracy?
Iļ¼SJTU masterļ¼Xiamen univer graduatedļ¼want a Chinese team .
Have you tried this - https://www.kaggle.com/code/yuanzhezhou/jane-street-baseline-lgb-xgb-and-catboost
For every time_id per date_id, will the features be available same goes for responders ? My point is if I want to generate features like rolling mean, lags as well (ignore the files for now)
hi i want to know how long will the calculate takes after a submission? i have waited for 20 mins
forget it. done in 24mins
try running it per partition i guess
this is my first competetion. i have made 8 submissions but can't go above score of 0.0043. i have tried lightgbm and xgb. should i focus on improving my existing submission OR try different model? any suggestion?
Hello, I am trying to submit a simple dummy submission just to get a good handle on the submission format. I am getting "Notebook threw Exception" despite passing the assertions in the given "predict" function. All I did was copy the example notebook and edit the predict response, it doesnt even use a model. Any help with navigating the API and submission format would be appreciated.
Example code without any model would be appreciated just to understand submission better
As far as I can tell the demo submission doesnāt actually work as is??
when i submit my notebook yo competetion it is failing again and again saying."Notebook Inference server error"
not able to submit.. tried so many times..
hey, I've got past the memory issue when loading the data, however this exact notebook seems to fail on the training phase (hitting the RAM limit), specifically on train(model_dict, 'lgb'), did you solve somehow this issue if you had it?
Using all my daily submissions juxst to debug submission isn't even a model š
Train on less partition
still haven't resolved the issue of inference error... if anyone faced the same. please suggest how to solve
I am facing the same issue here
Do you think it could be an error with the API?
or the environment where the notebook is run
I've used 3 of my 5 submissions trying to debug this error..
i don't think it's an error with api. cause my submissions didn't give any error. now have made some changes and trying to submit but i can't.
Then very likely to do with the environment
I've tried to submit the notebook versions that were submited and scored last week without any issue, and the notebook inference error keeps popping up.... š„¹
i tried the same thing... i tried to submit notebook which where accepted last week. and that got accepted again. but new one with complex code and higher run time is not being accepted. showing inference error
Hello, I was hoping someone can explain to me the exact way in which the API serves up the test data. We get batches which correspond to specific time_id's, but are consecutive batches also consecutive time_id's? That is, would it be similar to looping over date_id and time_id in the training dataset?
I saw this post up above which seems to imply something of that nature but I'm not entirely sure
I put this inside the "predict" function
global date_id
global time_id
global switched_date
global dates_without_time
if date_id is None:
date_id = test.select("date_id").max().item()
time_id = test.select("time_id").max().item()
if date_id != test.select("date_id").max().item():
switched_date += 1
date_id = test.select("date_id").max().item()
time_id = test.select("time_id").max().item()
else:
assert test.select("time_id").max().item() == time_id + 1
switched_date = 0
if switched_date > 1:
dates_without_time += 1
if dates_without_time > 10:
raise ValueError("Too many dates without time.")
The submission says that the notebook threw an exception. So I guess that means that, even if two batches have the same date_id, two consecutive batches do not generally have the consecutive time_id's. Do correct me if I'm wrong, the notebook may have thrown an exception for some other reason
I am guessing that there is something wrong with the scoring environment, as all the submissions are failing afaik
@bitter tapir you can read it above
I also saw that which is why I'm not confident about what caused the error - so if somebody has an insight into the time series that is served up by the API, please do share
I canāt try it yet but I think I know why itās breaking
Ran out of submissions need to wait an hourish
Bc it does work on the example notebook with no changes
you can print stuff in notebook and look at logs even if it fails
Oh? Good to know, I'm new to this
yeah trying to figure out just from if it throws error sounds fucked, I thought it was like that at first too lol
Nvmd idk it didnt work, shit just seems fukt i cant get anything that isnt the literal copy of the example to work
Bro this inference error coming because there is 15 min time limit for inference to start. our code is taking longer than 15 min to execute..
model training should happen in less than 15 min and inference should start.
Read thing it says can use first predict call
I think I've managed to overcome the issue
You have 90 seconds to finish the first predict call, if not, the notebook will launch an inference error. So all the time consuming operations (e.g.: loading the model/s) need to happen before the first call
That solved the issue for me š„¹
Hello. On the "Overview" page, under "Code Requirements -> Training Phase", there is a point that states: "Your notebook must use time-series module to make predictions". Could somebody clarify what this means?
I have a question .what is the relation between row_id ,date_id ,time_id and symbol_id.I see symbol_id from 0 to 38 ,and than time_id plus 1,and than symbol_id from 0 to38 again.
I was wondering about this too. Any help would be appreciated!
Anyone looking to collab?
life would be easy if responders are updated intra-day
very true
i had a model that gave an r squared of 0.8 using the previous responder
but then i realized they only update every day
š¢
Hello, i'm new here, i got one team
Hi, what is the team size allowed for this competition?
Hi there anyone knows which one is better for gradient boosting?
Hi all, I am looking for a team. I am based in New York City
geting submission scoring error after submitting. but output is same as my earlier notebooks outputs which got accepted.
this happened to me as well a couple times yesterday
Nice to know, been trying to figure this out forever. Can't run the submission notebooks without receiving errors. So there is no "responder_6" field provided, except for previous days?
i believe the test.parquet file is an example of what the test dataframes that are given to the predict method look like
test.parquet has the columns, "row_id", "date_id", "time_id", "symbol_id", "weight", "is_scored" and features 00-78
hi
Hi there
I need help in kaggle jane street as there are so many partition_id which one i should chose to predict responder 6.Thanks
Please help
Does anyone know that in the prediction period, we can have the responder for last time_id?
the responders are given through the lag dataframe in the predict function at the first time_id and it contains all of the responders for all of the time_ids of the previous day
Thanks for sharing! Yes this is very helpful, but also wondering whether we can get one previous time_id data, like for day N+1 prediction, when time_id = 1, can we get the info of date_id=N+1, time_id=0, this may be helpfulš
Every partition_id are corresponding to part of historical data and I think they (may) are all useful for future prediction
Finally have some time to look at the challenge again - basic question here about catching exception. Any ideas?
In that notebook see right section is it showing Jane street competition if not then that's the issue recreate notebook and copy paste from this to there and rerun the notebook or u might have disabled internet
Create separate notebook for submission only, upload ur models and scaler there and try again, if it is code competition then you need to optimize it
I'm getting negative score on leaderboard
@errant sierra That means your forecast didn't beat the baseline, 0 forecast.
Ohh I see
will there be new unseen symbol id shown in the testing stagesļ¼
Is it possible to submit locally rather than using kaggle notebooks? I find them rather cumbersome to use.
Check out the Kaggle API, you can work locally and push your changes to Kaggle.
hello, I cannot submit either choosing notebook and file upload. I already finished the verification though
Hello, Could we re-train our model during Evaluation time???
Ah great! Thanks.
Did anyone fix the Notebook Inference server error? I load a pretrained model outside of predict that seems to run quickly (18 µs with the local test.parquet) but getting the server error when I try to submit
can someone help me understand the data a bit more please by replying to my message. Ive loaded all the data into a pandas data frame and im trying to plot responder_6. Now i cant do this nicely at the minute because of the duplicate date_id. I tried to fix it to one symbol_id to fix the issue but theres still duplicate date_id. Can someone just help me with my understanding of the data please. (this is my first data science project and competition). My thought was that the symbol_id represented a stock ticker lets say and responder_6 was the price lets say, and i thought by fixing the symbol there would be no duplicate date.
i think you are getting multiple rows with the same date_id because there is also a time_id variable that represents around 900 time steps each day for each symbol
hey how do you know if you've submitted succesfully, I run the code with a saved model on the inference server and get no errors, but can't see any submissions on my end. Thanks so much!
I believe we can
, haven't explored it. The lag data frame shall have target as well.
Only for the previous date_id. So you don't have a target variable to train on, during inference
Can't we just save previous date_id dataframes and calibrate?
No, there are no intraday responders provided
Hi, everyone. I'm looking for a team member and I'm kaggle novice. I have deep understanding RL and Causal AI. Before that, I experienced multi agent systems and high frequency trading and market neutral strategy. Yes, also, I'm going to collaborate in jane streeet competition. Recently, I can work for more than 5 hours in kaggle competitions.
If you are more than kaggle expert, that's better. But that's not necessity.
Also, I'm going to try AI research in the near future and hope long collabrations.
If you are passionate with AI, please hit me.
Hey Owen, as advice, I suggest you to plot the responder_6 for only one symbol and only one file (for example train/partition_id=0/part-0). Just yo give you a taste, here what the distribution of responder_6 looks like š
Hey im wondering if anyone can help me in making my first submission. I think im setting up the inference server correctly, and i believe my notebook is in the competition enviroment so im not sure what im not doing.
Hey all. I have a question regarding the datasets. Not quite sure that does the feature.csv show. Can anyone explain me please
Does anyone have issues with installing packages through the Kaggle Package Manger in the submission? My submission aborts with a Server Inference Error after <60 seconds
Looks like someone had a similar issue: https://www.kaggle.com/competitions/jane-street-real-time-market-data-forecasting/discussion/543875#3039358
Submission looks fine when I run pip install and also when I use the import statement, but fails when I use the library
Hi all, what is lags.parquet file and how should I use it in my model prediciton? Should I do something with this data or just pass it as it is in this method?:
def predict(test: pl.DataFrame, lags: pl.DataFrame | None) -> pl.DataFrame | pd.DataFrame:
global lags_
if lags is not None:
lags_ = lags
predictions = test.select(
'row_id',
pl.lit(0.0).alias('responder_6'),
)
if isinstance(predictions, pl.DataFrame):
assert predictions.columns == ['row_id', 'responder_6']
elif isinstance(predictions, pd.DataFrame):
assert (predictions.columns == ['row_id', 'responder_6']).all()
else:
raise TypeError('The predict function must return a DataFrame')
# Confirm has as many rows as the test data.
assert len(predictions) == len(test)
return predictions
is there an upper limit to the total number of submission we can make?
you can make 5 submissions per day
Hey everyone,
I am pretty new to AI/ML and wanna participate in this competition so can anyone guide me on:
What should I learn(prerequisites)?
Any beginner-friendly resources to get started?
Thanks a lot!
So, just reading the description of the data again:
lags.parquet - Values of responder_{0...8} lagged by one date_id. The evaluation API serves the entirety of the lagged responders for a date_id on that date_id's first time_id. In other words, all of the previous date's responders will be served at the first time step of the succeeding date.
How does this even makes sense? We know that the responders' values depend not just on the date_id but also on the time_id. So which of the previous day's time_id is served up the following day in the lagged responders? It doesn't yield all of them, as evidenced by the lags.parquet file. Are we just to assume that it's last value of the previous day?
i think this might be helpful
Maybe they should have actually updated the article to reflect this information, that's a massive difference from what's written
This means you can, in theory, do semi-online learning (if that were computationally feasible)
Thanks for the link, that completely changes my outlook on the problem
Also need to correct the statement I made here, then
How is everyone doing with the competition?
Just started and very confused 
Just started, is anyone running the code on Kaggle? My kernel dies on just reading the parquet files
what does responder.csv mean?
can I close my computer while submitting? If I submitted the version but it is still computing the score, will it break if close my computer?
yes, you can shut down
thanks
From the post we have:
- Lags only covers responders, not features.
- For each time_stamp, we have responder 1 day before, same time stamp.
Am I understanding the post correctly?
This is very confusing.
Yep, given all features plus 1 day lagged responders to work with
just started, have no idea how to do this kind of competition, can anyone give some pointers on how to move forward
?
I am getting this error, this is mine first competition, can some1help, here are few more SS which I think would be relevant
I joined this discord to confirm this, thanks! Seemed like the case to me, but I was expecting a longer sequence time series challenge, but this is an interesting twist.
So in reality what we have for an observation is features_0..n and t-1 responders for a given date_id and time_id.
Has anyone had success just performing daily predictions? Based on test.parquet, they've only got it at date_id =0 and time_id =0; is it reasonable just to give a daily prediction and call that enough?
the explanation of the problem itself, and the data used is horrible for this project.
Hi guys, does anyone know why doing a test submission works, but the actual submission raises an unhandled error at runtime?Not sure what the exception is. Same thing if I manually check lags for None.
def predict(test: pl.DataFrame, lags: pl.DataFrame | None) -> pl.DataFrame | pd.DataFrame:
predictions = test.select(
'row_id',
pl.lit(lags['responder_6_lag_1']).alias('responder_6'))
return predictions
I was getting that too... I don't know why but I found a certain config that fixed it.
global lags_
if lags is not None:
lags_ = lags
predictions = test.select(
'row_id',
pl.lit(0.0).alias('responder_6').cast(pl.Float64),
)
if not lags is None:
lags = lags.group_by(["date_id", "symbol_id"], maintain_order=True).last() # pick up last record of previous date
test = test.join(lags, on=["date_id", "symbol_id"], how="left")
else:
test = test.with_columns(
(pl.lit(0.0).alias(f'responder_{idx}_lag_1') for idx in range(9))
)
# Replace this section with your own predictions
predictions = test.select(
'row_id',
pl.col('responder_6_lag_1').alias('responder_6').cast(pl.Float64),
)
I think the cast to Float64 is important, but it's hard to debug with only 5 submissions a day. Only difference here is I merged the lags into the test dataframe.
Anyone have any luck with Pandas?
My code runs fine but at the end the submissions section shows that it "Threw Exception".
Completely agree.
Is the symbol_id representing the ticker symbol (e.g. voo, goog) or something else?
I have a question, Do I have to save the submission.parquet file myself or server code will save itself for me?
Running this code actually gives me an error, although the error is different, now it's actually specifying data format as throwing some error
Correct
I think it's done on it's own but someone who got their code to submit properly would need to answer.
I had memory issues with pandas so just using polars
Fudge. Thank you Danila.
is this code block working on its own as a pred?
As general question, if I read in the same file with pandas vs polars, is Polars really using less memory to hold the same amount of data?
It was. let me grab the whole thing.
lags_ : pl.DataFrame | None = None
# Replace this function with your inference code.
# You can return either a Pandas or Polars dataframe, though Polars is recommended.
# Each batch of predictions (except the very first) must be returned within 1 minute of the batch features being provided.
def predict(test: pl.DataFrame, lags: pl.DataFrame | None) -> pl.DataFrame | pd.DataFrame:
"""Make a prediction."""
# All the responders from the previous day are passed in at time_id == 0. We save them in a global variable for access at every time_id.
# Use them as extra features, if you like.
global lags_
if lags is not None:
lags_ = lags
predictions = test.select(
'row_id',
pl.lit(0.0).alias('responder_6').cast(pl.Float64),
)
if not lags is None:
lags = lags.group_by(["date_id", "symbol_id"], maintain_order=True).last() # pick up last record of previous date
test = test.join(lags, on=["date_id", "symbol_id"], how="left")
else:
test = test.with_columns(
(pl.lit(0.0).alias(f'responder_{idx}_lag_1') for idx in range(9))
)
# Replace this section with your own predictions
predictions = test.select(
'row_id',
pl.col('responder_6_lag_1').alias('responder_6').cast(pl.Float64),
)
if isinstance(predictions, pl.DataFrame):
assert predictions.columns == ['row_id', 'responder_6']
elif isinstance(predictions, pd.DataFrame):
assert (predictions.columns == ['row_id', 'responder_6']).all()
else:
raise TypeError('The predict function must return a DataFrame')
# Confirm has as many rows as the test data.
assert len(predictions) == len(test)
return predictions
Ok I'm running this identical cell with the inference server and it's giving me a slightly frustrating Your notebook generated a submission file with incorrect format
inference_server = kaggle_evaluation.jane_street_inference_server.JSInferenceServer(predict)
if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
inference_server.serve()
else:
print("running local gateway - submitting predictions")
inference_server.run_local_gateway(
(
'/kaggle/input/jane-street-real-time-market-data-forecasting/test.parquet',
'/kaggle/input/jane-street-real-time-market-data-forecasting/lags.parquet',
)
)
I'll have to try again tomo, surprised mine is being thrown errors. thx for sharing
Thank you!
Also, does anyone know if the weight column is just used for scoring the model for evaluation and unrelated to the rest of the market data?
Not sure but thats something you can test in your model by removing it to see if the score goes up.
ok thank you!
Are responder tags [tag_0, tag_1..., tag_4] the same as the feature tags [tag_0, tag_1..., tag_4]?
Anyone solve this thrown exception problem during submission? I can't understand the problem. My Predictions, Sample submission, & what gets exported for submission are identical.
All being read in using pandas.
Sample CSV submission:
My Predictions:
Outputted submission parquet file after attempting to submit:
How are others getting this to submit correctly?
Hi, The instructions (Overview > Code Requirements > Training Phase) say, "Your notebook must use THE time-series module to make predictions". What module is this?
All I did was add in my predictions to the given sample code. What you see above is an info dump of the data frame on my personal machine so I could compare.
The third image is from the file that gets spit out after using the server code in a kaggle notebook
The codes runs successfully
The output for is the issue somehow
Does everyone's submission parquet have these attributes?
In your screenshot, have you looked at the very bottom of the message? I have seen that to give some clue about the error in my case.
Hi, does anyone know why the scoring process is so slow? I already put the model loading code before the prediction function. But just a simple lgbm would task over an hour to finish scoring.
Is it always slow? Or just at certain times?
Always! And it's a bit difficult to install some other packages to make the inference faster in the kaggle notebook, so some new ideas engaging feature engineerings could be difficult to evaluate.
Before doing massive feature engineering, did you try to do a simple configuration first to test speed? As mentioned above, the simplest would be responder_6 = responder_6_lag_1.
lags_: pl.DataFrame | None = None
def predict(test: pl.DataFrame, lags: pl.DataFrame | None) -> pl.DataFrame | pd.DataFrame:
"""Make a prediction."""
global lags_
if lags is not None:
# Rename `responder_6_lag_1` to `responder_6` if it exists
if "responder_6_lag_1" in lags.columns:
lags = lags.rename({"responder_6_lag_1": "responder_6"})
# Add `responder_6_2` as the square of `responder_6` if it exists
if "responder_6" in lags.columns:
lags = lags.with_columns(
(pl.col("responder_6") ** 2).alias("responder_6_2")
)
# Save the processed lags globally for future use
lags_ = lags
# Initialize predictions with default values
predictions = test.select('row_id', pl.lit(0.0).alias('responder_6'))
if lags is not None and "responder_6" in lags.columns:
# Ensure alignment between test and lags and update the responder_6 values
pred = lags["responder_6"].to_numpy()
predictions = predictions.with_columns(pl.Series("responder_6", pred.ravel()))
# Return the predictions as a Polars DataFrame
return predictions
When I try this I'm able to generate a submission file on my own, but it throws an exception when I try to submit on the competition page. Any ideas? I replace the predict fucntion in the 'provided Jane Street RMF Demo Submission' file
This should just return the most recent lag of responder 6
Theres a few similar questions in the discussion section but no answers, was anyone able to debug this?
I'm in the same boat as you. Been spending way too much time on this exact issue rather than improving a prediction model.
had this exact issue, someone was able to fix it by casting preds to the right data type, but that didn't work for me
Hey @west glade @swift harbor.
I was able to solve the Thrown Exception error with the following code:
lags_: pl.DataFrame | None = None
def predict(test: pl.DataFrame, lags: pl.DataFrame | None) -> pl.DataFrame:
if lags is not None:
lags = lags.group_by(["date_id", "symbol_id"], maintain_order=True).last()
test = test.join(lags, on=["date_id", "symbol_id", "time_id"], how="left")
else:
test = test.with_columns(
pl.lit(0.0).alias('responder_6_lag_1')
)
# Use the lagged responder_6 value as the prediction
# Assuming 'responder_6_lag_1' is the column that represents the most recent lag
predictions = test.select(
'row_id',
pl.col('responder_6_lag_1').alias('responder_6')
)
# test = test.write_parquet("test_out.parquet")
return predictions
However, I now get a scoring submission error. Let me know if you guys are able to make any progress
I'm testing your code out now. I'll let you know what happens on my end.
I believe the issue is that the lags are served on the first time step of each date. This means you need to save them outside of the predict function - presumably this is what the global lags_ object is for.
I'm not able to submit anymore today, but this is my code, that I believe should work: https://www.kaggle.com/code/joshlevent/js-null-hypothesis
It worked!
Friends, I'm getting this stubborn error after a few hours or so from submission: "Notebook Inference Server Error
Your submission notebook's inference server was disconnected unexpectedly, or a request timed out. See more debugging tips". Is that the problem due to 60-sec timeout, or could that be memory issues, etc.?
I'm sorry I haven't gotten that far yet. My current notebook is scoring with my own information for the first real time.
@swift harbor @steel lantern Josh's code worked all the way through. I was able modify it to fit how my own code predictions and its functional. I thought I had done the same as this code right at the start but apparently I missed something and overcomplicated it.
I got it working also, it's great. Although why is the R2 score <0? Otherwise works.
I found that interesting too. As our model gets better I suppose it will move into the positive.
how long does it takes to score ? appx?
For me I think it took 2 hours.
ohk!
I didn't do anything too fancy. I only made my prediction on what's existing, then passed that through to return it.
yeah ig it'll depend on model
For an actual model that is making a prediction, I think 2 hours is a good baseline.
Does anyone know what the responder values represent in the dataset?
I understand them to be other securities
If we were talking about crypto currency:
Bitcoin
Etherium
Dogecoin
etc...
Those would all be responders.
i stopped tring to figure3 this one out. ll the data is based of assumptions they are working by. None of which is necessary. Just give me OHLCV vals
Agreed. So much time spent trying to figure things out but I'm still chugging along to at least get an ok score. OHLCV for the win.
Did anyone tried using sequence models? I saw Motono223's preprocessing only creates one lag of timestep
I was planning to but I don't have enough time so I'm keeping it a bit more simple although very tuned.
How long does it take to submit your code?
OHLCV is really something in mid frequency. Since what is the open price? Many people refers to the first tick (trade) of an interval. However you won't be able to obtain it unlike the mid-frequency strategies. Also for illiquid names, in the mid of the day, it is very likely to be very quiet and no trade happened.
Glad you got it working, do you understand the lags they are providing? It sounded like we get only one lag from the previous day. And I assume within the submission test dataset we will be able to calculate the lags within a day.
Seems like they will provide the responders of the same time_id yesterday in the lag field.
What I saw from the discussion is that at the end they will provide the whole previous day's data in lag, not just one time_id
Hi guys, If anybody can help me, I wanted to know how should I start this jane street modelling challenge. I have only experience with basic models for regression and classifications. I have some knowledge about complex models like XGBooost and ensemble methods, I wanted to learn through this competition. I tried loading the dataset my VScode crashed (I have m2 pro macbook) is it difficult to this modelling on local machine? I really appreciate any help Thank You!
Yeah, the dataset is large for a small to medium laptop/desktop it's going to be a tough first one, you'll probably spend most of your time finding tricks to process the data. I used polars which has a streaming feature so I could lazy define things and then I setup some creative data generators that read only a portion of the data at any given time.
Thank you for the response!
Like Shane had said, I assumed that they would always give us the lag so no need to calculate anything.
The entire previous day? 0_o
Did it say that in the competition information anywhere?
So if our code is running at the middle of the day, We would get the entire previous days values plus the last time_id?
Too bad this wasn't more clear.
Still so many question š
Crazy that it took for someone to test this out to try and verify that instead of just getting a clear instruction within the competition notes.
If this is true then what they gave us as an example is crap.
A well, Not enough time to work my code to verify this or use it in a meaningful way if its true.
@keen kite Thanks for pointing this out.
Dang just thinking about it now. That would mean that the previous time_id lag won't be given at each time_id that is not zero. Fudge.
The instructions are just confusing
And for the test set, only one time_id lag is given
I think the goal is to use lag data (the previous day or all previous days) to predict the next day t=0 responders
I have a little pet hypothesis. I think that responder_6 is either implied volatility or is somehow related to implied volatility.
if I'm right, make sure to credit me later. 
lol
I've been working on a very interesting solution to this problem that for now I'm gonna keep secret. Once it's over I'd be glad to share it
would love to discuss how others approached this problem as well
How long is it taking for everyone's code to run?
I'm currently sitting at 6 hours of scoring and getting worried 0_0
hmm if you run your own scoring system, does it take 6 hrs to run through it?
on your own pc, that is
@steel parcel It would be great to discuss after the competition!
I had a little trouble getting the kaggle evaluation package to work for me so I just focused on my model on my PC and figured I'll let the kaggle notebook do the scoring.
do you load the entire dataset at once?
I did at the very start yes.
ya that might be it. what if you fractionalized the dataset and ran through it procedurally?
But then I soon realized that I needed to work with it in pieces probably like everyone else.
Do you know how the scoring portion actually works?
they tell us how it's calculated
Are they scoring on the data that they've given us to train on?
Yeah whatever they are using to get our Public score. Are they scoring our model on the data that they've already given us that we are training off of.
At the start of the forecasting phase, the unscored public test set will be extended up to the final day of the model training phase and the private set updated roughly every two weeks. Submissions will be rescored at the time of each update.
During the forecasting phase, the evaluation API will serve test data from the beginning of the public set to the end of the private set. You must make predictions at every timestep, but, in this phase, only predictions on the private set are scored. (You may predict 0.0 on the unscored segments, if you like.)```
from the data tab
Hmmm...Seems maybe I should have spent more time getting the evaluation package to function.
hey! does anyone know what the forecasting window is supposed to be? ik this is a simple question but I'm still confused
aka 1 time step into the future or n time steps
the test set they provide is 38 rows long, does this mean the window is just whatever length test set they give us?
It's 1 time step. The rows correspond to different symbol_ids.
Very welcome.
So in regards to this lags situation, lets say we are at time_id = 25. I won't have any access to any of the lags of any responders for the past 5 steps?
you can store the data in memory or in file data that can be read dynamically, right?
@west glade did you try pre-training your model before you upload it and loading it into memory to save time?
I was thinking of that but....we won't ever get any of the actual values untill the next time_id = 0 which is the next day.
just trying to think of ways to make your work easier
try storing useful relevant data into a readable format, unless they wipe the data clean on your system it should stay... right?
I think in my case I'm building several contingencies for situational chaos
I completely agree with you, as long as they provide the last time_id's lag at every step rather then only giving us yesterdays lags only at the beginning of a new day.
I am trying to confirm that there is something for me to store aside from what I was given for yesterday.
you might be able to extract current datetime & utilize that as a tool
no guarantee on that, though
I think that for the most part, they're shopping on kaggle for potential hires
ahhhhh
I don't think they care very much about the results,
obviously the results are useful & any code/algorithms/models they can obtain are probably worth a few pennies
& 50k is definitely just a few pennies to these guys
agreed.
I think that's why they aren't being more open about what can & can't be done
if you really cared strongly about the results, you'd be more specific with available options
They are probably enjoying watching us stress out lol
hey I find it fun
anonymized column data is actually kinda interesting, it got me thinking about how I can work around that
which I'd be really eager to discuss later
Same!
the math olympiad competition seems pretty challenging in a not-fun way
I read through it and it seemed more challenging than arc 2024
and not nearly as "nice"
I could be wrong though
I wish I had time to check out the other comps. I found that the spine MRI image classification was pretty cool but couldn't do it at the time.
I just realized that I didn't answer this question. Sorry. Yes I did.
And it failed at the 8th hour unfortunately.
Another question: in the sample test parquet they gave us, the is_scored field is all true, but this is the very first date and time step (which is part of the public set). they claim only the private set predictions are scored. what am I misunderstanding?
anyone having 'too many requests' issue?
Does anyone know. How can we get previous day features? I saw that the lag dataset only contains responders
I believe you can store them in a global variable manually
I am kinda confused about the data they provide. Please pont me out if I am wrong in this post
Everything seems right except what you said about test.parquet. As I understand it (take with a grain of salt) You will get a single date_id & time_id each time your predict is called. It will contain features for all symbol_ids. You are right about the lags in that at time_id = 0, you get the previous "date_id's" lags.
what I don't know is at, lets say time_id = 25, if we will get the lag for the previous time_id step (responder value at time_id_{t-1}.
Someone can correct me if any of that is wrong.
Anyone has problem called"submission format errors"? I tried many time, it is still fail and I can't find issues on my code
This is my prediction code
Thanks for pointing me out. I think you are correct. We know 'what time is it now' from the test df passed in
I think you get the previous day's lag response all together at the next day t0. So extra work to cache the features and do df join
When you run your code on the test and lags parquet files that we were given, do you get something that looks like this:
yeah. I fixed issues later by changing the logic of merging. I need to groupby first and merge
man, this submission process is driving me crazy
Your submission notebook's inference server was disconnected unexpectedly, or a request timed out. See more debugging tips```
and not a single clue as to why.
and I'm not sure how to troubleshoot this, either. the test submission ran perfectly
making the features anonymized/indiscriminate, unintelligible without context, IMO is a deficit to building the most optimal model for predicting markets
less realistic to exclude domain expertise and knowledge, better for challenge innovation though š
I would also love to know, but how they generated these features and responders are probably worth a lot of money
what's the frequency of this data(Hourly, daily or weekly?),
Did i missed where it was mentioned or it is hidden and there no way to figure it out.
it says in the data section "It's important to note that the real time differences between each time_id are not guaranteed to be consistent" which doesn't exactly say much but there are around 968 time_ids per date_id
one theory i've seen is that its roughly minutely and includes trading hours and after hours trading
has anyone's notebooks' ran successfully, but scoring will inevitably result in Notebook thrown exception?
I'm at my wits end, can't get any logs or prints to debug this. I've sprinkled defensive checks everywhere, types, shapes, bounds, anything I can think of, the code looks like messy spaghetti.
and Notebook thrown exception always laughs in my face, literally pulling my hairs out
Hello Kaggle Team, Kaggle Community, and Competition hosts,
Our team participated in the Jane Street Real-Time Market Data Forecasting competition, and we encountered a critical issue during the forecasting period where none of our submissions were properly scored on the Private Leaderboard.
During the Public Leaderboard phase, our submissions were successfully evaluated, and we had no issues. However, once the competition transitioned into the forecasting period, every submission we madeāwhether it was our own developed solution, a publicly available notebook solution, or a combination of bothāfailed to be scored correctly. This happened regardless of whether we selected those solutions as our final submissions.
The submission logs indicate that all our submissions were marked as "Succeeded," yet they were not evaluated on the Private LB. The attached image provides evidence of this issue.
We want to clarify that our team did not engage in any rule violations or unethical practices. Given that multiple solutions were affected, we believe this could be a technical issue rather than a problem specific to our team.
Could the Kaggle team and Community please investigate this matter and provide clarification on why our submissions were not evaluated on the Private LB? We would appreciate any insights or possible resolutions.
Jane Street unfortunately will always suffer from how they frame the Aproach; to which they framed the Problem. I may contact their clients rather then seek the Prize money
You should post in the forums, there is no support through discord.
Can we join this competition now
Says, "New entrants are currently not allowed. You will be able to accept the rules and submit late predictions after the competition completes."
Hi All, I am looking for Kaggle Grandmasters who have won competition who can mentor me. I am willing to pay for mentorship. Thank you!
hmm its a forecasting comp ofc not anytime u wake up n think u can join
š©āš¦¼
Machine Learning Algorithms You Never Knew Existed, But Are Quite Useful https://medium.com/pythoneers/machine-m. D
They do not exist btw
Now we don't know they exists fr, lol
Yes I'm omniscient

Reason: Bad word usage
Reason: Too many infractions
Reason: Bad word usage
Reason: Too many infractions
I knew you where going to say that š¤£
No more
Reason: Posted an invite
Reason: Too many infractions
