#jane-street-real-time-market-data-forecasting

1 messages Ā· Page 1 of 1 (latest)

livid vortex
livid elbow
#

What is the format for the submission? Does my notebook need to have an output file?

steel trench
#

how to submit and what exactly to submit

lapis marsh
#

@steel trench

south willow
#

Hello

brittle crown
#

hi

ember meteor
#

Hi Everyone. I am new to market data forecasting. I was just reading the problem statement and have a question on terminology. What do we mean by responders? I couldn't find the exact definition anywhere.

west glade
#

@ember meteor Hi! "responder" or more specifically "responder_6" is meant to be your target or the outcome. They don't want to expose any sensitive info so they use the term responder. Lets say you take a buy trade at 7:35 AM. "responder" would be the value at 7:36 AM. That's just an example. Your goal is to predict that value at 7:36 AM.

ember meteor
west glade
#

I have not dug into the data well enough yet but that sounds right. the features are probably values of different indicators. ex: RSI, MOVING AVERAGE, OTHER SECURITIES, ETC..

gloomy spear
#

Hello everyone this is my first Kaggle challenge, Is there an environment file we can use to set up everything that's needed for the APIs and inference servers ?

simple shoal
west glade
# simple shoal This lead me to a question, if we don't know the nature of those features how ca...

Our task is just to find and use useful relationships between the features that they provided us if there is any. I also prefer to have actual meaningful categories names so we can play with but if we look at this purely from a time series analysis point of view, then we can think more clearly.

Lets say that we know one of them is a moving average, we might be tempted to try every moving average under the sun. We can do our own feature engineering since we know its market. The way they presented it leaves things more focused in my opinion. Puts everyone on the same playing field.

west glade
gloomy spear
west glade
#

ahhh

simple shoal
#

I'm a beginner in machine learning and this my first time to deal with anonymous features so I might fail to make the most intuitive conclusion about the problem we have in this competition

west glade
#

The way I like to approach is use what is given, then manipulate/add/remove as needed to get better scores.

west glade
simple shoal
#

What about missing values NaNs I'm quite sure you didn't drop these rows but only want to confirm

#

I'm surprised that no one in the discussion talked about it in detail although there's significant number of NaNs in this dataset

#

I have the impression that the fact that a value is missing in this dataset holds a useful signal and that the testing data will include many NaNs as well so I think imputing or dropping them isn't smart

west glade
#

Handling missing or NaN values for time series can be a bit of an art.
Is that row important?
Can I use the row before it in my model to predict the next row?
All valid questions. In the case of returns, I personally would most likely impute in some way. That may or may not be the best approach in this case but it's up to the programmer šŸ’Ŗ šŸ¤“ .

narrow nest
#

Hey, just starting in this comp so apologies if this is a silly question.

In the test set are we provided lags for each date_id, time_id pair? Or only for the first one?

So in the current example, test and lags have the same number of rows. But if only the first time_id for each date is given for the lags, test.csv might have more rows than lags - right?

west glade
#

There are no stupid question here. I haven't had the chance to dig into the data well enough yet. I'm sorry I'm not sure.

crisp hawk
#

what are tags?

warm oriole
#

Sorry, new to these type of competitions. I see that there are many submission submitted already. Are teams meant to be making new submissions throughout the project until the deadline ?

knotty spade
#

During the forecasting phase, is our model forced to be fixed? Can we train dynamically when new information flows in, or feed this new information into our prediction function in anyway via not a serious "training"? In other words, can we use the information of sequentially earlier part of the test set when predicting the sequentially later part of the test set?

modern bronze
#

Hello there! "weight - The weighting used for calculating the scoring function" what's the scoring function here?

knotty spade
modern bronze
#

thanks!

solid bobcat
#

im a little confused about what some of the files are for. are the responders.csv and features.csv supposed to be used? or just to look at?

lofty ferry
#

I believe they're just to look at

rugged summit
#

guys im new i want to try a constant submission just to test but i cant import the module how can i do

meager garnet
#

Has anyone tried PySpark for submission. Is it even possible? I am new to Pyspark so was thinking of applying it in this contest.

thick creek
abstract vale
#

My notebook is failing scoring, but nothing appears in the logs. My guess is that this is due the inference server being in a different thread. Is there an example of how to capture those logs, so I can figure out the source of the failure?

wise spire
# rugged summit

Not sure, but I don’t think that’s the correct way to import ā€œkaggle evaluationā€ module.

burnt forge
#

Hello, what is the best practice regarding installing packages in kaggle notebooks? I see that there's a default environment with some reasonable package versions (i.e. lightgbm 4.2.0). I tried updating it to 4.5.0 by connecting the notebook to the internet, but it turns out we cannot submit notebooks with connection to the internet.

Curious to hear if there are Kaggle ways to deal with this?

somber mauve
#

So, in order to generate/create features based on the responders as example, will they be available at each date_id/time_id ? all of them not just responder_6 ? Same goes for features, for each date_id/time_id will they be available as well ?

fiery sage
#

what is the reason for the Nulls in the data? a decent chunk of some of the features are nulls

compact sedge
#

Hello I am trying to make submission through API, I tried setting os.environ['KAGGLE_IS_COMPETITION_RERUN'] = "True" and then run the Serve function and the server kept running with no results so far, Can anyone please help me?

rich sorrel
rich sorrel
#

this is not correct. the inference server gives you the t-1 responder values. you therefore can train an online model that includes test data

knotty spade
compact sedge
compact sedge
#

Hello EveryOne, Can anyone verify if we need to set KAGGLE_IS_COMPETITION_RERUN env variable before running the server and what how much time does it usually take to complete a submission?

queen trench
#

as of my understanding of the data, we have timeseries data for training set, but only the responders (0,[...],8) from T-1 and the features (0, [...],78) from T for the prediction right? so for the prediction we do not have historical data available?

rich sorrel
rich sorrel
still thunder
strange canyon
#

i am facing memory issue while running notebook in kaggle. what's the solution? should i use less partition from train data rather than using all 10, will that impact model's accuracy?

tame urchin
#

I,SJTU master,Xiamen univer graduated,want a Chinese team .

somber mauve
#

For every time_id per date_id, will the features be available same goes for responders ? My point is if I want to generate features like rolling mean, lags as well (ignore the files for now)

worthy needle
#

hi i want to know how long will the calculate takes after a submission? i have waited for 20 mins

fiery sage
strange canyon
#

this is my first competetion. i have made 8 submissions but can't go above score of 0.0043. i have tried lightgbm and xgb. should i focus on improving my existing submission OR try different model? any suggestion?

gritty summit
#

Hello, I am trying to submit a simple dummy submission just to get a good handle on the submission format. I am getting "Notebook threw Exception" despite passing the assertions in the given "predict" function. All I did was copy the example notebook and edit the predict response, it doesnt even use a model. Any help with navigating the API and submission format would be appreciated.

#

Example code without any model would be appreciated just to understand submission better

gritty summit
#

As far as I can tell the demo submission doesn’t actually work as is??

strange canyon
#

when i submit my notebook yo competetion it is failing again and again saying."Notebook Inference server error"

strange canyon
#

not able to submit.. tried so many times..

sonic gust
gritty summit
#

Using all my daily submissions juxst to debug submission isn't even a model 😭

strange canyon
strange canyon
stuck zealot
#

Do you think it could be an error with the API?

#

or the environment where the notebook is run

#

I've used 3 of my 5 submissions trying to debug this error..

strange canyon
stuck zealot
#

I've tried to submit the notebook versions that were submited and scored last week without any issue, and the notebook inference error keeps popping up.... 🄹

strange canyon
bitter tapir
#

Hello, I was hoping someone can explain to me the exact way in which the API serves up the test data. We get batches which correspond to specific time_id's, but are consecutive batches also consecutive time_id's? That is, would it be similar to looping over date_id and time_id in the training dataset?

bitter tapir
bitter tapir
#

I put this inside the "predict" function

    global date_id
    global time_id
    global switched_date
    global dates_without_time
    
    if date_id is None:
        date_id = test.select("date_id").max().item()
        time_id = test.select("time_id").max().item()
    
    if date_id != test.select("date_id").max().item():
        switched_date += 1
        date_id = test.select("date_id").max().item()
        time_id = test.select("time_id").max().item()
    else:
        assert test.select("time_id").max().item() == time_id + 1
        switched_date = 0
    if switched_date > 1:
        dates_without_time += 1
    if dates_without_time > 10:
        raise ValueError("Too many dates without time.")

The submission says that the notebook threw an exception. So I guess that means that, even if two batches have the same date_id, two consecutive batches do not generally have the consecutive time_id's. Do correct me if I'm wrong, the notebook may have thrown an exception for some other reason

stuck zealot
#

I am guessing that there is something wrong with the scoring environment, as all the submissions are failing afaik

#

@bitter tapir you can read it above

bitter tapir
#

I also saw that which is why I'm not confident about what caused the error - so if somebody has an insight into the time series that is served up by the API, please do share

gritty summit
#

I can’t try it yet but I think I know why it’s breaking

#

Ran out of submissions need to wait an hourish

#

Bc it does work on the example notebook with no changes

gritty summit
bitter tapir
#

Oh? Good to know, I'm new to this

gritty summit
#

yeah trying to figure out just from if it throws error sounds fucked, I thought it was like that at first too lol

gritty summit
#

Nvmd idk it didnt work, shit just seems fukt i cant get anything that isnt the literal copy of the example to work

strange canyon
#

model training should happen in less than 15 min and inference should start.

gritty summit
stuck zealot
#

I think I've managed to overcome the issue

#

You have 90 seconds to finish the first predict call, if not, the notebook will launch an inference error. So all the time consuming operations (e.g.: loading the model/s) need to happen before the first call

#

That solved the issue for me 🄹

thin verge
#

Hello. On the "Overview" page, under "Code Requirements -> Training Phase", there is a point that states: "Your notebook must use time-series module to make predictions". Could somebody clarify what this means?

tame urchin
#

I have a question .what is the relation between row_id ,date_id ,time_id and symbol_id.I see symbol_id from 0 to 38 ,and than time_id plus 1,and than symbol_id from 0 to38 again.

tame urchin
stuck zealot
coarse herald
#

Anyone looking to collab?

harsh relic
#

life would be easy if responders are updated intra-day

grim oriole
#

i had a model that gave an r squared of 0.8 using the previous responder

#

but then i realized they only update every day

#

😢

hallow heart
#

Hello, i'm new here, i got one team

keen crest
#

Hi, what is the team size allowed for this competition?

harsh relic
#

Hi there anyone knows which one is better for gradient boosting?

solid ibex
#

Hi all, I am looking for a team. I am based in New York City

strange canyon
#

geting submission scoring error after submitting. but output is same as my earlier notebooks outputs which got accepted.

fiery sage
strange canyon
bitter tapir
grim oriole
#

i believe the test.parquet file is an example of what the test dataframes that are given to the predict method look like

#

test.parquet has the columns, "row_id", "date_id", "time_id", "symbol_id", "weight", "is_scored" and features 00-78

exotic smelt
#

hi

#

Hi there
I need help in kaggle jane street as there are so many partition_id which one i should chose to predict responder 6.Thanks

exotic smelt
#

Please help

bronze storm
#

Does anyone know that in the prediction period, we can have the responder for last time_id?

grim oriole
#

the responders are given through the lag dataframe in the predict function at the first time_id and it contains all of the responders for all of the time_ids of the previous day

bronze storm
bronze storm
abstract vale
errant sierra
# rugged summit

In that notebook see right section is it showing Jane street competition if not then that's the issue recreate notebook and copy paste from this to there and rerun the notebook or u might have disabled internet

errant sierra
#

I'm getting negative score on leaderboard

glossy vigil
#

@errant sierra That means your forecast didn't beat the baseline, 0 forecast.

cedar gyro
#

will there be new unseen symbol id shown in the testing stages?

lofty ferry
#

Is it possible to submit locally rather than using kaggle notebooks? I find them rather cumbersome to use.

worthy girder
torpid wolf
#

hello, I cannot submit either choosing notebook and file upload. I already finished the verification though

blissful axle
#

Hello, Could we re-train our model during Evaluation time???

feral hare
#

Did anyone fix the Notebook Inference server error? I load a pretrained model outside of predict that seems to run quickly (18 µs with the local test.parquet) but getting the server error when I try to submit

deft moat
#

can someone help me understand the data a bit more please by replying to my message. Ive loaded all the data into a pandas data frame and im trying to plot responder_6. Now i cant do this nicely at the minute because of the duplicate date_id. I tried to fix it to one symbol_id to fix the issue but theres still duplicate date_id. Can someone just help me with my understanding of the data please. (this is my first data science project and competition). My thought was that the symbol_id represented a stock ticker lets say and responder_6 was the price lets say, and i thought by fixing the symbol there would be no duplicate date.

grim oriole
tiny gate
#

hey how do you know if you've submitted succesfully, I run the code with a saved model on the inference server and get no errors, but can't see any submissions on my end. Thanks so much!

glossy vigil
bitter tapir
glossy vigil
bitter tapir
#

No, there are no intraday responders provided

toxic spindle
#

Hi, everyone. I'm looking for a team member and I'm kaggle novice. I have deep understanding RL and Causal AI. Before that, I experienced multi agent systems and high frequency trading and market neutral strategy. Yes, also, I'm going to collaborate in jane streeet competition. Recently, I can work for more than 5 hours in kaggle competitions.
If you are more than kaggle expert, that's better. But that's not necessity.
Also, I'm going to try AI research in the near future and hope long collabrations.
If you are passionate with AI, please hit me.

shy cliff
deft moat
#

Hey im wondering if anyone can help me in making my first submission. I think im setting up the inference server correctly, and i believe my notebook is in the competition enviroment so im not sure what im not doing.

dusk rampart
#

Hey all. I have a question regarding the datasets. Not quite sure that does the feature.csv show. Can anyone explain me please

queen trench
#

Does anyone have issues with installing packages through the Kaggle Package Manger in the submission? My submission aborts with a Server Inference Error after <60 seconds
Looks like someone had a similar issue: https://www.kaggle.com/competitions/jane-street-real-time-market-data-forecasting/discussion/543875#3039358

Submission looks fine when I run pip install and also when I use the import statement, but fails when I use the library

hardy canyon
#

Hi all, what is lags.parquet file and how should I use it in my model prediciton? Should I do something with this data or just pass it as it is in this method?:

def predict(test: pl.DataFrame, lags: pl.DataFrame | None) -> pl.DataFrame | pd.DataFrame:
    global lags_
    if lags is not None:
        lags_ = lags

    predictions = test.select(
        'row_id',
        pl.lit(0.0).alias('responder_6'),
    )

    if isinstance(predictions, pl.DataFrame):
        assert predictions.columns == ['row_id', 'responder_6']
    elif isinstance(predictions, pd.DataFrame):
        assert (predictions.columns == ['row_id', 'responder_6']).all()
    else:
        raise TypeError('The predict function must return a DataFrame')
    # Confirm has as many rows as the test data.
    assert len(predictions) == len(test)

    return predictions
feral hare
#

is there an upper limit to the total number of submission we can make?

grim oriole
#

you can make 5 submissions per day

full sail
#

Hey everyone,
I am pretty new to AI/ML and wanna participate in this competition so can anyone guide me on:

What should I learn(prerequisites)?
Any beginner-friendly resources to get started?
Thanks a lot!

bitter tapir
#

So, just reading the description of the data again:

lags.parquet - Values of responder_{0...8} lagged by one date_id. The evaluation API serves the entirety of the lagged responders for a date_id on that date_id's first time_id. In other words, all of the previous date's responders will be served at the first time step of the succeeding date.

How does this even makes sense? We know that the responders' values depend not just on the date_id but also on the time_id. So which of the previous day's time_id is served up the following day in the lagged responders? It doesn't yield all of them, as evidenced by the lags.parquet file. Are we just to assume that it's last value of the previous day?

grim oriole
#

i think this might be helpful

bitter tapir
#

This means you can, in theory, do semi-online learning (if that were computationally feasible)

#

Thanks for the link, that completely changes my outlook on the problem

bitter tapir
west glade
#

How is everyone doing with the competition?

hollow cedar
#

Just started and very confused harold

waxen inlet
#

Just started, is anyone running the code on Kaggle? My kernel dies on just reading the parquet files

near shuttle
#

what are responders?

#

What are their relation with features?

near shuttle
#

what does responder.csv mean?

tired dagger
#

can I close my computer while submitting? If I submitted the version but it is still computing the score, will it break if close my computer?

tired dagger
#

thanks

glossy vigil
#

From the post we have:

  1. Lags only covers responders, not features.
  2. For each time_stamp, we have responder 1 day before, same time stamp.
    Am I understanding the post correctly? peepoTea This is very confusing.
west glade
#

Yep, given all features plus 1 day lagged responders to work with

lavish tusk
#

just started, have no idea how to do this kind of competition, can anyone give some pointers on how to move forward

#

?

vocal void
#

I am getting this error, this is mine first competition, can some1help, here are few more SS which I think would be relevant

heady condor
oak star
#

Has anyone had success just performing daily predictions? Based on test.parquet, they've only got it at date_id =0 and time_id =0; is it reasonable just to give a daily prediction and call that enough?

fossil wharf
#

the explanation of the problem itself, and the data used is horrible for this project.

swift harbor
#

Hi guys, does anyone know why doing a test submission works, but the actual submission raises an unhandled error at runtime?Not sure what the exception is. Same thing if I manually check lags for None.

def predict(test: pl.DataFrame, lags: pl.DataFrame | None) -> pl.DataFrame | pd.DataFrame:
    predictions = test.select(
        'row_id',
        pl.lit(lags['responder_6_lag_1']).alias('responder_6'))
    return predictions
heady condor
# swift harbor Hi guys, does anyone know why doing a test submission works, but the actual subm...

I was getting that too... I don't know why but I found a certain config that fixed it.

    global lags_
    if lags is not None:
        lags_ = lags

    predictions = test.select(
        'row_id',
        pl.lit(0.0).alias('responder_6').cast(pl.Float64),
    )    
        
    if not lags is None:
        lags = lags.group_by(["date_id", "symbol_id"], maintain_order=True).last() # pick up last record of previous date
        test = test.join(lags, on=["date_id", "symbol_id"], how="left")
    else:
        test = test.with_columns(
            (pl.lit(0.0).alias(f'responder_{idx}_lag_1') for idx in range(9))
        )

    # Replace this section with your own predictions
    predictions = test.select(
        'row_id',
        pl.col('responder_6_lag_1').alias('responder_6').cast(pl.Float64),
    )

I think the cast to Float64 is important, but it's hard to debug with only 5 submissions a day. Only difference here is I merged the lags into the test dataframe.

west glade
#

Anyone have any luck with Pandas?

#

My code runs fine but at the end the submissions section shows that it "Threw Exception".

rocky totem
#

Is the symbol_id representing the ticker symbol (e.g. voo, goog) or something else?

static rampart
#

I have a question, Do I have to save the submission.parquet file myself or server code will save itself for me?

swift harbor
west glade
swift harbor
west glade
#

Fudge. Thank you Danila.

swift harbor
west glade
#

As general question, if I read in the same file with pandas vs polars, is Polars really using less memory to hold the same amount of data?

heady condor
# swift harbor is this code block working on its own as a pred?

It was. let me grab the whole thing.

lags_ : pl.DataFrame | None = None


# Replace this function with your inference code.
# You can return either a Pandas or Polars dataframe, though Polars is recommended.
# Each batch of predictions (except the very first) must be returned within 1 minute of the batch features being provided.
def predict(test: pl.DataFrame, lags: pl.DataFrame | None) -> pl.DataFrame | pd.DataFrame:
    """Make a prediction."""
    # All the responders from the previous day are passed in at time_id == 0. We save them in a global variable for access at every time_id.
    # Use them as extra features, if you like.
    global lags_
    if lags is not None:
        lags_ = lags

    predictions = test.select(
        'row_id',
        pl.lit(0.0).alias('responder_6').cast(pl.Float64),
    )    
        
    if not lags is None:
        lags = lags.group_by(["date_id", "symbol_id"], maintain_order=True).last() # pick up last record of previous date
        test = test.join(lags, on=["date_id", "symbol_id"], how="left")
    else:
        test = test.with_columns(
            (pl.lit(0.0).alias(f'responder_{idx}_lag_1') for idx in range(9))
        )

    # Replace this section with your own predictions
    predictions = test.select(
        'row_id',
        pl.col('responder_6_lag_1').alias('responder_6').cast(pl.Float64),
    )

    if isinstance(predictions, pl.DataFrame):
        assert predictions.columns == ['row_id', 'responder_6']
    elif isinstance(predictions, pd.DataFrame):
        assert (predictions.columns == ['row_id', 'responder_6']).all()
    else:
        raise TypeError('The predict function must return a DataFrame')
    # Confirm has as many rows as the test data.
    assert len(predictions) == len(test)

    return predictions
swift harbor
# heady condor It was. let me grab the whole thing. ```python lags_ : pl.DataFrame | None = No...

Ok I'm running this identical cell with the inference server and it's giving me a slightly frustrating Your notebook generated a submission file with incorrect format

inference_server = kaggle_evaluation.jane_street_inference_server.JSInferenceServer(predict)
if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    inference_server.serve()
else:
    print("running local gateway - submitting predictions")
    inference_server.run_local_gateway(
        (
            '/kaggle/input/jane-street-real-time-market-data-forecasting/test.parquet',
            '/kaggle/input/jane-street-real-time-market-data-forecasting/lags.parquet',
        )
    )

I'll have to try again tomo, surprised mine is being thrown errors. thx for sharing

rocky totem
rocky totem
#

Also, does anyone know if the weight column is just used for scoring the model for evaluation and unrelated to the rest of the market data?

west glade
#

Not sure but thats something you can test in your model by removing it to see if the score goes up.

rocky totem
#

ok thank you!

weary wolf
#

Are responder tags [tag_0, tag_1..., tag_4] the same as the feature tags [tag_0, tag_1..., tag_4]?

west glade
#

Anyone solve this thrown exception problem during submission? I can't understand the problem. My Predictions, Sample submission, & what gets exported for submission are identical.

#

All being read in using pandas.
Sample CSV submission:

#

My Predictions:

#

Outputted submission parquet file after attempting to submit:

#

How are others getting this to submit correctly?

weary wolf
#

Hi, The instructions (Overview > Code Requirements > Training Phase) say, "Your notebook must use THE time-series module to make predictions". What module is this?

west glade
#

All I did was add in my predictions to the given sample code. What you see above is an info dump of the data frame on my personal machine so I could compare.

#

The third image is from the file that gets spit out after using the server code in a kaggle notebook

#

The codes runs successfully

#

The output for is the issue somehow

#

Does everyone's submission parquet have these attributes?

jolly parcel
# vocal void

In your screenshot, have you looked at the very bottom of the message? I have seen that to give some clue about the error in my case.

wooden gazelle
#

Hi, does anyone know why the scoring process is so slow? I already put the model loading code before the prediction function. But just a simple lgbm would task over an hour to finish scoring.

west glade
wooden gazelle
#

Always! And it's a bit difficult to install some other packages to make the inference faster in the kaggle notebook, so some new ideas engaging feature engineerings could be difficult to evaluate.

west glade
steel lantern
# west glade Before doing massive feature engineering, did you try to do a simple configurati...
lags_: pl.DataFrame | None = None

def predict(test: pl.DataFrame, lags: pl.DataFrame | None) -> pl.DataFrame | pd.DataFrame:
    """Make a prediction."""
    global lags_
    if lags is not None:
        # Rename `responder_6_lag_1` to `responder_6` if it exists
        if "responder_6_lag_1" in lags.columns:
            lags = lags.rename({"responder_6_lag_1": "responder_6"})
        
        # Add `responder_6_2` as the square of `responder_6` if it exists
        if "responder_6" in lags.columns:
            lags = lags.with_columns(
                (pl.col("responder_6") ** 2).alias("responder_6_2")
            )
        
        # Save the processed lags globally for future use
        lags_ = lags

    # Initialize predictions with default values
    predictions = test.select('row_id', pl.lit(0.0).alias('responder_6'))

    if lags is not None and "responder_6" in lags.columns:
        # Ensure alignment between test and lags and update the responder_6 values
        pred = lags["responder_6"].to_numpy()
        predictions = predictions.with_columns(pl.Series("responder_6", pred.ravel()))

    # Return the predictions as a Polars DataFrame
    return predictions

When I try this I'm able to generate a submission file on my own, but it throws an exception when I try to submit on the competition page. Any ideas? I replace the predict fucntion in the 'provided Jane Street RMF Demo Submission' file

#

This should just return the most recent lag of responder 6

steel lantern
#

Theres a few similar questions in the discussion section but no answers, was anyone able to debug this?

west glade
swift harbor
steel lantern
#

Hey @west glade @swift harbor.
I was able to solve the Thrown Exception error with the following code:

lags_: pl.DataFrame | None = None
def predict(test: pl.DataFrame, lags: pl.DataFrame | None) -> pl.DataFrame:
if lags is not None:
lags = lags.group_by(["date_id", "symbol_id"], maintain_order=True).last()
test = test.join(lags, on=["date_id", "symbol_id", "time_id"], how="left")
else:
test = test.with_columns(
pl.lit(0.0).alias('responder_6_lag_1')
)

# Use the lagged responder_6 value as the prediction
# Assuming 'responder_6_lag_1' is the column that represents the most recent lag
predictions = test.select(
    'row_id',
    pl.col('responder_6_lag_1').alias('responder_6')
)
# test = test.write_parquet("test_out.parquet")
return predictions

However, I now get a scoring submission error. Let me know if you guys are able to make any progress

west glade
nocturne shore
#

I believe the issue is that the lags are served on the first time step of each date. This means you need to save them outside of the predict function - presumably this is what the global lags_ object is for.

nocturne shore
eternal terrace
#

Friends, I'm getting this stubborn error after a few hours or so from submission: "Notebook Inference Server Error
Your submission notebook's inference server was disconnected unexpectedly, or a request timed out. See more debugging tips". Is that the problem due to 60-sec timeout, or could that be memory issues, etc.?

west glade
west glade
#

@swift harbor @steel lantern Josh's code worked all the way through. I was able modify it to fit how my own code predictions and its functional. I thought I had done the same as this code right at the start but apparently I missed something and overcomplicated it.

swift harbor
west glade
#

I found that interesting too. As our model gets better I suppose it will move into the positive.

tawny ledge
#

how long does it takes to score ? appx?

west glade
#

For me I think it took 2 hours.

tawny ledge
#

ohk!

west glade
#

I didn't do anything too fancy. I only made my prediction on what's existing, then passed that through to return it.

tawny ledge
#

yeah ig it'll depend on model

west glade
#

For an actual model that is making a prediction, I think 2 hours is a good baseline.

rocky totem
#

Does anyone know what the responder values represent in the dataset?

west glade
#

I understand them to be other securities

#

If we were talking about crypto currency:
Bitcoin
Etherium
Dogecoin
etc...

Those would all be responders.

fossil wharf
#

i stopped tring to figure3 this one out. ll the data is based of assumptions they are working by. None of which is necessary. Just give me OHLCV vals

west glade
#

Agreed. So much time spent trying to figure things out but I'm still chugging along to at least get an ok score. OHLCV for the win.

keen kite
#

Did anyone tried using sequence models? I saw Motono223's preprocessing only creates one lag of timestep

west glade
ocean shadow
#

How long does it take to submit your code?

glossy vigil
heady condor
glossy vigil
keen kite
eternal lotus
#

Hi guys, If anybody can help me, I wanted to know how should I start this jane street modelling challenge. I have only experience with basic models for regression and classifications. I have some knowledge about complex models like XGBooost and ensemble methods, I wanted to learn through this competition. I tried loading the dataset my VScode crashed (I have m2 pro macbook) is it difficult to this modelling on local machine? I really appreciate any help Thank You!

heady condor
west glade
west glade
west glade
keen kite
#

check out this thread

west glade
#

Still so many question šŸ˜‚
Crazy that it took for someone to test this out to try and verify that instead of just getting a clear instruction within the competition notes.
If this is true then what they gave us as an example is crap.
A well, Not enough time to work my code to verify this or use it in a meaningful way if its true.
@keen kite Thanks for pointing this out.

#

Dang just thinking about it now. That would mean that the previous time_id lag won't be given at each time_id that is not zero. Fudge.

keen kite
#

The instructions are just confusing

#

And for the test set, only one time_id lag is given

#

I think the goal is to use lag data (the previous day or all previous days) to predict the next day t=0 responders

steel parcel
#

I have a little pet hypothesis. I think that responder_6 is either implied volatility or is somehow related to implied volatility.

#

if I'm right, make sure to credit me later. KEKW

west glade
#

lol

steel parcel
#

I've been working on a very interesting solution to this problem that for now I'm gonna keep secret. Once it's over I'd be glad to share it

#

would love to discuss how others approached this problem as well

west glade
#

How long is it taking for everyone's code to run?
I'm currently sitting at 6 hours of scoring and getting worried 0_0

steel parcel
#

hmm if you run your own scoring system, does it take 6 hrs to run through it?

#

on your own pc, that is

west glade
#

@steel parcel It would be great to discuss after the competition!

#

I had a little trouble getting the kaggle evaluation package to work for me so I just focused on my model on my PC and figured I'll let the kaggle notebook do the scoring.

steel parcel
west glade
#

I did at the very start yes.

steel parcel
#

ya that might be it. what if you fractionalized the dataset and ran through it procedurally?

west glade
#

But then I soon realized that I needed to work with it in pieces probably like everyone else.

#

Do you know how the scoring portion actually works?

steel parcel
#

they tell us how it's calculated

west glade
#

Are they scoring on the data that they've given us to train on?

steel parcel
#

but if you're talking about the data they feed in...

west glade
#

Yeah whatever they are using to get our Public score. Are they scoring our model on the data that they've already given us that we are training off of.

steel parcel
#

At the start of the forecasting phase, the unscored public test set will be extended up to the final day of the model training phase and the private set updated roughly every two weeks. Submissions will be rescored at the time of each update.

During the forecasting phase, the evaluation API will serve test data from the beginning of the public set to the end of the private set. You must make predictions at every timestep, but, in this phase, only predictions on the private set are scored. (You may predict 0.0 on the unscored segments, if you like.)```
#

from the data tab

west glade
#

Hmmm...Seems maybe I should have spent more time getting the evaluation package to function.

steel parcel
#

you still got plenty of time

#

get in there, soldier

stable onyx
#

hey! does anyone know what the forecasting window is supposed to be? ik this is a simple question but I'm still confused

#

aka 1 time step into the future or n time steps

#

the test set they provide is 38 rows long, does this mean the window is just whatever length test set they give us?

west glade
#

It's 1 time step. The rows correspond to different symbol_ids.

stable onyx
#

ohh makes sense

#

awesome tysm

west glade
#

Very welcome.

#

So in regards to this lags situation, lets say we are at time_id = 25. I won't have any access to any of the lags of any responders for the past 5 steps?

steel parcel
#

@west glade did you try pre-training your model before you upload it and loading it into memory to save time?

west glade
#

I was thinking of that but....we won't ever get any of the actual values untill the next time_id = 0 which is the next day.

steel parcel
#

just trying to think of ways to make your work easier

west glade
#

I hope I'm wrong.

#

Thanks šŸ™‚

steel parcel
#

try storing useful relevant data into a readable format, unless they wipe the data clean on your system it should stay... right?

#

I think in my case I'm building several contingencies for situational chaos

west glade
#

I completely agree with you, as long as they provide the last time_id's lag at every step rather then only giving us yesterdays lags only at the beginning of a new day.

#

I am trying to confirm that there is something for me to store aside from what I was given for yesterday.

steel parcel
#

you might be able to extract current datetime & utilize that as a tool

#

no guarantee on that, though

#

I think that for the most part, they're shopping on kaggle for potential hires

west glade
#

ahhhhh

steel parcel
#

I don't think they care very much about the results,

#

obviously the results are useful & any code/algorithms/models they can obtain are probably worth a few pennies

#

& 50k is definitely just a few pennies to these guys

west glade
#

agreed.

steel parcel
#

I think that's why they aren't being more open about what can & can't be done

#

if you really cared strongly about the results, you'd be more specific with available options

west glade
#

They are probably enjoying watching us stress out lol

steel parcel
#

hey I find it fun

#

anonymized column data is actually kinda interesting, it got me thinking about how I can work around that

#

which I'd be really eager to discuss later

west glade
#

Same!

steel parcel
#

the math olympiad competition seems pretty challenging in a not-fun way

#

I read through it and it seemed more challenging than arc 2024

#

and not nearly as "nice"

#

I could be wrong though

west glade
#

I wish I had time to check out the other comps. I found that the spine MRI image classification was pretty cool but couldn't do it at the time.

west glade
west glade
#

And it failed at the 8th hour unfortunately.

stable onyx
#

Another question: in the sample test parquet they gave us, the is_scored field is all true, but this is the very first date and time step (which is part of the public set). they claim only the private set predictions are scored. what am I misunderstanding?

keen kite
#

anyone having 'too many requests' issue?

keen kite
#

Does anyone know. How can we get previous day features? I saw that the lag dataset only contains responders

wheat pumice
#

I believe you can store them in a global variable manually

keen kite
keen kite
# keen kite

I am kinda confused about the data they provide. Please pont me out if I am wrong in this post

west glade
# keen kite I am kinda confused about the data they provide. Please pont me out if I am wron...

Everything seems right except what you said about test.parquet. As I understand it (take with a grain of salt) You will get a single date_id & time_id each time your predict is called. It will contain features for all symbol_ids. You are right about the lags in that at time_id = 0, you get the previous "date_id's" lags.

what I don't know is at, lets say time_id = 25, if we will get the lag for the previous time_id step (responder value at time_id_{t-1}.

Someone can correct me if any of that is wrong.

split sphinx
#

Anyone has problem called"submission format errors"? I tried many time, it is still fail and I can't find issues on my code

#

This is my prediction code

keen kite
keen kite
west glade
split sphinx
#

yeah. I fixed issues later by changing the logic of merging. I need to groupby first and merge

steel parcel
#

man, this submission process is driving me crazy

#
Your submission notebook's inference server was disconnected unexpectedly, or a request timed out. See more debugging tips```
#

and not a single clue as to why.

#

and I'm not sure how to troubleshoot this, either. the test submission ran perfectly

zealous lance
#

making the features anonymized/indiscriminate, unintelligible without context, IMO is a deficit to building the most optimal model for predicting markets

#

less realistic to exclude domain expertise and knowledge, better for challenge innovation though šŸ˜‰

hallow dust
unreal cobalt
#

what's the frequency of this data(Hourly, daily or weekly?),
Did i missed where it was mentioned or it is hidden and there no way to figure it out.

grim oriole
#

it says in the data section "It's important to note that the real time differences between each time_id are not guaranteed to be consistent" which doesn't exactly say much but there are around 968 time_ids per date_id

#

one theory i've seen is that its roughly minutely and includes trading hours and after hours trading

hallow dust
#

has anyone's notebooks' ran successfully, but scoring will inevitably result in Notebook thrown exception?

I'm at my wits end, can't get any logs or prints to debug this. I've sprinkled defensive checks everywhere, types, shapes, bounds, anything I can think of, the code looks like messy spaghetti.

and Notebook thrown exception always laughs in my face, literally pulling my hairs out

hasty musk
#

Hello Kaggle Team, Kaggle Community, and Competition hosts,

Our team participated in the Jane Street Real-Time Market Data Forecasting competition, and we encountered a critical issue during the forecasting period where none of our submissions were properly scored on the Private Leaderboard.

During the Public Leaderboard phase, our submissions were successfully evaluated, and we had no issues. However, once the competition transitioned into the forecasting period, every submission we made—whether it was our own developed solution, a publicly available notebook solution, or a combination of both—failed to be scored correctly. This happened regardless of whether we selected those solutions as our final submissions.

The submission logs indicate that all our submissions were marked as "Succeeded," yet they were not evaluated on the Private LB. The attached image provides evidence of this issue.

We want to clarify that our team did not engage in any rule violations or unethical practices. Given that multiple solutions were affected, we believe this could be a technical issue rather than a problem specific to our team.

Could the Kaggle team and Community please investigate this matter and provide clarification on why our submissions were not evaluated on the Private LB? We would appreciate any insights or possible resolutions.

fossil wharf
#

categories of assets

#

unfortunately, not much use for competition

fossil wharf
#

Jane Street unfortunately will always suffer from how they frame the Aproach; to which they framed the Problem. I may contact their clients rather then seek the Prize money

worthy girder
tawny geyser
#

Can we join this competition now

trail wedge
#

Says, "New entrants are currently not allowed. You will be able to accept the rules and submit late predictions after the competition completes."

fleet igloo
#

Hi All, I am looking for Kaggle Grandmasters who have won competition who can mentor me. I am willing to pay for mentorship. Thank you!

marsh flint
trail wedge
fading warren
quiet fog
marsh flint
#

Yes I'm omniscient

tired lark
upbeat arrowBOT
#
ellyassam has been warned

Reason: Bad word usage

#
ellyassam has been banned

Reason: Too many infractions

upbeat arrowBOT
#
quetzal_002 has been warned

Reason: Bad word usage

#
quetzal_002 has been banned

Reason: Too many infractions

fossil wharf
fossil wharf
marsh flint
#

No more

upbeat arrowBOT
#
nickcillor has been warned

Reason: Posted an invite

#
nickcillor has been banned

Reason: Too many infractions

upbeat arrowBOT
#
eversoda has been warned

Reason: Posted an invite

#
eversoda has been banned

Reason: Too many infractions

upbeat arrowBOT
#
codelover10 has been warned

Reason: Posted an invite

#
codelover10 has been banned

Reason: Too many infractions