#march-machine-learning-mania-2025
1 messages · Page 1 of 1 (latest)
Hi, can someone explain these two features (FirstD1Season, LastD1Season) ? thank you all.
First and Last D1 season refer to the Teams first and last season as a Division-1 team. There are five levels to college basketball and roughly follow this order, the best divisions listed first. Division 1>Division 2> Division 3 >NAIA >NJCAA. Only Division 1 schools are eligible to play in march madness so if a teams first year as a D1 school was 2004, it does not necessarily mean they were not good enough to make the 2003 tournament it just means they were not eligible.
I will note that it is quite rare that a school will switch divisions, it is not like sporting leagues with relegation for bad teams
I’m so grateful for your detailed answer! @cunning ore
you're welcome!
In the sample submission file, each match up in a year (YYYY_TeamID_TeamID) occurs exactly one time. However, when I check the regular season results, there are some match up that occurs two times. This makes sense as teams can play matches more than one time in a season. But how is the brie score calculated in this case? Does the match up that occurs two times count as twice?
we are only being evaluated on tournament results, and with the March Madness Tournament, once you lose you are out, so its impossible to have same match twice
So, the predictions are only your guess if they were to match up in the tournament, not what you believe would happen in regular season
The vast majority of data rows in the submission file will be ignored during the scoring, since there will only be 63 men's scores and 63 women's scores, in a given year, that will ultimately be scored. We do not score the "play-in" tournament games that occur at the very start, because those are played while it is still legal to submit predictions. Nor do we score any games that occurred during the regular season, or in the "secondary" tournaments like the NIT and WNIT that happen in parallel with the NCAA tournament. The only games that matter are those final 63 in the men's NCAA tournament and those final 63 in the women's NCAA tournament.
Will data from Nate Silver's Silver Bulletin be allowed this year, assuming that it's behind a paywall again?
Hi, Im looking for a team, if you are interested feel free to message me in my DM 👍
Hello, I would like to double-check if we are required to submit a notebook along with our predictions csv? Or If we can just submit our csv files. Thank!
Just the csv is fine
It doesn’t have to be available for free, just a “reasonable” fee. And I know that’s subjective
Thanks Jeff. One final question: The submitted csv should have all possible matchups for both men and women all the way back to 2021 (2021-2025)? Or just all possible matchups for the year 2025?
We are currently in Stage 1, which means you do submissions of seasons 2021, 2022, 2023, and 2024, in accordance with the Stage 1 sample submission file. There’s no prizes or medals for stage 1 because if you wanted you could submit a perfect set of predictions just by consulting the contest data files. Once we move to Stage 2, you will submit predictions for 2025 in accordance with the Stage 2 sample submission file, which we will be releasing next week.
And we can’t start scoring those Stage 2 submissions until the tournament starts on March 20th
Ah I see! Thank you for clarifying
Hi all. Just joined and excited to mess around. First time kaggler. I think I have a pretty good idea of the setup. Only questions I have so far:
- what is the benefit of 100 submissions a day if you can just validate locally based on the 21-24 data and 25 pre-tournament data as it comes?
- you ultimately choose 2 submissions. Are they evaluated totally separately and your leaderboard position is your highest-ranking submission? What if someone has both the 1st and 2nd best submissions, for example, in that case?
-The 100 per day limit does not really matter, it just allows you a lot of tries to tune your model without setting up a manual validation scheme
-Your leaderboard position will be whichever of your submissions is the highest
Cool thank you.
Yes you will tag your 2 preferred submissions, and then when we score it, each person’s leaderboard score is just the better performing of their own 2. So you cannot win both 1st and 2nd place
It is important for people to remember to tag their two preferred submissions, since you might not like the automated way that they get picked, if you forget to tag them. We will be reminding people of this frequently over the course of the competition
Why are there like 3 copy pasted + chatgpt output public notebooks with over 10 upvotes...😭
Only difference is one notebook's LLM changed all the "" to '' 👍
there are many upvote farmers/groups active on kaggle
He atleast improved the code😄
I'm looking for team for this competition. Interested individuals dm me
wth let's me know the leardboard has many 0.000000 score
The LB for Stage 1 is for practice only. All the answers are public knowledge. Final competition results will be based on a different test data set from the NCAA Basketball tournaments in March
ty
@winter lagoon Do the SampleSubmissionStage1.csv contain all the possible matchups? Is it enough if I parse it?
Yes and yes
Thanks.
The data set doesn't have anything about players on each team - is that correct?
That’s right
Hi, could I ask what is the ddl for stage 1?
I'm not sure what ddl means? Deadline? There is no real deadline for Stage 1, no prizes or medals or anything, since the solutions for stage 1 could easily be looked up in the competition datafiles. It's more for practice and model development. Stage 2 started a few days ago and constitutes the real competition. The submission deadline is the morning of March 20 (but check yout timezone please!)
I thnk ddl also means data definition language, in which case I don't know what you mean...
Then what's the difference between the file of stage1 and 2? Which one should I parse? I want to get medals & no other goals.
In stage 2, you are making predictions prior to the 2025 season's tournament. The stage 2 submission file only contains predictions for the 2025 season. For practice, we also provide a stage 1 submission file, which includes predictions for seasons 2021-2024. This helps you understand the submission format. However, there are no prizes or medals for doing well, because the tournament game results for seasons 2021-2024 are part of the competition dataset.
sorry i just mean deadline, thanks for the clarification
Do this mean that I have to predict the same matches as in the stage 2 submission file only and nothing else? I care about medals and prizes only. 🔥
Yes that is correct, the sample stage 2 submission file indicates the matchups that must be predicted. If you like, you can parse through the sample stage 2 submission file and it will tell you which matchups to predict, and so you can prepare your own submission file that way. Once you have made any and all of your submissions, you will be able to pick two of them as your "selected submissions". That must all be done prior to the submission deadline early on March 20th (remember to check precise time, for your time zone!) Then the actual 2025 tournament starts on March 20th, and as the tournament progresses (across 2+ weeks) then everyone's scores will get updated based on what happens in the real life games. Assuming you selected two of them, the better-scoring of the two will determine your placement on the leaderboard and ultimately whether you get a medal or a prize.
Thanks a lot. It's helpful.
Im starting to suspect you care abt medals and prizes only..
Is there anything to learn from a competition which is only about logistic regression?
Its not only about logistic regression, and you can always learn something from any competition anyways
And I meant it as a joke bc you repeated urself again saying you only cared abt medal and prize
Yes. And I'm learning of course. But the main objective is achievement @wicked nymph
Unless you actually innovate (by learning something new) it is pretty much a gamble if you want to get a prize at least
The validity of this statement is depended on the situation of a person. You won't understand.
The stage 1 file is extremely useful as a test set. For anyone new to the competition I would highly recommend using the Stage 1 file as a pure test set and filter years 2021+ out of all of your training and validation data during your model development process. This competition can be extremely easy to overfit and/or introduce data leakage from your validation. Having a test set during development is very helpful if you want to have a Stage 2 model that performs reliably.
Can someone please explain in more details how the regional interpretations work (maybe with an example)? the explanation in the data section just makes it even more confusing for me.
The MSeasons and WSeasons files explicitly list the names of the regions associated with the WXYZ encoding that you will find in both the seeds and the slots files. Unless you are going to visualize the competition layout the real names don't matter too much, because they have encoded the regions in a way where region W and X always play in the semifinals and same with Y and Z. Knowing this you can say that any matchup between an X seed and a Y seed or a W seed and a Z seed is a finals matchup. Determining other rounds and tournament paths are a bit more complicated but can be done using a combination of the seeds and slots file. Does that help?
Real example: Here is a link to the 2024 NCAA women's bracket where the Albany 1 region should be region X in the Kaggle data. Because of how the regions are defined I know the semifinals match of South Carolina vs NC State would be seeds W01 vs X03. Of course you can also just look up the seeds by team after the fact. The understanding of W plays X and Y plays Z is only necessary for predictive simulations. https://www.ncaa.com/brackets/basketball-women/d1/2024
Okay, so the regions are more like how the brackets are gonna be set up with the winner coming out of Region W playing X and that coming out of Y playing Z, then winners between these matchups face each other in the finals, right?
Yes exactly. It's all about the tournament structure/flow
What's the best Brier Score have you achieved?
#march-machine-learning-mania-2025
Varies a lot by year, but mine and some others are here. https://www.kaggle.com/competitions/march-machine-learning-mania-2025/discussion/562253#3120917
In regards to overfitting, is it generally recommended to use 2021-2024 as testing data and previous seasons as training, or some other year(s)?
There is a lot of variance in the tournament results and so even 4 years isn't really all that much data. I would suggest a more robust validation scheme and only using those last 4 years as a test check. Even then, it's totally possible that you have a model that validates well but performs slightly worse than another model in the last 4 years. The hard part about this competition is that you probably need to be slightly over fit and lucky to have a chance at a strong medal.
How can I back test on the brier loss function like you detailed in your comment? Is there a convenient way to do this on Kaggle or do I just need to write a Python script?
you can use from sklearn.metrics import brier_score_loss basically get the compact results csv, join it to the submission so you have the result. The do something like this to got through and calc the results. ```
calculate the brier score for each season
for season in sub_score["Season"].unique():
bs = mmf.calculate_brier_score(
sub_score[sub_score["Season"] == season]
)
print(f"Season {season} Brier Score: {bs}")
mmf is my functions module def calculate_brier_score(sub): y_true = sub["Result"] y_pred = sub["Pred"] return brier_score_loss(y_true, y_pred)
Cool, thanks Anthony!
Almost forgot you need to convert the winning and losing team ids in the results into the lower and higher team id format of the submission. Np.where is your friend for that job
How are yall accounting for the Flagg injury? I’m thinking you have to use some metric like the point spread for the first game versus expected to do a efficiency margin correction that values player worth
I have accepted I have to just gloss over injuries and it's gonna hurt my score
Flagg injury 😳?
Wow
But it doesn’t look too terrible
Based on my immense medical expertise
im thinkin this too lol next year though I want to have a more comprehensive model
I don't know how to do this without a massive dataset of individual player statistics. Does one exist that isn't behind a paywall?
Somebody posted a link in the Discussion forum to a dataset that I think was hosted on Kaggle and I know had individual stats in it. I am struggling to find the link though; I believe it was about a week ago
OK I found it. Here's the forum posting about it, and it references a dataset called ncaahoopR : https://www.kaggle.com/competitions/march-machine-learning-mania-2025/discussion/562473#3144445
Direct link to the dataset is here: https://github.com/lbenz730/ncaahoopR_data
I have no idea about the quality or usefulness of the data; it was just something I noticed in passing
I noticed in MNCAATourneyCompactResults.csv, there are only 66 lines starting with the year 2021, whereas the other years (like 2018, 2019, 2022, 2023, 2024) have 67 (first 4 + 32 + 16 + 8 + 4 + 2 + 1). I'm checking with grep "^2021" MNCAATourneyCompactResults.csv | wc -l. I don't see anything different with the format of 2021 bracket: https://www.espn.com/mens-college-basketball/bracket/_/season/2021/2021-ncaa-tournament
I'm planning on leaning on Vegas odds as a modifier to my model with the hope that it will reflect any recent injuries... Not sure how obvious the odds impact lopsided 1 v 16 matchup though.
There was a COVID forfeit that year, so one game was not played
Yeah maybe it’s best to lean on overall odds to win it all versus odds versus a 16 seed. But as a UVA fan who saw us lose to UMBC after we lost DeAndre Hunter that season, maybe there’s something to it
I'm getting 0.1706 average brier over 2021-2024 mens. Surely overfit or did I stumble on a gem?
@winter lagoon Can I get permission for my child who is under 18 to participate in this competition please? I have given from my end. He is good at his studies and, also want to show his skills at ML. I believe that in this rapidly changing world skills are the most important thing. As much as academic grades. Your permission can help him to upskill himself.
I'm not really a Kaggle person who can grant that, sorry! I think maybe if he competes then he wouldn't be eligible for prize money, but you would have to check the rules.
This seems like a good score! My model is around 0.175 (and my model is certainly NOT the best possible). It's still possible you are slightly overfit, but you kind of need to be somewhat overfit to have a chance of winning in this competition.
are we not predicting the two men's & women's play-in games that happen tomorrow & wednesday?
They don’t count toward your score. The submission deadline is not until Thursday morning, and the games that are scored will start on Thursday
Will those games appear in the data?
They will not. The second half of the women's play in games actually happen after tip of the mens first round, which makes trying to incorporate play-in data pretty difficult
gotcha thanks!
do you know the typical highest scores? this is my first year, I dont expect to place by any means but I am curious, it feels like im shootin in the dark!
It varies a lot, because it really depends on how many upsets there are each year
To follow up on my question that you replied to @mild pebble, there indeed was a bug in my model. I'm doing much, much worse when I verify against 2021-2024 men's now (above 0.2 😢 ) but at least that's more accurate! Didn't want to give the impression I'm crushing it
That makes a lot of sense
Definitely better that it’s more accurate though!
good luck to anybody and everybody still grinding in the final hours!! been a fun and challenging project, can't wait to see some results
Sorry I'm late responding but you should also look at this thread including responses from Jeff! https://www.kaggle.com/competitions/march-machine-learning-mania-2025/discussion/562816
Bummer! I have made this mistake many times in this competition, but the challenge of this data set is what keeps me coming back every year! And the madness 🙃
I'm a bit confused with the submission format am I supposed to put the results for every possible matchup between all teams in WTeams.csv and MTeams.csv?
or can I just submit a file with the teams in the tournament now that we know what they are?
Your submission file should contain all possible matchups between men's teams, concatenated with all possible matchups between women's teams. An overwhelming majority of those matchups will never happen in real life (so your prediction values for them don't really matter as they won't be scored).
No problem, thanks for the response regardless! Ill check it out!
Does anyone know how to look at the score for both of your submissions? I can only see the score on the top submission.
Thanks!
did you guys choose to train your final model on just previous years post-season games or on both previous years regular-season and post-season games
I trained on regular season data with tournament games being the target
When using a typical ML approach I normally use only NCAA tournament games for training. This year I'm using a power ratings style approach that only uses regular season data as the input
Hey all, does anyone know why there are two scores on the leaderboard
Can somebody explain to me how the leaderboard works? I’m interested in knowing who’s in first place…. Is it by accuracy of picks? Who is the highest? I’ve been love to know the strategy behind the top leaders
Everyone submitted predictions for all possible game matchups. A prediction is a decimal number somewhere between 0 and 1. So if you submit 0.75 for that row, you are saying a 75% chance for the first team to beat the second team. Then when all the actual games get played, the result is either a 1 or a 0, depending on whether the first team beat the second team. We square the difference between the prediction and the actual result, and evaluate each player's two selected submission files. The leaderboard shows everyone's better-performing submission out of the two they selected.
Hello Everyone!
I am Shashank, with 3+ years of experience in the domain of data - I am very much interested in creating and deploying end-to-end machine learning models.
Since, going deep into the idea, models related to Artificial Intelligence, also facinate me to work on, and to have a proper solutions to the business.
Same interest students/professionals can connect me on my linkedin: https://www.linkedin.com/in/snkp0018
Happy Learning!
Best,
Shashank Pandey
one
i was just testing kernel bot reports it or not