#march-machine-learning-mania-2024
1 messages · Page 1 of 1 (latest)
can anyone confirm the data set for MTeams.csv and WTeams.csv, only MTeams.csv has the first and last date in D1 column?
The WTeams.csv file does not have D1 dates in it. You would need to look in the WTeamConferences.csv file as the easiest way to figure out which teams are present in which seasons.
hello there!
I have a question, if the submissions will be graded against a dummy solution then how can we check the quality of our solution?
Hi, you are not suppose to check the quality with the test data. In some competitions the leaderboard appears in real-time, but your not suppose to use this information. Otherwise, that would cause some bias to your model.
You need to use previous competitions to train your model. You can split the data into training and test data, and from that, estimate the generalization capability of your model.
The score of the public LB is obtained from testing the model on 2023 season am I right?
Yes, you are correct. But that doesnt mean you need to follow this. You can use other set of test data if you want, since the actual evaluation will be done on data that is yet to come (after the competition is closed).
Submission deadline is 21, but selection Sunday is on 17. So it means that selection Sunday is before the deadline. However, this conflicts with this description of the data: "After Selection Sunday, when the competition has closed, we will replace this file with the actual 2024 tournament selections and rescore your submissions against the 2024 results." Any ideias?
There is a warmup phase that's probably what's being referenced by "the competition"
The actual bracket will be released on 3/17 (Selection Sunday) and the first games begin on 3/21. The cutoff should be before the first games begin on March 21
Please please please I am begging for the evaluation metric to change. This scoring function does not incentivize predictions that reflect true unbiased opinions!
We got an update to the scoring format, it's now bracket brier score wooo
The leaderboard does seem a bit out of sorts at the moment though. I know a lot of people have just submitted perfect brackets, but I would have figured that scoring brackets based on prediction values from last year would get you a brier score around 0.25 while chalk would get you to 0.3 or so (based on this 538 forecast evaluation https://fivethirtyeight.com/features/how-fivethirtyeights-ncaa-tournament-forecasts-did/). I'm currently getting 0.075 for last year's winner's prediction values and 0.08 for the median expert predictions. Unless I'm misunderstanding how the scoring is done.
Are you remembering to divide by six?
Can someone explain how is the score being calculated. I'm new to this competition. I'm trying to understand this competition.
From the above chat, is it true that this competition was held last year too?
Yeah this is about the 7th time for it
It’s because that 538 article did not score their brier scores how they describe… they only evaluated each rounds brier based on the teams that were left. Thankfully, the competition scoring is evaluating all 64 teams every round, so having a bunch of 16 seeds with ~0% title odds decreases brier in a way 538 didn’t do. If Kaggle didn’t score like this, there would once again be major incentive to simply guess the champion correct, so good on them for not doing what 538 did exactly
Hi, I am a little confused on the submission format. Am I submitting probabilities for this years tournament or last years? Thanks.
You are practicing against last year's data for now, since you can only be measured against last year's tournament results. But since the "ground truth" is known, there's no prizes at stake yet. Once the tournament teams are known for this year, you will have a few days for submitting for this year instead.
But you are not directly submitting probabilities, you are submitting one or more simulated bracket results, and the probabilities will be inferred from the averages of the multiple brackets you submit.
So a bit similar to persisting all of the iterations of a Monte Carlo simulation?
Correct
there are two types of dataset i.e , compact and detailed which to use for prediction
respond
The compact dataset has more years of data but fewer variables (only winning score and losing score). The detailed data has many other variables in addition to winning and losing score, but is only available for more recent years
Hello everyone, could someone clarify if we are allowed to collect a train set (all data is public available) on a local machine and use it for training in a Kaggle notebook?
Yes that is fine
@keen quiver There seems to be some incorrect data in 2024_tourney_seeds.csv. See this thread: https://www.kaggle.com/competitions/march-machine-learning-mania-2024/discussion/483162 I checked it as well and there do seem to be some incorrect seeds. Thanks for all the support on this competition!
Forecast the 2024 College Basketball Tournaments
OK thanks, I will look at it
My workflow has been to do feature engineering, model training, and predictions all locally. Then to put those predictions into a bracket simulation on Kaggle and run those there to generate submissions. All my data that I'm adding on top of efficiency metrics generated from the team box scores is free and publicly available (presason ap polls, play by play data, all-american teams, etc.) However, I'm assuming that if you end up in the prize money, you'll have to open up what you do locally for auditing.
I am not quite sure what the verification process is, but yeah the bar is that people should be able to replicate the use of your data without significant expense
Would it be more advantageous to submit 100k brackets with binary outcomes for each teams round advancement then the process for scoring will inherently average all of these out OR to submit one bracket with just model odds. In reality they should be nearly identically with N=100k no?
If you only submit one bracket then you are saying 100% chance for that outcome, since each bracket can only have 0/1 result in each game
Oh I must have missed that the entries couldn't have floats. Thanks!
One more thing, when simulating the slots stay consistent and basically you’re just mapping to the winning team that overtakes that higher seed right? Like the champ will always be R5WX vs R5YZ and so to simulate you effectively simulate taking over that ID?
Sorry I’m actually having a lot of trouble figuring out how to simulate a game and propagate to the next round, are there any sample notebooks and that?
Yes, if you look at the discussion topic "Return of the Seed Benchmark"
If you look at the NCAA Tourney Slots file from the data, you see that for example slot R4W1 is made up of the "stronger seed slot" R3W1 and the "weaker seed slot" R3W2. If all the strongest seeds win their games then it will indeed be W01 against W02 in that game, whereas if for example W15 had defeated W02 in the first round they did indeed "take over their seed", since they would then face all the teams in order that W02 would have faced. You can think of it as "taking over the seeds" if you like, but I think it might be better to think of it as "winning the slot", where the slots are just usefully-named games. So the slot R1W2 is a first round game where W02 faces W15, and the winner (whether it be W02 or W15) advances to be the "strong seed" in the slot R2W2, which is a second round game between the winners of the slots R1W2 and R1W7.
It is certainly true that "strong" and "weak" get muddled up once there's an upset, and also that it's not really meaningful to talk about a stronger seed in the final two rounds, but it's still arranged that way in the NCAA Tourney Slots file so that the whole structure is captured.
There's nothing wrong with thinking about it in terms of taking over the seed from the higher seed team (which itself might have taken the seed over from another stronger-seeded team in an upset); that works too.
In your eventual submission file, you are not submitting floats or 0/1 integers, you are identifying the teams that win each slot (also could be called "each game") for each bracket you are submitting
And so if you just submit one bracket, then whoever you have as winning the R4W1 slot is basically being given a 100% chance to get to the final four.
What all files are we supposed to use?
It's in the Data section
How are yall working around the play-in games? Are you just waiting until they’re all played on Tuesday/Wednesday to update your simulated brackets for whichever of the teams win? Or are you simulating the winner of those play-in games in advance and then matching up whoever won that simulation in their respective game slots? Just thinking of teams like 2021’s UCLA men’s team who won their 11 seed play-in (in OT!!) and advanced all the way to the Final Four.
The moment of truth is here folks. The selection committee put 3/4 of the final 4 teams from last year all in the same half of their region, already off to a spicy start
Yeah I'm thinking I'm going to have things mostly set up and then submit a final prediction after the play-ins. Think this may be the first time they've put them at the 10 seed
How do I join season-level team variables to a dataset where each row is a head to head matchup that is split into team 1 and team 2 game specific variables in R studio?
I don't think I can help you much since I don't know anything about R studio. But most people seem to randomly assign team 1 versus team 2 for each matchup, since if you just use what's in the Compact Results data, the first team (WTeamID) is the one that won the game and the second team (LTeamID) is the one that lost the game. But conceptually, I would assume you need two joins to that same season-level team variables table, once from Team 1 and once from Team 2
Because of how many automatic qualifiers had upsets in the conference tournament, it pushed the seed line for Last Four In to the 10 seed and had a bunch of bubble teams get bounced. My Virginia Cavaliers were lucky to make the tournament, considering I had no hope they would make it over teams like St. John's and OU. Glad to have made it but I don't like how I'll be actually be sweating the play-in games when they're usually just a test point for the model I'm putting together.
That helps, thank you!
Do we know when the 2024 data (minus First Four) will be available?
Should be by tomorrow morning (Monday morning US time)
I just checked it seems the 2024_tourney_seeds seems to still be last years seeding
Should we expect the updated data to be dropped shortly?
so the 2024_tourney_seeds isn't updated yet but I just checked MNCAATourneySeeds and 2024 data is in there
Oh bet thanks!
There is a forum topic about the latest release: https://www.kaggle.com/competitions/march-machine-learning-mania-2024/discussion/484889
Forecast the 2024 College Basketball Tournaments
Just wanted to double check, it doesn't matter if we start RowId at 0 or 1? I ask because I see it differently in the sample submission file and on the website
Hello. in the data description for section-1, the TeamName column it says that there are 68 teams in NBA but when I merged the datasection-1 datasets for men and then there were 233 unique values in TeamName why is it so?
I built a simple and basic logistic regression model with 3 variables, WinRatio, GapAvg, and SeedDiff thanks to the help of a Kaggle notebook to get me started. I've always wanted to build a working March Madness predictive model but dealing with sports data is hard for me since I have little experience. Now I have a working model I can improve on next year. I had trouble setting up my model in a way to give predictions in the submission format but I copied the data and coefficients for each team into Excel and filled out a bracket manually 🙂
How do yall adjust for player injuries like McCullar for Kansas? He was just ruled out for the tournament and he’s their best player
What I usually do is avert my eyes and grit my teeth
As a UVA fan, I still get PTSD from DeAndre Hunter being ruled out right before the tournament and then losing to 16 seed UMBC. I’m just extra weary of injuries because it can tank a team’s chances in a way our models can’t really quantify. A way to work around would be using Vegas lines to extrapolate team ratings and adjust for those teams. Easy to say, harder to execute, especially if you’re backfilling game spreads back to 2003
Just be glad it was only one player. I'm still traumatized from Kristin Folkl and Vanessa Nygaard tearing their respective ACL's right before #1 Stanford played #16 Harvard.
Could someone please help me out with this. 
The regular season compact results contain all the games played among Division I college teams, of which there are about 360 men's teams and 360 women's teams. At the end of the regular season, 68 men's teams and 68 women's teams are picked to play in the NCAA tournaments. So the NCAA Tourney Seeds files tell you those 68 teams.
I think maybe what you are seeing with the 233 is that there are 233 different teams that have made the tournament across different years. For this year you need to look at the data rows where Season=2024
So basically my merging would be then incorrect right because merging just season-1 teams and regular season teams would be inaccurate.
Each year there is both regular season and NCAA tourney. In both cases, this year's data has Season=2024
It wont make any sense to jus merge division 1 teams and Regular Season teams right?
All that will do is tell you the names of the teams that played in the regular season games
Yeah but then it gets large very large dataset and you just cant plot anything and it gets blurred.
For the most part you can just use the TeamID to uniquely identify each team, and that is used in the regular season games, the NCAA tourney games, and the NCAA tourney seeds. The general problem you are trying to solve is how to take the Season=2024 Regular Season Compact Results data (for each team ID) along with the Season=2024 NCAA Tourney Seeds data (for each team ID) and predict what will happen when those teams participate in the 2024 tournament. To help you in developing a predictive model, you can look at the Regular Season Compact Results, NCAA Tourney Seeds, and NCAA Tourney Compact Results from previous years (where Season < 2024)
Oh okay thank you very much for your help.
Hey! I Just want to thank the organizers for putting this together, this is the first time I have joined a Kaggle competition and it was super fun. Thanks, everyone!
Bit confused, why does on kaggle it say the closing date for the competition is 11th April but the actual tournament started Tuesday?
Because it takes a few weeks for all the games to be played, so we won't know the contest winners for a while. Submission deadline is Thursday morning. It's complicated because there's a few games played on Tuesday and Wednesday
Technically the games Tuesday and Wednesday are play in games to get into the field of 64, then round 1 starts
Are there any samples workbooks on getting the seed matchup for historical tourney games? Having a lot of trouble with this but would like to incorporate in model.
Moreover, what’s a decent pure accuracy score for my own edification?
With an augmented dataset I’m stuck at low 70s for pure accuracy which seems identical to just taking the higher seed so I’m quite curious if I’m doing something very wrong in prepping my training data
You can join the two teams in a historical matchup to the NCAA Tourney Seeds record for that season for each of the two teams, and that tells you their seed
Hi @keen quiver
When will the 2024_tourney_seeds.csv gets updated with the new play-in game winners? I still see Boise St in the file.
Approximately how long should scoring take to run? I have a submission that I submitted over 20 minutes ago and was hoping to get in before the competition closes in a few minutes, but not sure how if it will be counted as one of my submissions if it doesnt score before 4 UTC?
Are you able to mark its checkbox to be one of your two selected?
no, because it does not have a score yet, I dont see the checkbox
I don't really have the access to help you on this. Maybe you could delete all your other ones? I know everything will get rescored once the deadline passes
do you know how to delete version 5 without deleting the entire notebook?
I don't really know what I am talking about with the notebooks, so I probably shouldn't suggest anything. I have barely ever done notebooks at all
I don't know if we can do that but you got your request out in time at least
I hadn’t made any submission on the new updated 2024 data until yesterday morning, but my top scoring submissions (on the 2023 test data) were auto selected despite being submitted like 2 weeks ago. This was an error and oversight on my part as I was not aware until after the deadline that we had select submissions after hitting submit in the notebook. I thought it was done automatically, especially because these were the only two submissions I sent after selection Sunday. I was used to the old system where you just uploaded the CSV file and the selection box was right there. I was traveling yesterday so in all the hectic mess, I neglected to select those two most recent submission notebooks.
Do you think it’d be possible to switch my submissions to these picks because there was no way I intended to use the submissions I selected before selection Sunday even occurred! Heck, I submitted them before the scoring criteria was changed.
I’ve been doing this competition for 4 years and I put a lot of work into this year, so I’d be incredibly disappointed if my relevant work wasn’t included. 😦
I doubt it as round 1 is almost over now.
Yeah but they are the only two submissions that were made in a window that makes sense
Probably organisers can comment on it, I am also facing some issues
Can anyone tell me why this is the case? I mean we had a score on public lb, we could execute the complete code on Kaggle and infact have our submission files with us as well but now this error?
Also there are no comments on other solution of ours nor we can see ourselves on current leaderboard
Have any of the submissions been released or will they at some point? Wondering who to root for or how stable the top 10 will be...
We normally don't reveal submission details because then some people will analyze and point out that only three out of a thousand people aren't already eliminated from first place chances. That's fun to do but also disheartening. So we preserve the suspense a little bit. Sometimes people high up on the leaderboard will share details with each other about their predictions, so they know specifically who to root for.
Has anyone tried LSTM for march madness or having any opinions on using it for predicting sports results?
It wouldn’t not work, but seems like it adds unnecessary complexity to me
Do all games count the same or are the final games more important?
final are important, check that out in description
I read that in robmulla 's youtube video. Its more like
4, 8 , 16, 32
Any updates on finalizing the LB? It's been 9 business days...