#march-machine-learning-mania-2024 | Kaggle | Page 1

uneven chasm Feb 28, 2024, 7:54 AM

#

Very excited for the competition this year and overhauling my code for the new submission format. It seems more applicable for actual bracket pools, but I did like the predicted win probabilities of every potential tournament matchup.

umbral siren Feb 28, 2024, 1:27 PM

#

can anyone confirm the data set for MTeams.csv and WTeams.csv, only MTeams.csv has the first and last date in D1 column?

keen quiver Feb 29, 2024, 6:58 AM

#

umbral siren can anyone confirm the data set for MTeams.csv and WTeams.csv, only MTeams.csv h...

The WTeams.csv file does not have D1 dates in it. You would need to look in the WTeamConferences.csv file as the easiest way to figure out which teams are present in which seasons.

wet oar Mar 1, 2024, 11:40 AM

#

hello there!
I have a question, if the submissions will be graded against a dummy solution then how can we check the quality of our solution?

pulsar hare Mar 1, 2024, 3:26 PM

#

wet oar hello there! I have a question, if the submissions will be graded against a dum...

Hi, you are not suppose to check the quality with the test data. In some competitions the leaderboard appears in real-time, but your not suppose to use this information. Otherwise, that would cause some bias to your model.

pulsar hare Mar 1, 2024, 3:31 PM

#

wet oar hello there! I have a question, if the submissions will be graded against a dum...

You need to use previous competitions to train your model. You can split the data into training and test data, and from that, estimate the generalization capability of your model.

wet oar Mar 1, 2024, 3:38 PM

#

The score of the public LB is obtained from testing the model on 2023 season am I right?

pulsar hare Mar 1, 2024, 5:48 PM

#

wet oar The score of the public LB is obtained from testing the model on 2023 season am ...

Yes, you are correct. But that doesnt mean you need to follow this. You can use other set of test data if you want, since the actual evaluation will be done on data that is yet to come (after the competition is closed).

pulsar hare Mar 1, 2024, 6:33 PM

#

Submission deadline is 21, but selection Sunday is on 17. So it means that selection Sunday is before the deadline. However, this conflicts with this description of the data: "After Selection Sunday, when the competition has closed, we will replace this file with the actual 2024 tournament selections and rescore your submissions against the 2024 results." Any ideias?

keen quiver Mar 1, 2024, 7:11 PM

#

There is a warmup phase that's probably what's being referenced by "the competition"

sand canyon Mar 3, 2024, 11:34 PM

#

pulsar hare Submission deadline is 21, but selection Sunday is on 17. So it means that selec...

The actual bracket will be released on 3/17 (Selection Sunday) and the first games begin on 3/21. The cutoff should be before the first games begin on March 21

open harbor Mar 4, 2024, 4:11 PM

#

Please please please I am begging for the evaluation metric to change. This scoring function does not incentivize predictions that reflect true unbiased opinions!

uneven chasm Mar 7, 2024, 3:09 AM

#

We got an update to the scoring format, it's now bracket brier score wooo

#

The leaderboard does seem a bit out of sorts at the moment though. I know a lot of people have just submitted perfect brackets, but I would have figured that scoring brackets based on prediction values from last year would get you a brier score around 0.25 while chalk would get you to 0.3 or so (based on this 538 forecast evaluation https://fivethirtyeight.com/features/how-fivethirtyeights-ncaa-tournament-forecasts-did/). I'm currently getting 0.075 for last year's winner's prediction values and 0.08 for the median expert predictions. Unless I'm misunderstanding how the scoring is done.

keen quiver Mar 7, 2024, 3:20 AM

#

Are you remembering to divide by six?

winter osprey Mar 7, 2024, 2:10 PM

#

Can someone explain how is the score being calculated. I'm new to this competition. I'm trying to understand this competition.

winter osprey Mar 7, 2024, 2:26 PM

#

From the above chat, is it true that this competition was held last year too?

keen quiver Mar 7, 2024, 5:05 PM

#

Yeah this is about the 7th time for it

open harbor Mar 7, 2024, 8:40 PM

#

uneven chasm The leaderboard does seem a bit out of sorts at the moment though. I know a lot ...

It’s because that 538 article did not score their brier scores how they describe… they only evaluated each rounds brier based on the teams that were left. Thankfully, the competition scoring is evaluating all 64 teams every round, so having a bunch of 16 seeds with ~0% title odds decreases brier in a way 538 didn’t do. If Kaggle didn’t score like this, there would once again be major incentive to simply guess the champion correct, so good on them for not doing what 538 did exactly

hot marsh Mar 8, 2024, 3:13 AM

#

Hi, I am a little confused on the submission format. Am I submitting probabilities for this years tournament or last years? Thanks.

keen quiver Mar 8, 2024, 3:15 AM

#

You are practicing against last year's data for now, since you can only be measured against last year's tournament results. But since the "ground truth" is known, there's no prizes at stake yet. Once the tournament teams are known for this year, you will have a few days for submitting for this year instead.

#

But you are not directly submitting probabilities, you are submitting one or more simulated bracket results, and the probabilities will be inferred from the averages of the multiple brackets you submit.

hot marsh Mar 8, 2024, 4:34 AM

#

So a bit similar to persisting all of the iterations of a Monte Carlo simulation?

keen quiver Mar 8, 2024, 4:35 AM

#

Correct

copper fossil Mar 8, 2024, 12:12 PM

#

there are two types of dataset i.e , compact and detailed which to use for prediction

#

respond

errant spoke Mar 8, 2024, 3:01 PM

#

copper fossil there are two types of dataset i.e , compact and detailed which to use for predi...

The compact dataset has more years of data but fewer variables (only winning score and losing score). The detailed data has many other variables in addition to winning and losing score, but is only available for more recent years

teal geode Mar 12, 2024, 4:36 PM

#

Hello everyone, could someone clarify if we are allowed to collect a train set (all data is public available) on a local machine and use it for training in a Kaggle notebook?

keen quiver Mar 12, 2024, 5:11 PM

#

Yes that is fine

gentle agate Mar 12, 2024, 7:11 PM

#

@keen quiver There seems to be some incorrect data in 2024_tourney_seeds.csv. See this thread: https://www.kaggle.com/competitions/march-machine-learning-mania-2024/discussion/483162 I checked it as well and there do seem to be some incorrect seeds. Thanks for all the support on this competition!

March Machine Learning Mania 2024

Forecast the 2024 College Basketball Tournaments

keen quiver Mar 13, 2024, 7:12 PM

#

OK thanks, I will look at it

uneven chasm Mar 14, 2024, 7:48 PM

#

teal geode Hello everyone, could someone clarify if we are allowed to collect a train set (...

My workflow has been to do feature engineering, model training, and predictions all locally. Then to put those predictions into a bracket simulation on Kaggle and run those there to generate submissions. All my data that I'm adding on top of efficiency metrics generated from the team box scores is free and publicly available (presason ap polls, play by play data, all-american teams, etc.) However, I'm assuming that if you end up in the prize money, you'll have to open up what you do locally for auditing.

keen quiver Mar 14, 2024, 7:50 PM

#

I am not quite sure what the verification process is, but yeah the bar is that people should be able to replicate the use of your data without significant expense

hot marsh Mar 15, 2024, 2:54 PM

#

Would it be more advantageous to submit 100k brackets with binary outcomes for each teams round advancement then the process for scoring will inherently average all of these out OR to submit one bracket with just model odds. In reality they should be nearly identically with N=100k no?

keen quiver Mar 15, 2024, 4:17 PM

#

If you only submit one bracket then you are saying 100% chance for that outcome, since each bracket can only have 0/1 result in each game

hot marsh Mar 15, 2024, 4:21 PM

#

Oh I must have missed that the entries couldn't have floats. Thanks!

#

One more thing, when simulating the slots stay consistent and basically you’re just mapping to the winning team that overtakes that higher seed right? Like the champ will always be R5WX vs R5YZ and so to simulate you effectively simulate taking over that ID?

#

Sorry I’m actually having a lot of trouble figuring out how to simulate a game and propagate to the next round, are there any sample notebooks and that?

keen quiver Mar 15, 2024, 4:41 PM

#

Yes, if you look at the discussion topic "Return of the Seed Benchmark"

#

If you look at the NCAA Tourney Slots file from the data, you see that for example slot R4W1 is made up of the "stronger seed slot" R3W1 and the "weaker seed slot" R3W2. If all the strongest seeds win their games then it will indeed be W01 against W02 in that game, whereas if for example W15 had defeated W02 in the first round they did indeed "take over their seed", since they would then face all the teams in order that W02 would have faced. You can think of it as "taking over the seeds" if you like, but I think it might be better to think of it as "winning the slot", where the slots are just usefully-named games. So the slot R1W2 is a first round game where W02 faces W15, and the winner (whether it be W02 or W15) advances to be the "strong seed" in the slot R2W2, which is a second round game between the winners of the slots R1W2 and R1W7.

It is certainly true that "strong" and "weak" get muddled up once there's an upset, and also that it's not really meaningful to talk about a stronger seed in the final two rounds, but it's still arranged that way in the NCAA Tourney Slots file so that the whole structure is captured.

There's nothing wrong with thinking about it in terms of taking over the seed from the higher seed team (which itself might have taken the seed over from another stronger-seeded team in an upset); that works too.

#

In your eventual submission file, you are not submitting floats or 0/1 integers, you are identifying the teams that win each slot (also could be called "each game") for each bracket you are submitting

#

And so if you just submit one bracket, then whoever you have as winning the R4W1 slot is basically being given a 100% chance to get to the final four.

winter osprey Mar 16, 2024, 7:46 PM

#

What all files are we supposed to use?

keen quiver Mar 16, 2024, 7:53 PM

#

It's in the Data section

uneven chasm Mar 16, 2024, 9:08 PM

#

How are yall working around the play-in games? Are you just waiting until they’re all played on Tuesday/Wednesday to update your simulated brackets for whichever of the teams win? Or are you simulating the winner of those play-in games in advance and then matching up whoever won that simulation in their respective game slots? Just thinking of teams like 2021’s UCLA men’s team who won their 11 seed play-in (in OT!!) and advanced all the way to the Final Four.

uneven chasm Mar 17, 2024, 10:10 PM

#

The moment of truth is here folks. The selection committee put 3/4 of the final 4 teams from last year all in the same half of their region, already off to a spicy start

errant spoke Mar 17, 2024, 11:32 PM

#

uneven chasm How are yall working around the play-in games? Are you just waiting until they’r...

Yeah I'm thinking I'm going to have things mostly set up and then submit a final prediction after the play-ins. Think this may be the first time they've put them at the 10 seed

ionic tree Mar 18, 2024, 12:34 AM

#

How do I join season-level team variables to a dataset where each row is a head to head matchup that is split into team 1 and team 2 game specific variables in R studio?

keen quiver Mar 18, 2024, 1:27 AM

#

I don't think I can help you much since I don't know anything about R studio. But most people seem to randomly assign team 1 versus team 2 for each matchup, since if you just use what's in the Compact Results data, the first team (WTeamID) is the one that won the game and the second team (LTeamID) is the one that lost the game. But conceptually, I would assume you need two joins to that same season-level team variables table, once from Team 1 and once from Team 2

uneven chasm Mar 18, 2024, 1:46 AM

#

errant spoke Yeah I'm thinking I'm going to have things mostly set up and then submit a final...

Because of how many automatic qualifiers had upsets in the conference tournament, it pushed the seed line for Last Four In to the 10 seed and had a bunch of bubble teams get bounced. My Virginia Cavaliers were lucky to make the tournament, considering I had no hope they would make it over teams like St. John's and OU. Glad to have made it but I don't like how I'll be actually be sweating the play-in games when they're usually just a test point for the model I'm putting together.

ionic tree Mar 18, 2024, 1:52 AM

#

keen quiver I don't think I can help you much since I don't know anything about R studio. B...

That helps, thank you!

cloud vault Mar 18, 2024, 2:04 AM

#

Do we know when the 2024 data (minus First Four) will be available?

keen quiver Mar 18, 2024, 2:06 AM

#

Should be by tomorrow morning (Monday morning US time)

torn sorrel Mar 18, 2024, 4:21 PM

#

I just checked it seems the 2024_tourney_seeds seems to still be last years seeding

#

Should we expect the updated data to be dropped shortly?

uneven chasm Mar 18, 2024, 4:34 PM

#

so the 2024_tourney_seeds isn't updated yet but I just checked MNCAATourneySeeds and 2024 data is in there

torn sorrel Mar 18, 2024, 5:24 PM

#

Oh bet thanks!

keen quiver Mar 18, 2024, 5:38 PM

#

There is a forum topic about the latest release: https://www.kaggle.com/competitions/march-machine-learning-mania-2024/discussion/484889

March Machine Learning Mania 2024

Forecast the 2024 College Basketball Tournaments

errant spoke Mar 19, 2024, 3:25 PM

#

Just wanted to double check, it doesn't matter if we start RowId at 0 or 1? I ask because I see it differently in the sample submission file and on the website

languid fossil Mar 19, 2024, 8:01 PM

#

Hello. in the data description for section-1, the TeamName column it says that there are 68 teams in NBA but when I merged the datasection-1 datasets for men and then there were 233 unique values in TeamName why is it so?

pallid wharf Mar 20, 2024, 12:02 AM

#

I built a simple and basic logistic regression model with 3 variables, WinRatio, GapAvg, and SeedDiff thanks to the help of a Kaggle notebook to get me started. I've always wanted to build a working March Madness predictive model but dealing with sports data is hard for me since I have little experience. Now I have a working model I can improve on next year. I had trouble setting up my model in a way to give predictions in the submission format but I copied the data and coefficients for each team into Excel and filled out a bracket manually 🙂

uneven chasm Mar 20, 2024, 12:16 AM

#

How do yall adjust for player injuries like McCullar for Kansas? He was just ruled out for the tournament and he’s their best player

keen quiver Mar 20, 2024, 12:19 AM

#

What I usually do is avert my eyes and grit my teeth

uneven chasm Mar 20, 2024, 12:26 AM

#

As a UVA fan, I still get PTSD from DeAndre Hunter being ruled out right before the tournament and then losing to 16 seed UMBC. I’m just extra weary of injuries because it can tank a team’s chances in a way our models can’t really quantify. A way to work around would be using Vegas lines to extrapolate team ratings and adjust for those teams. Easy to say, harder to execute, especially if you’re backfilling game spreads back to 2003

keen quiver Mar 20, 2024, 12:31 AM

#

Just be glad it was only one player. I'm still traumatized from Kristin Folkl and Vanessa Nygaard tearing their respective ACL's right before #1 Stanford played #16 Harvard.

languid fossil Mar 20, 2024, 12:48 AM

#

languid fossil Hello. in the data description for section-1, the TeamName column it says that t...

Could someone please help me out with this. kaggle

keen quiver Mar 20, 2024, 12:51 AM

#

languid fossil Could someone please help me out with this. <:kaggle:1138901474957598795>

The regular season compact results contain all the games played among Division I college teams, of which there are about 360 men's teams and 360 women's teams. At the end of the regular season, 68 men's teams and 68 women's teams are picked to play in the NCAA tournaments. So the NCAA Tourney Seeds files tell you those 68 teams.

#

I think maybe what you are seeing with the 233 is that there are 233 different teams that have made the tournament across different years. For this year you need to look at the data rows where Season=2024

languid fossil Mar 20, 2024, 12:52 AM

#

So basically my merging would be then incorrect right because merging just season-1 teams and regular season teams would be inaccurate.

keen quiver Mar 20, 2024, 12:53 AM

#

Each year there is both regular season and NCAA tourney. In both cases, this year's data has Season=2024

languid fossil Mar 20, 2024, 12:53 AM

#

keen quiver I think maybe what you are seeing with the 233 is that there are 233 different t...

It wont make any sense to jus merge division 1 teams and Regular Season teams right?

keen quiver Mar 20, 2024, 12:54 AM

#

All that will do is tell you the names of the teams that played in the regular season games

languid fossil Mar 20, 2024, 12:58 AM

#

Yeah but then it gets large very large dataset and you just cant plot anything and it gets blurred.

keen quiver Mar 20, 2024, 12:58 AM

#

For the most part you can just use the TeamID to uniquely identify each team, and that is used in the regular season games, the NCAA tourney games, and the NCAA tourney seeds. The general problem you are trying to solve is how to take the Season=2024 Regular Season Compact Results data (for each team ID) along with the Season=2024 NCAA Tourney Seeds data (for each team ID) and predict what will happen when those teams participate in the 2024 tournament. To help you in developing a predictive model, you can look at the Regular Season Compact Results, NCAA Tourney Seeds, and NCAA Tourney Compact Results from previous years (where Season < 2024)

languid fossil Mar 20, 2024, 12:59 AM

#

Oh okay thank you very much for your help.

summer vector Mar 20, 2024, 4:50 PM

#

Hey! I Just want to thank the organizers for putting this together, this is the first time I have joined a Kaggle competition and it was super fun. Thanks, everyone!

devout cairn Mar 21, 2024, 12:25 AM

#

Bit confused, why does on kaggle it say the closing date for the competition is 11th April but the actual tournament started Tuesday?

keen quiver Mar 21, 2024, 12:32 AM

#

Because it takes a few weeks for all the games to be played, so we won't know the contest winners for a while. Submission deadline is Thursday morning. It's complicated because there's a few games played on Tuesday and Wednesday

pallid wharf Mar 21, 2024, 12:33 AM

#

Technically the games Tuesday and Wednesday are play in games to get into the field of 64, then round 1 starts

hot marsh Mar 21, 2024, 1:43 AM

#

Are there any samples workbooks on getting the seed matchup for historical tourney games? Having a lot of trouble with this but would like to incorporate in model.

Moreover, what’s a decent pure accuracy score for my own edification?

hot marsh Mar 21, 2024, 2:04 AM

#

With an augmented dataset I’m stuck at low 70s for pure accuracy which seems identical to just taking the higher seed so I’m quite curious if I’m doing something very wrong in prepping my training data

keen quiver Mar 21, 2024, 2:16 AM

#

You can join the two teams in a historical matchup to the NCAA Tourney Seeds record for that season for each of the two teams, and that tells you their seed

marsh storm Mar 21, 2024, 8:31 AM

#

Hi @keen quiver

When will the 2024_tourney_seeds.csv gets updated with the new play-in game winners? I still see Boise St in the file.

toxic oar Mar 21, 2024, 3:47 PM

#

Approximately how long should scoring take to run? I have a submission that I submitted over 20 minutes ago and was hoping to get in before the competition closes in a few minutes, but not sure how if it will be counted as one of my submissions if it doesnt score before 4 UTC?

keen quiver Mar 21, 2024, 3:47 PM

#

Are you able to mark its checkbox to be one of your two selected?

toxic oar Mar 21, 2024, 3:49 PM

#

no, because it does not have a score yet, I dont see the checkbox

keen quiver Mar 21, 2024, 3:49 PM

#

I don't really have the access to help you on this. Maybe you could delete all your other ones? I know everything will get rescored once the deadline passes

toxic oar Mar 21, 2024, 3:50 PM

#

do you know how to delete version 5 without deleting the entire notebook?

keen quiver Mar 21, 2024, 3:51 PM

#

I don't really know what I am talking about with the notebooks, so I probably shouldn't suggest anything. I have barely ever done notebooks at all

#

I don't know if we can do that but you got your request out in time at least

uneven chasm Mar 22, 2024, 2:20 PM

#

I hadn’t made any submission on the new updated 2024 data until yesterday morning, but my top scoring submissions (on the 2023 test data) were auto selected despite being submitted like 2 weeks ago. This was an error and oversight on my part as I was not aware until after the deadline that we had select submissions after hitting submit in the notebook. I thought it was done automatically, especially because these were the only two submissions I sent after selection Sunday. I was used to the old system where you just uploaded the CSV file and the selection box was right there. I was traveling yesterday so in all the hectic mess, I neglected to select those two most recent submission notebooks.

Do you think it’d be possible to switch my submissions to these picks because there was no way I intended to use the submissions I selected before selection Sunday even occurred! Heck, I submitted them before the scoring criteria was changed.

I’ve been doing this competition for 4 years and I put a lot of work into this year, so I’d be incredibly disappointed if my relevant work wasn’t included. 😦

snow zinc Mar 22, 2024, 9:18 PM

#

uneven chasm I hadn’t made any submission on the new updated 2024 data until yesterday mornin...

I doubt it as round 1 is almost over now.

uneven chasm Mar 22, 2024, 9:40 PM

#

snow zinc I doubt it as round 1 is almost over now.

Yeah but they are the only two submissions that were made in a window that makes sense

snow zinc Mar 23, 2024, 7:28 PM

#

Probably organisers can comment on it, I am also facing some issues

#

Can anyone tell me why this is the case? I mean we had a score on public lb, we could execute the complete code on Kaggle and infact have our submission files with us as well but now this error?

Also there are no comments on other solution of ours nor we can see ourselves on current leaderboard

Screenshot_2024-03-24-00-57-07-42_40deb401b9ffe8e1df2f1cc5ba480b12.jpg

bitter ravine Mar 24, 2024, 7:52 PM

#

Have any of the submissions been released or will they at some point? Wondering who to root for or how stable the top 10 will be...

keen quiver Mar 24, 2024, 7:55 PM

#

We normally don't reveal submission details because then some people will analyze and point out that only three out of a thousand people aren't already eliminated from first place chances. That's fun to do but also disheartening. So we preserve the suspense a little bit. Sometimes people high up on the leaderboard will share details with each other about their predictions, so they know specifically who to root for.

devout cairn Mar 30, 2024, 8:18 PM

#

Has anyone tried LSTM for march madness or having any opinions on using it for predicting sports results?

fickle roost Apr 1, 2024, 4:59 PM

#

devout cairn Has anyone tried LSTM for march madness or having any opinions on using it for p...

It wouldn’t not work, but seems like it adds unnecessary complexity to me

bitter ravine Apr 5, 2024, 5:30 PM

#

Do all games count the same or are the final games more important?

viral spade Apr 6, 2024, 12:29 PM

#

bitter ravine Do all games count the same or are the final games more important?

final are important, check that out in description

I read that in robmulla 's youtube video. Its more like

4, 8 , 16, 32

dapper stone Apr 20, 2024, 5:04 AM

#

Any updates on finalizing the LB? It's been 9 business days...

viral spade Apr 23, 2024, 5:12 AM

#

dapper stone Any updates on finalizing the LB? It's been 9 business days...

lol