#march-machine-learning-mania-2025

1 messages · Page 1 of 1 (latest)

winter lagoon
#

Welcome everyone! The lack of posts so far was getting ominous...

muted heath
#

Hi, can someone explain these two features (FirstD1Season, LastD1Season) ? thank you all.

cunning ore
# muted heath Hi, can someone explain these two features (FirstD1Season, LastD1Season) ? thank...

First and Last D1 season refer to the Teams first and last season as a Division-1 team. There are five levels to college basketball and roughly follow this order, the best divisions listed first. Division 1>Division 2> Division 3 >NAIA >NJCAA. Only Division 1 schools are eligible to play in march madness so if a teams first year as a D1 school was 2004, it does not necessarily mean they were not good enough to make the 2003 tournament it just means they were not eligible.

#

I will note that it is quite rare that a school will switch divisions, it is not like sporting leagues with relegation for bad teams

muted heath
#

I’m so grateful for your detailed answer! @cunning ore

radiant temple
#

In the sample submission file, each match up in a year (YYYY_TeamID_TeamID) occurs exactly one time. However, when I check the regular season results, there are some match up that occurs two times. This makes sense as teams can play matches more than one time in a season. But how is the brie score calculated in this case? Does the match up that occurs two times count as twice?

cunning ore
#

So, the predictions are only your guess if they were to match up in the tournament, not what you believe would happen in regular season

winter lagoon
#

The vast majority of data rows in the submission file will be ignored during the scoring, since there will only be 63 men's scores and 63 women's scores, in a given year, that will ultimately be scored. We do not score the "play-in" tournament games that occur at the very start, because those are played while it is still legal to submit predictions. Nor do we score any games that occurred during the regular season, or in the "secondary" tournaments like the NIT and WNIT that happen in parallel with the NCAA tournament. The only games that matter are those final 63 in the men's NCAA tournament and those final 63 in the women's NCAA tournament.

outer snow
#

Will data from Nate Silver's Silver Bulletin be allowed this year, assuming that it's behind a paywall again?

carmine marlin
#

Hi, Im looking for a team, if you are interested feel free to message me in my DM 👍

last rain
#

Hello, I would like to double-check if we are required to submit a notebook along with our predictions csv? Or If we can just submit our csv files. Thank!

winter lagoon
last rain
# winter lagoon Just the csv is fine

Thanks Jeff. One final question: The submitted csv should have all possible matchups for both men and women all the way back to 2021 (2021-2025)? Or just all possible matchups for the year 2025?

winter lagoon
# last rain Thanks Jeff. One final question: The submitted csv should have all possible matc...

We are currently in Stage 1, which means you do submissions of seasons 2021, 2022, 2023, and 2024, in accordance with the Stage 1 sample submission file. There’s no prizes or medals for stage 1 because if you wanted you could submit a perfect set of predictions just by consulting the contest data files. Once we move to Stage 2, you will submit predictions for 2025 in accordance with the Stage 2 sample submission file, which we will be releasing next week.

#

And we can’t start scoring those Stage 2 submissions until the tournament starts on March 20th

last rain
#

Ah I see! Thank you for clarifying

cursive phoenix
#

Hi all. Just joined and excited to mess around. First time kaggler. I think I have a pretty good idea of the setup. Only questions I have so far:

  • what is the benefit of 100 submissions a day if you can just validate locally based on the 21-24 data and 25 pre-tournament data as it comes?
  • you ultimately choose 2 submissions. Are they evaluated totally separately and your leaderboard position is your highest-ranking submission? What if someone has both the 1st and 2nd best submissions, for example, in that case?
cunning ore
cursive phoenix
#

Cool thank you.

winter lagoon
#

Yes you will tag your 2 preferred submissions, and then when we score it, each person’s leaderboard score is just the better performing of their own 2. So you cannot win both 1st and 2nd place

#

It is important for people to remember to tag their two preferred submissions, since you might not like the automated way that they get picked, if you forget to tag them. We will be reminding people of this frequently over the course of the competition

wicked nymph
#

Why are there like 3 copy pasted + chatgpt output public notebooks with over 10 upvotes...😭

#

Only difference is one notebook's LLM changed all the "" to '' 👍

naive garden
copper grove
dark stump
#

I'm looking for team for this competition. Interested individuals dm me

hasty tapir
#

wth let's me know the leardboard has many 0.000000 score

loud spruce
dark stump
#

Hello all I'm looking for team members for this competition

#

Interested dm me

bright pulsar
#

@winter lagoon Do the SampleSubmissionStage1.csv contain all the possible matchups? Is it enough if I parse it?

winter lagoon
#

Yes and yes

bright pulsar
granite comet
#

The data set doesn't have anything about players on each team - is that correct?

winter lagoon
#

That’s right

high thistle
#

Hi, could I ask what is the ddl for stage 1?

winter lagoon
#

I'm not sure what ddl means? Deadline? There is no real deadline for Stage 1, no prizes or medals or anything, since the solutions for stage 1 could easily be looked up in the competition datafiles. It's more for practice and model development. Stage 2 started a few days ago and constitutes the real competition. The submission deadline is the morning of March 20 (but check yout timezone please!)

I thnk ddl also means data definition language, in which case I don't know what you mean...

bright pulsar
# winter lagoon Yes and yes

Then what's the difference between the file of stage1 and 2? Which one should I parse? I want to get medals & no other goals.

winter lagoon
#

In stage 2, you are making predictions prior to the 2025 season's tournament. The stage 2 submission file only contains predictions for the 2025 season. For practice, we also provide a stage 1 submission file, which includes predictions for seasons 2021-2024. This helps you understand the submission format. However, there are no prizes or medals for doing well, because the tournament game results for seasons 2021-2024 are part of the competition dataset.

high thistle
bright pulsar
winter lagoon
# bright pulsar Do this mean that I have to predict the same matches as in the stage 2 submissio...

Yes that is correct, the sample stage 2 submission file indicates the matchups that must be predicted. If you like, you can parse through the sample stage 2 submission file and it will tell you which matchups to predict, and so you can prepare your own submission file that way. Once you have made any and all of your submissions, you will be able to pick two of them as your "selected submissions". That must all be done prior to the submission deadline early on March 20th (remember to check precise time, for your time zone!) Then the actual 2025 tournament starts on March 20th, and as the tournament progresses (across 2+ weeks) then everyone's scores will get updated based on what happens in the real life games. Assuming you selected two of them, the better-scoring of the two will determine your placement on the leaderboard and ultimately whether you get a medal or a prize.

wicked nymph
bright pulsar
wicked nymph
wicked nymph
bright pulsar
#

Yes. And I'm learning of course. But the main objective is achievement @wicked nymph

wicked nymph
bright pulsar
loud spruce
#

The stage 1 file is extremely useful as a test set. For anyone new to the competition I would highly recommend using the Stage 1 file as a pure test set and filter years 2021+ out of all of your training and validation data during your model development process. This competition can be extremely easy to overfit and/or introduce data leakage from your validation. Having a test set during development is very helpful if you want to have a Stage 2 model that performs reliably.

zealous elm
#

Can someone please explain in more details how the regional interpretations work (maybe with an example)? the explanation in the data section just makes it even more confusing for me.

loud spruce
#

The MSeasons and WSeasons files explicitly list the names of the regions associated with the WXYZ encoding that you will find in both the seeds and the slots files. Unless you are going to visualize the competition layout the real names don't matter too much, because they have encoded the regions in a way where region W and X always play in the semifinals and same with Y and Z. Knowing this you can say that any matchup between an X seed and a Y seed or a W seed and a Z seed is a finals matchup. Determining other rounds and tournament paths are a bit more complicated but can be done using a combination of the seeds and slots file. Does that help?

loud spruce
#

Real example: Here is a link to the 2024 NCAA women's bracket where the Albany 1 region should be region X in the Kaggle data. Because of how the regions are defined I know the semifinals match of South Carolina vs NC State would be seeds W01 vs X03. Of course you can also just look up the seeds by team after the fact. The understanding of W plays X and Y plays Z is only necessary for predictive simulations. https://www.ncaa.com/brackets/basketball-women/d1/2024

zealous elm
loud spruce
#

Yes exactly. It's all about the tournament structure/flow

bright pulsar
loud spruce
dim bear
loud spruce
# dim bear In regards to overfitting, is it generally recommended to use 2021-2024 as testi...

There is a lot of variance in the tournament results and so even 4 years isn't really all that much data. I would suggest a more robust validation scheme and only using those last 4 years as a test check. Even then, it's totally possible that you have a model that validates well but performs slightly worse than another model in the last 4 years. The hard part about this competition is that you probably need to be slightly over fit and lucky to have a chance at a strong medal.

heady mason
#

anyone still looking to join team or looking for a member?

last rain
compact thorn
# last rain How can I back test on the brier loss function like you detailed in your comment...

you can use from sklearn.metrics import brier_score_loss basically get the compact results csv, join it to the submission so you have the result. The do something like this to got through and calc the results. ```

calculate the brier score for each season

for season in sub_score["Season"].unique():
bs = mmf.calculate_brier_score(
sub_score[sub_score["Season"] == season]
)
print(f"Season {season} Brier Score: {bs}")

compact thorn
last rain
#

Cool, thanks Anthony!

compact thorn
# last rain Cool, thanks Anthony!

Almost forgot you need to convert the winning and losing team ids in the results into the lower and higher team id format of the submission. Np.where is your friend for that job

marsh prairie
#

How are yall accounting for the Flagg injury? I’m thinking you have to use some metric like the point spread for the first game versus expected to do a efficiency margin correction that values player worth

dim bear
#

I have accepted I have to just gloss over injuries and it's gonna hurt my score

winter lagoon
#

Flagg injury 😳?

#

Wow

#

But it doesn’t look too terrible

#

Based on my immense medical expertise

mild pebble
dim bear
#

I don't know how to do this without a massive dataset of individual player statistics. Does one exist that isn't behind a paywall?

winter lagoon
#

Somebody posted a link in the Discussion forum to a dataset that I think was hosted on Kaggle and I know had individual stats in it. I am struggling to find the link though; I believe it was about a week ago

winter lagoon
#

I have no idea about the quality or usefulness of the data; it was just something I noticed in passing

tawdry kernel
#

I noticed in MNCAATourneyCompactResults.csv, there are only 66 lines starting with the year 2021, whereas the other years (like 2018, 2019, 2022, 2023, 2024) have 67 (first 4 + 32 + 16 + 8 + 4 + 2 + 1). I'm checking with grep "^2021" MNCAATourneyCompactResults.csv | wc -l. I don't see anything different with the format of 2021 bracket: https://www.espn.com/mens-college-basketball/bracket/_/season/2021/2021-ncaa-tournament

loud spruce
winter lagoon
marsh prairie
dim bear
#

I'm getting 0.1706 average brier over 2021-2024 mens. Surely overfit or did I stumble on a gem?

bright pulsar
#

@winter lagoon Can I get permission for my child who is under 18 to participate in this competition please? I have given from my end. He is good at his studies and, also want to show his skills at ML. I believe that in this rapidly changing world skills are the most important thing. As much as academic grades. Your permission can help him to upskill himself.

winter lagoon
loud spruce
red cairn
#

are we not predicting the two men's & women's play-in games that happen tomorrow & wednesday?

winter lagoon
#

They don’t count toward your score. The submission deadline is not until Thursday morning, and the games that are scored will start on Thursday

dim bear
#

Will those games appear in the data?

loud spruce
mild pebble
winter lagoon
#

It varies a lot, because it really depends on how many upsets there are each year

dim bear
#

To follow up on my question that you replied to @mild pebble, there indeed was a bug in my model. I'm doing much, much worse when I verify against 2021-2024 men's now (above 0.2 😢 ) but at least that's more accurate! Didn't want to give the impression I'm crushing it

mild pebble
mild pebble
#

good luck to anybody and everybody still grinding in the final hours!! been a fun and challenging project, can't wait to see some results

loud spruce
loud spruce
red cairn
#

I'm a bit confused with the submission format am I supposed to put the results for every possible matchup between all teams in WTeams.csv and MTeams.csv?

#

or can I just submit a file with the teams in the tournament now that we know what they are?

dim bear
#

Your submission file should contain all possible matchups between men's teams, concatenated with all possible matchups between women's teams. An overwhelming majority of those matchups will never happen in real life (so your prediction values for them don't really matter as they won't be scored).

mild pebble
slim yew
#

Does anyone know how to look at the score for both of your submissions? I can only see the score on the top submission.

last rain
slim yew
#

Thanks!

red cairn
#

did you guys choose to train your final model on just previous years post-season games or on both previous years regular-season and post-season games

mild pebble
loud spruce
quasi grove
#

Hey all, does anyone know why there are two scores on the leaderboard

hardy kernel
#

Can somebody explain to me how the leaderboard works? I’m interested in knowing who’s in first place…. Is it by accuracy of picks? Who is the highest? I’ve been love to know the strategy behind the top leaders

winter lagoon
#

Everyone submitted predictions for all possible game matchups. A prediction is a decimal number somewhere between 0 and 1. So if you submit 0.75 for that row, you are saying a 75% chance for the first team to beat the second team. Then when all the actual games get played, the result is either a 1 or a 0, depending on whether the first team beat the second team. We square the difference between the prediction and the actual result, and evaluate each player's two selected submission files. The leaderboard shows everyone's better-performing submission out of the two they selected.

wicked trail
#

Hello Everyone!
I am Shashank, with 3+ years of experience in the domain of data - I am very much interested in creating and deploying end-to-end machine learning models.
Since, going deep into the idea, models related to Artificial Intelligence, also facinate me to work on, and to have a proper solutions to the business.

Same interest students/professionals can connect me on my linkedin: https://www.linkedin.com/in/snkp0018

Happy Learning!
Best,
Shashank Pandey

finite lily
#

everyone

#

everyone

#

every

brave stratus
finite lily