A New Madness to March: Creating a Bracket Optimization Model

Submitted by mgoDAB on March 16th, 2022 at 3:09 AM

Hi all,

Happy March Madness! Hopefully your brackets are all starting to come together by now. But before they do, I wanted to share a bracket optimization model I’ve put together. It’s something that began as a silly hobby that utilized a nonsensical cauldron of random stats and facts for each team, and it has morphed into a full-blown obsession of trying to solve the enigma of March Madness using advanced analytics. For years, I’ve diligently studied the work of great sports writers and statisticians (Ken Pomeroy, Bart Torvik, the writers here at MGoBlog, Ed Feng, Michael Lewis’ Moneyball (of course!), and TeamRankings to name a few) to better understand analytics in sports.

All of this modeling is done using Excel and Visual Basic (which is the programming language that exists within Excel). Without boring you with the details, I make models like these for a living. And when I have the spare time, I enjoy learning and writing about sports statistics.

Getting on with it, the model (which I sarcastically named The Formula) is comprised of three major components:

1. I created a predictive ensemble model consisting of the most widely used advanced team rating systems including Kenpom, Torvik, and BPI. And in doing so, I can derive win probabilities for every possible matchup and examine the percentage chance for every team to advance to each round of the tournament.

This ensemble method isn’t all too dissimilar from FiveThirtyEight’s analysis that they do each year; each rating system is measured in some form of how many points better a team is than the average team in the NCAA. But each model is slightly different and has its unique quirks, so it’s generally considered to be more predictive by combining several models together.

Now if I had all the time and data in the world, I would back test these weights against gamelog data from the regular season. But this’ll do for now. I chose weights of 2 for Kenpom, 2 for Torvik, and 1 for BPI because the former two are generally open book (data is easily accessible and they each have blogs dedicated to explaining in extreme detail how their rating systems work). BPI, on the other hand, doesn’t really explain so much about how the sausage is made, but their rating system is still easy enough to extrapolate and include in the ensemble model.

Voila!

2. Having the advanced analytics as a foundation is a great start, but it isn’t necessarily enough to maximize the chances of winning a pool. The funny thing about March Madness is that it’s not like you’re going up against Vegas, which is efficiently priced and designed to win over the long run. No, the majority of people you’re actually competing against are novices that are susceptible to significantly over- and under-valuing teams. It’s important to exploit those misalignments in value in order to stack the odds in your favor.

ESPN publishes data on the frequency at which the public is selecting each team to advance to each round. And from there, I can run a series of simulations consisting of:

  • A simulation of the NCAA Tournament using the predictive ensemble model
  • Simulations of competing bracket entries assuming pool sizes of 10, 25, 50, 80, and 100 people. From there, I can assign point values to those competing bracket entry simulations corresponding to how they performed relative to the tournament simulation from the predictive ensemble model.

All together, I run 100,000 simulations of competing bracket entries.

3. The final step in the process is creating an input page whereby I can make selections for my personal bracket entry. And in doing so, I can compare how my bracket entry performs against thousands of simulated pools and derive probabilities for how a specific bracket entry would perform in pools of 10, 25, 50, 80, and 100 people.

And through this process I can examine, round by round of the tournament, the very calculated risk/decisions that can be made that are statistically justified in improving my odds of winning a pool.

I use a series of dropdown lists to make my selections, and then refresh the bracket entry to be compared to the thousands of simulations performed.

 

Conventional wisdom would be that someone should have a 1/n chance of winning a pool, where n = the number of people in a pool (i.e. if I’m in a pool of 10 people, I should assume that I have a 1/10 or 10% chance of winning). Using this model, on average I multiple those odds of winning by 3-5 times. Beyond just looking to simulations results, though, I’ve also assembled a short list of what I’ll call “advanced bracket selection stats”.

None of which are more important than:

  • E[P] (Expected Points) = (probability that a team reaches a certain round) * (point value of that round)
  • E[PAP] (Expected Points Against the Public) = (probability that a team reaches a specific round) * (1 – rate at which the Public is selecting the team to reach the corresponding round) * (point value of the round)

E[PAP] can also be interpreted as the expected incremental point value gained over the national average bracket entry. This is where we distinguish the statistically justified picks that can maximize the odds of winning a pool.

Here is the full glossary of statistics that I track:

 

But what's a diary without any predictions??

As a case study, let’s consider the South Region of this year's bracket. Below are the probabilities and point-related statistics corresponding to each team’s outlook of reaching the Final 4. 

South Region - To Reach the Final 4

Arizona has the best odds of coming out of the region at 24.0%.

And before I go much further, we need to address Houston, which has 22.8% chance of making the Final 4. While only a 5 seed and certainly lacking stellar wins, the advanced team ratings systems love Houston (currently rated #4 overall on Kenpom, #2 on Torvik, and #2 on BPI). Houston is also #3 in the NET Rankings.

And Tennessee, meanwhile, ranks #3 in my ensemble ratings and has a 22.2% chance of making the Final 4.

In the cases of both Houston and Tennessee, the E[P] for selecting these teams to reach the Final 4 are slightly less than that of Arizona’s. However, the E[PAP] for these two picks are considerably greater than that of Arizona’s.

Thus, I should expect by swapping out Arizona with either Houston of Tennessee, the odds of winning my pool will increase.

Final 4 – Arizona Replaced with Houston

Before:

After:

Final 4 – Arizona Replaced with Tennessee

Before:

After:

Indeed!

I apply this method across all possible selections in the tournament until I’ve maximized the probability of winning a pool. Pretty neat! It’ll certainly be interesting this year with Gonzaga as a clear favorite to win the whole thing at a 25.4% chance, followed immediately by 7 teams that are within a 5.5%-7.5% chance to win it all.

If you made it this far through the diary, thank you for reading! If you are at all interested in using the tool to help for your own brackets, you can leave your email in the comments and I’m happy to share. It's a Macro-Enabled Excel file, so unfortunately I cannot upload it here to MGoBlog. You’ll receive an email from me ([email protected]) and the model will include instructions on how to use. 

Good luck with your pickings!

Comments

MGoStretch

March 16th, 2022 at 9:52 PM ^

This is pretty slick, it's too late for me to use in my pools, but you have to come back and let us know how the practical application worked out for you. I hope you win 'em all.

Minent Domain

March 18th, 2022 at 4:49 PM ^

I'd love to check this out: [email protected].

Curious how you'd think about playing in multiple pools; if only the number 1 gets a prize, so you'd highly prioritize winning either, what level of overlap is advisable (because high probability outcomes), or how helpful is it to mix up the finalists?

Wally Llama

March 15th, 2023 at 2:49 PM ^

Thanks for sharing your hard work! I'm geeking out on this!

The only question I have so far is about how this tool accounts for different scoring systems. I assume that a 1-2-4-8-16-32 system will have a better optimal selection than a 1-1-1-1-1-1 system.

Does this tool have an input for scoring per round? If not, what does it use?

ETA: I've found it: it's a built-in 10-20-40-80-160-320 system.