OT: Team that wins game X wins series Y% (best-of-7)

Submitted by johnthesavage on

Hello, mgoblog. I've always wondered about these statistics, which we hear about this time of year re: NBA and NHL playoff series, which say something like the team that wins game one (in the NBA) wins the series 78% of the time -- this is actually something I've heard tonight. Of course, the team that wins any particular game wins the series most of the time, so without knowing what % we would expect for each game purely by chance, we can't assess the significance of something like this.

Tonight, I decided to solve this problem. It's very easy to do, and maybe others have done it, but I thought I'd share it here, because I'm a U-M grad, and I'd guess someone here will find it interesting. Cheers.

I wrote a simple code to simulate best-of-seven series in which the result of each game was pure chance -- a 50% coinflip. I don't know how to analytically solve this problem, but luckily in the modern age that's not an important skill. You just have to find a way to brute force it stupidly with a computer. So I simulated 100 million best-of-7 series in this way, in which the result of each game was totally random. Then I asked what is the probability of winning the series, given that you have won game X?

I don't know why this is the answer, but this is the answer:

Game 1: 65.6%

Game 2: 65.6%

Game 3: 65.6%

Game 4: 65.6%

Game 5: 67.85%

Game 6: 75.00%

Game 7: 100%

I'm not sure why these numbers look like they do; I'd be interested if anyone here feels like wading into the theory. But it seems like, for the first few games of a best-of-seven series, pure chance suggests that winning a game predicts a 65% chance of winning the series. Anything significantly above that would suggest that a particular game -- like game one in the NBA -- carries unusual importance. Of course, this is only a preliminary, superficial analysis that does not take into account important factors such as home advantage. But it is an important first step which I have never seen anyone actually take.

jakegoblue

April 20th, 2014 at 2:24 AM ^

Pretty sure that's the percentage of teams that won in series prior to that date, and that's where their stat is coming from. Not from conceptual math, just the data.

johnthesavage

April 20th, 2014 at 5:30 PM ^

One final follow up.

I modified the code to make some changeable assumptions:

1) Home teams win more (for this post I assumed a 20% increase in win probability), and the series follows a 2-2-1-1-1 format.

2) The team with home advantage is better (for this post I assumed they would win 65.6% of neutral court games)

This is probably a more analagous situation to the early rounds of the NBA playoffs, which often feature mismatches, and of course home court advantage is significant.

With these assumptions, simulating 100 million series gives:

Game 1: 77.35%

Game 2: 77.36%

Game 3: 61.01%

Game 4: 61.02%

Game 5: 77.98%

Game 6: 80.53%

Game 7: 100%

In short, turning up either of these parameters (the size of home-court advantage, and the disparity between the two teams) increases the correlation between game 1 wins and series wins. However, the effects are different for other games. Increasing the dispartity between the two teams increases the correlation for ALL games, as the better team just wins them all more frequently, and wins the series more frequently. Increasing the size of the home-court advantage increases the series win correlation only for games in which the better team is at home -- it will increase the correlations for games 1 and 2 but actually decrease the correlations for games 3 and 4.

So, if the NBA has a very high game 1 correlation, it could be due to either of these factors (large home-court advantage, or typically large disparity between the teams). If the correlations continue to be above chance for all games, it suggests the real culprit is the disparity between the teams, and less so the home court advantage effect.

What's clear is that we should expect some very large correlations, which might vary as the series goes on, even if every game is an independent event. Concluding that game one is an especially important game, because the winner wins 78% of series, is probably the wrong idea.

 

johnthesavage

April 20th, 2014 at 11:26 AM ^

Yes, so that's another thing that could be considered. For example, people often say something like the team that wins game 5 of a tied series blah-blah .. something like that is trivial to calculate. These problems could be solved the same way.

For example, suppose I want to consider the same question, but for a series that is already tied 1-1. To simulate that, I just ran 100,000,000 best-of-five series and got these results:

Game 3: 68.75%

Game 4: 68.76%

Game 5: 68.75%

Game 6: 75.00%

Game 7: 100%

So we can see that, given that the first two games are split, the third game does become *more* predictive of the result of the series, but only very slightly, and no moreso than game four.

Many similar analyses could be done to answer similar questions. But we can see here, that already just by chance, in a 1-1 series winning each game predicts nearly a 70% chance of winning the series.

johnthesavage

April 20th, 2014 at 11:49 AM ^

Fair enough, but actually I'd like to defend the method. Another poster has beautifully demonstrated the analytical answer, and I appreciate that. However, it's a rather involved calculation, and must be redone every time a similar question wants to be considered.

The brute-force method is much faster and simpler, and actually gets you the same answer. It is also flexible, as I can easily change a few lines of code and simulate a seires that is already 2-0 for team1, or any such thing. I can put in assumptions about home advantage and quickly see what happens.

One more benefit which would be fun to see someone like Nate Silver actually do (he doesn't seem very interesting at the sports stuff so far), is that this method is a ready made significance test for the real data. If I REALLY want to know how important a particular game is in an NBA series, I might want to actually have a p-value associated with my observation of game 1 -- specifically that the game 1 winner wins the series 78% of the time. What is the significance of that?

If I know how many series have been "sampled" (the total number of NBA game 1 results that go into that 78% number), then I set my simulation to run that many series, and I see what I get for game 1. For example, let me take a wild guess that this number is based on 100 series. I simulate 100 series, and see what the correlation is for game 1 wins and series wins. Then I do this over and over. I get numbers like --

First run: Game 1 = 70%

Second run: Game 1 = 59%

Third run: Game 1 = 66%

Fourth run: Game 1 = 60%

Now I simply ask how many of these numbers are as extreme (away from the expected 65.6% which I also brute-forced) as the observed value -- 78%. If it's less than 5%, voila, I have a p-value of less than .05 and thus a significant result, according to typical standards. Stupid yes, but probably stupider not to use tools like this.

swan flu

April 20th, 2014 at 8:30 AM ^

Alternatively, you could look at it as a binomial distribution question and compute P(A wins series|A win game X)

 

Which, on Excel, would be "=1-binomdist(2,6,0.5,true)"

 

2=number of successes

6=number of trials

0.5=p(success)

true=cumulative.

 

This will give you the Probability of A winning between 0 and 2 games, meaning Team A loses the series... So you do 1-that answer to get the P(A wins series).

 

Doing it this way negates the neccesity to do different calcluations for th series ending in 5, 6, or 7 games because we use the "at least" term. The computer will still compute all 6 remaning games, but as the numbers show, the results are the same.... and it's a lot less work.

 

swan flu

April 20th, 2014 at 9:32 AM ^

Assuming that we are talking about a series of indpeendent and identically distributed trials (which sports are surely not), it won't produce extraneous results. Here's why:

 

By using "at least" what you're saying is that everything after the 4th win is irrelevant, the computer is basically multiplying its answer by 1 for every game after that 4th win since a win or a loss by team A still results in team A winning the series.

 

The order is incorporated into the binomial distribution calculations done by Excel.

 

Considering the answer given by Excel is identical to the one done by hand, and identical to the one produced by the OP's code, i'd say it's pretty accurate.

 

 

m1jjb00

April 21st, 2014 at 6:57 PM ^

At least figuring it out for game 5.  The question can be re-asked as what is the probability of winning at least 3 games out of the other 6 games conditional on having to play game X.  For games 1-4, the conditional on having to playgame X doesn't matter b/c all 7 games series reach at least game 4. But, it's different for game 5.  In that case you have keep in mind that you've already ruled out having gone 0-4 or 4-0 in the first four games.

Game 6 is pretty easy.  In order to have to play game 6, you would have to be 2-3 or 3-2 through the first 5 games, and we can reason that the probabilyt of having gone 2-3 and 3-2 is each 1/2. The probability of winning a series having won game 6 = Prob of winning the series having gone 2-3 through the first 5 and then winning game 6 + prob of wiining the series having gone 3-2 through the first 5 and then winning game 6.  That = prob of going 3-2 times prob of winning series after winning game 6 and having gone 3-2 + prob of going 2-3 times prob of winning series winning game 6 and having gone 2-3.  That = 1/2 * 1 + 1/2 * 1/2 = 3/4.

LSAClassOf2000

April 20th, 2014 at 7:14 AM ^

The theoretical probability is indeed 0.65625 as the OP says, but this made me recall that there is also a site out there where someone tries to add another layer to the analysis (by specific series record and by sport) and then provided historic probabilities of series victory as well. The table also shows the difference when the winner of the previous game is the home team or the visiting team too. The table for "Leading 1-0" is here - LINK - and the links to the other tables are on the sidebar.

For example, in the NBA, as far as this person's data goes back anyway, the Game 2 record for teams that played game 1 at home is 228-108 regardless of playoff round and the historic win probability for that team in such a series is 0.848. Interesting data. 

alum96

April 20th, 2014 at 8:11 AM ^

Outside of this statistical analysis I always thought game 4 of most series was the most critical.  If you exclude series where there is a heavy favorite up 3-0, most series will be 2-1 at that point and game 4 either means one team will have a huge statistical advantage up 3-1, or the teans will be 2-2 and its basically a best of 3 from there. So it always struck me as a huge game in any series where the opponents were relatively competative because based on that game one team has no margin for error for 3 games or it's even steven.