You asked for it, you got it.
A few thoughts:
- I think that this is an interesting (and relevant) topic to study, and I think you have made a good start at it.
- Your mega-table at the end should be a Google Doc and you can give it to us as a link. As it stands right now, I'm not sure why you put it in there. It seems like it is simply your dataset, and if you want to encourage me to play with it, then put it in a play-able format and share it that way.
I'm not sure of the value of the Tourney PPG stat. It is an average of 2 numbers (3 in the case of VCU and LaSalle...first 4 teams that have made the sweet 16). I would guess that the average should go up for 2 reasons.
- Mostly you get higher seeds. #1 seeds often crush #16 seeds and usually make the sweet 16.
- They just won 2 games. I'm guessing that if you randomly sample two wins from a team's season and average them you will often get a number higher than the season scoring average.**
This seems like it might be a good application for a generalized linear modeling. Your basic question is:
Does the number of points scored matter on whether a game is played in a dome or an arena?
So you'd probably build a model like this:
pts in S16 = avg pts + opp (def) avg pts + ARENA
And ask if the ARENA variable is significant.
** I decided to investigate this statement further. I took Michigan's results this season to look at stability of averages. Here are a few facts I found:
- The average number of points Michigan scored this season (75.1 ppg) was not significantly different from the average number of points Michigan scored in their wins (77.9 ppg). A Student's T-test of the hypothesis that avgppg in wins > avgppg is FALSE (p=0.1603)
- If we randomly sample two winning scores and compare that average to the average ppg, we find that the hypothesis avg2winppg > avgppg is TRUE (p<10^-15). This tells me that the mean of 2 data points is worthless.
- The max point differential in my bootstrap sample (I sampled 1000 trials) was +22.4, and the minimum point differential in my bootstrap sample was -15.1. These are similar to the values that you found in your data (+21.8 and -13.5).
In the end, my point is that I think you should try a some sort of generalized linear model to support your analysis. Going back as far as you have, I believe, gives you the data set to explore this question. I think you should do it!