"It's a lot easier being a drug dealer than an AAU coach" - this guy. Tell me something I don't know. I mean, don't think but have never tried either.
Update at bottom
Update 2 at bottom
Note: This is a long and complex read. I know that. I'm looking for assistance with a project I'm working on that I know everyone will be interested in. If you wish to skip all of the reading, I have summarized everything in bullet points at the bottom.
I had hoped to keep this my little secret until I was completely done and I could unveil everything at once, but I no longer believe that I could do this project as efficiently without some other input. As an engineer, I require myself to do everything with as high efficiency as possible so I must petition the MGoBlog community for help.
As many (more likely all since you're on a site like this) of you are aware, there have been more and more threads being posted which essentially go down as so:
Poster 1: "We're going after slot-dot X and he's only 3 stars!. Argh! Doesn't RichRod understand he's not at WVU anymore and he needs to get MICHIGAN quality recruits. RichRod=Fail."
Poster 2: "Stars don't matter, obviously RichRod thinks that he's good enough and that's good enough for me."
Poster 3: "Rankings are early, they'll change, just settle down for now"
Poster 2: "He's only 3 stars but look at his offer sheet, I'd take someone that's 3 stars with offers from USC, OSU, UF, 'Bama, etc. over a 5 star with offers from us and the MAC."
Poster 1: "Stars do matter, you need talent!"
Poster 2: "Mike Hart, Braylon Edwards... nuff said"
And so on and so on.
So, I started thinking about rankings and their usefulness at predicting future college and pro success. To that end, I'm going to undertake what I believe will be the largest statistical analysis of recruiting rankings to date. But I need some help.
Let me describe what I'm planning on doing, what I've already done to accomplish that goal, and what I still need to do. Then I'll finally be able to show everyone what I need help with. You'll also be informed enough to offer criticisms, advice, and ask questions if necessary.
1- What I plan on doing
I'm going to take all recruiting data from Scout and Rivals from 2002-2009. As of right now, that includes: name, positional rank, number of stars, HT/WT/40, position, hometown, and home state. I'm then going to also compile data on how many starts each player had in each year of his career, if he redshirted, if he left early for the draft (manifested as number of years of eligibility remaining), the number of All-Conference honors received, and the number of All-American honors received. I will also take information on if they were drafted, what round they were drafted in, what overall number they were drafted as, what position they were drafted for, and what team they went to.
Once I have all of that data, I will first do a top-level analysis to see, independent of everything else, how star rankings alone are at predicting collegiate and pro success as defined by the stats that I will have collected above.
From then on, I will keep trying to dig further to get more and more relevant models and conclusions. This will include but will not be limited to how the average rankings of the other players around another player (independent of that player's rankings) affect collegiate/pro success, the number of blue-chip recruits that completely fail, the number of blue-chip recruits that leave their home state, the average team ranking, success of rankings at predicting success at each individual position, the affect of positional ranking on future success, etc.
I'm going to try to come up with as many ways as possible to analyze the data that either decouples the data or gives conclusions that are independent of coupling. Figuring out how to do that will be difficult but fun.
As a side note, this will also let me eventually compare Scout and Rivals to say with some authority, whose [final] rankings are more accurate.
Of course, I will also apply standard statistical analysis procedures to determine if my conclusions could be deemed statistically relevant or not (I don't know with what percent confidence yet so don't ask).
2- What I have done
It's all well-and-good to have thought all of this out, I'd be willing to wager that at least one other person currently reading this has thought about it, but thinking alone won't get any of us anywhere. So, I've started to do a lot of the grunt work as a sign of my commitment so that people will understand that I'm dedicated enough to make helping me worth their time.
I have already collected all of the information from Rivals for every class and every player.
So, for the classes from 2002-2009, I have every name, positional rank, Rivals Rating (RR), star rating, position (as Rivals breaks it down), and what school they committed to.
I have also created an Excel spreadsheet template that will allow me (once I get all of that data) to merely copy and paste a few things from Rivals and all of the data that I have on every player will be retrieved. With that, I will be able to create a spreadsheet for every BCS team (as Rivals only has complete listings for BCS teams) which will have every class and all of the data for each kid in every class all in one spot. Then I'll be able to do my analyses more easily.
3- What I still need to do
Obviously I'm still not done with the collecting data/grunt work as I still have to take all of Scout's data. It's taking a little while because of the way that they format their data compared to Rivals. Fortunately, I have solved the problem and can now do the usual copy and paste (followed by several other things to make it all work).
I'm considering also grabbing data from ESPN but I'm really not sure if it's even worth it. They only have data from 2007-2009 (I believe) so that doesn't even include a class that been drafted yet.
More importantly, I need to find a source for the other data that I'm trying to collect. I need to find some place(s) that lists all of following:
- If a player redshirted
- Number of starts each year
- Every All-Conference team (not just first team) for all BCS conferences starting from 2002
- Every All-American team starting from 2002
- Every transfer since 2002
- What position each player was drafted for
- Individual player positional statistics (e.g. completion percentage, interceptions, tackles for loss, etc.)
There is also some other data that I’m going to try and collect but I already have sources for that so it need not be listed here.
4- What I need help with
I need help finding the data that I list above. Pieces of it are available everywhere but I haven’t found a single site that has a repository of all the information implied in even just one of those points above.
Additionally, getting individual statistics is extremely hard. But, it would allow more comparisons than possibly anything else. But, there are literally tens-of-thousands of players. There were over 1000 wide-receivers in 2009 alone! There are simply too many players to try and go to each player individual profile page somewhere and collect the data. I, unfortunately, require lists. That is, unless there is some tool or way to automate that data collection process. I myself know of no such way but that is one of the reasons that I’m asking the MGoBlog community for help, because I don’t necessarily know everything that I could do to make this project as easy as possible (at least on the data collection front).
I’d also like to find a way to collect data on all of the schools that have officially offered a kid a scholarship to see if there is some way to show that stars or scholarship offers is, statistically speaking, the best measure of a kid’s future ability. Again, I can’t go to every Rivals profile page to try and collect that data. This is one area where I feel that since the pages are so similar, it might be possible to write some sort of script to do the work for me. Unfortunately, I’m a ChemE and MSE person, not a CSE person (for those of you outside the engineering that’s Chemical Engineering, Material Science Engineering, and Computer Science and Engineering respectively) so I don’t know what tool or utility I would go about using to accomplish that. I am in Tech. Services so I’m sure that if someone pointed out to me the appropriate tool and maybe some documentation on how to use it then I wouldn’t have any problems.
I know that what I wrote above was long so here’s the summary (whether you read everything preceding this or not).
I’m going to perform a statistical analysis on Scout and Rivals to determine how good their final star ratings and positional rankings are at predicting future success both in college and the pros. To do so, I have already collected the data from Rivals and am currently working on collecting data from Scout. I will probably not take data from ESPN although that is not a certainty.
To determine collegiate success I will take data that includes but is not limited to All-Conference honors, All-American honors, and the number of starts. To determine pro success I will take into consideration where a player was drafted and for what position.
I know where to acquire some of the information that I need but I still need help finding useful places to take large amounts of data on:
- All-Conference teams
- What position each player was drafted for
- Number of starts by each player
I would also like to find a way to automate data collection, specifically with an eye towards collecting data on what schools offered each kid a scholarship. Since there are tens-of-thousands of kids this cannot be done individually but must somehow by automated. I do not know how to do that and am thus asking for help. The same situation applies for collecting individual, positional specific, statistics on each kid.
If anyone would like to help me out with what I have asked, then I would greatly appreciate it. Any criticisms will be well-received (or at least as well-received as I can) and taken into account. Any comments or other thoughts are also welcome and appreciated.
For more information, read the sections above.
Since so many people have responded with helpful ideas, if you wish to contact me with anything that you either don't want to post in the comments, is too long and complicated for the comments, or that you wish to have a more private dialogue about then email me at:
That's not my main email so I won't check it as often (i.e. not every 20 minutes) but I'll try to check it at least once a day. If you want to send me anything, links or other work that you've done that might help me, then send it there.
Thanks for all the great ideas and please keep them coming. I'm still thinking about ways to handicap a teams that have a lot or a little talent relative to the average (for reasons that are too long to fully explain in this update, although there are some interesting thoughts on why and how in the comments below). I'm also looking for ways to automate the data collection process. There are a few suggestions below but I'm going to be looking for more so please tell me.
Again, I prefer using the comments if possible but if not then email me.
Update 2: 3-27-09
Well, it's been pointed out in the comments and confirmed by me that the email address is listed above doesn't work. That's because I had a small typo. Of course, small typos in email addresses are big typos.
Anyways, the correct email address is: Bleedin9Blue@gmail.com
If you tried emailing me earlier with the previous email address then please try again. I appreciate your patience.
Maybe one day I'll bother to do an introductory post... but it's the internet so who cares who I am.
Anyways, I thought that if you're going to write something about sports, why not make it about one of the most stupid parts of sports. That's right, the much maligned BCS system needs replacing and it's high time someone insulted my idea as well as the current one.
Obviously the biggest problem is that the BCS creates an MNC rather than an NC (for those of us who don't like acronyms, that's Mythical National Champion vs. National Champion (a for-realsy one)). This comes about because there are 119 DI teams and more are being added (I can think non DI team that's been pretty good...). Thus, it's difficult to truly compare teams to one another. As we saw last year you it's possible to lose 2 games in the SEC and make the National Championship Game and yet have a bevy of 1 loss (and in Hawaii's case 0 lose) teams not make the cut. Is the SEC better than C-USA, of course. Are they better than the ACC, yep. Big East, also yes. Pac 10... are we counting USC? Big Ten... depends upon if you look at Michigan's or OSU's record... The Big 12? Who knows? (It's not like the right team from the Big 12 will get to the BCS!)
The point is, as we already knew, when you have people vote it's perception that matters rather than how good a team actually is that matters. So, we need a system that will be acceptable but take care of that problem.
Being "acceptable" entails two things:
1- Not a playoff
2- Preserve the "sanctity" of the bowls
By "sanctity" I mean that the Rose Bowl and to some degree the other BCS bowls have to not lose their uniqueness and luster. What that comes down to meaning is that you can't use the bowls as playoff bowls (explained below) with different names. You have to keep the Pac 10 vs. Big Ten, and other such "Big 6" conference tie-ins.
What I mean by you can't use the BCS bowls as playoff bowls is that you can't institute a playoff system with 8 teams and then stick them in the various bowl locations according to their seeding then continue until you get your champion.
No, there has to be a system that keeps the tie-ins while actually accomplishing something.
Quite possibly the biggest problem with the BCS as it stands is that pollsters have to choose a number 1 and number 2 team. Often (i.e. every year but one since the BCS started) there's been at least 3 teams that can legitimately claim to be the second best or best team in the country. But ya'all already knew that.
My idea hinges on the thought that it'll be pretty rare that there are more than 8 teams that might be good enough to be 1 or 2. Thus, we go back to just putting the conference champs in their respective bowl. That would put the Pac 10 and Big Ten champs in the Rose Bowl, the SEC champ in the Sugar Bowl, the ACC champ in the Orange Bowl, and the Big 12 champ in the Fiesta Bowl. The Big East then is pitted against either the ACC, SEC, or Big 12 champ by taking an at large bid (as they currently do). I would consider changing this but part of the appeal of this plan is how little must actually be changed.
For the final 2 at large bids, if Notre Dame finishes in the top 8 they'll automatically get an at large bid. If a non-"Big 6" team is in the top 8, they too will get an automatic bid. If there are two non-"Big 6" teams in the top 8, only the higher ranked of the two will get the automatic bid.
Then, whether there be one or two at large bids left, those bids will go to the two highest ranked teams that aren't already going to a BCS bowl (regardless of if they're in the "Big 6" or not). The only stipulation would be that no conference could have more than 2 teams in the BCS bowls. This would only come into affect if Notre Dame finished ranked 9 or lower and there was no non-"Big 6" team in the top 8 and one conference had the two highest ranked teams that weren't conference champs.
I probably won't go back and check the standings, but I doubt that this has ever happened nor is it likely to happen.
Let me say what would've happened using last year final BCS rankings before the bowl games were played. Note, for space I'll just show the top 10, nobody below there would affect this system.
1. Ohio State
3. Virginia Tech
9. West Virginia
The Champs were:
Big Ten: Ohio State
ACC: Virginia Tech
Big 12: Oaklahoma
Pac 10: USC
Big East: West Virginia
Taking out the conference champs, the list would look like this:
Therefore Georgia and Missouri would take the 2 at large bids since neither Notre Dame nor a non-"Big 6" team finished in the top 8 and they [Georgia and Missouri] were the two highest ranked teams left. If Georgia for whatever reason had been ranked lower (say 12), then Missouri and Hawaii would've taken the two at-large bids. Kansas would've been skipped over because if Kansas got the final at-large bid, the Big 12 would've had 3 teams in the BCS with Oaklahoma, Missouri, and Kansas.
Who the at large bids play is based on their final rank, the highest ranked at large bid team will play the lowest ranked conference champ from THE ACC, Big 12, or SEC. The lowest ranked at large bid will play the highest ranked conference champ.
Thus, the final matchups (using the above final rankings) would be:
Rose: OSU vs. USC
Sugar: LSU vs. West Virginia
Orange: Virgina Tech vs. Missouri
Fiesta: Oaklahoma vs. Georgia
The fact that ranking matters so much makes every game important. Even a team has already won their conference and still have games, they'll still want to win all of those games to make sure that they face the "easiest" opponent possible.
From those 4 games, you will then have 4 winners (obviously). The national championship game will still be held a week later. The two teams that go to the National Championship game are again chosen by the pollsters. There will be a final poll which ranks only those 4 winners, the top 2 get to play in the National Championship game. The other two don't play any more.
It's notable that this could allow two teams from the same conference to play each other. For example, say the winners of the bowls were USC, LSU, Missouri, and Georgia. Then the pollsters would have to rank those four teams. If the rankings were:
Then the National Championship game would be between LSU and Georgia.
This avoids it being a "playoff". It would essentially be the same-old bowl games with an extra game afterwards. There will still be controvery, but probably signifcantly less controversy. Pollsters should have a very good idea of who the two best teams are of the four winners.
Although, I should admit ranking just those 4 teams was very hard and I think 3 out of the 4 would be good enough to play in the National Championship game
The location of the National Championship game must, unfortunately, still be chosen before the teams playing in it are. This is simply because it's difficult to demand that cities prepare for a National Championship game that they will never play if they're city isn't chosen.
I would consider saying that the National Championship game cannot be played at either of the bowls of the winners, but I think that people would be less likely to agree to that.
Does this still have controvery, yes. Lots. Does it solve some problems, yes. It doesn't solve them all, but it solves some. I advocate this system not because I believe it's the best, but because I believe it's the best that the people in control of such things would accept. The bowls would still be identical to how they are now in terms of their tie-ins (most important for the Rose Bowl) and there wouldn't be a playoff. No new games would need to be added.
The only rule that might need t be added is that the at large bids cannot allow 2 teams from the same conference to play each other in the BCS games. If that was added, then if two teams from the same conference would play each other, then the two at large bids that aren't taken by the Big East champ would simply switch who they're playing.