Statistical analysis assistance

Submitted by Bleedin9Blue on March 25th, 2009 at 11:01 AM

Hello MGoBlog,

Update at bottom

Update 2 at bottom

Note: This is a long and complex read. I know that. I'm looking for assistance with a project I'm working on that I know everyone will be interested in. If you wish to skip all of the reading, I have summarized everything in bullet points at the bottom.

I had hoped to keep this my little secret until I was completely done and I could unveil everything at once, but I no longer believe that I could do this project as efficiently without some other input. As an engineer, I require myself to do everything with as high efficiency as possible so I must petition the MGoBlog community for help.

As many (more likely all since you're on a site like this) of you are aware, there have been more and more threads being posted which essentially go down as so:

Poster 1: "We're going after slot-dot X and he's only 3 stars!. Argh! Doesn't RichRod understand he's not at WVU anymore and he needs to get MICHIGAN quality recruits. RichRod=Fail."
Poster 2: "Stars don't matter, obviously RichRod thinks that he's good enough and that's good enough for me."
Poster 3: "Rankings are early, they'll change, just settle down for now"
Poster 2: "He's only 3 stars but look at his offer sheet, I'd take someone that's 3 stars with offers from USC, OSU, UF, 'Bama, etc. over a 5 star with offers from us and the MAC."
Poster 1: "Stars do matter, you need talent!"
Poster 2: "Mike Hart, Braylon Edwards... nuff said"

And so on and so on.

So, I started thinking about rankings and their usefulness at predicting future college and pro success. To that end, I'm going to undertake what I believe will be the largest statistical analysis of recruiting rankings to date. But I need some help.

Let me describe what I'm planning on doing, what I've already done to accomplish that goal, and what I still need to do. Then I'll finally be able to show everyone what I need help with. You'll also be informed enough to offer criticisms, advice, and ask questions if necessary.

1- What I plan on doing

I'm going to take all recruiting data from Scout and Rivals from 2002-2009. As of right now, that includes: name, positional rank, number of stars, HT/WT/40, position, hometown, and home state. I'm then going to also compile data on how many starts each player had in each year of his career, if he redshirted, if he left early for the draft (manifested as number of years of eligibility remaining), the number of All-Conference honors received, and the number of All-American honors received. I will also take information on if they were drafted, what round they were drafted in, what overall number they were drafted as, what position they were drafted for, and what team they went to.

Once I have all of that data, I will first do a top-level analysis to see, independent of everything else, how star rankings alone are at predicting collegiate and pro success as defined by the stats that I will have collected above.

From then on, I will keep trying to dig further to get more and more relevant models and conclusions. This will include but will not be limited to how the average rankings of the other players around another player (independent of that player's rankings) affect collegiate/pro success, the number of blue-chip recruits that completely fail, the number of blue-chip recruits that leave their home state, the average team ranking, success of rankings at predicting success at each individual position, the affect of positional ranking on future success, etc.

I'm going to try to come up with as many ways as possible to analyze the data that either decouples the data or gives conclusions that are independent of coupling. Figuring out how to do that will be difficult but fun.

As a side note, this will also let me eventually compare Scout and Rivals to say with some authority, whose [final] rankings are more accurate.

Of course, I will also apply standard statistical analysis procedures to determine if my conclusions could be deemed statistically relevant or not (I don't know with what percent confidence yet so don't ask).

2- What I have done

It's all well-and-good to have thought all of this out, I'd be willing to wager that at least one other person currently reading this has thought about it, but thinking alone won't get any of us anywhere. So, I've started to do a lot of the grunt work as a sign of my commitment so that people will understand that I'm dedicated enough to make helping me worth their time.

I have already collected all of the information from Rivals for every class and every player.

So, for the classes from 2002-2009, I have every name, positional rank, Rivals Rating (RR), star rating, position (as Rivals breaks it down), and what school they committed to.

I have also created an Excel spreadsheet template that will allow me (once I get all of that data) to merely copy and paste a few things from Rivals and all of the data that I have on every player will be retrieved. With that, I will be able to create a spreadsheet for every BCS team (as Rivals only has complete listings for BCS teams) which will have every class and all of the data for each kid in every class all in one spot. Then I'll be able to do my analyses more easily.

3- What I still need to do

Obviously I'm still not done with the collecting data/grunt work as I still have to take all of Scout's data. It's taking a little while because of the way that they format their data compared to Rivals. Fortunately, I have solved the problem and can now do the usual copy and paste (followed by several other things to make it all work).

I'm considering also grabbing data from ESPN but I'm really not sure if it's even worth it. They only have data from 2007-2009 (I believe) so that doesn't even include a class that been drafted yet.

More importantly, I need to find a source for the other data that I'm trying to collect. I need to find some place(s) that lists all of following:

  1. If a player redshirted
  2. Number of starts each year
  3. Every All-Conference team (not just first team) for all BCS conferences starting from 2002
  4. Every All-American team starting from 2002
  5. Every transfer since 2002
  6. What position each player was drafted for
  7. Individual player positional statistics (e.g. completion percentage, interceptions, tackles for loss, etc.)

    There is also some other data that I’m going to try and collect but I already have sources for that so it need not be listed here.

    4- What I need help with

    I need help finding the data that I list above. Pieces of it are available everywhere but I haven’t found a single site that has a repository of all the information implied in even just one of those points above.

    Additionally, getting individual statistics is extremely hard. But, it would allow more comparisons than possibly anything else. But, there are literally tens-of-thousands of players. There were over 1000 wide-receivers in 2009 alone! There are simply too many players to try and go to each player individual profile page somewhere and collect the data. I, unfortunately, require lists. That is, unless there is some tool or way to automate that data collection process. I myself know of no such way but that is one of the reasons that I’m asking the MGoBlog community for help, because I don’t necessarily know everything that I could do to make this project as easy as possible (at least on the data collection front).

    I’d also like to find a way to collect data on all of the schools that have officially offered a kid a scholarship to see if there is some way to show that stars or scholarship offers is, statistically speaking, the best measure of a kid’s future ability. Again, I can’t go to every Rivals profile page to try and collect that data. This is one area where I feel that since the pages are so similar, it might be possible to write some sort of script to do the work for me. Unfortunately, I’m a ChemE and MSE person, not a CSE person (for those of you outside the engineering that’s Chemical Engineering, Material Science Engineering, and Computer Science and Engineering respectively) so I don’t know what tool or utility I would go about using to accomplish that. I am in Tech. Services so I’m sure that if someone pointed out to me the appropriate tool and maybe some documentation on how to use it then I wouldn’t have any problems.

    Summary

    I know that what I wrote above was long so here’s the summary (whether you read everything preceding this or not).

    I’m going to perform a statistical analysis on Scout and Rivals to determine how good their final star ratings and positional rankings are at predicting future success both in college and the pros. To do so, I have already collected the data from Rivals and am currently working on collecting data from Scout. I will probably not take data from ESPN although that is not a certainty.

    To determine collegiate success I will take data that includes but is not limited to All-Conference honors, All-American honors, and the number of starts. To determine pro success I will take into consideration where a player was drafted and for what position.

    I know where to acquire some of the information that I need but I still need help finding useful places to take large amounts of data on:

    1. Transfers
    2. All-Conference teams
    3. What position each player was drafted for
    4. Number of starts by each player

      I would also like to find a way to automate data collection, specifically with an eye towards collecting data on what schools offered each kid a scholarship. Since there are tens-of-thousands of kids this cannot be done individually but must somehow by automated. I do not know how to do that and am thus asking for help. The same situation applies for collecting individual, positional specific, statistics on each kid.

      If anyone would like to help me out with what I have asked, then I would greatly appreciate it. Any criticisms will be well-received (or at least as well-received as I can) and taken into account. Any comments or other thoughts are also welcome and appreciated.

      For more information, read the sections above.

      Update: 3-26-09

      Since so many people have responded with helpful ideas, if you wish to contact me with anything that you either don't want to post in the comments, is too long and complicated for the comments, or that you wish to have a more private dialogue about then email me at: [email protected]

      That's not my main email so I won't check it as often (i.e. not every 20 minutes) but I'll try to check it at least once a day. If you want to send me anything, links or other work that you've done that might help me, then send it there.

      Thanks for all the great ideas and please keep them coming. I'm still thinking about ways to handicap a teams that have a lot or a little talent relative to the average (for reasons that are too long to fully explain in this update, although there are some interesting thoughts on why and how in the comments below). I'm also looking for ways to automate the data collection process. There are a few suggestions below but I'm going to be looking for more so please tell me.

      Again, I prefer using the comments if possible but if not then email me.

      Update 2: 3-27-09

      Well, it's been pointed out in the comments and confirmed by me that the email address is listed above doesn't work. That's because I had a small typo. Of course, small typos in email addresses are big typos.

      Oops.

      Anyways, the correct email address is: [email protected]

      If you tried emailing me earlier with the previous email address then please try again. I appreciate your patience.

Comments

Tacopants

March 25th, 2009 at 12:20 PM ^

A couple of tips for you on your analysis:

I think you're making this a little more complicated than you have to. A lot of your statistics will shed a good deal of light on recruiting rankings, but some of the measureables won't really tell you anything. FAKE 40 times are of course fake, and should not be counted unless you get inside info on all 119 teams. Other things like comp%, ypc, tackles may be indicative of overall team success rather than individual accomplishment. A 3* 4 year starter at Tulane at MLB will accumulate more tackles than a 4* 2 year starter at USC. Likewise, you'll get mundane stats from meaningless games (Yay 20 TFL, Boo 18 of them being against MAC teams) that really won't show you that much.

Also, the more factors you put in, the more muddled your final result will probably turn out to be. A good example of this would be redshirting. Sometimes a player is redshirted only to preserve his elligibility. Other times, the player needs development. Then there's the possibility that the player really deserved a redshirt ability wise, but the team couldn't afford it. And still other times the team unwisely burns a redshirt year for special teams play or some such nonsense. Finally, you have the possibility that the player was ready to play and the team had a need for that player to play. Until you can figure all of these things out, knowing if a player redshirted will give you minimal information for the amount of work required.

Lastly, make some IOE friends. This is what we do. MINITAB is great software, available on CAEN or ITCS computers, and has a 30 day free trial for you to mess around with.

Good Luck.

Bleedin9Blue

March 25th, 2009 at 12:32 PM ^

One of the reasons that I'm first going to to do a top-level analysis is to just show those results and hope that such a large amount of data will equal out the extremes.

I agree, there are a lot of factors that essentially cannot be compared only to each other. But, they might still be useful. For the individual stats for example, if I could compare against the same competition then it might be useful. This would start to wander away from star rankings but could still be interesting and still be brought back to stars. Of course, that would then require game-by-game stats which makes everything even more complicated. But one of the reasons I'm doing this as my pet side-project is because it's complex.

I'll definitely check out MINITAB.

As for the 40 times, I have that data because it would actually be harder to not take that data. You're right, as little stock as possible should probably be put into them.

Asquaredroot

March 26th, 2009 at 5:12 AM ^

for you, but I'd think the most relevant data would be to plot the number of years spent on an NFL roster in relation to the number of years as a professional athlete (i.e., after college) against rivals and scout ratings from high school.

That's it. There's your baseline indicator of rankings accuracy, regardless of position.

Anyway, good luck, and parse to your hearts content.

blueheron

March 26th, 2009 at 12:44 PM ^

Assuming that the data can be acquired and analyzed in a rigorous sort of way, I'll be shocked if he doesn't find that the SEC's recruiting base winds up being slightly overrated relative to that of, say, Minnesota.

The biggest fans of college football "anything" are down south. Rivals and Scout know this and probably do everything they can to please those customers. Rating six SEC recruiting classes in the top ten every year will help.

Bleedin9Blue

March 26th, 2009 at 12:54 PM ^

That brings up something else I meant to mention but completely forgot when I wrote above. I'm also going to see where (state wise) each recruit comes from. I know this has already been done (there's a map of where all 5-stars ever came from, it's what you'd expect). But, I don't think anyone has taken it further than that, I want to see how each state does with their top end recruits actually succeeding in the pros and college.

As for plotting time in the NFL versus rankings, I'm not sure that it would entirely work as Scout/Rivals only started in 2002 so they've only had kids in the NFL that they actually rated for a few years. Several of those top-end kids will be in the NFL for several more years so it might give an inaccurate view. For instance, say 5 stars average a longer span in the NFL. That would only show up to a minimal degree since the end of each of those players' careers hasn't come. So if the average 4 star makes it 4 years but the average 5 star makes it 7 years in reality, we wouldn't know that because 7 years hasn't elapsed since the first kids that Scout/Rivals ranked made it to the NFL. Instead, it would look like 4 and 5 stars are actually closer to each other when in reality they aren't.

Bleedin9Blue

March 26th, 2009 at 8:35 PM ^

That [simply counting the number of players of each star rating that got drafted and/or are still on an NFL roster] is a valid point and was one of the first things I thought of when doing this. It will be one of the findings that I will present but it won't be good enough, in my mind at least, to fully reveal the accuracy of rankings and their usefulness in predicting future success.

It will be a very good top level view, especially if I figure out the percentage of each star rating at each position that makes it to the NFL, but that doesn't show collegiate success. Also, even though there are many players, there would only be about 4 years worth of data which can be thought of as only 4 data points. 4 data points isn't enough to draw authoritative conclusions from.

Don't worry though, I'm going to be presenting many results and many conclusions so hopefully everyone will be happy... or at least not unhappy.

Asquaredroot

March 26th, 2009 at 11:12 PM ^

I just hope you don't bite off more than you can chew and leave it to sit on the back burner without ever presenting any of the data.

I'm getting overwhelmed just reading about all the data you're going to parse. I wish you joy of it.

Please start with that top level view so the lazy folks like myself can read about it before you start drilling down deeper.

El Jeffe

March 25th, 2009 at 2:05 PM ^

Being a professional data geek myself, I like the basic idea behind what you're proposing. However, it strikes me that you have to be careful with the structure of your models.

(Much of the following could be interpreted as pedantic, so let's agree that I'm just helping you with your thinking, not telling you things you don't already know).

Essentially, stars/position rankings are variables that sit in the middle of a very long causal chain, starting with player characteristics coming out of high school (only some of which are measurable) and ending with college stats, pro draft position, or pro success, depending on how you define the dependent variable(s).

So then the question becomes, what do stars/position rankings actually measure? One thought is that they should be largely, if not entirely correlated with pre-college stuff, and therefore should have no independent effect on the dependent variables, once all that pre-college stuff is accounted for.

One could, I suppose, argue that if a 5* is taken by some school, that will affect the likelihood that the coach gives him playing time (thereby allowing him to perform on the field and increasing his chances of being drafted--in demography we would say that playing time "exposes a player to the risk" of success). After all, he's a fucking 5*! But, maybe that doesn't happen too much (see Grady, Kevin).

So therefore, the independent "effect" of star ranking should theoretically be zero, assuming you've specified the predictors of star rankings correctly. To the extent that star ranking shows up as a significant predictor of football outcomes, it suggests that you haven't done this (correctly specified the model).

This implies, then, that you should not include those predictors, so that you're really just assessing the accuracy of the rankings. Maybe this is what you had in mind all along, but the whole "player pages from Scout and Rivals with fake 40 times" led me to think otherwise.

I also wonder whether you can simplify your analysis somewhat by using two sorts of dependent variables:

1. Statistical rankings of players each year in college. To do this, you would want several variables for each position. You would want to control for years of experience here, obviously, and also probably a dummy variable for each team-year (i.e., UM-2005, UM-2006, etc.). These two controls would net out the fixed effects of team characteristics, such as schedule strength and overall team performance.

2. Pro draft round (probably a Tobit model because some kids don't get drafted). Here you would be making the assumption that guys who get drafted probably did pretty well in college, else they wouldn't have been drafted. This essentially would allow you to ignore all the college stat stuff, which will be a big pain in the arse to gather, and would also allow you to look at non-skill players, who don't have easily available stats (I'm sure teams and scouts keep track of pancake blocks and shit, but it's probably rough to get those data).

Finally, are you sure this hasn't been done already by some economist in some sports economics journal? There's a guy named Alan Sanderson at the University of Chicago who was a sports economist. Might be worth googling him and asking him if he's aware of similar analyses. Wouldn't be as much fun, but it might save you a lot of time.

Best of luck!

Bleedin9Blue

March 25th, 2009 at 5:21 PM ^

First off, thank you very much for this comment, there's a lot in it to consider.

I mean this analysis to first show how good star rankings are at predicting future success. I entirely agree that the star rankings themselves SHOULD have no effect on the eventually success (or failure) of a player. This would be supported by coaches like RichRod essentially always saying that he'll play the best athletes and no starting position is ever safe. This is not always the case though. Sometimes the best athletes don't get played. This could be because a coach thinks that a player has so much potential that its worth the initial drop in talent for the eventual upturn when that potential begins to be realized, a player might do something to make themselves unable to play (a broken leg, breaking team rules so there's a suspension, having to show the players who's boss, etc.), etc. All of that is independent of star rating, definitely, but it is something to consider.

I've actually been trying to figure out how best to account for the possibility of a good player A sitting behind even better player B. Normally A would play but B is so good that the normal situation isn't there. Which is one of the reasons that I'll definitely have to find who transfers where.

Anyways, any model I make should always show that stars have no effect on the outcome but I do expect that they will be a predictor of the outcome (that's the whole point of this analysis and the whole point of the stars). And you're right, stars are basically a normalized aggregate of all of the high school factors (size, weight, build, speed, football IQ, mechanics, discipline, work ethic, etc.). Thus, it probably wouldn't be a good idea to include those factors that go into creating the star rankings into any model that uses the star rankings are predictors of success. Doing so would essentially "double count" certain attributes which is not the goal.

But, I'm still going to take that data because there's no reason not to and because it might be fun to try and find a model that is better at predicting future success independent of stars. Of course, since high school data is so scarce, it would mainly have to rely on college stats which begins to make it somewhat redundant because so many people already do that. But, then I could compare my model to the predictions from the stars and see what happens.

In sum, I think that I already agreed with your points, you merely enumerated them in a way which I either could not or did not.

As for your final suggestion, I have gotten Alan Sanderson's email address and will email him informing him of what I'm doing and if he knows of anything similar to it. Thanks for pointing me towards him.

Any other suggestions would be great as you obviously know at least a little bit about data analysis.

big gay heart

March 25th, 2009 at 2:47 PM ^

R is a statistical package that is free. I've heard good things about it. Also, SPSS is the social science industry standard.

You could use these program,s in a variety of ways, but they would be most helpful in determining direction and relationship strength.

Bleedin9Blue

March 25th, 2009 at 4:32 PM ^

I have a limited knowledge of it and even some experience. Fortunately, I have access to it so I'll give it a try. As I said, right now I'm still gathering all possible data (just as a variety of Excel sheets) so I don't necessarily need SPSS yet. I know that I can just import my data from Excel though so it's not a big concern right now.

I will definitely at least look at SPSS once the analysis part really gets going.

Tully Mars

March 25th, 2009 at 11:20 PM ^

I've used R for a bit of my research and it's a nice (free!) program to use for all kinds of statistical analysis. It's open source and built by statisticians, which is a bonus. It is the program that all statisticians tend to use (as opposed to SPSS, which is social scientists who do statistics use). It is very similar to Matlab in syntax and organization. So if you're familiar with Matlab you'll have no problem picking it up.

Rodriguez Fami…

March 26th, 2009 at 4:46 PM ^

SPSS is archaic, but hard to mess up if you know what statistical tests you need to run and how to interpret them. I'd recommend SAS version 9.2, which you have access to as a student. Can be purchased for $40 (student price) at the union if you have a PC. SAS is the standard statistical analysis tool.

bluebrains98

March 25th, 2009 at 3:19 PM ^

UM is such an awesome school. Find one other college football blog in this country with people who are this number-savvy and care this much about football.

And, I should say, you were right Bleedin9Blue in that a lot of us have thought of this. And, I am glad you are getting good feedback.

My two cents is that even if the star-ratings are confounded by the likelihood a coach plays them or there is some sort of self-fulfilling prophecy on the part of the programs getting 5* recruits, that will hash itself out in your analysis. Such confounds will result in a minimal correlation between star-ratings and performance, thereby taking down Rivals/Scout and letting me be much more productive at work. Thanks for doing this!

Bleedin9Blue

March 25th, 2009 at 4:36 PM ^

Would I really WANT to take down Scout and Rivals? No. That's what I do a lot at work too. And, thanks to a new phone, what I do while on buses and anytime I'm waiting for something.

My Firefox tabs are ordered thematically with football being the last theme. I have them in this order: MGoBlog, Scout, Rivals, Varsity Blue, GBMW, and WLA. If I somehow dismantled Scout/Rivals with this, my life would get a lot more boring.

Thus the only logical conclusion is to stop all this now and destroy all my data.

AC1997

March 25th, 2009 at 3:49 PM ^

I like Mini-Tab for this as well. Some people also like JMP ("jump") also.

I agree with the response saying that you don't want to bite off more than you can chew. Start with a reasonable base for your analysis - you can always add columns to your spreadsheet later. I worry with something this daunting whether you'll get overwhelmed in the data collection and by the time you start analyzing you've lost passion for it.

Pick some of the critical attributes you want to compare, check to make sure the raw data is reliable (# of starts should be reliable, 40-times will not), and then look for trends.

Good luck!

Bleedin9Blue

March 25th, 2009 at 4:39 PM ^

Originally I was just going to only gather data on and only compare star ratings to number of starts and where a player is drafted. It was only after that I started gathering data that I began to make this more-and-more complex.

Fortunately, gathering data has only heightened my passion and the positive responses and ideas that I've gotten here also only help.

Unfortunately, but fortunately for everyone else, I won't have a job at the end of April so I'll have a lot more time to spend on this (when I'm not doing job hunting). So, I'd expect that that's when so good conclusions might start to appear.

joeyb

March 25th, 2009 at 4:18 PM ^

What language are you going to do this in?

Also, ESPN had a contest where they gave you all the data in CSVs and wanted you to predict winners of games each week.

http://winningformula.espn.com/

That's the site, but you will need to find the data because I don't remember where it was. I think it goes back to 2004.

a non emu

March 25th, 2009 at 4:49 PM ^

I have been thinking about doing something like this for a while now. But instead of manually compiling these things, I was going to take the shortcut out and write a script (I am reasonably comfortable with Python, but this can be anything) to harvest all the data from rivals and scout. I have a few ideas on where to go from here, but nothing's concrete. One of the first things that I thought about was having the script look up each person on wikipedia. If a page exists, you automatically get a few points. In addition, parse the page to collect NFL draft and round information, if any, and assign points based on if, and what round, the player got drafted in.

The part that I was still missing was the comprehensive database on player college stats. I am sure something of the sort exists somewhere, so if someone knows please let me know. I would like to think that coding this up would take maybe a couple of weeks. Doing that I avoid the pain of manually entering all of this into a spreadsheet and staring at the data. Plus, it would be a fun coding project and I would learn some useful skills.

So yeah, parts missing - an acceptable formula to define "success" - some W1*A1+W2*A2... = X, where W's are the weights you assign to each quantifiable indicator of success A, and a good player stats database.

Bleedin9Blue

March 25th, 2009 at 4:49 PM ^

Looking up each person on wikipedia I don't think would necessarily work for this endeavor because the only people on wikipedia would be college legends, college kids that got huge hype and are still in college (Pryor), or NFL guys. Then I'd only have data on the guys that made it and then mostly just NFL stats (although at least some college stats would have to be there). For better or worse, there are thousands of kids that no one outside of the super-fans for that team know exists and thus they don't have a wikipedia page.

If it was possible to create a script that could just go to: http://michigan.rivals.com/viewprospect.asp?sport=1&pr_key=80311 where the last number simply incremented 1 every time the script went and collect all of the data on the page, including offers, then that would do a lot of the data collection for me.

If you, or anyone else, could recommend a way to do that, then I know that I could start this analysis much more quickly.

As for the formula for success... this project is going to have to have multiple conclusions based around the idea of the different ways of categorizing success. Certainly being taken high in the draft will be one of those ways, as will having a high number of starts, but it might also be prudent of me to try and create a dynamic model that takes guys that I define as being successful at a specific position and creating a formula out of that that can predict success for future kids. Part of that would be what you were talking about with the W1*A1+W2*A2...

Most importantly though, if someone could help me at least start on how to figure out how to write a script to collect all of the data, then that would be greatly appreciated.

a non emu

March 25th, 2009 at 5:07 PM ^

"college legends, college kids that got huge hype and are still in college (Pryor), or NFL guys"

That is precisely why I think Wikipedia might be a good idea. I am not suggesting that you should attempt to extract stats from Wikipedia, but every player that made it into the NFL will have a Wikipedia page. I certainly wasn't implying that a Wikipedia page should be the main A with the most W, but I was just looking for anything that would indicate success (of course in certain cases, hype supersedes success). Something that would help in the winnowing process.

As for the script, I will try to put something together over the coming weekend. But shouldn't you have some cut off point for the analysis? Say, only players rated 3-star and higher make the list or something. Or only those who signed scholarships with D1-A schools. That I'd think is a reasonable limit to stop the data size from becoming unwieldy. Also, you could just treat the one's who made it into the NFL from IAA or unranked players that made the NFL as outliers. And every statistical study needs outliers :)

Bleedin9Blue

March 25th, 2009 at 5:29 PM ^

But part of the point of this (in my mind at least) is to use such an overwhelmingly large amount of data that any conclusions that are found to be statistically significant are essentially undeniable. If I were to restrict the analysis to only NFL players then I'd only be looking at the very best of the best, nearly all of them 4 stars or higher. That would deny me the chance to find out the probability of players that are only 3 stars (and lower) of making it to the NFL. Besides, I think that determining success by number of starts and the other factors that I listed might be a better metric. And, there's always the fact that Wikipedia can be edited by anyone so there could just be some incorrect data. There's a reason you can't site Wikipedia in papers.

Don't worry about outliers, the Pat Whites of the world will make sure that I'm ripping my hair out trying to figure out how to account for them.

If you want to work on a script I'd appreciate it and be more than happy to help you in any way. I have some fairly limited coding ability (C++ and Java) but I'm actually fairly quick at learning that stuff when necessary.

Tacopants

March 25th, 2009 at 6:22 PM ^

Visual Basic + Microsoft Access. You can use the MSDNAA and get Visual Studio and Access for free since you're still a student.

VB is a pretty intuitive language, it meshes pretty well if you retained any residual Engin 101 C++ and has a good UI. Access is excel on steroids if you're doing database analysis. VB is also built into Microsoft Excel, that's how you write macros.

For your purposes though, there are tons of stock data miners out there written as VB for excel. It shouldn't be too hard to convert the address from a stock website to scout and have it read schools instead of ticker prices, volumes, etc.

FingerMustache

March 25th, 2009 at 5:09 PM ^

I definitely think it is important to break down the star reliance based on position. For instance a 5 star WR is probably more likely to succeed than a 5star Olineman.

FingerMustache

March 25th, 2009 at 5:23 PM ^

My only concern is that where a player goes to school often affects the number of starts. Consider the USC QBs over the Palmer, Leinart, Castle(not so much) years. Each rode the bench behind a future NFL QB. But i dont think that takes away from their developed abilities. Now you can look at michigan where all one needed was an arm that was partially attached to ones torso to get playing time at QB during the 2008 season.

And after all, 5* players tend to go to better schools with better depth and greater competition.

With this in mind, im not entirely sure that # of starts is a telling factor as to how good the player turned out. Althou i really cant offer an alternative.

Bleedin9Blue

March 25th, 2009 at 5:36 PM ^

Believe me I've been thinking about all that stuff.

I think that there might be a way to essentially rate a school and have that as a handicap. For instance, Beaver went to Tulsa. He's obviously going to be a big fish in a small pond so he's going to start there a lot. Now imagine that he came here and lost out the QB battle to Forcier, he wouldn't start nearly as much. But, the school that he's going to would have a higher rating (i.e. handicap) so not starting as much wouldn't mean as much because higher rated schools have better athletes so the odds of a good athlete sitting for a while is comparatively higher than low rated schools.

I'm not sure if that'll work (or even be all that feasible) but it's something I'm going to try.

If someone else has a better way of taking that into account, then I'd be happy to hear it.

Also, if someone else can think of a stat to measure collegiate success by, other than number of starts, which someone (probably me) can find somewhere on the internet for every player that signs an LOI, then I'd be very excited to hear that.

guanxi

March 26th, 2009 at 3:20 PM ^

My only suggestion would be to make the data useful, as much as possible, for other research. Allow others to add fields, etc (maybe with approval). If possible, maybe use an online database that others can access.

The purpose of my post, however, is to point you to research others have done on similar questions, which may inform or guide what you are doing. I've been bookmarking these for a little while to satisfy my own curiosity. Most of these are an evaluation of prior recruiting rankings, but the first one will tell you who the clear top ranked recruiting school has been since 1995.

* http://michigan.scout.com/a.z?s=162&p=2&c=834176
* http://michigan.scout.com/a.z?s=162&p=2&c=836034
* http://sportsillustrated.cnn.com/2009/writers/andy_staples/02/16/2006-c…
* http://rivals.yahoo.com/ncaa/football/blog/dr_saturday/post/Hug-your-fr…
* http://rivals.yahoo.com/ncaa/football/blog/dr_saturday/post/Hug-your-fr…

Also, this one compares NFL and high school combine performance for many players. Apparently, it's all downhill after you turn 18:
http://footballrecruiting.rivals.com/content.asp?CID=917643

northpj2

March 26th, 2009 at 5:25 PM ^

You are going to have a lot of omitted variables bias if you don't control for some of the factors like quality of school (as measured by how many kids go to the NFL, some statistic that reflects their record, etc). In addition, you need to define success. Is it getting to the NFL, or specifically getting drafted? You could use an indicator variable and then do a Tobit or Probit model to examine this. But many other predictors are difficult to manage. You will also need to understand the dynamics of the position. A 4 year starter at Quarterback is unlikely (if they are good enough to start three years they probably will depart for the draft), so you must determine what your measure of quality is. (Also remember that if you use a number like, how many games started or played that those are COUNT VARIABLES, and so you must use something like a Poisson Regression or a Negative Binomial Regression.)

Once you have figured out your measure, let me know about what you are doing with your regression and I can try to help.

GO BLUE.

mth822

March 26th, 2009 at 9:08 PM ^

It is tough to put a stat on raw physicality, that translates to the level of play Michigan Football is at. To coach or play at Michigan is one TOUGH ordeal. The mental prep alone for one practice would shell shock some bloggers. Not all but some, probably me included. There is something gooey you look for,"beyond the numbers." Hart showed it in the infamous youtube video,"The Run." And his toughness was evident in his rallying and side line demeanor after a 3 and out. His hubris was there at moments as well. Take the good with the bad. It's tough to put a stat on mental toughness. And really that is the only stat that counts. But at this level you need some physical prowess and I think college football fans are lost in the lore of the physicality. Recruiting is a crap shoot, no one believes the players verbals and no one believes the coaches really mean what they say. But in every class there are those kids who show the unmeasurable stats that get you to the W. But I like what you're doing with this research and it will be cool to see how it comes out.

brccli

March 27th, 2009 at 1:55 AM ^

The ASA has a Statistics in Sports section ( http://www.amstat.org/sections/sis/ ). They have a list of relevant journals that you might want to check for previous work.

I highly recommend R. It's the standard in the statistics world. It has sort of a steep learning curve since it's really a high-level programming language (on the order of Matlab, but with a stat focus), but the upside is lots of flexibility and plotting capability. I can recommend some learning material if you're interested.

I would definitely advise doing lots of exploratory data analysis. Just make a bunch of plots and think about what you're seeing for a while. This goes a long way toward building good models.

Depending on how this goes, it's probably publishable. I have three degrees in statistics from our fair university, so I know a little bit about data analysis. I'd definitely be willing to read anything you write or discuss any ideas you have. Good luck.

Bleedin9Blue

March 27th, 2009 at 10:06 AM ^

This is an extremely helpful post, thank you very much. I will definitely check out those journals to see what techniques I investigate for possible usefulness.

And it sounds like R is the way to go for my analysis once I get the data. I don't mind a steep learning curve, although I admit that I was hated Matlab compared to other similar programs simply because I didn't like treating everything as a matrix.

I actually contacted an old Stats professor of mine to inquire into potential "publishability". His response was... less than useful. Most of all, I would greatly appreciate someone with a lot of stat knowledge and experience looking over some of my data, methods, and conclusions at some point.

If you don't mind (although it's obviously perfectly understandable if you do) how about you send me an email at the above posted email and we can start a dialogue. Then, if I come up with any ideas or something that I wish someone with an actual statistics background to look at, I can simply email you.