OT: Extracting play-by-play data

Submitted by Bleedin9Blue on

Due to all of the debate regarding the Wisconsin game and the quality of the 2010 offense in general, I've been thinking about stats a fair bit.  Thus, I went to find out some more regarding FEI calculation- I ended up not finding the information that I needed so I emailed Brian Fremeau to see if he can provide some illumination (although I believe the actual formula he uses is proprietary so I don't expect to learn too much).

The functional end result is that I've become curious about how people such as Brian Fremeau and others that create advanced stats based on play-by-play or drive-by-drive data are able to collect their data.

The NCAA team reports have game-by-game play-by-play data, but extracting the necessary information from them seems difficult since it's all text based.  I'm guessing that it just looks complicated to me since I'm not a CS or CE person.  But, I'm still interested in how the data is extracted.

So, if there a better site than the NCAA team reports to get play-by-play data to extract and distill down into the necessary components (pass, rush, yards, player(s), etc.) or is the NCAA site the best and it just takes some coding to make it work efficiently?

I wonder what kind of advanced stats the MGoCommunity could come up with access to years worth of distilled data from every team in the country...

Thanks.

BlockM

July 10th, 2011 at 12:10 AM ^

I've been working on getting the UFR data into a tractable database format for a while now in very short, very spread out bursts, but I don't think that's the data you'd want to use for this and I'm not really anywhere close to having something useful yet because i've been busy with other stuff.

The NCAA team reports could be pretty easy to extract if it's all structured in the same way every time. That was the biggest problem with the UFRs, as each season, and even each game, can have a different format so it needs its own parsing function.

Do you have a format you'd like to see it in?

Bleedin9Blue

July 10th, 2011 at 12:24 AM ^

I'm not so much looking for UFR data as a method of extracting play-by-play data from either the NCAA game reports or some other site.  This could then be used to get as much data is reasonably possible on every game played that's been tracked.  Obviously the UFRs go into much greater detal than the NCAA game reports, but UFRs are specific to Michigan (and whatever other team specific sites do the equivalent of UFRs).

What I'd like to do is create a database of all play-by-play data that can be extracted from the game reports, and then unleash the MGoCommunity on it.  I personally can't fully trust FEI since I don't know how it's calculated.  But, if I had access to the database, then I could start trying to create my own stats.  Or, other people on the site could create their own stats and share their formulation.  The results could then be vetted by the community.  Of course, nobody would come up with a stat to make everyone happy, but it would definitely be something interesting to work on and discuss during the off season.

Fortunately, the NCAA database for play-by-play reports appears to be identical starting from 2004 (the first year that I found any play-by-play reports for Michigan v. OSU at least) until the present.  So, if something could be written to extract all possible information from any 1 game report, then it should work on all of them.

Unfortunately, my coding skills are nearly non-existant (I'm a ChemE/MSE) so my value to a project like this is rather small for this part.  Thus, if someone else wants to take up the banner of this I'd be delighted.  Until then, I'm using my l33t HZXOR Excel knowledge to try to extract the data.

I'd say that I can appreciate the difficulty in making the UFRs into an easily navigated database, but I really can't.

Vincent

July 10th, 2011 at 12:52 AM ^

Maybe we can split the work. I am typically able to extract data from webpages and to format it in a nice database if it all looks the same. But I really don't have time to figure out what the HTML naming convention is. Tell me the sets of HTML pages I have to loop over, and how these URL addressee match teams/seasons/opponents, and I should be able download them (well, I'll try anyway).

Bleedin9Blue

July 10th, 2011 at 10:30 AM ^

I'm starting to examine the URLs and source code of the play-by-play pages for both the ESPN and NCAA game reports.

I've found that ESPN labels their games with a "gameID" whose formulation I'm unable to determine.  I tried just going through the gameIDs and incrementing them by 1 to see what would happen.  Unfortunately, if a gameID is used [in the URL] which doesn't exist, you are redirected to the college football home page.

The NCAA team report is based on 4 things:

  1. Year game was played
  2. Home team ID
  3. Away team ID
  4. Full date game was played

Thus, the generalized URL for the NCAA play-by-play game reports is:

web1.ncaa.org/mfb/driveSummary.jsp?expand=A&acadyr=YEAR&hHOMETEAMID&v=AWAYTEAMID&date=DATE(FORMAT:DD-MMM-YY)&game=YEAR(FORMAT:YYYY)000000HOMETEAMIDDATE(FORMAT:YYYYMMDD)

Example- the 2010 Michigan v. UConn game was played on 2010/9/4.  Thus, the URL for that game is:

web1.ncaa.org/mfb/driveSummary.jsp?expand=A&acadyr=2010&h418&v=164&date=04-SET-10&game=2010000000418D20100904

Thus, it would be a fairly simple operation to find the URL for every game as long as a schedule of games could be found and distilled into a usable format.  I've already extracted the team IDs.

But, when viewing the HTML for the NCAA site, it is essentially identical to what is written in the plain-view report.  That is to say, the NCAA site doesn't use any useful tagging system that would allow easy data extraction.  Thus, I believe that anything used to extract data would have to be a semi-intelligent brute force method instead of simply looking for useful tags.

As previously mentioned, my coding skills are rusty at best.  Therefore, I'm simply trying to perform all of the necessary extractions using Excel.  I am fully confident that I can come up with a flexible system that will get all of the data out of a data report.  But, the problem is that iterating that over thousands of games using Excel would require a significant amount of labor- there must be a better program to use for such a task.  If someone could point me towards it, then I might be able to learn the most rudimentary portions of it and make it work.

umich1

July 10th, 2011 at 1:50 PM ^

Last year, for Michigan and their opponents, I extracted all of the reports from the NCAA, like

http://web1.ncaa.org/mfb/driveSummary.jsp?acadyr=2010&h=31&v=692&date=04-SEP-10&game=201000000003120100904

into excel, and imported them into SAS.  From there, I made a predictive analytic to aid in my selection of scores for a football pick'em league I was in.

If the MGoBlogeratti is able to develop a better way, I'm very interested to learn myself.

HouseThatYostBuilt

July 10th, 2011 at 12:55 AM ^

I don't know who this Fremeau guy is, but how can the statistic he fabricated be trusted if he hasn't even published the formula for it? For all we know, he could be pulling numbers out of his ass. I'm suprised that so many people take FEI as gospel when they don't even know how it's calculated.

redwhiteandMGOBLUE

July 10th, 2011 at 5:56 AM ^

.I wish I could help but you guys are way too smart for me.

This is how I feel when the discussion turns to FEI, advanced metrics, data mining, etc:

 

I really hate being dumb...