UFR Database
This has been an extremely long offseason for me up to this point, and I've been struggling to find ways to get my Wolverine Football fix, as I'm sure just about everyone that frequents MGoBlog has. My internship has also been extremely slow, so I've had a lot of desk time to sit and do nothing.
While perusing the site I began to wonder if there was some sort of UFR database that could be queried in order to research different trends, etc. I'm a computer engineering major with a math minor and statistics are not exactly my cup of tea, but I would at least like to poke through that information if it was all in one place.
After emailing Brian and finding that there was no such database, I found myself pondering how one might go about making such a database. I started writing an application in C++ that would take the HTML table and convert it into XML that could be imported into Access. I was relatively close to getting that to work, but it was a bit of a hassle, so I decided to try something else.
I decided to write a Perl script that would parse the HTML, clean out all the tags (<table>, <tr>, <td>, etc), and put in their place more useful XML tags (<line>, <player>, <analysis>, etc). The result was a pretty nifty script that does almost all the work for me. After converting each UFR table into XML, I needed to add some extra information like drive IDs, play IDs, and things of that nature so that the database would be easily queryable. I did some of this by hand (lots of search and replace), and wrote another script to take care of a bunch of the nit-picky stuff that would have taken hours. I didn't really want to spend the rest of the day adding <PLAYID>some number</PLAYID> to over a thousand plays... I decided to be an engineer because I'm lazy and there's ALWAYS an easier way to do something.
If you're interested in seeing the Perl scripts, especially if you're proficient in Perl, let me know and I'll send them to you. This was my first useful Perl script (I taught myself, and hadn't ever done anything more significant than Hello World!) and a lot of my techniques came straight off a Google search and probably aren't following best practices. Suggestions/critiques would be greatly appreciated.
After some trial and error, I ended up with what I think is a pretty nifty database that includes information about every game, drive, and play that was charted over the course of last season. OSU is missing because Brian didn't do a UFR, along with halves of some games (a couple were only charted for offense or defense, but not both), and some drives (I believe a couple games were so painful that the final drives were non-UFR-able).
I've hosted the database file on FileFactory, along with the final XML file for the complete season. Hopefully that works...
If I get really bored in the future, I'd ideally like to find a way to parse the text to record all the +/- information for each player, but I haven't thought about that much... seems like it would be quite a bit of work, and I might not have time for it now that my boss and my manager are back from vacation.
So anyway, feel free to download and look through that. Do whatever you'd like with it. I didn't really do this for any reason other than I wanted something to do, and it seemed more useful than Facebooking all of our potential recruits for the next five years. Sorry about all the parentheses... no idea how they've become so prevalent in my writing over the years, but everything sounds choppy when I write without them.
Go Blue!
EDIT: As was mentioned in the comments, the Miami game, the Northwestern game, and the other half of the PSU and Illinois games were actually available. These have been added to the database and the final file is located HERE. (The corresponding XML has been updated as well.)
While perusing the site I began to wonder if there was some sort of UFR database that could be queried in order to research different trends, etc. I'm a computer engineering major with a math minor and statistics are not exactly my cup of tea, but I would at least like to poke through that information if it was all in one place.
After emailing Brian and finding that there was no such database, I found myself pondering how one might go about making such a database. I started writing an application in C++ that would take the HTML table and convert it into XML that could be imported into Access. I was relatively close to getting that to work, but it was a bit of a hassle, so I decided to try something else.
I decided to write a Perl script that would parse the HTML, clean out all the tags (<table>, <tr>, <td>, etc), and put in their place more useful XML tags (<line>, <player>, <analysis>, etc). The result was a pretty nifty script that does almost all the work for me. After converting each UFR table into XML, I needed to add some extra information like drive IDs, play IDs, and things of that nature so that the database would be easily queryable. I did some of this by hand (lots of search and replace), and wrote another script to take care of a bunch of the nit-picky stuff that would have taken hours. I didn't really want to spend the rest of the day adding <PLAYID>some number</PLAYID> to over a thousand plays... I decided to be an engineer because I'm lazy and there's ALWAYS an easier way to do something.
If you're interested in seeing the Perl scripts, especially if you're proficient in Perl, let me know and I'll send them to you. This was my first useful Perl script (I taught myself, and hadn't ever done anything more significant than Hello World!) and a lot of my techniques came straight off a Google search and probably aren't following best practices. Suggestions/critiques would be greatly appreciated.
After some trial and error, I ended up with what I think is a pretty nifty database that includes information about every game, drive, and play that was charted over the course of last season. OSU is missing because Brian didn't do a UFR, along with halves of some games (a couple were only charted for offense or defense, but not both), and some drives (I believe a couple games were so painful that the final drives were non-UFR-able).
I've hosted the database file on FileFactory, along with the final XML file for the complete season. Hopefully that works...
If I get really bored in the future, I'd ideally like to find a way to parse the text to record all the +/- information for each player, but I haven't thought about that much... seems like it would be quite a bit of work, and I might not have time for it now that my boss and my manager are back from vacation.
So anyway, feel free to download and look through that. Do whatever you'd like with it. I didn't really do this for any reason other than I wanted something to do, and it seemed more useful than Facebooking all of our potential recruits for the next five years. Sorry about all the parentheses... no idea how they've become so prevalent in my writing over the years, but everything sounds choppy when I write without them.
Go Blue!
EDIT: As was mentioned in the comments, the Miami game, the Northwestern game, and the other half of the PSU and Illinois games were actually available. These have been added to the database and the final file is located HERE. (The corresponding XML has been updated as well.)
"I might not have time for it now that my boss and my manager are back from vacation."
Nonsense. Look at how well that worked out for me.
I have not looked at the database and likely will never look at it, but what an awesome resource. You are to be commended for the effort and I look forward to guys bashing each other with arguments they pulled out of your database. Thank you.
That's what I'm hoping for as well. Should be interesting to see if anything comes of it.
I would have to imagine that the amount of lost productivity due visiting this blog during working hours amounts to several thousand dollars daily, in the aggregate.
probably a dumb question around here, but what application opens .mdb and .xml files? I don't know how to utilize your database without it looking like an engineering script in textedit.
The XML is readable with Notepad. I had to remove all the line breaks because they were causing some problems with queries in the database. I didn't need to remove all of them, but it was quicker than going through 20,000 lines of text and only removing what I had to.
You can also open the XML file in Internet Explorer or Firefox, and that will show the data structure quite nicely.
The .mdb file is readable with Microsoft Access. You can open it using OpenOffice as well, which is free at www.OpenOffice.org. Instructions for doing that are at http://salahuddin66.blogspot.com/2007/09/mdb-file-in-openofficeorg.html. There are some extra things you may need to do in the comments at the bottom of the blog post as well... I just did a quick Google search, so there may be better resources for figuring out how to do that.
I was hoping you would at least say VIM. But yeah, notepad is alright. I guess... :P
But thanks for the amazing work! Stuff like this is what keeps me coming back to this site (apart from Brian's posts of course).
It opens in VIM (I had never heard of VIM before, but I just downloaded it quickly and tried it... seems to work, which makes sense because it's just a text file.), and I actually did any hand editing in Notepad++.
vim is this famous (I thought :D) text editor that a lot of the *NIX community swears by. I just like using it because I do a lot of code editing, and it has this really powerful regular expressions engine that you can do a lot of clever things with (things like search, search/replace etc are a breeze). It has a bit of a learning curve, but once you get used to it, editing with anything else becomes next to impossible.
Having said that, I like Notepad++ too, and have used it quite a bit. But, don't want to hijack this thread though, there are plenty of other places on the web to discuss the merits/demerits of various code editors :)
Once again, great post! I like the idea of turning this into an HTML page as someone had suggested. Will definitely make it a lot more accessible for people not used to working with raw DBs/Access... You could maybe even get Brian to host it on the site. I am sure he will find it useful as well (probably more so than most of us).
Gotcha. As an Electrical/Computer Engineering student I don't get as deep into the CS side of things as I'd like. I'll play around with it at some point just so I can have it on my list of "things I've at least tried to use."
I am an EE/CE too. I just happen to work on the firmware level. Sort of straddling that boundary between EE, CE and CS. But yeah, perl is a great skill to have, and knowing perl, you might appreciate vim even more :)
Thanks for posting, and the follow up. This is really cool.
If there is any interest I could convert this to a SQL database and create a simple HTML user interface to make it easier for people to search all this information from their favorite browser.
That would be amazing. Just out of curiosity... where would you host something like that?
I have business site that I could put it on to avoid hosting fees. I was also wondering how complicated it would be to add it on mgoblog somewhere (and keep it updated after each UFR). That may not be feasible or even wanted by Brian. Either way, I'm sure something could be put up in the next few weeks.
Keeping it updated after each game would be fairly easy... I'd have to polish up the Perl stuff to make it one automated step, but then it's just a matter of copy/pasting the table HTML into a file, converting it, and importing it into Access (or whatever type of database it ends up in).
Like you said though, if it's housed on MGoBlog, it's up to Brian.
That was exactly what I was thinking. I would be happy to help with this project if you need it.
properties list to open.
See what a little boredom can do?
Unfortunately, I have a bunch of "real" work to do today, but I will definitely check this out tonight and over the weekend. Thanks!
and the interaction with the programming community that makes me sad that all the further I got with programming was 101 and then I fell into ChemE.
Cocktails on an epic scale.
I was thinking about doing something like this, but definitely did not have the drive to do it. I have done similar things in the past and the way I have done it is to use Regular Expressions and create objects from the regex results. Then, in Access (I use SQL Server) create an identity on the tables which will automatically add the id's as you insert records.
Very cool. Thank you for putting the effort into this so that I may reap the benefits.
I was just importing the file into my database and noticed which games you had missing. I did a quick search and it looks like there are UFRs for everything except the OSU game. Would you mind either running the script again with those in there or posting the script so that I can do it. I want to make use of your work, but I think it might be more useful with everything included.
Thanks
Sure. I'll get them up tomorrow at some point. I'd send you the script, but there's some tedious manual stuff that needs to be done at this point and it'll be easier this way.
Thanks for the heads up!
EDIT: Or tonight... gosh I'm bored lately. Enjoy!
Apparently WLA did the OSU UFR, I would check over there.
I don't think we want this in the database.
http://mgoblog.com/content/upon-further-review-michigan-vs-ohio-state
Yeah, UFR of the UFR I wouldn't recommend it being incorporated into the database.
Overall a good post though.
I will be more than happy to help with this as well. I know stats, SQL, XML, HTML, other initials. This would be hella fun and useful.
Edit: Please use this not Access!
I can play around a bit with the SQL Express and see what I can do, Access was just what I had at the time and I know that's not ideal for most situations. With all of the available data in there now, feel free to do any analysis/tweaking you'd like.
Hmm, once I'm at work tomorrow I'll import to SQL 2008 and I'll post the MDF/LDF files for loading into SQL Express.
Edit: Thanks again for all the work you put in.
I already have a SQL script to import the raw data. Right now, I am working on a script to normalize the data and convert it into proper values.
The problem that I have is that, for example, the down is not always 1-4, but sometimes 2pt. Or, the TOGO is sometimes G for goal. The line is M28 or O36.
Basically getting the data in is not that difficult but making it usable is a little more challenging.
Yeah, not sure exactly how to go about fixing that...
Also, FYI, I realized that I had missed the Northwestern game in the newer updated version of the database. I'm changing the link in the diary again, but here it is as well:
http://www.filefactory.com/file/ahd2a8b/n/08_UFR_mdb
I have a few thoughts on this.
First, tag 2pt with "Special Teams" and leave it at that. It's an anomaly that is easier excluded. For TOGO with G, just change 1st and G at the 5 to 1st and 5. The line could range from 0 (our goaline) to 100 (their goaline).
Once I get a chance to look at the database, I'll see if I can come up with schema for efficient querying. I just need a couple free hours at work with a whiteboard.
Those were my thoughts as well. I have already made my way through most of it. I will try to finish the script tonight and get it up for you to see.
gChat me at [email protected] and we can work through it on a shared google doc if you want. I'll be online from 830pm EST until midnight.
Also, I have a formatted/indented version of the above XML and also have the Access DB converted into SQL 2008 if you'd like either.
Comments