UFR Database

Submitted by BlockM on July 14th, 2009 at 11:06 AM
This has been an extremely long offseason for me up to this point, and I've been struggling to find ways to get my Wolverine Football fix, as I'm sure just about everyone that frequents MGoBlog has. My internship has also been extremely slow, so I've had a lot of desk time to sit and do nothing.

While perusing the site I began to wonder if there was some sort of UFR database that could be queried in order to research different trends, etc. I'm a computer engineering major with a math minor and statistics are not exactly my cup of tea, but I would at least like to poke through that information if it was all in one place.

After emailing Brian and finding that there was no such database, I found myself pondering how one might go about making such a database. I started writing an application in C++ that would take the HTML table and convert it into XML that could be imported into Access. I was relatively close to getting that to work, but it was a bit of a hassle, so I decided to try something else.

I decided to write a Perl script that would parse the HTML, clean out all the tags (<table>, <tr>, <td>, etc), and put in their place more useful XML tags (<line>, <player>, <analysis>, etc). The result was a pretty nifty script that does almost all the work for me. After converting each UFR table into XML, I needed to add some extra information like drive IDs, play IDs, and things of that nature so that the database would be easily queryable. I did some of this by hand (lots of search and replace), and wrote another script to take care of a bunch of the nit-picky stuff that would have taken hours. I didn't really want to spend the rest of the day adding <PLAYID>some number</PLAYID> to over a thousand plays... I decided to be an engineer because I'm lazy and there's ALWAYS an easier way to do something.

If you're interested in seeing the Perl scripts, especially if you're proficient in Perl, let me know and I'll send them to you. This was my first useful Perl script (I taught myself, and hadn't ever done anything more significant than Hello World!) and a lot of my techniques came straight off a Google search and probably aren't following best practices. Suggestions/critiques would be greatly appreciated.

After some trial and error, I ended up with what I think is a pretty nifty database that includes information about every game, drive, and play that was charted over the course of last season. OSU is missing because Brian didn't do a UFR, along with halves of some games (a couple were only charted for offense or defense, but not both), and some drives (I believe a couple games were so painful that the final drives were non-UFR-able).

I've hosted the database file on FileFactory, along with the final XML file for the complete season. Hopefully that works...

If I get really bored in the future, I'd ideally like to find a way to parse the text to record all the +/- information for each player, but I haven't thought about that much... seems like it would be quite a bit of work, and I might not have time for it now that my boss and my manager are back from vacation.

So anyway, feel free to download and look through that. Do whatever you'd like with it. I didn't really do this for any reason other than I wanted something to do, and it seemed more useful than Facebooking all of our potential recruits for the next five years. Sorry about all the parentheses... no idea how they've become so prevalent in my writing over the years, but everything sounds choppy when I write without them.

Go Blue!

EDIT: As was mentioned in the comments, the Miami game, the Northwestern game, and the other half of the PSU and Illinois games were actually available. These have been added to the database and the final file is located HERE. (The corresponding XML has been updated as well.)

Comments

Brian

July 14th, 2009 at 11:27 AM ^

"I might not have time for it now that my boss and my manager are back from vacation."

Nonsense. Look at how well that worked out for me.

NYWolverine

July 14th, 2009 at 12:23 PM ^

probably a dumb question around here, but what application opens .mdb and .xml files? I don't know how to utilize your database without it looking like an engineering script in textedit.

BlockM

July 14th, 2009 at 12:40 PM ^

The XML is readable with Notepad. I had to remove all the line breaks because they were causing some problems with queries in the database. I didn't need to remove all of them, but it was quicker than going through 20,000 lines of text and only removing what I had to.

You can also open the XML file in Internet Explorer or Firefox, and that will show the data structure quite nicely.

The .mdb file is readable with Microsoft Access. You can open it using OpenOffice as well, which is free at www.OpenOffice.org. Instructions for doing that are at http://salahuddin66.blogspot.com/2007/09/mdb-file-in-openofficeorg.html. There are some extra things you may need to do in the comments at the bottom of the blog post as well... I just did a quick Google search, so there may be better resources for figuring out how to do that.

a non emu

July 14th, 2009 at 2:46 PM ^

vim is this famous (I thought :D) text editor that a lot of the *NIX community swears by. I just like using it because I do a lot of code editing, and it has this really powerful regular expressions engine that you can do a lot of clever things with (things like search, search/replace etc are a breeze). It has a bit of a learning curve, but once you get used to it, editing with anything else becomes next to impossible.

Having said that, I like Notepad++ too, and have used it quite a bit. But, don't want to hijack this thread though, there are plenty of other places on the web to discuss the merits/demerits of various code editors :)

Once again, great post! I like the idea of turning this into an HTML page as someone had suggested. Will definitely make it a lot more accessible for people not used to working with raw DBs/Access... You could maybe even get Brian to host it on the site. I am sure he will find it useful as well (probably more so than most of us).

a non emu

July 14th, 2009 at 2:59 PM ^

I am an EE/CE too. I just happen to work on the firmware level. Sort of straddling that boundary between EE, CE and CS. But yeah, perl is a great skill to have, and knowing perl, you might appreciate vim even more :)

AnnarRankings

July 14th, 2009 at 1:22 PM ^

If there is any interest I could convert this to a SQL database and create a simple HTML user interface to make it easier for people to search all this information from their favorite browser.

AnnarRankings

July 14th, 2009 at 2:29 PM ^

I have business site that I could put it on to avoid hosting fees. I was also wondering how complicated it would be to add it on mgoblog somewhere (and keep it updated after each UFR). That may not be feasible or even wanted by Brian. Either way, I'm sure something could be put up in the next few weeks.

BlockM

July 14th, 2009 at 2:44 PM ^

Keeping it updated after each game would be fairly easy... I'd have to polish up the Perl stuff to make it one automated step, but then it's just a matter of copy/pasting the table HTML into a file, converting it, and importing it into Access (or whatever type of database it ends up in).

Like you said though, if it's housed on MGoBlog, it's up to Brian.

jc

July 14th, 2009 at 2:43 PM ^

Unfortunately, I have a bunch of "real" work to do today, but I will definitely check this out tonight and over the weekend. Thanks!

benjahen

July 14th, 2009 at 3:39 PM ^

and the interaction with the programming community that makes me sad that all the further I got with programming was 101 and then I fell into ChemE.

Cocktails on an epic scale.

joeyb

July 14th, 2009 at 5:25 PM ^

I was thinking about doing something like this, but definitely did not have the drive to do it. I have done similar things in the past and the way I have done it is to use Regular Expressions and create objects from the regex results. Then, in Access (I use SQL Server) create an identity on the tables which will automatically add the id's as you insert records.

Very cool. Thank you for putting the effort into this so that I may reap the benefits.

joeyb

July 14th, 2009 at 9:21 PM ^

I was just importing the file into my database and noticed which games you had missing. I did a quick search and it looks like there are UFRs for everything except the OSU game. Would you mind either running the script again with those in there or posting the script so that I can do it. I want to make use of your work, but I think it might be more useful with everything included.

Thanks

joeyb

July 15th, 2009 at 10:31 AM ^

I already have a SQL script to import the raw data. Right now, I am working on a script to normalize the data and convert it into proper values.

The problem that I have is that, for example, the down is not always 1-4, but sometimes 2pt. Or, the TOGO is sometimes G for goal. The line is M28 or O36.

Basically getting the data in is not that difficult but making it usable is a little more challenging.

ShockFX

July 15th, 2009 at 10:57 AM ^

I have a few thoughts on this.

First, tag 2pt with "Special Teams" and leave it at that. It's an anomaly that is easier excluded. For TOGO with G, just change 1st and G at the 5 to 1st and 5. The line could range from 0 (our goaline) to 100 (their goaline).

Once I get a chance to look at the database, I'll see if I can come up with schema for efficient querying. I just need a couple free hours at work with a whiteboard.