I bet I know what you’re thinking. I love baseball stats and this new statistical revolution in my favorite sport has really enhanced my understanding. But, darn it, sites like Fangraphs simply aren’t geeky enough for me.
I mean, who hasn’t had that thought?
Fangraphs is a fantastic site. I can look at things like player A’s wRAA and see that linear weights suggest he was worth a certain number of batting runs last season. I can see what percentage of pitches Player B swung out that were outside the strike zone. I can look at hundreds of different calculations and can analyze pretty much to my heart’s desire.
But, I’m selfish. What if I want more? I want the ability to calculate things that aren’t available at Fangraphs, using tons of data. Maybe I simply want to learn how to use pitchFX, that new tool that you see everywhere with them fancy complicated graphs? Well, this fanpost is a very basic look at a tool that’s been making it’s way from StatSpeak to the Book Blog to the Hardball Times to Beyond the Boxscore.
This offseason I’ve probably overextended myself. I started by creating a series that I’ve
kinda completely left unfinished 20% of the way through. I also started reading most of the book Mathletics by Wayne Winston. Winston’s book was interesting* and inspired me. The book presented some of the things one could do with baseball data in Excel.
*If you want to read a review of it from a sabermetrician’s perspective, go here.
Over the course of the offseason, I’ve spent many an hour developing Excel sheets analyzing our national past time. To give you a taste, I can tell you that a lineup made entirely of 9 Barry Bondses from the year 2004 would score approximately 4,074 runs in a season based on a Monte Carlo estimation. I can tell you that based on CHONE, Marcel, Oliver and ZiPS, Marcel was the best at estimating wRAA. I can tell you that Raj Davis performed in the 80th percentile* in projection systems.
*How I defined this was I ranked all players by the formula
proj wRAA – actual wRAA
in each projection system. I averaged these ranks and ranked those averages. Raj Davis outperformed 79.8% of players this way. The highest on the 2009 A’s team: Adam Kennedy, 85th percentile.
The point is I messed around with Excel for ages and wrote a ton of formulas that were pretty much useless, although quite entertaining. Well now I give you a better way to waste one’s time writing formulas: SQL!
WARNING: To follow along with the rest of this fanpost will require quite a few downloads but not a penny of money. Every download mentioned is completely free (of viruses and of the need for cash). Some of my links may open files you may need to download. These links can be figured out through context. If you don't wish to actually try SQL, you can follow along once we get to the commands by looking at my tables which are exported from SQL to Excel to HTML.
SQL (pronounced sequel) is a database language. In fact, it stands for structured query language. Sound geeky enough for you?
But don’t fret yet. I had never heard of it before I decided to learn it. I’m far from a master, but I’ve been messing with it only a few days and have gotten a pretty good grasp of the basics so far. Let’s start where I started. This article by Colin Wyers is one of the best introductions to using SQL for baseball.
I’ve managed to get through the entire article except his last bit of code (the Marcel projections). I recommend opening Colin's article in a new window. I’ll give you the basics of the article as it never hurts to get a second wording of information. I figure, as a neophyte, my explanations should not be overly complex.
Onto the first step: the software. Also, as a note, I'm using PC and so is Colin in his article. Any other OS might need to use different software. In this case, the first software needed is SQL itself. Colin suggests MySQL in his article. Here is the appropriate section on what to download:
First, you need to download and run the installer for MySQL 5.4 CE Essentials. Choose a "typical" install. You'll typically have to wait a few minutes for files to copy. Then there will be some ads. Skip over them. Make sure that "Configure the MySQL server now" is checked. (I uncheck the box about registering the software with MySQL, but that's up to you.) Chose "Detailed configuration." Then you'll go through a set of pages asking you question. Here are the answers: 1.Developer Machine 2. Non-Transactional Database Only 3.Decision Support (DSS)/OLAP Continue on, using the defaults. When prompted, either add a password or uncheck the box to alter security settings. (This will leave your root password blank.) Once you're finished, you shouldn't have to start MySQL at all—it should be ever vigilant, waiting for you to command it to do your sabermetric bidding.
The next thing we need is a GUI to use MySQL. The link in Colin’s article links to a download page that only has free trials. I use the community version. You can download it here. Now that we have SQLyog and SQL, all we need is data. Again, I’ll quote Colin’s article on how to get some. In this case, we’ll work with Baseball Databank’s data:
First, download the ZIP file. Extract the .SQL file from the ZIP, and remember where you put it.
Now go into SQLyog. In SQLyog there will be three panes. The left-hand side should list your databases, the top is where you write queries, and the bottom shows the results. Let's add a database. Right-click in the pane, and select "Create Database." Call it "BDB," leaving off the punctuation. Leave the rest as the defaults, and click okay.
Now, let's populate the database. Right-click on the database, and select "Restore from SQL Dump." There should be a button with an ellipsis (...) in it. Click that, and select the file we extracted from the ZIP. Select Execute. It'll ask you if you want to execute the queries in database BDB. You do. This should take a little while.Once that's done, go into your database and click on the folder that says Tables. If nothing is there, try hitting "F5" on the top row of your keyboard to refresh the browser. Go ahead and click on batting, and then slide on over to the right and click on "Table Data." Scroll around for a bit. If you've ever used a spreadsheet or looked at a table before, the layout should feel familiar.
Here is where you may hit your first roadblock. Unzipping files can be a bit of a pain. If you can’t unzip the files by simply double clicking, you may need to download another program. One of the better ones for this task is 7-Zip.
You may also have trouble with the tables loading. Most of the time these will import, but may not show up. Refreshing multiple times is very helpful.
So, now you should have the toolbox of tools needed to reside in the top level of geekiness, even among sabermetricians. Now what do you do with these tools? In layman's terms, we have a database with lots of spreadsheets in it. To interact with this database, we need to talk to it. We do this task by writing queries.
To learn basic (but powerful!) queries you only need a few commands in SQL. I'll (attempt to) teach you these queries by teaching you what each term means (although most are self-explanatory), giving you a code, and showing you the results. You can copy-paste my code but I recommend typing it just so you get the hang of it.
The first of the commands we need is SELECT. SELECT is a very flexible command that will start almost all of your first queries. You tell the system what you want it to return. FROM is the second most basic command; it tells the computer where to search for what you’ve selected.
Open SQLyog. Browse through some of the tables you now have. Go to batting. It should have a bunch of columns named "playerid, yearid, stint…" Now we want to write a query. Go to the box in the top right where it says "1". If we simply want the database to return the entire batting table, we type:
SELECT * FROM BATTING;
Click "F9". This will execute the query. The asterisk tells the computer to select everything. The semicolon tells the computer that the query is finished. The query should display the entire batting table, which is too large to reproduce here.
Now, if we wanted to narrow our table and have only the specific columns playerid, yearid, and HR, we would write:
SELECT playerid, yearid, HR FROM BATTING;
That's much more manageable if still pretty useless. Next we have the commands AND and WHERE. WHERE allows us to be more specific with our requests. AND combines requests:
SELECT playerid, yearid, HR FROM BATTING WHERE teamid= 'oak' AND yearid = 2009;
You must put the OAK in single quotes. You can copy-paste this command into SQL if you want, but typing it out will help you learn it better. This command returns a list of all the 2009 A’s and their corresponding HRs:
However, we don't really care that Trevor Cahill hit 0 HRs. It would be much nicer to the eyes if we could order them. The command for this request is simply ORDER BY:
SELECT playerid, yearid, HR FROM BATTING WHERE teamid= 'oak' AND yearid = 2009 ORDER BY HR DESC;
The ORDER BY command defaults to ordering ascendingly. Adding DESC makes the table prettier to us, as we normally think of the A’s HR leaders in descending order.
Here's what the query gives:
The next command that will take a little work to master is joins. There are multiple ways to join, but I’ll explain the way that makes the most sense to me. The point of a join is to combine certain statistics from different tables. For example, if we wanted a player’s actual name instead of his player id we would do this:
SELECT master.nameLast, master.nameFirst, batting.yearid, batting.HR FROM batting, master WHERE master.playerid=batting.playerid AND teamid='oak' AND yearid = 2009 ORDER BY HR DESC;
Players’ actual names are stored under the "master" table, while the "batting" table only carries their playerids. However these playerids are also stored in the master table. Using the command
allows us to combine multiple columns within the master table as well as the batting table.
Here is the result of the final query:
A couple more functions and their use:
AS: Allows you to use a formula and renames the column
SELECT playerid, yearid, AB, H/AB AS AVG FROM BATTING
WHERE teamid= 'oak' AND yearid = 2009 ORDER BY AVG DESC;
HAVING: Allows one to set a limitation on a group.
SELECT playerid, yearid, AB, H/AB AS AVG FROM BATTING WHERE teamid= 'oak' AND yearid = 2009 HAVING AB>=50 ORDER BY AVG DESC;
GROUP BY, SUM: When combined with a SUM command, GROUP BY combines a player’s stats when he is in your query more than once. If a player changes teams in a single year, this will combine his stats for each team and give you his year-long stats.
SELECT playerid, SUM(HR) AS HR, SUM(RBI) AS RBI FROM batting WHERE playerid='hollima01' AND yearid=2009 GROUP BY playerid;
With just these simple commands, you should now be able to run a bunch of basic queries. With these you can accomplish a ton.
For example, every Oakland A All-Star:
SELECT m.nameLast, m.nameFirst, b.yearid, b.teamid FROM master m, allstar a, batting b WHERE a.playerid=m.playerid AND b.playerid=m.playerid AND b.yearid=a.yearid HAVING b.teamid='oak' ORDER BY a.yearid DESC;
Every playoff series the A's have ever won in any city:
SELECT yearid, round, teamidloser, wins, losses, ties FROM seriespost WHERE teamidwinner IN('OAK','PHA','KCA');
EDIT: 'KCA' in the above command should be 'KC1'.
HOFers ranked by the percentage they were voted in by:
SELECT m.namefirst, m.namelast, h.ballots, h.votes, h.votes/h.ballots AS PCTG FROM hofold h, MASTER m WHERE m.hofid=h.hofid AND h.ballots IS NOT NULL ORDER BY PCTG DESC;
Career HR leaders as Oakland A’s:
SELECT playerid, SUM(HR) AS HR FROM batting WHERE teamid='oak' GROUP BY playerid ORDER BY HR DESC LIMIT 5;
Here's what it will look like:
These examples are only the tip of the iceberg. One can link SQL to R or Access, not to mention the other databases we can play around with (including Retrosheet and pitchFX). If there’s interest I’ll give tutorials on more stuff we can do with SQL including building Marcels, heat maps, and developing linear weights.
If you want to practive, here’s what should be almost a review in the form of a SQL tutorial. If you want to get ahead, a couple of glances at connecting SQL to the pitchFX databaseand to R. Here's one that talks about installing it for Retrosheet.
Use this thread (all one or two of you who actually want to try SQL) as a place to post queries that you can’t figure out or tips you may pick up. You can ask me questions and I’ll do my best to answer them. Remember we’re all beginners here and no question is stupid.
Will you or did you try using SQL?
Looks way too complicated for me, but I'd still like to see what it can do with another FP (11 votes)
My head hurts, don't do this again (13 votes)
I tried it and liked it. I'd like to see another FP (17 votes)
I tried it and it was too hard; No more (1 vote)
42 total votes