PRESENTED BY
PRESENTED BY

An introduction to SQL or the Nerdiest Fanpost you'll see on AN

I bet I know what you’re thinking. I love baseball stats and this new statistical revolution in my favorite sport has really enhanced my understanding. But, darn it, sites like Fangraphs simply aren’t geeky enough for me.

I mean, who hasn’t had that thought?

Fangraphs is a fantastic site. I can look at things like player A’s wRAA and see that linear weights suggest he was worth a certain number of batting runs last season. I can see what percentage of pitches Player B swung out that were outside the strike zone. I can look at hundreds of different calculations and can analyze pretty much to my heart’s desire.

But, I’m selfish. What if I want more? I want the ability to calculate things that aren’t available at Fangraphs, using tons of data. Maybe I simply want to learn how to use pitchFX, that new tool that you see everywhere with them fancy complicated graphs? Well, this fanpost is a very basic look at a tool that’s been making it’s way from StatSpeak to the Book Blog to the Hardball Times to Beyond the Boxscore.

This offseason I’ve probably overextended myself. I started by creating a series that I’ve kinda completely left unfinished 20% of the way through. I also started reading most of the book Mathletics by Wayne Winston. Winston’s book was interesting* and inspired me. The book presented some of the things one could do with baseball data in Excel.

*If you want to read a review of it from a sabermetrician’s perspective, go here.

Over the course of the offseason, I’ve spent many an hour developing Excel sheets analyzing our national past time. To give you a taste, I can tell you that a lineup made entirely of 9 Barry Bondses from the year 2004 would score approximately 4,074 runs in a season based on a Monte Carlo estimation. I can tell you that based on CHONE, Marcel, Oliver and ZiPS, Marcel was the best at estimating wRAA. I can tell you that Raj Davis performed in the 80th percentile* in projection systems.

*How I defined this was I ranked all players by the formula

proj wRAA – actual wRAA

in each projection system. I averaged these ranks and ranked those averages. Raj Davis outperformed 79.8% of players this way. The highest on the 2009 A’s team: Adam Kennedy, 85th percentile.

The point is I messed around with Excel for ages and wrote a ton of formulas that were pretty much useless, although quite entertaining. Well now I give you a better way to waste one’s time writing formulas: SQL!

WARNING: To follow along with the rest of this fanpost will require quite a few downloads but not a penny of money. Every download mentioned is completely free (of viruses and of the need for cash).  Some of my links may open files you may need to download.  These links can be figured out through context.  If you don't wish to actually try SQL, you can follow along once we get to the commands by looking at my tables which are exported from SQL to Excel to HTML.

SQL (pronounced sequel) is a database language. In fact, it stands for structured query language. Sound geeky enough for you?

But don’t fret yet. I had never heard of it before I decided to learn it. I’m far from a master, but I’ve been messing with it only a few days and have gotten a pretty good grasp of the basics so far. Let’s start where I started. This article by Colin Wyers is one of the best introductions to using SQL for baseball.

I’ve managed to get through the entire article except his last bit of code (the Marcel projections). I recommend opening Colin's article in a new window. I’ll give you the basics of the article as it never hurts to get a second wording of information. I figure, as a neophyte, my explanations should not be overly complex.

Onto the first step: the software. Also, as a note, I'm using PC and so is Colin in his article. Any other OS might need to use different software. In this case, the first software needed is SQL itself. Colin suggests MySQL in his article. Here is the appropriate section on what to download:

The next thing we need is a GUI to use MySQL. The link in Colin’s article links to a download page that only has free trials. I use the community version. You can download it here. Now that we have SQLyog and SQL, all we need is data. Again, I’ll quote Colin’s article on how to get some. In this case, we’ll work with Baseball Databank’s data:

First, download the ZIP file. Extract the .SQL file from the ZIP, and remember where you put it.

Now go into SQLyog. In SQLyog there will be three panes. The left-hand side should list your databases, the top is where you write queries, and the bottom shows the results. Let's add a database. Right-click in the pane, and select "Create Database." Call it "BDB," leaving off the punctuation. Leave the rest as the defaults, and click okay.

Now, let's populate the database. Right-click on the database, and select "Restore from SQL Dump." There should be a button with an ellipsis (...) in it. Click that, and select the file we extracted from the ZIP. Select Execute. It'll ask you if you want to execute the queries in database BDB. You do. This should take a little while.

Once that's done, go into your database and click on the folder that says Tables. If nothing is there, try hitting "F5" on the top row of your keyboard to refresh the browser. Go ahead and click on batting, and then slide on over to the right and click on "Table Data." Scroll around for a bit. If you've ever used a spreadsheet or looked at a table before, the layout should feel familiar.

Here is where you may hit your first roadblock. Unzipping files can be a bit of a pain. If you can’t unzip the files by simply double clicking, you may need to download another program. One of the better ones for this task is 7-Zip.

You may also have trouble with the tables loading. Most of the time these will import, but may not show up. Refreshing multiple times is very helpful.

So, now you should have the toolbox of tools needed to reside in the top level of geekiness, even among sabermetricians.  Now what do you do with these tools? In layman's terms, we have a database with lots of spreadsheets in it. To interact with this database, we need to talk to it. We do this task by writing queries.

To learn basic (but powerful!) queries you only need a few commands in SQL.  I'll (attempt to) teach you these queries by teaching you what each term means (although most are self-explanatory), giving you a code, and showing you the results.  You can copy-paste my code but I recommend typing it just so you get the hang of it.

The first of the commands we need is SELECT. SELECT is a very flexible command that will start almost all of your first queries. You tell the system what you want it to return. FROM is the second most basic command; it tells the computer where to search for what you’ve selected.

Open SQLyog. Browse through some of the tables you now have. Go to batting. It should have a bunch of columns named "playerid, yearid, stint…" Now we want to write a query. Go to the box in the top right where it says "1". If we simply want the database to return the entire batting table, we type:

```SELECT *
FROM BATTING;```

Click "F9". This will execute the query. The asterisk tells the computer to select everything. The semicolon tells the computer that the query is finished. The query should display the entire batting table, which is too large to reproduce here.

Now, if we wanted to narrow our table and have only the specific columns playerid, yearid, and HR, we would write:

```SELECT playerid, yearid, HR
FROM BATTING;```

The result:

 playerid yearid HR abercda01 1871 0 addybo01 1871 0 …. …. ….

That's much more manageable if still pretty useless.  Next we have the commands AND and WHERE. WHERE allows us to be more specific with our requests. AND combines requests:

```SELECT playerid, yearid, HR
FROM BATTING
WHERE teamid= 'oak' AND yearid = 2009;```

You must put the OAK in single quotes. You can copy-paste this command into SQL if you want, but typing it out will help you learn it better. This command returns a list of all the 2009 A’s and their corresponding HRs:

 playerid yearid HR cahiltr01 2009 0 outmajo01 2009 0 cunniaa01 2009 1 mazzavi01 2009 0 ..... ..... ....

However, we don't really care that Trevor Cahill hit 0 HRs.  It would be much nicer to the eyes if we could order them. The command for this request is simply ORDER BY:

```SELECT playerid, yearid, HR
FROM BATTING
WHERE teamid= 'oak' AND yearid = 2009
ORDER BY HR DESC; ```

The ORDER BY command defaults to ordering ascendingly. Adding DESC makes the table prettier to us, as we normally think of the A’s HR leaders in descending order.

Here's what the query gives:

 playerid yearid HR custja01 2009 25 suzukku01 2009 15 hollima01 2009 11 giambja01 2009 11 kennead01 2009 11 ...... ...... ......

The next command that will take a little work to master is joins. There are multiple ways to join, but I’ll explain the way that makes the most sense to me. The point of a join is to combine certain statistics from different tables. For example, if we wanted a player’s actual name instead of his player id we would do this:

```SELECT master.nameLast, master.nameFirst, batting.yearid, batting.HR
FROM batting, master
WHERE master.playerid=batting.playerid
AND teamid='oak' AND yearid = 2009
ORDER BY HR DESC;```

Players’ actual names are stored under the "master" table, while the "batting" table only carries their playerids. However these playerids are also stored in the master table. Using the command

WHERE master.playerid=batting.playerid

allows us to combine multiple columns within the master table as well as the batting table.

Here is the result of the final query:

 nameLast nameFirst yearid HR Cust Jack 2009 25 Suzuki Kurt 2009 15 Holliday Matt 2009 11 Giambi Jason 2009 11 ...... ....... ....... ......

A couple more functions and their use:

AS: Allows you to use a formula and renames the column

```SELECT playerid, yearid, AB, H/AB AS AVG
FROM BATTING         WHERE teamid= 'oak' AND yearid = 2009
ORDER BY AVG DESC;```
 playerid yearid AB AVG gonzaed01 2009 2 1 cahiltr01 2009 2 0.5 davisra01 2009 390 0.3051 sweenry01 2009 484 0.2934 kennead01 2009 529 0.2892 ....... ........ ....... ........

HAVING: Allows one to set a limitation on a group.

```SELECT playerid, yearid, AB, H/AB AS AVG
FROM BATTING
WHERE teamid= 'oak' AND yearid = 2009
HAVING AB>=50
ORDER BY AVG DESC;```
 playerid yearid AB AVG davisra01 2009 390 0.3051 sweenry01 2009 484 0.2934 kennead01 2009 529 0.2892 patteer01 2009 94 0.2872 hollima01 2009 346 0.2861 ......... ......... ....... .......

GROUP BY, SUM: When combined with a SUM command, GROUP BY combines a player’s stats when he is in your query more than once. If a player changes teams in a single year, this will combine his stats for each team and give you his year-long stats.

```SELECT playerid, SUM(HR) AS HR, SUM(RBI) AS RBI
FROM batting
WHERE playerid='hollima01' AND yearid=2009
GROUP BY playerid;```
 playerid HR RBI hollima01 24 109

With just these simple commands, you should now be able to run a bunch of basic queries. With these you can accomplish a ton.

For example, every Oakland A All-Star:

```SELECT m.nameLast, m.nameFirst, b.yearid, b.teamid
FROM master m, allstar a, batting b
WHERE a.playerid=m.playerid AND b.playerid=m.playerid
AND b.yearid=a.yearid
HAVING b.teamid='oak'
ORDER BY a.yearid DESC;```
 nameLast nameFirst yearid teamid Bailey Andrew 2009 OAK Duchscherer Justin 2008 OAK Haren Danny 2007 OAK Zito Barry 2006 OAK Duchscherer Justin 2005 OAK Hudson Tim 2004 OAK Mulder Mark 2004 OAK Foulke Keith 2003 OAK Hernandez Ramon 2003 OAK Mulder Mark 2003 OAK ..... ..... ...... .......

Every playoff series the A's have ever won in any city:

```SELECT yearid, round, teamidloser, wins, losses, ties
FROM seriespost
WHERE teamidwinner IN('OAK','PHA','KCA');```

EDIT: 'KCA' in the above command should be 'KC1'.

Result:

 yearID round teamidloser wins losses 1910 WS CHN 4 1 1911 WS NY1 4 2 1913 WS NY1 4 1 1929 WS CHN 4 1 1930 WS SLN 4 2 1972 ALCS DET 3 2 1972 WS CIN 4 3 1973 ALCS BAL 3 2 1973 WS NYN 4 3 1974 ALCS BAL 3 1 1974 WS LAN 4 1 1980 ALCS NYA 3 0 1981 AWDIV KCA 3 0 1985 ALCS TOR 4 3 1985 WS SLN 4 3 1988 ALCS BOS 4 0 1989 ALCS TOR 4 1 1989 WS SFN 4 0 1990 ALCS BOS 4 0 2006 ALDS2 MIN 3 0

HOFers ranked by the percentage they were voted in by:

```SELECT m.namefirst, m.namelast, h.ballots, h.votes, h.votes/h.ballots AS PCTG
FROM hofold h, MASTER m
WHERE m.hofid=h.hofid AND h.ballots IS NOT NULL
ORDER BY PCTG DESC; ```

Result:

 nameFirst nameLast ballots votes PCTG Tom Seaver 430 425 0.9884 Nolan Ryan 497 491 0.9879 Cal Ripken 545 537 0.9853 Ty Cobb 226 222 0.9823 George Brett 497 488 0.9819 …………. …………. …………. …………. ………….

Career HR leaders as Oakland A’s:

```SELECT playerid, SUM(HR) AS HR
FROM batting
WHERE teamid='oak'
GROUP BY playerid
ORDER BY HR DESC
LIMIT 5; ```

Here's what it will look like:

 playerid HR mcgwima01 363 jacksre01 268 cansejo01 254 chaveer01 229 giambja01 198

These examples are only the tip of the iceberg. One can link SQL to R or Access, not to mention the other databases we can play around with (including Retrosheet and pitchFX). If there’s interest I’ll give tutorials on more stuff we can do with SQL including building Marcels, heat maps, and developing linear weights.

If you want to practive, here’s what should be almost a review in the form of a SQL tutorial.  If you want to get ahead, a couple of glances at connecting SQL to the pitchFX databaseand to R. Here's one that talks about installing it for Retrosheet.

Use this thread (all one or two of you who actually want to try SQL) as a place to post queries that you can’t figure out or tips you may pick up. You can ask me questions and I’ll do my best to answer them. Remember we’re all beginners here and no question is stupid.

Recent FanPosts

A's Team Performance

Shortcuts to mastering the comment thread. Use wisely.

C - Next Comment
X - Mark as Read