| Sign Up | Google+

Fanposts

PRESENTED BY
PRESENTED BY

An introduction to SQL or the Nerdiest Fanpost you'll see on AN

 

I bet I know what you’re thinking. I love baseball stats and this new statistical revolution in my favorite sport has really enhanced my understanding. But, darn it, sites like Fangraphs simply aren’t geeky enough for me.

I mean, who hasn’t had that thought?

Fangraphs is a fantastic site. I can look at things like player A’s wRAA and see that linear weights suggest he was worth a certain number of batting runs last season. I can see what percentage of pitches Player B swung out that were outside the strike zone. I can look at hundreds of different calculations and can analyze pretty much to my heart’s desire.

But, I’m selfish. What if I want more? I want the ability to calculate things that aren’t available at Fangraphs, using tons of data. Maybe I simply want to learn how to use pitchFX, that new tool that you see everywhere with them fancy complicated graphs? Well, this fanpost is a very basic look at a tool that’s been making it’s way from StatSpeak to the Book Blog to the Hardball Times to Beyond the Boxscore.

This offseason I’ve probably overextended myself. I started by creating a series that I’ve kinda completely left unfinished 20% of the way through. I also started reading most of the book Mathletics by Wayne Winston. Winston’s book was interesting* and inspired me. The book presented some of the things one could do with baseball data in Excel.

*If you want to read a review of it from a sabermetrician’s perspective, go here.

Over the course of the offseason, I’ve spent many an hour developing Excel sheets analyzing our national past time. To give you a taste, I can tell you that a lineup made entirely of 9 Barry Bondses from the year 2004 would score approximately 4,074 runs in a season based on a Monte Carlo estimation. I can tell you that based on CHONE, Marcel, Oliver and ZiPS, Marcel was the best at estimating wRAA. I can tell you that Raj Davis performed in the 80th percentile* in projection systems.

*How I defined this was I ranked all players by the formula

proj wRAA – actual wRAA

in each projection system. I averaged these ranks and ranked those averages. Raj Davis outperformed 79.8% of players this way. The highest on the 2009 A’s team: Adam Kennedy, 85th percentile.

The point is I messed around with Excel for ages and wrote a ton of formulas that were pretty much useless, although quite entertaining. Well now I give you a better way to waste one’s time writing formulas: SQL!

WARNING: To follow along with the rest of this fanpost will require quite a few downloads but not a penny of money. Every download mentioned is completely free (of viruses and of the need for cash).  Some of my links may open files you may need to download.  These links can be figured out through context.  If you don't wish to actually try SQL, you can follow along once we get to the commands by looking at my tables which are exported from SQL to Excel to HTML.

SQL (pronounced sequel) is a database language. In fact, it stands for structured query language. Sound geeky enough for you?

But don’t fret yet. I had never heard of it before I decided to learn it. I’m far from a master, but I’ve been messing with it only a few days and have gotten a pretty good grasp of the basics so far. Let’s start where I started. This article by Colin Wyers is one of the best introductions to using SQL for baseball.

I’ve managed to get through the entire article except his last bit of code (the Marcel projections). I recommend opening Colin's article in a new window. I’ll give you the basics of the article as it never hurts to get a second wording of information. I figure, as a neophyte, my explanations should not be overly complex.

Onto the first step: the software. Also, as a note, I'm using PC and so is Colin in his article. Any other OS might need to use different software. In this case, the first software needed is SQL itself. Colin suggests MySQL in his article. Here is the appropriate section on what to download:

First, you need to download and run the installer for MySQL 5.4 CE Essentials. Choose a "typical" install. You'll typically have to wait a few minutes for files to copy. Then there will be some ads. Skip over them. Make sure that "Configure the MySQL server now" is checked. (I uncheck the box about registering the software with MySQL, but that's up to you.) Chose "Detailed configuration." Then you'll go through a set of pages asking you question. Here are the answers: 1.Developer Machine 2. Non-Transactional Database Only 3.Decision Support (DSS)/OLAP Continue on, using the defaults. When prompted, either add a password or uncheck the box to alter security settings. (This will leave your root password blank.) Once you're finished, you shouldn't have to start MySQL at all—it should be ever vigilant, waiting for you to command it to do your sabermetric bidding.

The next thing we need is a GUI to use MySQL. The link in Colin’s article links to a download page that only has free trials. I use the community version. You can download it here. Now that we have SQLyog and SQL, all we need is data. Again, I’ll quote Colin’s article on how to get some. In this case, we’ll work with Baseball Databank’s data:

First, download the ZIP file. Extract the .SQL file from the ZIP, and remember where you put it.

Now go into SQLyog. In SQLyog there will be three panes. The left-hand side should list your databases, the top is where you write queries, and the bottom shows the results. Let's add a database. Right-click in the pane, and select "Create Database." Call it "BDB," leaving off the punctuation. Leave the rest as the defaults, and click okay.

Now, let's populate the database. Right-click on the database, and select "Restore from SQL Dump." There should be a button with an ellipsis (...) in it. Click that, and select the file we extracted from the ZIP. Select Execute. It'll ask you if you want to execute the queries in database BDB. You do. This should take a little while.

Once that's done, go into your database and click on the folder that says Tables. If nothing is there, try hitting "F5" on the top row of your keyboard to refresh the browser. Go ahead and click on batting, and then slide on over to the right and click on "Table Data." Scroll around for a bit. If you've ever used a spreadsheet or looked at a table before, the layout should feel familiar.

Here is where you may hit your first roadblock. Unzipping files can be a bit of a pain. If you can’t unzip the files by simply double clicking, you may need to download another program. One of the better ones for this task is 7-Zip.

You may also have trouble with the tables loading. Most of the time these will import, but may not show up. Refreshing multiple times is very helpful.

So, now you should have the toolbox of tools needed to reside in the top level of geekiness, even among sabermetricians.  Now what do you do with these tools? In layman's terms, we have a database with lots of spreadsheets in it. To interact with this database, we need to talk to it. We do this task by writing queries.

To learn basic (but powerful!) queries you only need a few commands in SQL.  I'll (attempt to) teach you these queries by teaching you what each term means (although most are self-explanatory), giving you a code, and showing you the results.  You can copy-paste my code but I recommend typing it just so you get the hang of it.

The first of the commands we need is SELECT. SELECT is a very flexible command that will start almost all of your first queries. You tell the system what you want it to return. FROM is the second most basic command; it tells the computer where to search for what you’ve selected.

Open SQLyog. Browse through some of the tables you now have. Go to batting. It should have a bunch of columns named "playerid, yearid, stint…" Now we want to write a query. Go to the box in the top right where it says "1". If we simply want the database to return the entire batting table, we type:

SELECT *
     FROM BATTING;

Click "F9". This will execute the query. The asterisk tells the computer to select everything. The semicolon tells the computer that the query is finished. The query should display the entire batting table, which is too large to reproduce here.

Now, if we wanted to narrow our table and have only the specific columns playerid, yearid, and HR, we would write:

SELECT playerid, yearid, HR
 	FROM BATTING;

The result:

playerid           yearid                HR
abercda01 1871 0
addybo01 1871 0
…. …. ….

 

That's much more manageable if still pretty useless.  Next we have the commands AND and WHERE. WHERE allows us to be more specific with our requests. AND combines requests:

SELECT playerid, yearid, HR
 	FROM BATTING
        WHERE teamid= 'oak' AND yearid = 2009;

You must put the OAK in single quotes. You can copy-paste this command into SQL if you want, but typing it out will help you learn it better. This command returns a list of all the 2009 A’s and their corresponding HRs:

playerid yearid HR
cahiltr01 2009 0
outmajo01 2009 0
cunniaa01 2009 1
mazzavi01 2009 0
.....
.....
....

 

However, we don't really care that Trevor Cahill hit 0 HRs.  It would be much nicer to the eyes if we could order them. The command for this request is simply ORDER BY:

SELECT playerid, yearid, HR
 	FROM BATTING
        WHERE teamid= 'oak' AND yearid = 2009
        ORDER BY HR DESC; 

The ORDER BY command defaults to ordering ascendingly. Adding DESC makes the table prettier to us, as we normally think of the A’s HR leaders in descending order.

Here's what the query gives:

playerid          yearid                HR
custja01 2009 25
suzukku01 2009 15
hollima01 2009 11
giambja01 2009 11
kennead01 2009 11
...... ......
......

 

The next command that will take a little work to master is joins. There are multiple ways to join, but I’ll explain the way that makes the most sense to me. The point of a join is to combine certain statistics from different tables. For example, if we wanted a player’s actual name instead of his player id we would do this:

SELECT master.nameLast, master.nameFirst, batting.yearid, batting.HR
 	FROM batting, master 
        WHERE master.playerid=batting.playerid 
              AND teamid='oak' AND yearid = 2009
        ORDER BY HR DESC;

Players’ actual names are stored under the "master" table, while the "batting" table only carries their playerids. However these playerids are also stored in the master table. Using the command

WHERE master.playerid=batting.playerid

allows us to combine multiple columns within the master table as well as the batting table.

Here is the result of the final query:

nameLast nameFirst          yearid                HR
Cust Jack 2009 25
Suzuki Kurt 2009 15
Holliday Matt 2009 11
Giambi Jason 2009 11
...... ....... ....... ......

 

A couple more functions and their use:

 

AS: Allows you to use a formula and renames the column

SELECT playerid, yearid, AB, H/AB AS AVG
 	FROM BATTING 
WHERE teamid= 'oak' AND yearid = 2009 ORDER BY AVG DESC;
playerid           yearid                 AB              AVG
gonzaed01 2009 2 1
cahiltr01 2009 2 0.5
davisra01 2009 390 0.3051
sweenry01 2009 484 0.2934
kennead01 2009 529 0.2892
.......
........
.......
........

 

HAVING: Allows one to set a limitation on a group.

SELECT playerid, yearid, AB, H/AB AS AVG
 	FROM BATTING 
        WHERE teamid= 'oak' AND yearid = 2009
        HAVING AB>=50
        ORDER BY AVG DESC;
playerid           yearid
                AB              AVG
davisra01 2009 390 0.3051
sweenry01 2009 484 0.2934
kennead01 2009 529 0.2892
patteer01 2009 94 0.2872
hollima01 2009 346 0.2861
.........
.........
.......
.......

GROUP BY, SUM: When combined with a SUM command, GROUP BY combines a player’s stats when he is in your query more than once. If a player changes teams in a single year, this will combine his stats for each team and give you his year-long stats.

SELECT playerid, SUM(HR) AS HR, SUM(RBI) AS RBI
 	FROM batting
 	WHERE playerid='hollima01' AND yearid=2009
 	GROUP BY playerid;
playerid HR RBI
hollima01 24 109

 

With just these simple commands, you should now be able to run a bunch of basic queries. With these you can accomplish a ton.

For example, every Oakland A All-Star:

SELECT m.nameLast, m.nameFirst, b.yearid, b.teamid
 	FROM master m, allstar a, batting b
 	WHERE a.playerid=m.playerid AND b.playerid=m.playerid
             AND b.yearid=a.yearid
 	HAVING b.teamid='oak'
 	ORDER BY a.yearid DESC;
nameLast nameFirst yearid teamid
Bailey Andrew 2009 OAK
Duchscherer Justin 2008 OAK
Haren Danny 2007 OAK
Zito Barry 2006 OAK
Duchscherer Justin 2005 OAK
Hudson Tim 2004 OAK
Mulder Mark 2004 OAK
Foulke Keith 2003 OAK
Hernandez Ramon 2003 OAK
Mulder Mark 2003 OAK
.....
.....
......
.......

 

 

Every playoff series the A's have ever won in any city:

SELECT yearid, round, teamidloser, wins, losses, ties
 	FROM seriespost
 	WHERE teamidwinner IN('OAK','PHA','KCA');

EDIT: 'KCA' in the above command should be 'KC1'.

Result:

yearID round teamidloser wins losses
1910 WS CHN 4 1
1911 WS NY1 4 2
1913 WS NY1 4 1
1929 WS CHN 4 1
1930 WS SLN 4 2
1972 ALCS DET 3 2
1972 WS CIN 4 3
1973 ALCS BAL 3 2
1973 WS NYN 4 3
1974 ALCS BAL 3 1
1974 WS LAN 4 1
1980 ALCS NYA 3 0
1981 AWDIV KCA 3 0
1985 ALCS TOR 4 3
1985 WS SLN 4 3
1988 ALCS BOS 4 0
1989 ALCS TOR 4 1
1989 WS SFN 4 0
1990 ALCS BOS 4 0
2006 ALDS2 MIN 3 0

 

HOFers ranked by the percentage they were voted in by:

SELECT m.namefirst, m.namelast, h.ballots, h.votes, h.votes/h.ballots AS PCTG
 	FROM hofold h, MASTER m
 	WHERE m.hofid=h.hofid AND h.ballots IS NOT NULL
 	ORDER BY PCTG DESC; 

Result:

nameFirst nameLast ballots votes PCTG
Tom Seaver 430 425 0.9884
Nolan Ryan 497 491 0.9879
Cal Ripken 545 537 0.9853
Ty Cobb 226 222 0.9823
George Brett 497 488 0.9819
…………. …………. …………. …………. ………….

 

Career HR leaders as Oakland A’s:

SELECT playerid, SUM(HR) AS HR
 	FROM batting
 	WHERE teamid='oak'
 	GROUP BY playerid
 	ORDER BY HR DESC
 		LIMIT 5; 

Here's what it will look like:

playerid               
HR
mcgwima01 363
jacksre01 268
cansejo01 254
chaveer01 229
giambja01 198

 

These examples are only the tip of the iceberg. One can link SQL to R or Access, not to mention the other databases we can play around with (including Retrosheet and pitchFX). If there’s interest I’ll give tutorials on more stuff we can do with SQL including building Marcels, heat maps, and developing linear weights.

If you want to practive, here’s what should be almost a review in the form of a SQL tutorial.  If you want to get ahead, a couple of glances at connecting SQL to the pitchFX databaseand to R. Here's one that talks about installing it for Retrosheet. 

Use this thread (all one or two of you who actually want to try SQL) as a place to post queries that you can’t figure out or tips you may pick up. You can ask me questions and I’ll do my best to answer them. Remember we’re all beginners here and no question is stupid.

                                                                                                                                                                                                               

Recent FanPosts

View All Fan Posts

The Next FanPosts

There are 64 Comments. Load Now. Loading

Shortcuts to mastering the comment thread. Use wisely.

C - Next Comment
X - Mark as Read

R - Reply
Z - Mark Read & Next

Shift + C - Previous
Shift + A - Mark All Read

Comment Settings

Live comment alert: Hide it!

Comments for this post are closed.

tracking_pixel_5351_tracker