FanPost

R: Or How YOU Can Make Pretty Pictures

More and more on the internet and AN, I find the most thoughtful and enlightening baseball posts to be interesting graphics.   On AN, we have a few experts who provide and share unique and thought-provoking plots.

However, many people believe that they simply don't have the time or skills to create such great works of art as a danmerquery, elcroata, the wizards at BTB, or countless others across the internet.  I'm here to tell you that you're wrong.  If I can do it, you can do it.

This fanpost is here to give you one of the most powerful graphical tools you can use and to make it relatively easy to comprehend*: R. 

*I hope

R is statistical software that you encounter more and more as you advance into the statistical world.  You can download R onto your machine by going here.

Scroll to "Getting Started", click the CRAN Mirror link, and select the closest mirror to you, your operating system, and other necessary fields.  This process shouldn't be too hard and there are FAQs at the cran site that should help you if you need it. 

Once you've downloaded R, you have a ton of power in your hands.  Anything a statistician can dream of is available in R, can be programmed in it, or can be connected to it.  There are packages to connect R with Excel, MySQL, Google Earth, Google Maps, anything.  But what we want to use R for is its graphics. 

The first, and most important, part of creating a good graphic is knowing what to plot. An informative plot needs to portray information in a new and interesting way. Coming up with this idea should be the hardest part of creating a graphic. 

 I believe that the quickest way of learning how to write R code is seeing relatively simple examples.   So below are a few ideas of different graphs you can create quickly and without too much work. 

But first we need our data.  Unfortunately, data can be saved in many different ways and it can sometimes take effort coercing into what we want.  Thankfully, quite a few websites, including Fangraphs and Baseball Reference, allow you to download data in csv format*.   I went  here and then clicked the link highlighted below to get the first data I worked with:

*CSV is a very nice format as you can also save an Excel sheet as CSV by simply going to File> Save As.

Bbref_medium

 

Next I simply copy-pasted this csv data into Notepad (or any other simple text editor or even Excel) and saved it as a csv file (I saved the file as "AsRecords.csv").  Make sure to save this file into R's working directory.  You can tell what directory that is by simply typing the command into R:

getwd()

You can also change the directory under the File menu in R.  Now that we have the file in our working directory, we simply type:

records = read.csv("AsRecords.csv")

Using this command, we've created a new variable (records) and set it equal to the values of the csv.  If you type "records" now, R will return the data frame with all the data we got from BBRef.  We can do some simple commands such as class, head, and names to take our first peak at the data:

Firstcommands_medium

First we check the class of our variable and, as expected, it's a data frame.  Then we look at the first 6 rows of data.  This is always a good idea to make sure the data transferred properly.  It did.  We also can look at the names of each column.   The next trick was to use the dollar sign to subset.  We chose the column "W" to look at the number of wins each year (We could get the same result with the " records[["W"]] " command).   Subsetting will be important later. 

There are further commands like mean, sd, median, max, min, and summary that you could use to analyze the data, but we've gone way too long without getting to the graph part of this post, right?

Try the following command:

plot(records$W)

The result is the following bland graph of wins over the years:

Plot1_medium

 

That doesn't look like something that would make the front page of AN, does it?  It's confusing, unclear, and generally ugly.  What we can do instead is make the graph easier to understand, and more interesting to look at  simply by adding in a couple extra variables.  Like this:

plot(records$Year, records$W.L., type ="l",

                main = "A's Winning Percentage",

                xlab = "Year", ylab = "Winning %",

                 col="forestgreen", lwd=2)

Our result:

Plot2_medium

Our graph is no longer quite so bland.  We've changed it into a line graph instead of a dot plot.  Instead of wins, it's a graph of winning percentage.  Instead of just a graph of one variable, it has dates on the x axis.  The axes are labeled.  It's a different color.   But how did I know what variables to add to my command to produce so many pretty additions?

One of the most valuable tools in R is the help menu.  To access the help file for a specific command simply type:

?plot

R opens a help file that tells you what the command does, and if you scroll down, the arguments.  In the arguments section, one can find variables like type, main, and xlab, and their uses. All the arguments I used are in the plot help file or the par help file.

We've now created a reasonable looking graph but one that could use a couple more additions.  If you didn't close graphs between the first command I gave you for a graph and the second, you will have noticed R automatically closed the graph and started a new one. 

While this process holds true for many functions, it doesn't for all.  Text(), points(), lines(), abline(), symbols(), segments(), polygon(), rect(), box(),  and axis() all add to the current plot.  You could look up the help file for each of those functions, to see what they do, but like most of the functions we've used their names give you at least some idea.  I added the following to my previous code to come up with:

plot(records$Year, records$W.L, type ="l",

                main = "A's Winning Percentage",

                xlab = "Year", ylab = "Winning %",

                 col="forestgreen", lwd=2)

abline(h=.5)

rect(1850,0.4, 2050,0.6, col="gold", density=25)

legend(1960, 0.33, title="Middle of the Pack:",

                 ".400-.600 win %

                65-97 W/162 G", fill="gold")

text(1931, .704, pos=4, "Max%: .704 \n 1931: 107-45", cex=.75)

text(1916, .235, pos=4, "Min%: .235 \n 1916: 36-117", cex=.75)

text(2002,.636, pos=3, "Last time > .600 \n 2002: 103-59 (.636)", cex=.7)

I think this is a pretty good graph (if I do say so myself): 

Records

It's starting to get a bit crowded, but I think it conveys a lot of good information.  One can see just how many ups and downs the A's have had as a franchise.   The rectangle highlights just how high and how low those winning percentages are.  It tells us (in the legend) what those percentages are per 162 games.  Text on the graph shows the A's franchise record highs and lows and gives information about the years they were achieved, as well as the last time the A's passed .600.

This entire graph was basically written by one line of code (our original plot function), and the other 12 lines of code are simply making the graph prettier and contain a bit more info.      

Let's move onto a different dataset.  The next one I tried was the A's team stats on Fangraphs.  I selected all years between 1901 and 2010.  Here's the link.

Just like BBRef, Fangraphs provides a link to save as CSV.  I opened this download in Notepad and saved it as a CSV in my working directory.  Here's where we hit our first snag with downloading CSV data into R.  Fangraphs saved the names in the csv file with links to their player pages. 

We have to get rid of the html code to properly use the data. If you know a clever way to do this in Excel, by all means use it.  I used R to split the expressions by the > and < tags. 

Here's what I used:

 

 

WAR = read.csv("AsWARLeaders.csv", stringsAsFactors=FALSE)

#without stringsAsFactors, strsplit won't work

for (i in 1:length(WAR[,1])){
    WAR[i,1] = strsplit(WAR[i,1], ">")[[1]][2]
    WAR[i,1] = strsplit(WAR[i,1], "<")[[1]][1]
}

 

Now we can create a graph with the data we used.  I decided to create a barplot of the top 15 performances of players in an A's uniform:

 

par(mar = c(7,4,4,2))
barplot(WAR$WAR[1:15], names=WAR$Name[1:15], las=2,
cex.names=.6, space=0, col=c("forestgreen", "gold"),
main = "Career fWAR Leaders in A's Uniform", ylim=c(0,80))

text(1:15-.5, WAR$WAR[1:15]+2, WAR$WAR[1:15], cex=.7)

 

Here is the result:

Fwarleaders

 

Again, this graph is one real line of code with four extra to prettify it.  A couple of the extra variables included in the commands you can find by going to ?par rather than ?barplot.  

 

I've now given you an example of a graph from Fangraphs and BBRef.  There are endless combinations of data you can now import into R using just those two sites and CSV downloads.  If you want to look at some examples of other R and baseball graphs, here is BTB on how to link PitchFX to R,  create a heat map, and perform cluster analysis.

If you want to see some other graphics R can create, you can browse through here and even read the source code for all the images.

The graphs I presented are just two examples of different types of graphs capable in R's base package.  There are whole hosts of other packages one can work through.  Seriously, go to Packages>Install Packages in R to check it out.  If anyone has any questions about R and baseball, feel free to ask.  If anyone tries a graph, feel free to post it. 

Maybe I'll even be able to do a follow-up post that I actually follow through with this time if I get enough positive response.

X
Log In Sign Up

forgot?
Log In Sign Up

Forgot password?

We'll email you a reset link.

If you signed up using a 3rd party account like Facebook or Twitter, please login with it instead.

Forgot password?

Try another email?

Almost done,

Join Athletics Nation

You must be a member of Athletics Nation to participate.

We have our own Community Guidelines at Athletics Nation. You should read them.

Join Athletics Nation

You must be a member of Athletics Nation to participate.

We have our own Community Guidelines at Athletics Nation. You should read them.

Spinner

Authenticating

Great!

Choose an available username to complete sign up.

In order to provide our users with a better overall experience, we ask for more information from Facebook when using it to login so that we can learn more about our audience and provide you with the best possible experience. We do not store specific user data and the sharing of it is not required to login with Facebook.

tracking_pixel_9351_tracker