Alright, I had a little free time this morning and, unfortunately for you guys, that means I decided to write about the A's. And since the debate du jour of today seems to be how to predict defensive performance, here's my take.
I first took all the data on UZR from fangraphs from 2005 to 2010 for fielders who fielded for at least 500 innings. For 2010, that was 220 players led by Brett Gardner with a 22.3 UZR and trailed by Matt Kemp at -24. For all players, I then matched those with their corresponding scores in previous seasons dating back to 2005. My goal was to use the five years 2005-2009 to predict a player's 2010 UZR*.
*I did all of this in R of course. If anyone wishes to view my codes, feel free to email me.
Of course, if a player didn't play the previous year or didn't play the same position the previous year, they were eliminated. Unsurprisingly, by the end, only 38 players remained. Now obviously that is not a big enough sample size to conclude much. I decided to change the 5 year requirement and created 2 batches: 2005-2008 to predict 2009 and 2006-2009 to predict 2010. The first batch had 53 players and the second 56.
From that data, I took each player's mean and median UZR score and used both to predict the next year's UZR score. Below are pictures:
So what do these tell us? The closer to the y=x line the points are, the more accurate the prediction is. Which do you think is better in each case?
Lastly, I will give you the correlation coefficients. These numbers explain the percent of the variance that can be explained by our model:
> cor(UZR2005_2009$"2009", UZR2005_2009$meds)
> cor(UZR2005_2009$"2009", UZR2005_2009$means)
From 2005-2009, using the mean UZR from 2005-2008 was more accurate than using the median score.
> cor(UZR2006_2010$"2010", UZR2006_2010$meds)
> cor(UZR2006_2010$"2010", UZR2006_2010$means)
From 2006-2010, using the median UZR from 2006-2009 was slightly better. Interestingly, the correlation from 2006-2009 to 2010 was much better than from 2005-2008 to 2009.
> cor(UZR2005_2010$"2010", UZR2005_2010$meds)
> cor(UZR2005_2010$"2010", UZR2005_2010$means)
For the entirety of our data, means won by a fair amount.
So, can we conclude anything? Probably not, although if you're deciding between using mean and median, I suppose you lean towards choosing the mean. It is unlikely that it will make much of a difference. But this is AN, where even our tiniest differences are overanalyzed.
With more time, I would add in a time series model that weighted more recent years more heavily and see how that compares to the mean and median models. I might also see if TZ or Dewan +/- or the Fan's Scouting Report would help our model. But in 2 hours how much can you expect of me?