Grab a snack and pull up a chair. This will be long.
I think I've been called, indirectly at least, a small sample size Nazi. I don't have the patience to discuss with anybody who insists on using small sample sizes as the basis for their argument. It's dangerous, it's wrong, and you should never, ever do it. Let's talk about why that is.
In baseball, the danger of using small sample sizes boils down to a a phenomenon called regression to the mean. Most of you have at least heard of this concept, and practically everybody utilizes it intuitively in their everyday lives. Regression to the mean, simply put, is the tendency for any observation to be less extreme on subsequent observations.
For example, let's look at the top of the batting average leaderboard for 2001 (min 500 PA), and see how they did in 2002:
|Player||2001 AVG||2002 AVG|
Finishing near the top of the batting average leaderboard is an extreme observation. And, in aggregate these players performed worse in the following year. That is, our observation of their batting averages was less extreme.
You can do the same thing for the worst hitters and you'd find the exact same thing. You can use batting average, OPS, Runs Created - it really doesn't matter.
In this case, we had the extreme observation that each of these players had very high batting averages in 2001. And the next year, by and large, their batting averages decreased - in other words, the observation in 2002 was less extreme than the observation in 2001.
Why does this happen? The reason is that for any player, you have "true skill" and "actual performance." We too often think that "actual performance" is "true skill." That's incorrect. A player has a true skill, and he utilizes that true skill to accumulate actual performance data. But those data are only random samples from his true skill.
Any observation we make will have the tendency to regress to the mean, and the amount that it regresses depends on two things:
1. How much performance data we have. When you have almost no data, you regress all the way. So if the A's call up Matt Sulentic for one day and he goes 1-1, we observe his batting average to be 1.000. But we know that is not his true skill, and the fact that we have only one data point means that, if we want to estimate his true skill, we have to regress very heavily to the mean (like, all the way).
If he were to continue to accumulate plate appearances, we would regress smaller and smaller amounts. We need a lot of performance data before the actual performance matches the true skill. This is what people mean when they talk about "small sample size."
2. The spread in skills are among the general MLB population. We regress to the mean more if the spread in skills among the general population is small, and we regress less if the spread in skills among the general population is large. If there were a skill for which there were zero variation among the general MLB population, we would regress all the way to the mean, no matter what the performance data indicated.
Sitting on a dock on a Bayes...
Let's back up a minute and talk philosophy.
Pick a hitter - say, Travis Buck. You watch Travis Buck play in 2007 and observe that he has a .377 OBP. What do we know about Travis Buck's ability to get on base? Here's a list:
1. Travis Buck had a .377 OBP in 334 plate appearances.
Now, statisticians will tell you that there's a margin of error there. I'll skip the derivations, but you can get the standard deviation using the equation:
So, we observed that Travis Buck had a .377 OBP +/- .026. We think that Travis Buck's "true OBP" is somewhere between .351 and .403. If you're not familiar with standard deviation, then think of it as a "margin of error" of sorts. The true talent is 68% likely to fall between plus or minus one standard deviation.
There's just one problem: our conclusion that Buck's true OBP skill is .377 +/- .026 is hopelessly and utterly wrong. Why? Because our list sucked.
We actually know two things about Travis Buck:
1. Travis Buck had a .377 OBP in 334 plate appearances.
2. Travis Buck is a major league baseball player, and the average major league baseball player has a .330 OBP.
The second point is absolutely critical. Buck isn't some random dude picked off the street; he is part of an overall population. We know two things about him: what we observed about him, and what we observed about people who are like him.
Taking the second item into account is the same as regression to the mean.
The league-average OBP is .330. Through some complicated statistics, we know that the variation in true skill (the standard deviation) across all major league hitters is 0.025. That is, 68% of all major leaguers have a true OBP skill between .305 and .355.
Now we have two measurements corresponding to our two observations regarding Travis Buck:
1. We observed his OBP to be .377 +/- .026
2. Tavis Buck is a major leaguer, whose collective OBP is .330 +/- .025.
We must take into account both of these measurements. Must! We combine these measurements in such a way that the measurement with the least uncertainty is given more consideration, and the measurement with the most uncertainty is given less consideration.
Mathematically, we do this by weighting by the inverse of the square of the standard deviation.
The equation for combining two measurement with two different standard deviations is:
true skill = (m1/s1^2 + m2/s2^2)/(1/s1^2 + 1/s2^2)
where m1 is measurement 1, m2 is measurement 2, s1 is the standard deviation in meaurement 1, and s2 is the standard deviation in measurement 2.
This equation is how we regress to the mean, and we're about to do it for Travis Buck.
true OBP skill = [.377/(.026^2) + .330/(.025^2)]/(1/.026^2 + 1/.025^2) = .353
Notice that our estimate of his true OBP skill is basically halfway between our first measurement (.377) and our second measurement (.330). This is because the uncertainty in our first measurement is almost equal to the uncertainty in the second measurement.
One can play lots of games with regression to the mean, and I won't go into them here. But I do want to point out that our second measurement, that Buck is a major leaguer, is somewhat arbitrary. We could just as easily have chosen any population to which Buck belongs: all American men, major leaguers with really nice butts, people who are 6'2"/200 lbs, etc. Choosing the right population - statisticians call this "choosing the correct prior" - is something that forecasters have spent many hours pondering. But in the absence of any other information, a good population to use is "all major leaguers." Forecasting performance based on regressing to the major league mean is actually remarkably accurate. Projecting a player's performance by simple regression to the league mean is almost as good as systems like PECOTA, Chone, or ZiPS.
Also, we must regress any measurement we make, including, for example, splits. Let's say you look at Eric Chavez's awful splits against portsiders. You must - must! - regress those splits to the mean performance of all lefty-on-lefty matchups. The same thing goes for home/road splits, splits by lineup order, splits by position, splits by month, etc. What you find is that the many splits that are touted as important ("he hits .312 after the All-Star Break!") are actually meaningless.
The take home message
"Small sample size" depends on which skill is being discussed. Maybe later, I - or maybe in the comments, you - will discuss how aggressively we regress different skills. Until then, let's look at some examples.
As Buck accumulates more and more plate appearances, our uncertainty in his actual performance will decrease. The more certain we are about about his actual performance, the less we have to regress to the mean.
The following chart shows how our estimate of his true OBP skill changes with the number of plate appearances over which we make observe him to have a .377 OBP.
If Buck had a .377 OBP one month and we wanted to estimate his true talent based only upon the 100 PA he got in that month, then regression to the mean would be very strong and we would estimate his true OBP skill to be .338. If Buck accumulated 5000 PA over eight years and had a .377 OBP, we would regress only a small amount and estimate his true OBP skill to be .373 (assuming no change in skill over eight years, which is a bad assumption). Here, we see that the "small sample size" argument is intexricably linked to regression to the mean.
Some skills have very wide variation among the MLB population and other skills have a very narrow variation. If a skill has a very wide variation, then the standard deviation will be very large. This results in regression to the mean being fairly weak. If a skill has a very narrow variation, then the standard deviation will be quite small and the regression will be very aggressive.
One skill that has a wide variation, for example, is hitting home runs. You will know a lot about a hitter's ability to hit home runs after only a few hundred plate appearances. So "small sample size" when discussing home run hitting ability is anything less than a few hundred plate appearances.
A skill that has a narrow distribution, for example, is batting average on balls in play (for pitchers). After nearly 3700 balls in play (about 5 years of full-time pitching) we still have to regress 50% of the way in order to estimate a pitcher's true BABIP skill. So if anybody mentions that a pitcher has had a consistently low BABIP, don't believe them unless they come armed with several years worth of data.
If you remember nothing, remember this: we must always regress to the mean when figuring true player skills, and the amount we regress is based on how much performance data we have and what the spread in skills are among the general MLB population.
If you'd like to learn more about regression to the mean as it applies to baseball, I highly recommend purchasing and reading The Book by Tom Tango, Mitchel Lichtman, and Andy Dolphin. You should also read their blog.
Thanks for reading.