salb918 article at Beyond the Boxscore
i don't know why he hasn't posted it here yet (out of modesty, maybe?), but our fellow ANer sal wrote a pretty interesting guest article over at beyond the boxscore. the more stat-oriented folks here at AN should definitely check it out. we're lucky to have such smart people as a's fans!
i've only read 1/3 of it so far, and there are tons of excellent points that i had never even thought about...
0 recs |
38 comments
Comments
i'm glad i posted this
i think my head's going to start hurting pretty soon...
by xbhaskarx on Jul 1, 2005 5:45 PM PDT reply actions 0 recs
Damn!
Recommending this though - this is the kind of thing we should all read. Not necessarily because we'd understand it, but this is the kind of thinking that goes on in our front office all the time.
by RickeySteals on Jul 1, 2005 6:04 PM PDT reply actions 0 recs
I know just enough statistics
by salb918 on Jul 1, 2005 8:42 PM PDT up reply actions 0 recs
I swear...
Not taking anything away from that post though. Very well written.
by chri5 on Jul 1, 2005 6:28 PM PDT reply actions 0 recs
Can you give me
by salb918 on Jul 1, 2005 10:40 PM PDT up reply actions 0 recs
needle in a haystack
by chri5 on Jul 1, 2005 10:47 PM PDT up reply actions 0 recs
cant find it
http://baseballprospectus.com/article.php?articleid=4004
http://baseballprospectus.com/article.php?articleid=3766
http://baseballprospectus.com/article.php?articleid=3779
I swear I read something where they looked at a better number than OPS, and used similar methodology to your project. hmmm...
by chri5 on Jul 1, 2005 11:03 PM PDT up reply actions 0 recs
I read the
by salb918 on Jul 1, 2005 11:10 PM PDT up reply actions 0 recs
good job sal!
One tweak you might consider is your model assesses the predictive power of different coefficients by referring to a reference model of 9 clones batting. It seems likely to me that you would get something different if there were different random mixes of players hitting around a player with different profiles... e.g. 8 Ichiros with 1 power hitter would probably score more runs than the weighted average of 8 all-Ichiro teams and 1 all-slugger team.
Of course, I'm not the one who has to mess with Matlab...
by Apricot on Jul 1, 2005 6:47 PM PDT reply actions 0 recs
interesting
- There's no good mathematical reason to demand that the two variables you use in the regression be minimally correlated. For example, when you run your regression using AVG and ISO, you could get an identical answer using AVE and SLG instead. If it's not a lot of work, I'd like to see what you get using OBP and SLG on the same data. I suspect it would be as good or slightly better a fit as the one using OBP and ISO.(Geek talk: the reason to avoid variables that are too highly correlated with each other is for reasons of numerical stability. If two variables are close to being linearly dependent, you end up trying to invert a matrix that is close to being singluar, which can end up giving the wrong answer. But I don't think that will happen here, and trying to avoid "double counting" is really an aesthetic issue, not a mathematical one.)
- If you're trying to judge a players true value, it's probably better to see how a lineup of that player + 8 average players (or that player + a typical lineup of 8 other players) would perform. That's not really an issue for this study, though, and probably only makes a big difference in extreme cases. For example, as you note, a lineup full of players who got on base with a walk or a single every time would score an infinite number of runs, but if you add such a player to a more typical lineup, his actual value would depend mostly on the ability of others on the team to drive him in, though it would still be very high.
- Again, it probably doesn't make much difference in this case, but as we discussed before, if you're looking for smaller effects (on the order of a few runs over the course of a season) 1500 games still turns out to be way too small a sample size. matlab=slow.
- Miggy had 150 RBI! You need to add a clutchness variable. ;-)
by andeux on Jul 1, 2005 7:13 PM PDT reply actions 0 recs
Good thoughts
- You're right. I did do the regression with OBP and SLG and you get a pretty good r^2 value, around what you get for OBP and ISO. I need to look at the results more closely, but I still think that avoiding the double counting is a good reason not to use SLG.
- You're right again. I have tried this and it is difficult to implement. The result, a sort of MLVr, is hard to interpret because, honestly, the values for many players are bunched together and differ only at the third decimal place. I would like to refine this and get and MLVr out of it.
- You're right again, again. MATLAB is slow, and is a big reason I chose only 1500 games. ANer Genaro tried converting my program to a faster language. We are still working on this front.
- Don't even get me started.
by salb918 on Jul 1, 2005 8:40 PM PDT up reply actions 0 recs
keep it up
by anahola fan on Jul 1, 2005 10:46 PM PDT up reply actions 0 recs
::Rummaging through back pocket::
That's all I got.
by salb918 on Jul 1, 2005 11:03 PM PDT up reply actions 0 recs
too bad...
by anahola fan on Jul 2, 2005 12:48 AM PDT up reply actions 0 recs
Stats
by LowcountryJoe on Jul 2, 2005 7:09 AM PDT reply actions 0 recs
OPS makes no sense
I like the attempts by people like Sal to come up with a reasonable way to measure hitting both in consistency and power. Most current attempts, like this one, involve setting up target data like actual runs (or in this case a run simulation) to predict using this magic new statistic.
It seems weird and unlikely to me that this new statistic would be a linear combination of OBP and SLG, but this is a good first shot at it. I'd imagine it's more complicated and is coupled with the performance profile of the rest of the lineup in some nonlinear way.
Hey Sal, have you thought about crazier curve fits, like quadratic regressions, etc etc?
by Apricot on Jul 2, 2005 9:43 AM PDT up reply actions 0 recs
Thought about crazier fits, yes
It seems to me that while baseball has a lot of sample space, most of the interesting phenomena occur over a small range of that space and a first-order Taylor expansion (ie, linear fits) is sufficient. The effort expended in getting an incremental increase in information is, well, lets just say there are diminishing returns.
Using a linear combination is just the easiest way to go.
Like I said in my article, using the run simulator as my target data is wide open to criticism, since I have no real defense of that. But I do think it is better than say, Runs Created. RC uses what I feel like is an arbitrary set of weights and factors. EqA and VORP are kind of inaccessible. It would be nice if there were a simple stat, based on easily available information, that could be manipulated and understood by the casual fan. That's what I was after, I don't know if I succeeded, but I think I took baby steps in the right direction.
"I'd imagine it's more complicated and is coupled with the performance profile of the rest of the lineup in some nonlinear way." --- I would like to look into this farther. If only I had more time...
by salb918 on Jul 2, 2005 11:49 AM PDT up reply actions 0 recs
linear in one variable probably enough
(off the top of my head).
And after all, The James Pythagorean Thm is an approximately quadratic relationship between Runs Scored, Runs Allowed and Winning Pct.
by Apricot on Jul 2, 2005 4:17 PM PDT up reply actions 0 recs
Could be.
by salb918 on Jul 2, 2005 4:27 PM PDT up reply actions 0 recs
Ow! My head hurts....
I'm glad to have you on my team.
by Masaryk on Jul 2, 2005 9:54 AM PDT reply actions 0 recs
OPS
by skwid on Jul 2, 2005 10:23 AM PDT reply actions 0 recs
someone who isn't in calculus
good. [feels less stupid]
by Jjjsixsix on Jul 2, 2005 11:26 AM PDT reply actions 0 recs
the basic idea is simple
So in theory BA is helpful if it helps you decide the probability that a hitter is going to hit safely.
OBP and SLG are helpful in that they help predict how often the batter will be safe, and how many extra base hits the hitter gets. But they don't quite predict how many runs a batter contributes, which is after all the whole point.
Some people use OPS = OBP + SLG to account for both skills. It doesn't predict that well either, mainly when used in models, it over-counts OBP.
So Sal has decided to see if he can tweak the formula to predict better by multiplying the SLG by some magic number "S" (for Sal) to help count SLG like it should be. How to do this? He said, I'm going to use a computer program to run simulated games where 9 clones of a batter hits for a game. Then I'm going to see which number S helps me predict the number of simulated runs best. He came up with a number.
At that point Billy Beane said, Sal why don't you come and program on our Linux cluster which is located right behind Our Black Muslim Bakery in the Coliseum. But Sal refused because he knew he would be distracted with the smell of the fish sandwiches.
The end.
by Apricot on Jul 2, 2005 4:24 PM PDT up reply actions 0 recs
a couple corrections
First "it over-counts OBP" should be "under-counts"...
Second, I think I simplified Sal's work too much. The key idea (I think) is that instead of adding multiples of OBP and SLG, let's find two measures of hitting that are less related, since if you have high SLG, you tend to have high OBP. So he compared how related a bunch of different measures were. He decided that AVG and ISOlated power (SLG - AVG) and OBP and ISO were least related. Then he fussed to find the right numbers that linearly related them.
by Apricot on Jul 2, 2005 4:44 PM PDT up reply actions 0 recs
Hey!
by salb918 on Jul 2, 2005 5:14 PM PDT up reply actions 0 recs
you look like that kind of guy
by Apricot on Jul 2, 2005 5:16 PM PDT up reply actions 0 recs
more on OPS (and similar stats)
http://remarque.org/~grabiner/baseball.html
This one, by David Grabiner gives a good mathematical justification for why taking linear combination of OPS and SLG really is a sensible thing to do.
http://baseballprospectus.com/article.php?articleid=2596
This one, from the BPro Basics series compares the correlation of a number of common stats to team run scoring (similar to what Sal is doing, but using historical data rather than simulations). OPS does much better than AVG, OBP, or SLG by themselves. Stats like EqA and RC/27 are only a little better than OPS, at the expense of much greater complexity. So for most purposes, OPS strikes the right balance between simplicity and accuracy.
by andeux on Jul 2, 2005 12:16 PM PDT reply actions 0 recs
eh
But if OPS approximates the 'real' statistics off by a factor of .2 * OBP, that means that comparisons of straight OPS can typically be off by .080 for each batter in 'real' weight... it just makes OPS fuzzier to calculate comparisons with, I think.
by Apricot on Jul 2, 2005 4:53 PM PDT up reply actions 0 recs
a couple more articles
http://www.aarongleeman.com/2003_11_23_baseballblog_archive.html#106974007971391611
Tangotiger on optimal weights:
http://www.tangotiger.net/ops.html
http://www.tangotiger.net/ops2.html
His articles on linear weights and "How are runs really created" are also relevant.
by andeux on Jul 2, 2005 5:12 PM PDT reply actions 0 recs
Thanks for all the articles
Thanks again.
by salb918 on Jul 2, 2005 5:16 PM PDT up reply actions 0 recs
From what I see
by Marc Normandin on Jul 3, 2005 9:16 AM PDT up reply actions 0 recs
Way over my head
Not that there's anything wrong with that but I like to watch the game when I pay to watch it :)
BTW, there is an article on Kielty being some kind of math whiz over at the official site. They mentioned him doing calculus etc. but then he said something about not using math much anymore for playing baseball.... this reminded me of that article a little.
by streetfan on Jul 2, 2005 11:26 PM PDT reply actions 0 recs
Bobby Kielty
by Larry E on Jul 3, 2005 9:50 AM PDT reply actions 0 recs

by 





















