We live in an incredible age for baseball statistics. Even as recently as a few years ago, if you wanted any statistics more advanced than ERA, batting average, and RBI's, you had to calculate them yourself. Defensive metrics and exact plate discipline data were a pipe dream. Today, you can log onto Fangraphs or Baseball-Reference and get up-to-the-second data on everything from a team's win probability of a game currently in progress to the precise location of Bartolo Colon's last strike. In a word: we are spoiled. Because statistics are so easy to access in 2013, it's easy to forget that there are men and women hard at work behind the scenes working tirelessly in front of TV screens making those statistics available to the masses.
The most prominent company providing these statistics is Baseball Info Solutions (BIS). I have always been very curious as to how exactly those statistics get from the baseball diamond to our computers, and Dan Lependorf, a research and development intern at BIS, kindly agreed to share a bit of that information with me and for all the readers at AN.
Long time readers of AN might know Dan Lependorf better as danmerqury. Dan was a front-page columnist here for 2 years, and a columnist at the excellent baseball site The Hardball Times for a year and a half. After getting a BS in chemical physics at UCSD and working in the field for a year and a half, he saw a job posting in the R&D department at BIS, and the rest is history. The following is the transcript of my interview with Dan:
Sam: Thanks so much for talking to me today Dan... as somebody employed in that baseball stats industry, you've really got the insider scoop. Can you give us a general overview of the different services that a company like Baseball Info Solutions provides?
Dan: Baseball Info Solutions provides statistics to lots of different customers- teams, agents, things like that. It's for teams who want data collection that they can't do themselves, and a lot of times, that they can't do themselves. We have an army of video scouts that watch every single game and they have for the last, I believe, ten years. They watch every single inning, and they catalogue... what they end up doing is, essentially, it's almost like a giant version of Gameday. Or, if you brought your scorecard to the game, it's like an interactive extensive version of that. And they catalogue absolutely everything you can possibly think of. There's also an R&D side that works on delivering the data to the customers. We can customize stuff for customers if they want. If we collect it, we can package it for them however they want. It's really kind of an incredible company.
S: You mentioned customers... who exactly is using the services that BIS provides?
Dan: Major league teams, media outlets, sports agents... I'm not sure if I can give specifics.
S: Of course. I totally understand. Can you walk us through a typical day in the life of a stats intern?
Dan: Sure. I'm actually responsible for helping out the R&D department. It can be stuff like... if a customer calls us and says "Hey, it would be really nice if you could get us this kind of data", or "What do you think about this, is it something we could work up?", we'll try our best to actually get it done.
S: For the people who are charting the games: are they typically given one task, like, say charting the batted balls or charting balls and strikes, or does one person chart everything for a given game?
Dan: We usually go through, I believe, three passes for any given game, and those are three different people. One guy is responsible for scoring which is essentially a little bit more advanced version of a scorebook or Gameday. Then you have another guy who will do pitch charting, and a third guy who will essentially double check everything. We'll also do some other passes and other accuracy checks on top of that.
S: How much training do stats interns get, or are you kind of expected to know the ropes going in?
Dan: There's absolutely a lot of training, because we absolutely need to make sure that the data we collect is accurate. So of all the interns... I actually wasn't here, I started in the middle of the season, but from what I'm told, they start a little before the season starts and they get trained. They watch games from previous years, and then we compare them to the data that was actually recorded, and stuff like that.
S: So how are those new interns screened?
Dan: They look for quite a bit. They look for someone, obviously, with a lot of baseball knowledge, and a lot of passion for baseball. They also look for scoring experience, identifying pitch experience, that kind of thing.
S: Is there a sort of exam or anything like that that they take?
Dan: There is, actually. When I started... I'm not technically a video scout, even though I do it part of the time... there was a test during my interview, before I moved out here [BIS is based in Allentown, PA]. They gave me a test: here's a clip of videos, watch this, catalogue what you see, identify these, how would you catalogue this. Stuff like that.
S: Often times in baseball, things are "borderline". So how do you differentiate between, say, a fly ball and a line drive if it's close, or if a ball's right on the edge of a strike zone? Do you have specific guidelines, or is it up to your own discretion?
Dan: We do have specific guidelines for stuff like that. Again, I'm not entirely sure what I can say... but we do time every batted ball. So we have definitions based on that time, so we can give a description of the ball in flight and that kind of thing.
S: So it's objective rather than subjective.
Dan: Yeah, we try to be as objective as possible. I know it's not always possible to be 100% objective, but we try our best.
S: When you're charting pitches, it's often difficult to tell the difference between, say, a 2-seam fastball, a cutter, and a sinker. Do you match pitch classifications based on a pitcher's past performance, or do you just go by what you see?
Dan: Oh, absolutely. If you have a pitcher and you know he throws a 4-seam, a 2-seam, a curveball and a slider, that is incredibly valuable information to know. We absolutely... well, we cheat and we get a little bit of help wherever we can so that we're not out in a vacuum or that kind of thing.
S: That's really good to know. So for plate discipline stats line Zone percentage, how do you know where exactly the strike zone is, and do you ever double-check yourself with pitch f/x?
Dan: I'm not entirely sure I'm the right guy to ask that, to be honest, because I'm a part time guy [with regards to video scouting]. I actually haven't done much pitch charting.
S: Ah, ok. Do you know if there's a marked difference between pitch f/x and what the person saw, do they double-check? Is there any method for going through and seeing what's correct?
Dan: Well, we're always going through and trying to improve the accuracy of our data. There are certainly times where we're going through and we'll say "Wait a minute, that doesn't quite look right", or something is not quite lining up. Part of the R&D team's job is to do that kind of thing. We'll try to figure out if something isn't coming out the way it should be, why is that?
S: Personally, how do you account for the fact that plate discipline stats on, say, Fangraphs from BIS differ wildly from the same stats based on pitch f/x? Do you think that one is more accurate than the other?
Dan: Well... I do know that if you watch Gameday for a little bit, if you just park yourself down and watch it for say, 2 games, you'll notice that every once in a while it'll take a pitch off. It may be a computer glitch, or it may be a problem with the Gameday stringer, but every once in a while... normally on Gameday you'll get the flight path and all the spin and the pfx numbers, but every once in a while you'll just get a location and that's it. You'll just get ball or strike, and that's it... you won't get pitch type or anything like that. That's because the pitch f/x system just, for whatever reason, missed that pitch, and the stringer would have to do that manually. For stuff like that, that's a reason we may have slightly different numbers. As far as accuracy, between our numbers and pitch f/x, it's kind of hard to say. We don't know exactly what was right and what was wrong. All I can say really is that we try to do the best we can, and I know we do a very good job.
S: So in general, how much accountability is there for any given classification? If someone makes a mistake, what steps are there to make sure it gets caught?
Dan: We are always comparing to Gameday, to make sure we didn't make a big mistake. Like, say, you miss-clicked and somebody ended up at second base when they advanced to third, stuff like that. So we're always monitoring that kind of thing. We do have customers that want this data live, almost immediately after it happens, so we try to be quick and yet as accurate as possible.
S: What steps does BIS take to help eliminate human bias, and what do you think is the biggest source of bias in the data?
Dan: Certainly... we have about 20 video scouts, and of course they all have different fan-hoods and different biases, and that kind of thing. So if you were a Red Sox fan... you don't want to put the Red Sox fan on the Boston game all the time. It would be fun for him, but the quality of the data may not come out so good. We try to rotate as much as we can to try to eliminate as many of those systematic biases as possible.
S: Have you had the chance to chart a lot of A's games yourself this year?
Dan: I actually have not! I'm kind of bummed about that... if it was purely random, I guess I'd get an A's game one out of every fifteen games. So let's see... I think I've done close to 10 so far, but I haven't hit it yet.
S: Soon enough, hopefully.
Dan: [Laughs] Soon enough... I did have to do a Giants game, which I was not so pleased about.
S: [Laughing] Are there any stats the BIS that you feel is underused by the sabermetrics community, and are there any you'd like to see used more?
Dan: Well certainly, I think DRS is one that gets ignored quite a bit in the sabermetric community. It seems like it's been cast aside in favor of UZR. I personally feel that DRS is a very strong defensive stat... I know that defensive stats as a whole get kind of a bad rap, because- and it's true- they aren't as robust as offensive stats. But we have UZR, we have DRS. I think the best thing to do would be to take both... I don't know if you'd average them, but at least take the two into consideration. I feel like a lot of people ignore DRS in favor of UZR, which may not be the wisest choice.
S: Can you explain the difference between DRS and UZR?
Dan: Yeah, absolutely. They actually both come from our batted ball locations... both UZR and DRS come from the BIS hit locations, but the data workup is just a little different. We also add in factors for good plays and missed plays, which our video stringers catalogue manually. There's certainly a difference in the calculation methods, but at their core, they're basically the same plus/minus defensive measurement based on batted ball locations.
S: What's the weirdest stat that you've come across? Maybe one you didn't know about before you started at BIS?
Dan: [Laughs] I actually have a very strong answer for that!
S: [Laughing] Ok...
Dan: The Houdini!
S: The... what?
Dan: The Houdini!
S: [Laughs] What is that?!
Dan: I think it was actually Bill James himself who had the idea to start tracking what we call "Houdinis" which are situations where a pitcher finds himself in a bases-loaded no-out situation, which he successfully escapes from. I was actually going through those numbers the other day... It turns out the A's lead the league in Houdinis by quite a bit. I'm sorry... not Houdinis... reverse-Houdini's! I guess the term would be "getting Houdini'd?"
Dan: They've loaded the bases six times and failed to score this year alone.
S: That somehow does not surprise me.
Dan: [Laughs] Well, the record... and I believe we've been tracking this since 2002... the all-time record for a season is 8. The A's are already at 6... and it's August 7th.
S: Wow. Ok. That's really... awesome? Just one last question: do you have any advice for aspiring sabermetricians who want to get involved in the field? I know you've been involved in various capacities in the past.
Dan: Absolutely! This may be a three-part answer.
First, you need to know on a conceptual level why stats are calculated the way they are. If I'm looking for someone in the field, I don't need someone who specifically knows the calculus or the mathematics involved. But if you can explain, say, why the wOBA weights are the way they are on a conceptual level, that says a lot. It says that you understand it, even if you can't do the math yourself.
Second, LEARN SQL! SQL is a database language, and it's something I didn't know very well coming into this job and I really wish I had. It's incredibly useful, and at this point, it's become practically a requirement if you want to be in this field. I was actually at the SABR convention last weekend, and Sean Lahman had a pretty fantastic presentation on big data and how everything was trending toward big data analytics. He had an example of Target who had figured out that a girl was pregnant before her father did just based on what she was buying. It's all about the big data analytics, and knowing SQL is an incredible leg up.
As far as the third part: just a passion for baseball. I would say that in this field more than a lot of fields, having that passion and being able to show that you love what you're doing goes a long way.
S: Thank you so much, Dan. This was really informative.
Dan: Of course. Oh! My boss did want me to say that we are hiring! If anyone wants to become a video scout, we would absolutely love to look at your resume, so please send them in!
S: I'll let everyone know! [Hey everyone, this is me letting you know! Go forth and be a video scout!] Thanks so much for your time Dan, I really appreciate it.
Dan: No problem!