Note: I gave a presentation at the 2013 SABR convention in Philadelphia called “Baseball in the Age of Big Data.” Many people have asked whether audio or video was available — and the answer is not yet. But for now, I’m posting my slides and notes for those who are interested. You can also download the slides as a PDF.
Baseball in the Age of Big Data
Why the revolution will be televised
What is big data? Everybody’s heard the buzz word, but what does it mean? Today I’m going to talk about what we mean we say “big data,” how it’s transforming the field of information technology, and how I think it’s going to impact baseball research in the next few years.
“Big data” does not simply mean a lot of data. It really means collecting all available data… every scrap and morsel of information that exists. When we talk about big data, we’re talking about a quantity of information so high that it might as well be infinity.
It can be a tough concept to wrap your head around. A recent example would be the NSA’s Prism project, where they collected information on every call and every text from every cell phone in the United States.
But the retail world is where big data has really made the most visible impact. The “Big Box” stores understand consumer behavior in ways that were never before possible, by collecting data on every transaction, on every item in every store.
You may recall this article, about how Target mined its own customer data to figure out which customers were pregnant. They asked their analysts to figure out what patterns would help them identify customers who were expecting — maybe they bought pregnancy tests, or prenatal vitamins, or maternity clothes.
There’s big money in selling cribs and car seats, to say nothing of diapers and baby formula. Big data makes it possible to do this sort of analysis.
Another good example is Netflix, which uses actual customer viewing data to understand what their customers do.
They have 25 million users for their streaming service. They deliver 30 million video views a day. And they don’t just keep track of what you watch. They know when you pause or rewind. They know when you give up on a movie after five minutes, and they know when you watch 12 episodes of “The Office” in one sitting.
Big data gives them an incredible competitive advantage compared to broadcast networks, who rely on Nielsen ratings. The Nielsen ratings rely on surveys of very small samples. Participants keep a diary of what they watch, and who knows whether they’re telling the truth, or whether they are an accurate representative of everyone else’s viewing habits.
It’s fascinating, for example, for Netflix to observe the difference between what people say they want to watch and what they actually watch. folks put Citizen Kane and Casablanca in their instant queue but they watch “Breaking Bad” and the Hangover movies
Big data has revolutionized the business world. Retailers are not simply guessing, not sampling what’s happening, but actual data in real time, which can be sliced in a infinite number of ways
Traditionally when we have talked about working with data, we think of very well defined information, rows and columns of numbers in spreadsheets, or structured tables in a relational database.
That model is becoming outdated, because with big data we’re often talking about collections of data that aren’t well structured, but are more like amorphous blobs. Rather than investing the work into organizing the data on the front end - what we call normalizing the data – increasingly moving towards systems that use artificial intelligence to extract answers.
This is Watson, a super computer that IBM built to play a television game show. And even though it played with and beat human opponents at Jeopardy!, it wasn’t a novelty. It was a dramatic proof of concept.
Watson used a combination of machine learning and natural language processing, combined with vast stores of information — 4 terabytes of data, including all of Wikipedia. IBM researchers were able to build a computer that could answer questions and not just return an answer, but calculate the likelihood that the answer it came up with was right. If the confidence level was too low, it wouldn’t buzz in. And the best part was that when it got an answer wrong, it learned, so it could improve the next time.
We are teaching computers to think, not just to process a list of canned commands in order but to analyze information in an abstract way. And that’s what’s behind the push for vast, limitless collections of data to feed them.
So that’s a quick overview of big data… Nearly infinite amounts of data, with a focus on collecting every possible detail that can be recorded, and using powerful computers to analyze and deliver answers.
I want to show you a few data points that will help illustrate the scope of what I’m talking about.
This is 1951, the year Turkin and Thompson published the first Barnes baseball encyclopedia. We had roughly 1,800 data points per season.
In 1969, when the first edition of the Macmillan baseball encyclopedia came out — Big Mac — we had about 12,000 data points per season
And 1980, when Bill James was in the midst of publishing his first Baseball Abstracts. Those books were based largely on his own research with box scores – giving him roughly 200,000 data points per season.
If you go back and look at those early baseball abstracts, much of his Bill’s early analysis was not even really analysis, it was a call for improved data gathering. James introduced things like pitcher run support, umpiring statistics, and stolen bases stats for catchers. He would pour through box scores and compile data that wasn’t being compiled by anyone else. None of these things involved the creation of new formulas. He was simply counting things that weren’t being counted, building data sets and pulling out interesting bits.
As far as I’m concerned, that was the real genius of Bill James. He clearly understood that to make advances in our understanding of the game, we needed to make a quantum leap in terms of the amount information we had available.
Bill helped launch Project Scoresheet, as many of you know, which started collecting and sharing play-by-play data. Shortly thereafter, pitch-by-pitch data started to become available. Increasingly larger data sets.
And what happened?
Once we had the play-by-play data, a whole new world opened up when we started looking at player splits and situational stats: Lefty vs. righty matchups and batting with runners in scoring position. Those differences began to become apparent because new data sets were available.
The availability of pitch-by-pitch data opened our eyes to pitch counts, and the fact that maybe it’s not a great idea for your 21-year-old phenom to throw 135 pitches every fifth day.
And here’s where we are with Pitch f/x data, which Major League Baseball has collected for every game since 2007.
If you’re not familiar, the Pitch f/x system measures the speed and trajectory of every pitch thrown. MLB makes this data openly available for researchers
This was made possible because of technological advances in our ability to gather, store, and, share this volume of data.
We have created more data about games played in the last five years than in about the 140 odd years before that combined.
Every time there has been a surge In the amount of data available there has been a corresponding surge in the quality of analysis and thus our understanding of the game.
And I would argue that we are in a golden era of baseball analysis.
But we are just beginning to scratch the surface. Technology is advancing so fast. In 3 or 4 years, we’ll look back at the Pitch f/x data and scoff at how primitive it was.
Here’s why: video.
Video has been a boon for fans. Access through the MLB.TV or MLB At Bat has given more people more access to more games than ever before. That’s a great thing for fans but an opportunity that we as a research community have not really begun to exploit
But teams have. They’ve embraced video in a big way. Phillies CEO Dave Montgomery talked yesterday about what a huge role video plays today, with players watching video of their plate appearances between innings.
A lot of the work teams are doing is proprietary, but here’s a peek at one project that’s really going to help blow the lid off things
This is the Field f/x system, which MLB launched this year. The data hasn’t been publicly released, but it reportedly captures over one million points of data for every game.
Field f/x records high resolution shots 15 times a second, identifying every human on the field. Each image is time stamped, and the computer recognizes and records events that occur on the field: when the pitcher releases the ball, the batter hits the ball, the fielder gains possession of a ball, and the fielder throws the ball.
That comes out to something like 2.4 billion data points per season
It would be incredibly labor intensive for a human being to go through the video of a single game and make all of those measurements. And there would be human error and variability. But the Field f/x system isn’t constrained by those human limits. All of those measurements are made by the computer.
And you can imagine the insights such a data set might yield: True measures of fielding range, reaction time of fielders. Runners speed from first to third.
Some of you may have seen the presentation Thursday by Mike Eckstein of KinaTrax. His company is in the early stages of deploying a motion capture system used to generates biometric analysis of pitchers’ throwing motions. He described how it uses high speed cameras to capture 10-12 gigabytes of video each pitch. That’s 1.5 to 1.8 terabytes per pitcher per game.
This is the future of data analysis. It’s not increasingly larger spreadsheets. It’s raw video of games and smart computer systems that can analyze them.
How many people here today have an iPad?
That technology did not even exist 5 years ago. Today 90 million people have them, one third of Internet users in the US.
It used to be that a player had to go into the video room and watch cassette tapes of opposing pitchers. The theory remains the same, but the delivery systems for video have vastly improved.
When I was covering the NFL ten years ago, teams were cutting video after every game and practice and passing out DVDs to every player. Now, pro and college teams have moved their playbooks to tablets. A coach can insert a new play from his desk and have it show up instantly in his team’s hands. And because the systems are interactive, he knows which players have seen it.
In my day job I write about technology for Gannett. I get to talk to some of the top research labs in the world, both at universities and large companies. I’ve been surprised at how many of them are working on video analysis
They’re working on advances in computer vision – teaching computers to look at images and understand what they see.
You’re probably familiar with things like facial recognition software, but here are a few of the other cutting edge technologies I’ve encountered in my reporting.
License plate readers, mounted on police cars, can read the number from a passing car and check to see if the vehicle is stolen or has an expired registration. These systems can read four vehicles a second while moving at full speed.
Researchers at Xerox developed software that recognizes human gestures. They can tell, for example, that a patient whose had hip surgery is trying to get out of bed.
At MIT, they’re using motion amplification and color amplification to detect heartbeat and respiration from a video image. It’s not infrared or some other sort of special video. The technology can can be applied to existing videos — I saw a demo using footage from the latest Batman film.
Microsoft is working on visual tracking, teaching computers to identify people or objects and follow them. In the UK, where they have 1.85 million CCTV cameras, they’re teaching computers to recognize when a human passenger separates from his backpack. At UCSD, they have computers that learn how to drive by observing how people do it.
But for the civilian research community, we are not there. We don’t have access to data from systems like Field f/x — either the raw video footage or the advanced metrics that come out of it. We don’t have access to cutting edge technologies like high speed cameras or to supercomputers.
What we do have is broadcast footage of major league games back to 2010, and while its a cumbersome process, you can use the pitch fx data or pbp data to identify plays and then go look up the video. And the results can be pretty impressive, even though we are in the very earliest days of the digital video as a tool for research.
There’s been some good work in this area, particularly by Ben Lindbergh of Baseball Prospectus.
Here’s an example from last year where he looked at the effect of the new balk rule — the fake to third throw to first move which is now outlawed
Ben’s also done published some interesting video studies on pitch framing… a topic I don’t think anyone was even talking about a few years ago. And rather than just pontificating, he uses the Pitch F/X data and video to show how some catchers get a pitch called a strike that others don’t
And maybe may favorite piece, with Evan Brunnell, when they analyzed video footage of manager ejections and, through lip reading, were able to create transcripts of the conversations that took place between managers and umpires.
So I’ll close with this and then take some questions.
In the classic 60′s film “The Graduate,” Dustin Hoffman’s character was trying to figure out what to do with his life, and he got some advice about what the next hot thing was going to be. “Plastics.”
Today, I’m here to tell you the next big thing… is video.