By David Grabiner

**I. What is sabermetrics?**

Bill James defined sabermetrics as “the search for objective knowledge about baseball.” Thus, sabermetrics attempts to answer objective questions about baseball, such as “which player on the Red Sox contributed the most to the team’s offense?” or “How many home runs will Ken Griffey hit next year?” It cannot deal with the subjective judgments which are also important to the game, such as “Who is your favorite player?” or “That was a great game.”

Since statistics are the best objective record of the game available, sabermetricians often use them. Of course, a statistic is only useful if it is properly understood. Thus, a large part of sabermetrics involves understanding how to use statistics properly, which statistics are useful for what purposes, and similar things. This does not mean that you need to know a lot about mathematics to understand sabermetrics, only that you need to have some idea of how statistics can be used and misused.

The statistics which are available in baseball are a collected record of observations. An individual fan, sportswriter, or even a player or manager will see most teams thirteen or fewer times during the year. His observations may be of some interest, but they are a small (and often biased) sample. In thirteen games, the difference between a great hitter and a poor hitter is just five hits; thus, if the observer happens to see a mediocre player’s two best games of the season, he would get an incorrect impression of the player’s ability.

In contrast, a player’s statistics are a record obtained from all of his games, as observed by the official scorers in the league. This is a much larger collection of observations, and it is converted to a form which can be easily understood; few fans could get a good idea of a player’s batting average by watching his 600 plate appearances.

And since sabermetrics is an objective study of the game, it is necessary to use logical reasoning in sabermetric arguments. Thus, a hypothesis can be developed from the information you have, either from statistics or observation; a claim which cannot be directly tested can be evaluated by studying the conclusions which would follow.

A good example is the statement “Pitching is X% of baseball,” which has been said with X between 15 and 80. Suppose you want to test the claim “Pitching is 75% of baseball.” If this were true, you would conclude that the teams with the best pitching would be much more likely to win the pennant than the teams with the best hitting. However, this isn’t the case. The league leaders in fewest runs allowed (which is both pitching and fielding) win the pennant about half the time; the league leaders in runs scored (which includes all of hitting) win just as often. (Note the definition of offense here: if you measure hitting by an incomplete measure such as batting average, you would conclude that pitching is much more important.) Other unreasonable conclusions also follow; for example, a team with 75% of its value in pitching would never trade a regular pitcher for a regular hitter. Thus the claim must be rejected. But if 75% is replaced by a number close to 40%, the conclusions become reasonable. This is how a sabermetric argument works.

**II. General principles**

The goal of a baseball team is to win more games than any other team. Since one team has very little control over the number of games other teams win, the goal is essentially to win as many games as possible. Therefore, it is of interest to measure the player’s contribution to the team’s wins.

There is a clear relationship between a team’s runs scored and allowed and its wins and losses. This relationship isn’t perfect, but it is very strong. A good formula, determined empirically from the data by Bill James, is that a team’s ratio of wins to losses will be equal to the square of the ratio between its runs scored and allowed. Thus a team which scores and allows the same number of runs will win and lose the same number of games, finishing at .500; a team which scores 800 runs and allows 700 will win 64 games for every 49 it loses, which projects to a 92-70 record over a season. This formula comes very close to the actual records of most teams.

The basic goal of sabermetrics is to evaluate a measure for a given purpose. The most common uses of statistics are to evaluate past performance (such as to determine who should win the MVP award) and to predict future performance (such as to evaluate a trade that was just made). In both cases, we are interested in measuring contribution to games won and lost.

The reasons that such analysis is possible are the same reasons that make statistics more interesting in baseball than in other sports. Baseball statistics can measure individual performance, independent of what other players do. And while the importance of an individual event depends on the situation, the effect of the situations on the importance of the statistic over a large sample such as a season is not great.

When a batter hits a single, this describes what he did; when a quarterback throws a ten-yard pass, the guard who took out a linebacker gets no statistical credit. And the batter who received a single is properly credited for a success; the ten-yard pass may have been a failure if it was third down with 13 yards to go. Thus it is reasonable for the goal of a baseball statistic to be to measure a player’s individual contribution to runs or wins.

Given the goal, it is possible to evaluate a statistic. Baseball statistics can be evaluated in the same way as non-baseball statistics; they can have the same types of flaws, or be misused or misinterpreted in the same ways.

The first natural question to ask about a statistic is, “Does the statistic measure an important contribution to that goal?” For example, ERA measures the number of runs a pitcher allows, which is almost all a pitcher contributes to winning games. Batting average does fairly well because it counts hits, but it ignores power and walks, which are also important parts of offense. Few statistics fail badly here; those which do measure things which happen only rarely (such as HBP), have little to do with winning games (such as the fraction of a batter’s outs which are strikeouts), or both. As a non-baseball example, the number of crimes in a city last year is important if you want to know something about the safety of the city; the number of crimes on a single street says very little about the safety of the whole city.

The second, and usually most important, question to ask is, “How well does the statistic measure the player’s own contribution?” There are many ways that a statistic, baseball-related or not, can fail here. Virtually every statistic fails in some way to some extent, so the best statistics are those with only minor failings, and relatively few of them.

For example, a player should be evaluated for what he does, not for what his teammates or manager do. This is a major problem with such statistics as runs scored. Unless the batter hits a home run or steals home, he needs his teammates’ contribution to actually score a run, and he cannot do much to cause them to get hits once he is on base. Thus, if you bat in front of the best home-run hitters in the league, you will score a lot of runs, whether or not you have a good ability to score runs. If your manager decides to bat you eighth on an NL team, you won’t score many runs when you do get on base.

Likewise, a good statistic should not measure outside effects over which the player has no control, such as the park. A good non-baseball example of this problem is the high death rate in Miami. The population of Miami is older than the population of most other cities; thus, regardless of the quality of medical care in Miami, you would expect a high death rate.

Likewise, it is easier to score runs in Fenway Park than in Oakland. Therefore, a pitcher with a 3.60 ERA in Oakland could pitch just as well in Fenway, helping his team win games just as much, but have a 4.00 ERA. You will sometimes see a discussion of park-adjusted numbers, designed to eliminate this effect; for example, the pitcher above might have a 3.80 park-adjusted ERA in either park. Note that this is adjusting for the value of the pitcher’s performance, not the actual performance; the 4.00 ERA for a Red Sox pitcher is just as valuable to his team regardless of how it is split between home and road games.

If a player’s statistics change considerably when he changes teams, parks, or lineup positions, this suggests that the outside effect has a major effect on the statistics. If the statistic remains consistent when outside conditions change, this means that it is measuring the player’s own contribution. Pitchers with good ERA’s tend to keep them when they change teams, so the park effect is not a serious problem. Hitters who score a lot of runs in the leadoff spot will score many fewer runs if they are dropped to sixth in the lineup, which means

that the runs scored were mostly created by the lineup position rather than the batter.

In addition to these problems with outside effects, there can be problems with measurement. For example, no statistic can be useful without proper context, a measure of opportunities. There were more crimes committed in New York than in Boston last year, but this doesn’t say much about the relative safety of the cities; to make such a comparison, you would need to compare crime rates.

If a batter has 150 hits, what does that mean? Well, if he has 500 at-bats, he is good at getting hits; if he has 650 at-bats, he is poor.

This is a problem with most counting statistics. Batting average places hits in a reasonable context, and this is recognized because the batting title goes to the player with the highest average, not the player with the most hits.

Similarly, a statistic may not be useful if it tries to measure something with a very small sample size or number of occurrences. The best pitchers at throwing shutouts often don’t lead the league in shutouts, because the league leader normally has about five, and it’s quite common for a pitcher who usually throws three shutouts a year to get seven in one year. In contrast, the best strikeout pitchers do lead the league in strikeouts (or strikeouts per nine innings), because their totals are in the hundreds, and a pitcher who is capable of getting 250 strikeouts in 240 innings might get 230, but not 150.

Again, the same problem comes up with non-baseball statistics. If 2/3 of the people polled in your city plan to vote Democratic, that means nothing if it was four of six, and not much if it was forty of sixty, but quite a lot if it was 400 of 600. This is the major flaw with many of the statistics that are often used on TV; a statistic such as, “Wade Boggs is hitting .154 against Baltimore pitchers with runners in scoring position” means nothing because the sample is probably two hits in thirteen at-bats.

Sabermetricians agree with most fans that such stats are ridiculous; they are there only to hold the interest of the (mostly statistically illiterate) television audience.

Now, once you have some idea of how well the statistic measures the player’s own contribution to the goal, the final question to ask is, “Is there a better way to measure the same thing?” A statistic which has problems with the other questions but has no reasonable alternative measurement may still be useful. In contrast, a statistic such as runs scored, which can be replaced by other statistics, is of very little value. A player’s own contribution to his total of runs scored can be measured by his ability to get on base (measured very well by on-base percentage) and, to a lesser extent, to advance himself once he gets on base (measured by extra-base hits, and by stolen bases and caught stealing).

Now, given these criteria, you can evaluate a statistical conclusion. If you dispute the conclusion, your argument may be valid if it is based on these criteria. That is, you need to find something which is not measured by the statistic, or is measured but shouldn’t be. For example, you can argue that Mike Schmidt is a good hitter, even though his career average is .267, because he hit 548 home runs and drew 1507 walks. These are valid arguments, because batting average gives the same value to homers and singles, and does not count walks at all. Likewise, Ozzie Smith is not a great offensive player, but he is still an excellent player, because of his defense; no offensive statistic measures his overall value.

But you cannot dispute a statistical conclusion with a claim which is based on something which is already included in the statistic, or something which is improperly measured by your claim. It isn’t reasonable to say that Brooks Robinson was great at getting hits because of his 2848 hits; the correct measure of how well he got hits is his .267 batting average, which led to such a high hit total because his other skills allowed him to have a very long career. Turning one of the above examples around, you can’t claim that Schmidt could not possibly be a great hitter, despite his .527 SLG, by looking at his batting average; the batting average is already counted in the slugging average.

III. Sabermetric stats

A good, complete measure of individual offense would satisfy the criteria above for a valuable statistic better than any of the traditional offensive measures. Therefore, sabermetricians often use or develop such statistics. (For measuring pitching, there is less need for such a statistic, because ERA and runs allowed already count the number of runs allowed by a pitcher.)

At the team level, a good measure of offense should have a strong

correlation with runs scored. This means that it should be possible to

predict runs scored reasonably well from the measure; the best teams by

this measure should score a lot of runs, while the worst teams should

score very few. Measures such as batting average do not do this; it is

common for the team with the best batting average to be below average in

runs scored. Runs scored itself obviously measures team offense very

well, but it creates a problem when you try to measure individual

contributions; it isn’t easy to measure directly how much a batter

helped or hurt his team score runs.

There are several ways to develop a statistic which measures team

offense. Probably the most natural way is to say that a team scores

runs by getting runners on base, and then advancing them. Thus, a

team’s runs scored should be proportional to the number of runners it

gets on base, and to the frequency with which it advances the runners.

On-base percentage measures the number of runners on base, while

slugging average is one way to measure advancement. (Note that an out

reduces slugging average, because it makes it less likely that any

runners on base will be advanced.) Thus team runs should be correlated

with OBP*SLG.

The test of a statistic of this type is how well it agrees with

reality. If you compare teams’ OBP*SLG to their runs scored, you find a

very good correlation; the standard error is just 24 runs. For

comparison, the standard deviation of runs scored in one season is 70

runs (this is the error you would get if you predicted that all teams

would be average in runs scored), while batting average alone has a

standard error of 54 runs. The 24-run standard error covers everything

which OBP*SLG does not measure or measures improperly; this includes

such factors as baserunning and imperfections in the formula, but much

of the difference is chance.

Now, we need to make an individual statistic by measuring a player’s

contribution; OBP*SLG is not the correct measure for a player because he

usually doesn’t drive himself in. Instead, you want to multiply his OBP

by the team’s SLG, and his SLG by the team’s OBP. Since the league SLG

(and individual teams’ SLG) are usually about 1.2 times the OBP, each

point of a player’s OBP has 1.2 times the effect on OBP*SLG that a point

of his SLG has. Thus our measure is (1.2*OBP)+SLG. For simplicity, we

often ignore the factor of 1.2 and refer to OPS, On-base Plus Slugging.

When using this statistic, remember that OBP is slightly undervalued,

and that stolen bases have not been counted.

Using the same process for other models of offense gives other measures,

which give slightly different values for different elements of offense.

The choice of which measure to use depends on which ones you have handy,

the purpose for which you want to use it, and some personal preferences.

But if you use any well-designed measure of offense, you won’t be wrong.

You may find that a player who has two more Runs Created than another is

.003 worse in OPS, but such differences aren’t important; either way,

you will reach the reasonable conclusion that they are very close.

The complete measures of offense give a good estimate of the value of

the individual categories, such as walks, home runs, and outs, which

make them up. The value of a player’s home runs is the effect that they

have on OPS or any similar statistic, and the importance of home runs

thus depends on this value and their frequency.

**IV. Evaluating official statistics**

We can now apply the criteria to the official statistics. While it

isn’t reasonable to go through the arguments for every statistic, it is

useful to look at the statistics which cause the most frequent

arguments.

RBI’s are commonly used as a measure of a player’s offense, because they

are the only statistics which are easily available which look like a

complete measure. (As a result, the MVP winner is more likely to be the

league leader in RBI than in any other category.) Of course, they

aren’t a complete measure; the ability to drive in runs is an important

part of offense, but not the whole thing. This does not make RBI’s

meaningless, only incomplete.

But the real problem with RBI’s is the second question; they measure a

lot of things which are not the player’s own contribution. You cannot

drive in runners who are not on base (except with home runs), but your

own batting doesn’t put them there; if you bat behind good players, you

will get a lot of chances. In fact, the league leaders in RBI are much

more likely to be the players who batted with the most teammates on base

or in scoring position (not the batter’s contribution) than those who

hit the best with runners on base or in scoring position. Thus RBI are

a better measure of who had the most chances to drive in runners than of

who was the best at driving in runners.

And now, we try the third test; there is a better measure of the ability

to drive in runners. Hits drive runners in from scoring position;

therefore, a player who gets a lot of hits is good at this part of

driving runners in. Likewise, extra-base hits drive runners in from

first base, and home runs drive them in from home plate. Slugging

average measures a player’s ability to get hits, extra-base hits, and

home runs, so it measures his ability to drive in runs, with park

effects the only significant bias. Thus RBI’s are not useful as a

measure of offense, or even as a measure of the ability to drive in

runs.

The other statistic which is subject to many of the same problems is a

pitcher’s won-lost record; we will compare it to ERA. Both measure

something which is clearly important, since a pitcher’s goal is to win

games, and the way he does this is by preventing the opponents from

scoring. But both have some problems measuring the pitcher’s own

contribution; a comparison of their value depends on these problems.

The first problem is that runs are allowed by the whole defense, not

just by the pitcher. This is slightly more of a problem with W-L; ERA

eliminates runs due to errors, but not due to fielders who are out of

position, run slowly, or make weak throws. At the major-league level,

it isn’t a serious problem; good pitchers can still have good ERA’s (and

runs allowed) even with teams of poor fielders.

Won-lost record is one of the few categories which is immune to park

effects; there is one win in every game in every park. ERA has a slight

problem with park effects, which makes it more useful with a park

adjustment.

But the most important factor is the effect of the team offense.

Offense has almost no effect on ERA, but it has a considerable effect on

W-L. A game is not won just by the pitcher (despite the name of the

statistic), but by the team which scores more runs than it allows. In a

single season, the pitcher with the best W-L record in the league is

just as likely to be the pitcher with the best run support as the

pitcher with the fewest runs allowed. And the run support is not the

pitcher’s contribution (except for batting in the NL). If there were

pitchers who could cause their teammates to score more runs for them, it

would make sense to give the pitchers some of the credit. But this

doesn’t happen; there is no tendency for pitchers who had support better

than their team’s average in one season to have it again in the

following season. Nor does a pitcher have any control over whether he

gets to pitch on a good offensive team.

Because of the effect of run support, single-season W-L records are not

a good measure of a pitcher’s own value. ERA is available, and it is a

better measure of what you actually want to know. However, a career W-L

reduces the luck in run support by using a much larger sample size. In

addition, pitchers rarely spend their full careers with poor or good

teammates. Thus a career W-L record for a long career (several hundred

decisions) is a decent measure of a pitcher’s own performance; it’s

about as useful as a career ERA without park adjustments.

Since we have now dealt with the most common measures of batting and

pitching, it makes sense to deal with the most common measure of

fielding. Fielding average has its problem with the first test; while

defense is important, an incomplete measure of defense is not. The

league leader in errors at third usually makes about 30; the leader in

fielding average makes about 10. There aren’t enough plays to make a

difference of very many runs. The more important part of fielding is

the ability to prevent hits; if the third baseman can’t reach a ball in

the hole, or knocks it down but has no play, he won’t be charged with an

error, but the batter will get a hit which has the same effect.

Errors are about as useful as a measure of defense as strikeouts are as

a measure of batting average. They measure one way to fail to make a

play; while it is the most obvious failure, all failures count the same

on the scoreboard. A fielder with poor range will be a poor fielder

whether he makes few or many errors, just as a hitter who hits too many

routine grounders or popups can be a poor hitter even though he puts the

ball in play.

While fielding average also has problems with park effects and scorer’s

biases, the incompleteness is the most serious problem. Still, since it

does measure something useful, and fielders who are good at other things

tend not to make errors (fielding percentage has a good correlation with

games won), it would be a useful measure in the absence of anything

else. It still has some value, particularly in concluding that players

with very low fielding averages can’t handle their positions, but it

should be used in conjunction with putouts, assists, and an attempt to

understand any biases in the numbers.

But for recent players, we have a better measure of overall defense,

Defensive Average (abbreviated DA), which makes fielding average

unnecessary. The basis for DA is a division of the playing field into

zones of responsibility for the fielders. When a ball is hit into a

fielder’s zone, it is charged as an opportunity for that fielder; if the

fielder turns it into an out, he receives credit for a play made. Thus,

all ground balls near third base are charged as chances for the third

baseman; a good third baseman will make plays on most of them. If he

fails to make a play, the effect is the same whether his throw is wild

(scored an error) or late (scored a single), so fielding average does

not tell you anything more.

Defensive average should be put to the same tests as any other

statistic. It does reasonably well in the first test. It measures a

player’s ability to turn balls in play into outs, which covers most of

his defensive play but not all of it; such skills as turning the double

play and throwing out runners trying to stretch hits are not counted.

It also does well in the second test, although it still has some

problems, mostly with park effects. Pitchers cannot introduce bias

simply by being left-handed (and thus allowing a lot of ground balls to

third base and fly balls to left), but good pitchers may help their

fielders’ DA slightly by allowing fewer hard-hit balls. Fielders do not

have a great effect on each other’s DA, although there will be a small

effect for plays such as the low throws that a good first baseman can

handle. (All of these effects will cause problems with almost any

measure of fielding.) And for the third test, DA is the best measure of

the ability to make the play in the field that we have; it isn’t

perfect, but it is complete enough and accurate enough to be useful.

Thus the established statistics, used for reasons of tradition, may

be good measures (such as ERA) or poor measures (such as RBI’s). Their

value does not depend on their tradition or their names; it depends on

how well they meet the basic tests of any statistic.

**V. Other sabermetric arguments**

Similar analysis must also be used in evaluating a hypothesis which

depends on a statistical argument. If the hypothesis leads to

conclusions which don’t correspond with the real game of baseball, then

it needs to be revised.

For example, a natural question in predicting a player’s future

performance in the major league is how useful his minor-league numbers

will be in a prediction. There are problems with using minor-league

numbers because there are extreme park effects and differences between

leagues. However, once you adjust a player’s minor-league numbers for

these effects, and then make a specific adjustment for the difference

between AA or AAA ball and the majors, you may have something

meaningful. There is a method for making these corrections; the result

is called the MLE, Minor-League Equivalency. This will be useful if it

works, tested against the real world. In fact, it works almost as well

as past major-league performance in predicting future major-league

performance. Most players with MLE’s which say they will hit .300 will

hit close to .300 as rookies, just as most players who hit .300 last

year will. (Of course, neither prediction is perfect.)

Another issue which sabermetricians have studied (and often discussed)

is the existence of clutch hitters. Clutch hits themselves certainly

exist; when Bobby Thomson, Carlton Fisk, Bucky Dent, Kirk Gibson, and

Joe Carter hit their famous home runs, or when Ken Griffey singled in

the tying run in the eighth inning in May, they got hits when it was

important. But many players have reputations as players who will hit

their best with the game on the line, and this is a hypothesis which can

be tested; are there any players with such an ability?

Again, it is necessary to look at what actually happens, and what would

happen if there were no clutch ability at all or if clutch hitting was a

significant ability. Even if a .250 hitter were just a pair of coins

which got a hit when they were both heads, some .250 hitters would hit

.400 during one season in the late innings of close games (a 3% chance

in 80 AB), so the existence of such numbers doesn’t prove anything. But

if there is an ability, players who hit well in the clutch in the past

will continue to do so. This can be tested, and has been; there is only

very weak evidence of an ability, and it is clear that whatever ability

there is does not mean much in baseball terms. There may be .267

hitters who are actually as valuable as .268 hitters because of their

good clutch numbers, but if you replace .268 with .275, you have a

conclusion which is inconsistent with what actually happens.

**VI. Conclusion**

Baseball statistics are useful only if they enhance your understanding of the game. Therefore, they should be judged by how well they measure what actually happens in the game. Meaningless statistics should be ignored or replaced; deficient statistics should be improved. And well-designed statistics should be used as an important part of discussion about the game and its players.

**Bibliography**

Bill James, _The Baseball Abstract_, published annually from 1980 to 1988 by Ballantine Books.

John Thorn and Pete Palmer, _The Hidden Game of Baseball_, New York: Doubleday, 1985.

John Thorn and Pete Palmer, eds., _Total Baseball_, New York: HarperCollins, 1993.