Statistical Studies of Baseball
John F. Jarvis

(Links to specific projects follow a short introduction.)

In the world of professional team sports baseball is unique in several ways. The discrete nature of the game, enabling individual plays to be readily categorized with a modest number of possible outcomes, leads to a very complete statistical record. Likewise, the discrete nature of the game enables detailed simulations to be carried out. The game's statistical record provides accurate parametrization of the simulator and insight into the game can be obtained from these simulations. A number of statistical studies based on simulator results and full season play-by-play descriptions are linked to this page.

Baseball questions amenable to numerical analysis can be approached in two general ways. Actual season records can be combed for particular events relevant to the question at hand. These can be used to compare teams or individuals in the same of in different seasons. This approach has the advantage that it relates directly to the seasons as played. However, "what if" questions are difficult to answer because the a season can never be replayed. A second confounding issue with analysis is too few events of a particular kind. When the number of events gets small, rates and probabilities derived from them become quite uncertain.

The second generic approach to analysis is simulation. Simulation easily deals with the problems of insufficient data and readily allows modification of parameters enabling the "what if" form of analysis. A significant problem with simulations arises from the difficulty of reproducing certain kinds of game situations. A common technique is to use season average rates for events rather than attempt to implement tactical decisions of the kind made by managers.

Complete season play-by-play accounts, "events files", are available for the Major Leagues. Given these complete accounts any team or player quantity that can be defined can be computed from this record. Retrosheet, Inc. ( is collecting, entering into a computer, and distributing full season play-by-play descriptions of seasons prior to 1992.

For the events files to be useful, programs are needed to process the play-by-play data contained into appropriate statistical summaries. Programs performing this task are available from Retrosheet. However, I have written my own program, called the parser, for doing this analysis.

I have also written a simulator that can be used for a variety of studies.

(January 1, 2009: stat pages for both leagues, 1954-2008 are available.) Team Season Statistics. An extensive collection of statistics covering team performance for entire seasons.

Research Papers

(New July 2004) How Many World Series Should the Braves Have Won?, presented at the 2004 SABR convention uses simulations to evaluate post season performance for the 12 seasons 1991-2003. I must conclude that the post season is no better than a lottery in picking the WS winner.

(New July 17, 2003) Trends, Exceptions and Results of IBB Usage, presented at the 2003 SABR convention, again examines the value of the IBB. I claim that 96% of the IBBs issued in MLB are detremental to the defense. The IBB is only justified for the very best players. This paper won the 2003 USA Today Sports Weekly Award for best Poster Presentation.

(July 8, 2002) Career Summaries and Projections, presented at the 2002 SABR Convention in Boston, addresses the concept of projecting a players career - estimating how he will do in one or more future seasons based on what has been done so far in a career. A careful evaluation of the Brock2 and two other projection systems suggests that more sophisticared career modelling leads to poorer projections.

(Updated September 30, 2003: The analysis now includes the 2002 season.) Since Interleague (IL) play began in 1997 the National League teams have won 53.6% of their home IL games compared to 52.2% in non IL home games. The corresponding values for the American League are 55.1% and 54.2%. In Hitting Asymmetries in Interleague Games presented at the SABR 2001 Conference, I presented an argument that this additional disparity in home winning percentage in both leagues is due to the use of the Designated Hitter.

Do managers use a consistent strategy when calling for the Intentional Base on Balls? The IBB often is used in game turning point situations. How does the pressure in such situation affect the game? These and similar questions are considered in my SABR 2000 Cnnvention presentation Hitting in IBB Situations . This paper describes the use of a neural network to define IBB situations.

It seems, according to TV play by play announcers, that the Intentional Base on Balls (IBB) is a tactic that should be invoked any time first base is open. I have performed an analysis of where and how many runs are saved by this technique. I was privileged to give this paper at the 1999 SABR annual convention in Phoenix, Arizona.

Stealing bases is one of the most visible tactical parts of the game. It is also one of the most exciting plays from the standpoint of a fan. I have used my simulator to study the effects of both stealing second base and having a runner caught in the attempt on a team's won-loss record. Information about batting averages and on base percentages when a stolen base event takes place during a plate appearance can be obtained from the event files and are also presented.

(January 1, 2009. The top 150 APW pitchers pages now include both leagues 1954 -2008.) I presented a paper at the SABR 1997 Annual Convention titled "Apportioned Wins and Losses. An Alternative measure of Pitching Performance" The web version of the paper contains more extensive data summaries than the printed version.

In their 1977 paper, "An Offensive Earned-Run Average for Baseball", Operations Research, Vol 25 No. 5, September-October 1977, pp 729-740, Thomas Cover and Carroll Keilers formulate an effective method for estimating the expected future runs, EFR, for any inning state from six readily available hitting parameters: at bats, hits, doubles, triples, home runs and bases on balls. Implementation of their calculation is somewhat more complex than normally encountered in the field of baseball statistics. Details and source code are given in Implementing the Cover-Keilers Offensive Earned Run Average

(January 1, 2009: All data from the 1954-2008 seasons used in the analysis.) In A Survey of Baseball Player Performance Evaluation Measures I present a systematic evaluation of offensive player evaluation methods including runs created and linear weights. I apply the same measures to evaluating the defense and also show that assists cannot easily be included in a linear weights formulation.

(February 13, 2008: All data from the 1956-2007 seasons used in the analysis.) In Too Many Intentional Bases on Balls? I offer an analysis of the Intentional Base on Balls showing that this defensive technique leads to additional runs scored for the majority of times it is used.

Other Sites of Interest

Gary Jarvis's Minor League Ballpark Photos

SABR, the Society for American Baseball Research , provides many resources on baseball related topics including a page (SABR ONLINE) containing links to member web sites.

John Skilton's Baseball Links is perhaps the most complete index of baseball related topics on the World Wide Web.


Copyright 1997-2009, John F. Jarvis