A Baseball Simulator

John F. Jarvis

An accurate simulator can be used to investigate efficacy of different tactical options such as batting orders and stolen bases, asess the effects of errors and gain insight into how improbable an particular season outcome is.

The discrete nature of the outcome of an at bat is the key to creating Monte Carlo simulations of the game. At any point in the game, one of several outcomes is possible. These are chacterized by their probability of occurring and a random number is used to choose one outcome based on the probabilities.

Three strong assumptions have guided the design and implementation of the simulator.

First, no attempt is made to include strategy or tactics in the program. All types of events included in the simulator take place at their season average rates. Given the intense tactical game projected by major league baseball, I feel this is a very strong assumption.

Second, I assume there is no correlation between outcomes of successive at bats. Sports commentators focus on clutch hitting (or the lack of it), hitting streaks and other short term trends. However, statistical support for the notion of better hitting with a runner in scoring position is weak. There is significantly more support for a home and away difference in team performance. Still, relying on season averages seems to be the simplest and best simulator implementation strategy.

Third, no attempt is made to model individual players. Hitting is modeled at the level of detail of the batting order using season averages at each position. Base running is accomplished using team averages. Pitching is summarized by the single parameter, allowed runs per game. If individual players stats were to be used, the simulator would have to implement a substitution strategy. This would contradict the first assumption.

In the discussion of the simulator that follows reference will be made to a large number of detailed statistics characterizing each team. These performance parameters are derived from the full season play-by-play descriptions for the leagues and are prepared by a parsing program.

Errors are implemented using defensive team averages. Two classes of errors are used: those that allow the batter to advance to first and those that allow runners on base to advance. Values used for an error class are the sum of all categories that produce the same result. For example to get the number of runner advancing errors used by the simulator, the official runner advancing class of errors is added to the number of balks, wild pitches and passed balls that the team made. The implementation rational for this is if the result is the same don't distinguish the event as a separate category.

Simulated games follow the flow of an actual game. The visiting teams bats first and continues until three outs are made. The home team follows and does the same. If the home team is ahead after the middle of the ninth inning the game is won. When the score is tied after nine innings, extra innings are played until the tie is broken. If the home team wins in the ninth or in extra innings, only as much of its last inning as is needed to win is played. Runs scored in the home half of the last inning are credited according to the Major League rules. Each team plays half its games as the home team and half as the visiting team. While there is no home field advantage programmed in the simulator, alternating in this way prevents biases in the simulation statistics due to the rules for ending a game. A complete set of games defined by actual season pairings defines a simulated season.

A plate appearance is modeled by first determining if a stolen base is possible. If one is possible, attempting to steal third and second are considered separately in this order, a determination of the result is made. The batting team season averages are used to determine possible outcomes which are no attempt, success or caught stealing. The simulator can create double steals, runners on first and second stealing second and third on the same play. However these are artificially rare as the simulator does not generate a true double steal. Caught stealing rates include pick off plays at the starting base. Stealing home occurs so infrequently that it was not programmed into the simulator. Runner advancing errors can occur on stolen base attempts.

Following the stolen base evaluation, an at bat is simulated. First, a check is made for a defensive error allowing the batter to proceed to first. Error rates are based on the fielding team statistics. If an error occurs, all runners on base also advance one base. If no error occurs, the most likely outcome, the batting outcome is simulated. Events that can occur are walks, singles, doubles, triples, home runs and outs. Separate set of probabilities are maintained for each batting order position. Since probabilities for these events must total 1.0 thus there is an implied "out" column containing 1.0. Probabilities are based on all plate appearances, not just official at bats so they are only proportional to batting averages and on base percentages. The probability for an event is the difference between the value in the column and the preceding column (0 for home runs). These probabilities for a single team displayed as an array follow:

Position    Home Run   Triple	 Double   Single    Walk
       1:     0.0195   0.0255   0.0540   0.2099   0.2924
       2:     0.0215   0.0246   0.0631   0.2277   0.3446
       3:     0.0377   0.0425   0.0833   0.2358   0.3569
       4:     0.0434   0.0450   0.0916   0.2428   0.3585
       5:     0.0461   0.0510   0.0872   0.2155   0.3487
       6:     0.0438   0.0489   0.1180   0.2580   0.3592
       7:     0.0467   0.0536   0.0813   0.2388   0.3339
       8:     0.0071   0.0160   0.0410   0.1925   0.2870
       9:     0.0092   0.0129   0.0460   0.1526   0.2096
Table 1. Cumulative AT BAT outcome probabilities for the 1986 Mets

The first step in determining an at bat result is to generate a number that can be compared to the hitting probability table. A value, S, that represents the effects of the opposing pitching and defense in general is computed:

(1) S = 1 + a ( 1 - Tra/Lra )

In (1) Lra is the league average for runs allowed per game. Tra is the same quantity for the defensive team. When the team and league runs allowed per game are the same, S = 1. If Tra < Lra, that is pitching and overall defense are better than the league average, then S > 1 which will decrease the probability of hits and walks. Similarly, when Tra > Lra, S < 1 and the probability of getting on base increases. The constant a determines the strength of this change and is determined by minimizing the chi-square statistic for team runs allowed evaluated for all the teams in a league. Separate minimizations have been done for each season data set. While the values determined for a differ slightly, the differences are sufficiently small that a single value is used for all simulations: a = 0.44 . Also, total runs allowed are used, not the more commonly referenced "earned run". All runs count equally, earned or not, and the intent of the simulation is to reproduce season results, not choose between pitchers. This method is entirely empirical and does not purport to represent the actual hitter - pitcher interaction. Justification for it is the very small values of the runs allowed chi-square statistic achieved. The value of S needs to be determined for each team just once per game.

Given S, the quantity to be used to choose the at bat result is computed:

(2) R = S * randomf()

The function randomf() returns a pseudo random number in the range 0.0 to 1.0. The scaled random number, R, is compared to the array of probabilities (Table 1) to determine the result of the at bat. If R is greater than the appropriate value in the walk column the at bat result is an out. If not an out the same R is compared to the singles column. Again, if greater than this value, a walk is the at bat result. The interpretation of the entries is the probability of the particular event added to the probabilities of the previous events. Continuing in sequence, if R is less than the value under home run column a home run is the result. The order of testing is done from more likely to less likely outcomes (triples are slightly out of order) to minimize the number of tests needed. When S > 1 a larger fraction of the 0-1 range of the RNG is greater than the on base event thresholds reducing the probability of getting on base. With S< 1 the converse is true.

For each possible at bat result, appropriate and also conditional, base running is done. Home runs are the simplest at bat outcome to process. The batter and all runners on base score. Triples are almost as easy. All base runners score and the batter goes to third. Doubles present a slightly more complicated situation. Runners on second and third score unconditionally. The slight chance (Table 2) fora runner on second not scoring is ignored. There are three possibilities for a runner on first: score, go to third or be out trying to score. The choice is made using a random number to select one of the three possibilities. Singles are processed the same general way although there are more possibilities. The runner on third scores. Base runners on first and second have probabilistic outcomes. Table 2 tabulates the overall advance patterns for the 1995 American League. Team values are used for these quantities in the simulator except for improbable events such as runner on first scoring (1-h) or being out (1-x) after a single where the league average is used for all teams. Read the headings as the starting base to the final base on the play. An x indicates an out made while advancing from the specified base and h indicates a score.

Errors may occur on any hit except the home run and advance all runners one base. Error rates used are from defensive team statistics and include all event types that can advance a base runner (fielding, balk, wild pitch and passed ball).

lead runner advance on single
   1-2   1-3   1-h    1x   2-2   2-3   2-h   2-x   3-3   3-h    3x
  1951   951    23    36    19   687  1365    54    12  1629     1

runner on third, single, next runner advances
   1-2   1-3   1-h   1-x   2-2   2-3   2-h   2-x
  1008   525    22    33     8   232   460    18

lead runner advance on double ------------,  on double play
   1-3   1-h   1-x   2-3   2-h   2-x   3-h   2-2   2-3   3-3   3-h
   497   321    28    10   636     1   463     5   153    10    82
Table 2. 1996 AL Total Runner Advances

An at bat yielding an out presents the greatest number of base running possibilities. The simulator does not generate different kinds of outs such as fly, ground or strike outs. The type of out is implicit in the base running choices made following the out. Force outs can occur at any base. For example, if there are runners on first and second, the runner at second is out. Double plays are possible if there are less than two outs in the inning. The only type of double play included in the simulator is the common runner on first out at second with the batter out at first ground double play. If the third out was not made, base runner advances on outs are possible. These include the sacrifices hits (advancing a runner a single base, both first to second and second to third are possible) and sacrifice flies (scoring a runner). Either of these can occur when the lead runner has not made an out. All of these possibilities are conditional with rates determined by individual team averages. The rates used include all runner advances of the specified kind, not just the officially tabulated sacrifices.

Establishing an event probability requires two quantities, the number of times the event happened and the number of chances there were for the particular event to occur. In some cases, all possible outcomes can be counted. A runner on second following a single has just four possibilities: stay on second, advance to third, score, or be out trying to advance. Many other event possibilities require evaluating particular runner configurations to determine if the event can take place. A stolen second base requires that a runner be on first and none on second. Hitting rates require the number of plate appearances. While the latter quantity can be derived from the event files, the former is not so easily available. The intent of the simulator is to reproduce numbers of these events thus if the simulator chances are different than actual season chances the numbers of events would be different if rates were determined entirely by actual season results. Therefore, to establish the category chances in the simulator, including plate appearances, the following iterative process has been used. The simulator identifies the particular runner configurations corresponding to these events and counts them. This is done by team. These counts become the basis for the event probabilities in the next iteration.

Statistics for the simulation are accumulated at the season and multiple season level. Adequately determining distributions requires accumulating data from multiple seasons. Typically, 100 - 10000 seasons are simulated depending on the analysis being done.

The random number generator (RNG) is a key component of the simulation. The one used is random() from the GNU software libraries which provides a 31 bit mantissa. It was carefully evaluated using the frequency, two and three number serial tests and run tests described by Knuth (Knuth, D. E. "The Art of Computer Programming. Vol. 2. Semi Numerical Algorithms", Addison-Wesley 1969). It satisfactorly passes all these tests. The two and three number serial tests correspond most closely to the useage of the RNG in the simulator. A further test was to compare the simulation results using random() with a second RNG, drand48() also from the GNU libraries. This RNG uses a different algorithm and provides a number with a 48 bit mantissa. No differences besides expected statistical variations were seen in spite of the much larger mantissa. The routine random() is used because it is significantly faster.

Debugging a simulator is notoriously difficult. Distinguishing between relatively rare events and program bugs is challenging. The final series of tests performed were the generation of event files in the Retrosheet, Inc format from single season simulations. These files were then processed by the events file analysis program. Lack of error messages from the consistency checking done during event file processing gives considerable additional confidence in the simulator.



1996 American League East Division
     %wins\pl    1    2    3    4    5   wins  std min max act   a-s
 BAL 0.542:    462  336  157   45    0   87.8  6.0  68 105  88   0.2
 NYA 0.539:    409  351  183   57    0   87.3  6.3  65 111  92   4.7
 BOS 0.498:     97  210  397  296    0   80.7  6.2  59 100  85   4.3
 TOR 0.476:     32  103  263  602    0   77.1  6.4  52 103  74  -3.1
 DET 0.289:      0    0    0    0 1000   46.8  5.6  28  65  53   6.2
1996 American League Central Division
     %wins\pl    1    2    3    4    5   wins  std min max act   a-s
 CLE 0.606:    639  324   32    5    0   97.6  6.3  74 120  99   1.4
 CHA 0.581:    351  564   74   10    1   94.2  6.2  73 113  85  -9.2
 MIL 0.485:      5   56  370  322  247   78.6  6.6  58 100  80   1.4
 MIN 0.479:      3   40  297  343  317   77.7  6.5  55  97  78   0.3
 KCA 0.468:      2   16  227  320  435   75.3  6.3  52  95  75  -0.3
1996 American League West Division
     %wins\pl    1    2    3    4   wins  std min max act   a-s
 TEX 0.595:    825  160   15    0   96.4  6.1  80 116  90  -6.4
 SEA 0.545:    163  649  179    9   87.7  6.3  67 110  85  -2.7
 OAK 0.485:     12  182  711   95   78.5  6.2  62  99  78  -0.5
 CAL 0.412:      0    9   95  896   66.3  6.1  45  86  70   3.7
team wins chisq   3.102 prob  0.997
Table 4. 1000 Season Simulation Results 1996 AL (Wins)

Overall results from applying the simulator to the the 1996 American League season are given in Table 4. Teams are indicated by a three letter code representing their home city. %wins is the fraction of simulated games won. The next five or four columns depending on the division, represents the number of times a team finished in that particular place in their division. "wins" is the average number of wins during a season. "std" is the standard deviation of the wins and is followed by the minimum and maximum number of wins during the simulation. The column labeled "act" is the number of wins during the actual season. Finally, "a-s" is the difference between actual and averaged simulated wins. The sign of the result was chosen to be positive when the actual season results had more wins than the simulation average.

The last line in the table is the chi-square of the actual and simulated wins using simulated wins as the expected number. Probability is computed on the basis of 13 degrees of freedom. Other seasons yield similar results.

Table 5 present simulated and actual season distributions of three run related quantities, season team runs scored, team runs allowed and runners left on base. The column gms is the number of games played during the season. Simulated results are the average of 1000 seasons. The "std" columns obviously correspond to the simulated results. Three chi-square values and resulting probabilities are also given. The low chi-square value for the runs allowed category reflects the minimization procedure outlined in the discussion of the hitting model. In this case the probabilities are evaluated for 13 degrees of freedom. Simulations for the other three seasons produce comparable results.

1996          runs scored              runs allowed             left on base
team  gms     in   sim sm-in   std     in   sim sm-in   std     in   sim sm-in
 BAL  162    949   942    -7  43.7    903   867   -36  40.8   1154  1175    21
 BOS  162    928   913   -15  42.8    921   912    -9  42.3   1251  1273    22
 CAL  161    762   790    28  38.7    943   945     2  44.7   1209  1169   -40
 CHA  162    898   926    28  42.6    794   780   -14  39.0   1231  1252    21
 CLE  161    952   961     9  44.4    769   768    -1  39.3   1224  1238    14
 DET  162    783   735   -48  38.4   1103  1150    47  47.7   1040  1084    44
 KCA  161    746   730   -16  36.4    786   784    -2  37.6   1117  1166    49
 MIL  162    894   868   -26  41.8    899   895    -4  42.1   1198  1244    46
 MIN  162    877   848   -29  41.3    900   882   -18  41.5   1194  1228    34
 NYA  162    871   860   -11  42.1    787   788     1  38.4   1258  1277    19
 OAK  162    861   859    -2  39.9    900   883   -17  41.0   1175  1209    34
 SEA  161    993   959   -34  42.4    895   873   -22  43.7   1238  1277    39
 TEX  162    928   957    29  43.6    799   783   -16  37.5   1253  1260     7
 TOR  162    766   768     2  39.3    809   808    -1  39.6   1169  1185    16
 tot 2264  12208 12116   -92 159.3  12208 12116   -92 159.3  16711 17037   326
sim. runs scored           chi-sq   9.81e+00 prob    0.709
sim. runs allowed          chi-sq   5.39e+00 prob    0.966
sim. left on base          chi-sq   1.18e+01 prob    0.546
Table 5. 1000 Season Run Production Summary 1996 AL

The remaining distributions evaluated as tests on the simulator are runs per game, runs per inning and game length in innings (Table 6).

Runs per game distribution
team     0     1     2     3     4     5     6     7     8     9  >=10
 sim    76   160   232   274   288   273   239   197   154   115   258
 act    79   162   248   280   289   260   237   166   142   116    85
x-sq   0.2   0.0   1.2   0.1   0.0   0.6   0.0   4.9   0.9   0.0
for league chisq:     23.7 prob:    0.127 (17 DOF)
Runs per inning distribution
         0     1     2     3     4     5     6     7     8     9  >=10
 sim 13870  3396  1614   789   369   163    71    30    13     5     4
 act 14015  3211  1592   778   411   172    92    34    12     2     2
x-sq   1.5  10.1   0.3   0.1   4.7   0.5   6.3   0.6   0.1
for league chisq:     24.2 prob:    0.004 (9 DOF)<
Game length in innings distribution
         9    10    11    12    13    14    15    16    17  >=18
 sim  2065    99    49    25    13     6     3     2     1     1
 act  2040   120    34    38    14     4     8     2     0     0
x-sq   0.3   4.3   4.7   6.5   0.1
for league chisq:     15.9 prob:    0.007 (5 DOF)
Table 6. Three Distributions, 1996 AL

Table 6 tabulates league totals, not individual team results. Simulated (sim), actual season (act) and individual chi-square (x-sq) contributions are given for each of the three distributions. The chi-square is only tabulated where there are > 5 runs or innings depending on the distribution in the particular histogram bin. The total degrees of freedom used in the probability calculation are also given. The simulated values are the averages of a 1000 season simulation.

Using the simulator to investigate batting order and the effects of varying team parameters such as the relative value of stolen bases and runners caught stealing requires sufficient season simulations to produce a standard error about 5 times smaller than the smallest change in season wins considered significant. Since the standard error is the standard deviation divided by the square root of the number of simulated seasons, it takes 1000 simulated seasons to produce a standard error of 0.2 wins allowing win differences of 1 to be considered significant. Ten thousand season simulations yield a minimum significant difference of 0.3 wins per season.

The simulator was designed and coded for time performance. Typically, a 14 team 162 game season simulation requires 217000 random number evaluations and takes approximately 0.6 seconds on a Power Mac 6100/66 system. Even with this relatively quick simulation time, many hours of computation are needed to provide repeatable results for changes in strategy that might produce a 1 run per season difference. The implementation language is C++. The simulator software compiles and executes on the Macintosh as well as GNU and Silicon Graphics, Inc. UNIX systems.

Back to the J. F. Jarvis baseball page.

Copyright 1997, John F. Jarvis