An accurate simulator can be used to investigate efficacy of
different tactical options such as batting orders and stolen bases,
asess the effects of errors and gain insight into how improbable an
particular season outcome is.
The discrete nature of the outcome of an at bat is the key to creating Monte Carlo simulations of the game. At any point in the game, one of several outcomes is possible. These are chacterized by their probability of occurring and a random number is used to choose one outcome based on the probabilities.
Three strong assumptions have guided the design and implementation of the simulator.
First, no attempt is made to include strategy or tactics in the program. All types of events included in the simulator take place at their season average rates. Given the intense tactical game projected by major league baseball, I feel this is a very strong assumption.
Second, I assume there is no correlation between outcomes of successive at bats. Sports commentators focus on clutch hitting (or the lack of it), hitting streaks and other short term trends. However, statistical support for the notion of better hitting with a runner in scoring position is weak. There is significantly more support for a home and away difference in team performance. Still, relying on season averages seems to be the simplest and best simulator implementation strategy.
Third, no attempt is made to model individual players. Hitting is modeled at the level of detail of the batting order using season averages at each position. Base running is accomplished using team averages. Pitching is summarized by the single parameter, allowed runs per game. If individual players stats were to be used, the simulator would have to implement a substitution strategy. This would contradict the first assumption.
In the discussion of the simulator that follows reference will be made to a large number of detailed statistics characterizing each team. These performance parameters are derived from the full season play-by-play descriptions for the leagues and are prepared by a parsing program.
Errors are implemented using defensive team averages. Two classes of errors are used: those that allow the batter to advance to first and those that allow runners on base to advance. Values used for an error class are the sum of all categories that produce the same result. For example to get the number of runner advancing errors used by the simulator, the official runner advancing class of errors is added to the number of balks, wild pitches and passed balls that the team made. The implementation rational for this is if the result is the same don't distinguish the event as a separate category.
Simulated games follow the flow of an actual game. The visiting teams bats first and continues until three outs are made. The home team follows and does the same. If the home team is ahead after the middle of the ninth inning the game is won. When the score is tied after nine innings, extra innings are played until the tie is broken. If the home team wins in the ninth or in extra innings, only as much of its last inning as is needed to win is played. Runs scored in the home half of the last inning are credited according to the Major League rules. Each team plays half its games as the home team and half as the visiting team. While there is no home field advantage programmed in the simulator, alternating in this way prevents biases in the simulation statistics due to the rules for ending a game. A complete set of games defined by actual season pairings defines a simulated season.
A plate appearance is modeled by first determining if a stolen base is possible. If one is possible, attempting to steal third and second are considered separately in this order, a determination of the result is made. The batting team season averages are used to determine possible outcomes which are no attempt, success or caught stealing. The simulator can create double steals, runners on first and second stealing second and third on the same play. However these are artificially rare as the simulator does not generate a true double steal. Caught stealing rates include pick off plays at the starting base. Stealing home occurs so infrequently that it was not programmed into the simulator. Runner advancing errors can occur on stolen base attempts.
Following the stolen base evaluation, an at bat is simulated. First, a check is made for a defensive error allowing the batter to proceed to first. Error rates are based on the fielding team statistics. If an error occurs, all runners on base also advance one base. If no error occurs, the most likely outcome, the batting outcome is simulated. Events that can occur are walks, singles, doubles, triples, home runs and outs. Separate set of probabilities are maintained for each batting order position. Since probabilities for these events must total 1.0 thus there is an implied "out" column containing 1.0. Probabilities are based on all plate appearances, not just official at bats so they are only proportional to batting averages and on base percentages. The probability for an event is the difference between the value in the column and the preceding column (0 for home runs). These probabilities for a single team displayed as an array follow:
Position Home Run Triple Double Single Walk 1: 0.0195 0.0255 0.0540 0.2099 0.2924 2: 0.0215 0.0246 0.0631 0.2277 0.3446 3: 0.0377 0.0425 0.0833 0.2358 0.3569 4: 0.0434 0.0450 0.0916 0.2428 0.3585 5: 0.0461 0.0510 0.0872 0.2155 0.3487 6: 0.0438 0.0489 0.1180 0.2580 0.3592 7: 0.0467 0.0536 0.0813 0.2388 0.3339 8: 0.0071 0.0160 0.0410 0.1925 0.2870 9: 0.0092 0.0129 0.0460 0.1526 0.2096
The first step in determining an at bat result is to generate a number that can be compared to the hitting probability table. A value, S, that represents the effects of the opposing pitching and defense in general is computed:
In (1) Lra is the league average for runs allowed per game. Tra is
the same quantity for the defensive team. When the team and league
runs allowed per game are the same, S = 1. If Tra < Lra, that is
pitching and overall defense are better than the league average, then
S > 1 which will decrease the probability of hits and walks.
Similarly, when Tra > Lra, S < 1 and the probability of getting
on base increases. The constant a determines the strength of this
change and is determined by minimizing the chi-square statistic for
team runs allowed evaluated for all the teams in a league. Separate
minimizations have been done for each season data set. While the
values determined for a differ slightly, the differences are
sufficiently small that a single value is used for all simulations: a
= 0.44 . Also, total runs allowed are used, not the more commonly
referenced "earned run". All runs count equally, earned or not, and
the intent of the simulation is to reproduce season results, not
choose between pitchers. This method is entirely empirical and does
not purport to represent the actual hitter - pitcher interaction.
Justification for it is the very small values of the runs allowed
chi-square statistic achieved. The value of S needs to be determined
for each team just once per game.
Given S, the quantity to be used to choose the at bat result is computed:
The function randomf() returns a pseudo random number in the range
0.0 to 1.0. The scaled random number, R, is compared to the array of
probabilities (Table 1) to determine the result of the at bat. If R
is greater than the appropriate value in the walk column the at bat
result is an out. If not an out the same R is compared to the singles
column. Again, if greater than this value, a walk is the at bat
result. The interpretation of the entries is the probability of the
particular event added to the probabilities of the previous events.
Continuing in sequence, if R is less than the value under home run
column a home run is the result. The order of testing is done from
more likely to less likely outcomes (triples are slightly out of
order) to minimize the number of tests needed. When S > 1 a larger
fraction of the 0-1 range of the RNG is greater than the on base
event thresholds reducing the probability of getting on base. With
S< 1 the converse is true.
For each possible at bat result, appropriate and also conditional, base running is done. Home runs are the simplest at bat outcome to process. The batter and all runners on base score. Triples are almost as easy. All base runners score and the batter goes to third. Doubles present a slightly more complicated situation. Runners on second and third score unconditionally. The slight chance (Table 2) fora runner on second not scoring is ignored. There are three possibilities for a runner on first: score, go to third or be out trying to score. The choice is made using a random number to select one of the three possibilities. Singles are processed the same general way although there are more possibilities. The runner on third scores. Base runners on first and second have probabilistic outcomes. Table 2 tabulates the overall advance patterns for the 1995 American League. Team values are used for these quantities in the simulator except for improbable events such as runner on first scoring (1-h) or being out (1-x) after a single where the league average is used for all teams. Read the headings as the starting base to the final base on the play. An x indicates an out made while advancing from the specified base and h indicates a score.
Errors may occur on any hit except the home run and advance all runners one base. Error rates used are from defensive team statistics and include all event types that can advance a base runner (fielding, balk, wild pitch and passed ball).
lead runner advance on single 1-2 1-3 1-h 1x 2-2 2-3 2-h 2-x 3-3 3-h 3x 1951 951 23 36 19 687 1365 54 12 1629 1 runner on third, single, next runner advances 1-2 1-3 1-h 1-x 2-2 2-3 2-h 2-x 1008 525 22 33 8 232 460 18 lead runner advance on double ------------, on double play 1-3 1-h 1-x 2-3 2-h 2-x 3-h 2-2 2-3 3-3 3-h 497 321 28 10 636 1 463 5 153 10 82
An at bat yielding an out presents the greatest number of base running possibilities. The simulator does not generate different kinds of outs such as fly, ground or strike outs. The type of out is implicit in the base running choices made following the out. Force outs can occur at any base. For example, if there are runners on first and second, the runner at second is out. Double plays are possible if there are less than two outs in the inning. The only type of double play included in the simulator is the common runner on first out at second with the batter out at first ground double play. If the third out was not made, base runner advances on outs are possible. These include the sacrifices hits (advancing a runner a single base, both first to second and second to third are possible) and sacrifice flies (scoring a runner). Either of these can occur when the lead runner has not made an out. All of these possibilities are conditional with rates determined by individual team averages. The rates used include all runner advances of the specified kind, not just the officially tabulated sacrifices.
Establishing an event probability requires two quantities, the
number of times the event happened and the number of chances there
were for the particular event to occur. In some cases, all possible
outcomes can be counted. A runner on second following a single has
just four possibilities: stay on second, advance to third, score, or
be out trying to advance. Many other event possibilities require
evaluating particular runner configurations to determine if the event
can take place. A stolen second base requires that a runner be on
first and none on second. Hitting rates require the number of plate
appearances. While the latter quantity can be derived from the event
files, the former is not so easily available. The intent of the
simulator is to reproduce numbers of these events thus if the
simulator chances are different than actual season chances the
numbers of events would be different if rates were determined
entirely by actual season results. Therefore, to establish the
category chances in the simulator, including plate appearances, the
following iterative process has been used. The simulator identifies
the particular runner configurations corresponding to these events
and counts them. This is done by team. These counts become the basis
for the event probabilities in the next iteration.
Statistics for the simulation are accumulated at the season and multiple season level. Adequately determining distributions requires accumulating data from multiple seasons. Typically, 100 - 10000 seasons are simulated depending on the analysis being done.
The random number generator (RNG) is a key component of the simulation. The one used is random() from the GNU software libraries which provides a 31 bit mantissa. It was carefully evaluated using the frequency, two and three number serial tests and run tests described by Knuth (Knuth, D. E. "The Art of Computer Programming. Vol. 2. Semi Numerical Algorithms", Addison-Wesley 1969). It satisfactorly passes all these tests. The two and three number serial tests correspond most closely to the useage of the RNG in the simulator. A further test was to compare the simulation results using random() with a second RNG, drand48() also from the GNU libraries. This RNG uses a different algorithm and provides a number with a 48 bit mantissa. No differences besides expected statistical variations were seen in spite of the much larger mantissa. The routine random() is used because it is significantly faster.
Debugging a simulator is notoriously difficult. Distinguishing between relatively rare events and program bugs is challenging. The final series of tests performed were the generation of event files in the Retrosheet, Inc format from single season simulations. These files were then processed by the events file analysis program. Lack of error messages from the consistency checking done during event file processing gives considerable additional confidence in the simulator.
1996 American League East Division %wins\pl 1 2 3 4 5 wins std min max act a-s BAL 0.542: 462 336 157 45 0 87.8 6.0 68 105 88 0.2 NYA 0.539: 409 351 183 57 0 87.3 6.3 65 111 92 4.7 BOS 0.498: 97 210 397 296 0 80.7 6.2 59 100 85 4.3 TOR 0.476: 32 103 263 602 0 77.1 6.4 52 103 74 -3.1 DET 0.289: 0 0 0 0 1000 46.8 5.6 28 65 53 6.2 1996 American League Central Division %wins\pl 1 2 3 4 5 wins std min max act a-s CLE 0.606: 639 324 32 5 0 97.6 6.3 74 120 99 1.4 CHA 0.581: 351 564 74 10 1 94.2 6.2 73 113 85 -9.2 MIL 0.485: 5 56 370 322 247 78.6 6.6 58 100 80 1.4 MIN 0.479: 3 40 297 343 317 77.7 6.5 55 97 78 0.3 KCA 0.468: 2 16 227 320 435 75.3 6.3 52 95 75 -0.3 1996 American League West Division %wins\pl 1 2 3 4 wins std min max act a-s TEX 0.595: 825 160 15 0 96.4 6.1 80 116 90 -6.4 SEA 0.545: 163 649 179 9 87.7 6.3 67 110 85 -2.7 OAK 0.485: 12 182 711 95 78.5 6.2 62 99 78 -0.5 CAL 0.412: 0 9 95 896 66.3 6.1 45 86 70 3.7 team wins chisq 3.102 prob 0.997
Overall results from applying the simulator to the the 1996
American League season are given in Table 4. Teams are indicated by a
three letter code representing their home city. %wins is the fraction
of simulated games won. The next five or four columns depending on
the division, represents the number of times a team finished in that
particular place in their division. "wins" is the average number of
wins during a season. "std" is the standard deviation of the wins and
is followed by the minimum and maximum number of wins during the
simulation. The column labeled "act" is the number of wins during the
actual season. Finally, "a-s" is the difference between actual and
averaged simulated wins. The sign of the result was chosen to be
positive when the actual season results had more wins than the
The last line in the table is the chi-square of the actual and simulated wins using simulated wins as the expected number. Probability is computed on the basis of 13 degrees of freedom. Other seasons yield similar results.
Table 5 present simulated and actual season distributions of three run related quantities, season team runs scored, team runs allowed and runners left on base. The column gms is the number of games played during the season. Simulated results are the average of 1000 seasons. The "std" columns obviously correspond to the simulated results. Three chi-square values and resulting probabilities are also given. The low chi-square value for the runs allowed category reflects the minimization procedure outlined in the discussion of the hitting model. In this case the probabilities are evaluated for 13 degrees of freedom. Simulations for the other three seasons produce comparable results.
1996 runs scored runs allowed left on base team gms in sim sm-in std in sim sm-in std in sim sm-in BAL 162 949 942 -7 43.7 903 867 -36 40.8 1154 1175 21 BOS 162 928 913 -15 42.8 921 912 -9 42.3 1251 1273 22 CAL 161 762 790 28 38.7 943 945 2 44.7 1209 1169 -40 CHA 162 898 926 28 42.6 794 780 -14 39.0 1231 1252 21 CLE 161 952 961 9 44.4 769 768 -1 39.3 1224 1238 14 DET 162 783 735 -48 38.4 1103 1150 47 47.7 1040 1084 44 KCA 161 746 730 -16 36.4 786 784 -2 37.6 1117 1166 49 MIL 162 894 868 -26 41.8 899 895 -4 42.1 1198 1244 46 MIN 162 877 848 -29 41.3 900 882 -18 41.5 1194 1228 34 NYA 162 871 860 -11 42.1 787 788 1 38.4 1258 1277 19 OAK 162 861 859 -2 39.9 900 883 -17 41.0 1175 1209 34 SEA 161 993 959 -34 42.4 895 873 -22 43.7 1238 1277 39 TEX 162 928 957 29 43.6 799 783 -16 37.5 1253 1260 7 TOR 162 766 768 2 39.3 809 808 -1 39.6 1169 1185 16 ---- tot 2264 12208 12116 -92 159.3 12208 12116 -92 159.3 16711 17037 326 sim. runs scored chi-sq 9.81e+00 prob 0.709 sim. runs allowed chi-sq 5.39e+00 prob 0.966 sim. left on base chi-sq 1.18e+01 prob 0.546
The remaining distributions evaluated as tests on the simulator
are runs per game, runs per inning and game length in innings (Table
Runs per game distribution team 0 1 2 3 4 5 6 7 8 9 >=10 sim 76 160 232 274 288 273 239 197 154 115 258 act 79 162 248 280 289 260 237 166 142 116 85 x-sq 0.2 0.0 1.2 0.1 0.0 0.6 0.0 4.9 0.9 0.0 for league chisq: 23.7 prob: 0.127 (17 DOF) Runs per inning distribution 0 1 2 3 4 5 6 7 8 9 >=10 sim 13870 3396 1614 789 369 163 71 30 13 5 4 act 14015 3211 1592 778 411 172 92 34 12 2 2 x-sq 1.5 10.1 0.3 0.1 4.7 0.5 6.3 0.6 0.1 for league chisq: 24.2 prob: 0.004 (9 DOF)< Game length in innings distribution 9 10 11 12 13 14 15 16 17 >=18 sim 2065 99 49 25 13 6 3 2 1 1 act 2040 120 34 38 14 4 8 2 0 0 x-sq 0.3 4.3 4.7 6.5 0.1 for league chisq: 15.9 prob: 0.007 (5 DOF)
Table 6 tabulates league totals, not individual team results.
Simulated (sim), actual season (act) and individual chi-square (x-sq)
contributions are given for each of the three distributions. The
chi-square is only tabulated where there are > 5 runs or innings
depending on the distribution in the particular histogram bin. The
total degrees of freedom used in the probability calculation are also
given. The simulated values are the averages of a 1000 season
Using the simulator to investigate batting order and the effects of varying team parameters such as the relative value of stolen bases and runners caught stealing requires sufficient season simulations to produce a standard error about 5 times smaller than the smallest change in season wins considered significant. Since the standard error is the standard deviation divided by the square root of the number of simulated seasons, it takes 1000 simulated seasons to produce a standard error of 0.2 wins allowing win differences of 1 to be considered significant. Ten thousand season simulations yield a minimum significant difference of 0.3 wins per season.
The simulator was designed and coded for time performance. Typically, a 14 team 162 game season simulation requires 217000 random number evaluations and takes approximately 0.6 seconds on a Power Mac 6100/66 system. Even with this relatively quick simulation time, many hours of computation are needed to provide repeatable results for changes in strategy that might produce a 1 run per season difference. The implementation language is C++. The simulator software compiles and executes on the Macintosh as well as GNU and Silicon Graphics, Inc. UNIX systems.
Back to the J. F. Jarvis baseball page.