John F. Jarvis

In their 1977 paper, "An Offensive Earned-Run Average for Baseball", Operations Research, Vol 25 No. 5, September-October 1977, pp 729-740, Thomas Cover and Carroll Keilers formulate an effective method for estimating the expected future runs, EFR, for any inning state from six readily available hitting parameters: at bats, hits, doubles, triples, home runs and bases on balls. The on bases component of the state is determined by the base runners with third assigned a value of 4, second 2 and first a value of 1. Base values range from 0 (no one on base) to 7 (bases loaded). The inning state value is the runners on bases value + 8*outs . State 0 is nobody out and nobody on base while state 23 is the bases loaded with 2 outs. My state 0 is the same as the Cover-Keilers state 1. Assignment of the remaining 23 states to particular base runner configurations and outs is an implementation task.

The Offensive Earned Run Average, OERA, is defined as 9 times the expected runs starting with the no outs and the bases empty state. The interpretation of this quantity is the estimated number of runs scored in a nine inning game.

Two different implementations of this method have been made. Source code for both of the functions is available and links to them will be given at the end of this document.

The paper derives a matrix equation of order 24, and indicates that the 24 EFR values, E, are obtained from the solution to the matrix equation:

which can be written

Q is the 24*24 matrix of probabilities of moving between 2 states. R is the probability of scoring runs from each state. Vectors E and R have a length of 24. I is the identity matrix: 1 on its diagonal, 0 everywhere else. Solving equation sets of the form, Q'E = R, is a standard numerical analysis procedure.

The OERA paper provides a minimal (section 3), but adequate, description of how to specify the Q matrix and R vector. Compare the code in the numerical solution for building matrix Q' and vector R with their description.

Does this implementation work? Cover and Keilers include many examples. Following is one from their paper. The lifetime hitting statistics for Ted Williams are 7706 at bats, 2654 hits, 525 doubles, 71 triples, 521 home runs and 2018 walks. Applying the OERA calculation to these numbers gives the OERA as 13.20 runs (remember OERA is 9*e[0]), the same value given in the paper. The interpretation is that a team of 9 Ted Williams would score an average of 13.2 runs per game.

Pete Palmer and John Thorn in "The Hidden Game of Baseball", page 153, provide an often referenced table of expected future runs, EFR, developed from the 1900-1977 season data. EFR is useful for comparing various game strategies. For example, the EFR after a successful stolen base can be compared the EFR following caught stealing providing an estimate the fraction of stolen bases that must be successful to reach the break even point. Rearranging the HBG table into the form I use:

outs --- 1-- -2- 12- --3 1-3 -23 123 0 0.454 0.783 1.068 1.380 1.277 1.639 1.946 2.254 1 0.249 0.478 0.699 0.888 0.897 1.088 1.371 1.546 2 0.095 0.209 0.348 0.457 0.382 0.494 0.661 0.798

Inning State: - indicates the base is empty, a number indicates a runner is on the specified base.

Cover/Keilers OERA using all data from 1900 to 1977 seasons:

outs --- 1-- -2- 12- --3 1-3 -23 123 0 0.475 0.884 1.075 1.497 1.075 1.497 1.689 2.190 1 0.252 0.526 0.709 0.993 0.709 0.993 1.175 1.537 2 0.092 0.224 0.354 0.492 0.354 0.492 0.622 0.816

Ratios: OERA Estimated EFR/Palmer observed EFR (1900-1977)

outs --- 1-- -2- 12- --3 1-3 -23 123 0 1.046 1.129 1.007 1.085 0.842 0.913 0.868 0.972 1 1.012 1.100 1.014 1.118 0.790 0.913 0.857 0.994 2 0.968 1.072 1.017 1.077 0.927 0.996 0.941 1.023

Ratios larger than 1 indicate that the OERA calculation is an overestimate compared to the observed values. The OERA calculation tends to overestimate scoring except when a runner is on third where it mostly underestimates scoring.

The numeric values used to create the OERA table are at bats = 7090038, hits = 1857653, doubles = 294558, triples = 73893, home runs = 113567 and walks = 660126. The totals were obtained from TB5 season totals. I have made some comparisons indicating OERA is slightly less accurate than the Bill James Runs Created (Tech-1) formula. RC-1 uses additional parameters beyond those used by OERA thus it's better performance is not surprising. Consequently the OERA might be most useful to those sabermetricians doing "what if?" analysis based on expected future runs. The calculation can be done for a player, team, league or for the input stats summed over several years.

A second comparison is obtained from hitting data for the 1967 AL and for both leagues for the 1980 - 1986 and 1992 - 1996 seasons obtained from analyzing full season play-by-play data. For this analysis, hit by pitch totals have been added to walks. This yields the following table of expected future runs using the OERA calculation.

OERA Predicted Expected Future Runs from all PA Base runners on: outs --- 1-- -2- 12- --3 1-3 -23 123 0 0.524 0.945 1.123 1.561 1.123 1.561 1.739 2.260 1 0.284 0.569 0.740 1.037 0.740 1.037 1.207 1.588 2 0.107 0.247 0.369 0.514 0.369 0.514 0.636 0.844

The same events files show the following distribution of inning states. That is, the listed state occurred that many times during the analyzed seasons. The careful reader will observe that the number of 0 out bases empty states exceeds the number of innings played. Any state can occur more than once. A home run with no outs creates another instance of the no out bases empty state.

Observed count of inning states Base runners on: outs --- 1-- -2- 12- --3 1-3 -23 123 0 466816 121719 36394 27578 6862 12045 6682 6858 1 333344 138538 65900 50260 22451 24397 17088 17449 2 264740 139007 78751 63447 32133 31637 18784 20993

Besides the inning states, runs scored following each state to the end of the inning were also tabulated. Dividing the these runs totals by the number of times the inning state occurred provides the observed expected future runs from each inning state.

Observed Expected Future Runs Base runners on: outs --- 1-- -2- 12- --3 1-3 -23 123 0 0.493 0.871 1.121 1.486 1.373 1.750 2.004 2.311 1 0.261 0.517 0.682 0.902 0.953 1.163 1.386 1.552 2 0.099 0.224 0.328 0.441 0.376 0.507 0.603 0.774

One way to compare the OERA predicted future runs and the actually observed values is to divide each estimated value by the corresponding observed value.

Ratio: OERA estimated EFR/Observed EFR Base runners on: outs --- 1-- -2- 12- --3 1-3 -23 123 0 1.062 1.085 1.002 1.050 0.818 0.892 0.868 0.978 1 1.087 1.101 1.084 1.149 0.776 0.892 0.871 1.023 2 1.081 1.101 1.123 1.166 0.980 1.014 1.055 1.091

The weighted average of the ratios is 1.070 with a standard deviation of 0.058 using the number of times the corresponding inning state occurs as the weighting term.

Comparing this set of ratios to the ones computed from the Palmer EFR data shows considerable similarity.

Interestingly, the numerical solution can easily be generalized to an arbitrary number of outs per inning. This allows estimating what game scores would be given different number of outs per inning and innings per game. The following table shows the four different ways that inning length and number of innings can give 27 outs. The third out nips a lot of rallies.

outs/inning innings e[0] runs/game 1 27 0.092 2.48 3 9 0.475 4.28 9 3 2.655 7.97 27 1 11.164 11.16

A direct numeric solution to the Cover-Keilers matrix equation is the natural approach to this problem. The numeric values of the Q' matrix and R vector elements are computed. The resulting 24th order set of equations is solved using a standard numerical method (lusolve() from Numerical Recipes, 2nd ed., Press et al, Cambridge, 1992). The solution is the E vector and has the interpretation of expect future runs in an inning starting from corresponding game state. The numerical solution is a computationally easy one. A 27 out inning which has 8*27 = 216 states is readily solved by lusolve(). Perhaps, the most important role of the numerical solution is the clear description of the algorithm it affords. The code implementing the numeric solution will need to modified to replace a private matrix/vector set of classes and to properly interface to the numeric linear equation solver.

The set of 24 equations can be solved symbolically yielding an expression for each component of the expected future runs, E, vector in terms of the probabilities for the six batting outcomes. A file of commands for the symbolic algebra program, Mathematica, was created by modifying a copy of the numeric solution code. The expressions for the matrix and vector elements are made from the numeric code by the simple expedient of enclosing all numeric expressions in double quotes thus making them strings. The matrix and vector base type must all be changed to string ("char *" in C/C++). The resulting matrices and vectors are written to a file according to Mathematica syntax rules. After the solution found by Mathematica is simplified, the solution is written to another file which is further edited into the "symbolic" evaluation C function. This function is a numerical computation of course. The term "symbolic" refers to how the function was created. However, all mathematical insight about the structure of the problem is lost in this process. I am indebted to my colleague, Steven King, for his efforts in managing the Mathematica phase of the project.

Using a 66 MHz Power PC the execution time for the numeric function is 3780 microseconds. On the same system the symbolic approach provides an execution time of 18 microseconds, more than 200 times faster. The performance of a Power PC 601 CPU is comparable to a Pentium with the same CPU clock rate. Processing time should decrease proportionally as the CPU clock rate increases. The compiler used is Metrowerks CodeWarrier with all optimizations enabled.

The numerical solution is recommended if outs is a needed parameter. In the more likely case of always using 3 out innings, the speed of the symbolic based function and the fact that it does not require any specialized numerical libraries makes it the method of choice. If only a few of the EFR values are needed the unneeded ones may be deleted further increasing the speed of the function.

Either or both functions can be requested as web pages. The details of the calling sequence is best obtained from the source files. After loading one of them with a Web Browser, save the page as a text file and edit the saved file into legal C/C++ source code by removing the beginning and ending HTML sequences. The links to the two functions are:

Get the C++ source code for the "Numerical" Function:

Use caution when editing either function but especially the symbolic version. There is no way to debug this version other than a character by character comparison of a working and non working version. (And if you have a working version, why bother?) If the routine fails to work the best procedure would be to redo all editing using a fresh copy.

If you successfully use either of these functions, please drop me a note.

**Revision History**

February 3, 1998: Original Posting to the Web.

August 19, 1998: Added the comparison to observed expected future
runs for the 1967 AL and both leagues 1980-1986 and 1992-1996. A
paper copy of this version has been sent to the SABR Archives.

Copyright 1997-1998, John F. Jarvis