This study addresses the concept of projecting a players career - estimating how he will do in one or more future seasons based on what has been done so far in a career. A second possible use of projection systems is interpolation, estimating what might have been accomplished when a player misses all or most of a season.
Perhaps the best known system of this kind is Brock2 introduced by Bill James. I will also describe a system of my own design and the simplest projection system: assume a player will do the same next year as he did last year. Projections made by each of these will be compared to the actually achieved results.
2) Player Evaluation Measures
Since a single number is being used to represent player achievement, the difficult question of what to use must be faced. Many different quantities have been proposed for this purpose. Counting stats, as well as various combinations them, can be used. Counting stats used are RUNS, RBIS, hits, and home runs. The combinations include James' Runs Created (BASIC, see Total Baseball 7th Ed. pp 2498-2499), (RUNS+RBIS)/2 and two linear weights formulas, BWOE and SDTHWOE. Each letter in the linear weights formulas represents a term in the formula. STDH represents the number of singles, doubles, triples and home runs. B is total bases, E is errors and O is outs defined as AB - Hits. The weights for the terms have been determined by linear regression using team data from 1978-2001. When used as a measure of a player's contribution, the error term is not included in the linear weights formulas. Both linear weights formulas estimate seasons runs contributed by the player. The SDTHWO measure is used in this study. Details on the linear weights formulas and the efficacy of some of the other measures is available in my web published study A Survey of Baseball Player Performance Evaluation Measures. In this study SDTHWO is referred to as BASIC.
3) Data Source
The source of the batting data used in this study is the Sean Lahman version 4.5 database available. Using the Lahman version 4.5 database for the years 1910 to 2001, batting careers were assembled for players that are complete within the years 1911 and 2000. Pitcher records are included. Completeness is defined as not having played in 1910 or 2001. This is a weak definition of completeness, especially at the beginning of a career. This version of the database does not include fielding position information, which is needed by the Brock2 system.
4) The Career Function
I feel very strongly that any formula containing arbitrary or data related numeric constants that is used to represent or evaluate player achievements must derive these constants from the game's statistical record. Among the formulas included in this category are the multitudinous linear weights batting evaluation formulas and the projection formulas discussed in this study. The following discussion of my career function will describe one method for determining values for seemingly arbitrary constants.
The form of the career summary function I have developed, CFTN, is suggested by the typical ML career: a period of development, mid career at full capabilities followed by the inevitable decline due to age. Each individual career is characterized by two constants, mpeak and shape. Shape is related to the length of career and mpeak is a measure of the peak ability of a player. Given a player age, CFTN returns a value indicating his ability, as measured by CFTN, at the given age.The ability is given in the same units that were used in the creation of the function. CFTN is the product of three terms, one term for each of the three general periods of the career given by the following computational expressions.
1) cftn_b = cftn_ba + cftn_bb*shape
2) cftn_c = cftn_ca - cftn_cb*shape
3) cftn_t1 = cftn_t1z - cftn_t1s*shape
4) cftn_t2 = cftn_t2z + cftn_t2s*shape
5) a = 1.0/(1.0 + exp(-cftn_a*(age-cftn_t1)))
6) b = 1.0 + cftn_b*(age-20)
7) c = 1.0/(1.0 + exp(cftn_c*(age-cftn_t2)))
8) CFTN = mpeak*a*b*c
The nine constants: cftn_a, cftn_ba, cftn_bb,
cftn_ca , cftn_cb, cftn_t1z, cftn_t1s, cftn_t2z, cftn_t2s are global,
the same for all players. Determination of these values is the key to
CFTN and will be described in detail.
Expression 5) determines the behavior of CFTN at the beginning of a career, 6) the middle and 7) the end. The resulting value, 8), expresses the value of the function for some age. An illustration of this is given in Figure 1 that has been computed from the parameters given in Table 1.
The function CFTN is linear in the parameter mpeak. Thus, for any set of values of the global constants and a value for the parameter shape, mpeak can be calculated:
9) mpeak = sum(measure*CFTN) / sum(CFTN*CFTN)
In 9) the sums are for each season in a player's career. Measure is the quantity chosen to represent a players achievement. For this calculation CFTN is evaluated with mpeak set to 1. Varying shape until a minimum in the variance (sum of squared differences between the actual value and estimated value of the measure) is found for the particular player determines the two parameters. To minimize numerical problems in these calculations the shape parameter is constrained to the range -0.25 to 1.50 . The units of CFTN are the units of the measure used. If the measure specifies season runs then mpeak will have the same interpretation. For some players and some sets of the global parameters more than one minimum in the shape range of -0.25 to 1.50 is possible.
The constant cftn_bb has a negative value that when combined with values of shape near the upper limit (1.5) can result in a negative value for CFTN towards then end of a career. In the plots included in this report negative values of CFTN have been set to 0.
From the complete player career record a training set of players having at least 2000 at bats, at least 6 seasons played and no more than 3 missing seasons was selected. 1230 players satisfied this criteria. Using this data set the global parameters are determined by minimizing the total variance which is the sum of the variance resulting from a determination of mpeak and shape for each player in the training set. For this variance computation negative vales for CFTN are not set to zero. This global minimization is accomplished by a standard numerical procedure given in section 10.5 of Numerical Recipes, 2nd Edition, Press et al, 1992. Missing seasons within a career are not used in the computation of variance for the optimization procedure. Typically, a single global optimization takes 30 minutes on a 500 MHz INTEL Pentium III processor.
To determine the parameters used an evaluation measure was specified (SDTHWO) and a number of CFTN global parameter optimizations were performed. To avoid ending the optimizations in the same local minima, slightly different initial values for the CFTN parameters were used for each starting point. The obvious criteria for choosing a parameter set is the smallest total variance. Since the goal is to project a career another selection criteria is to determine which parameter set provides the most accurate projections defined as the highest correlation. Details on correlations are given in Section 7. There is a generally increasing accuracy of projection with smaller variance but the relationship is noisy. After carefully examining several parameter sets one providing the best projections based on eight seasons was chosen and is given in Table 1. This parameter set provides a low but not the minimum variance observed and is used with the SDTHWO evaluation measure for all the CFTN computations given in this report. The R2 vs Final Variance plot, Figure 11, shows the results of 258 determinations of CFTN global constants for the first year projections based on 4, 8 and 12 seasons. The general trend that lower variance corresponds to higher correlation is evident. There are also some clearly poor CFTN parameters sets for any final variance when the evaluation criteria is correlation. The form of CFTN is only slightly dependent on the player measure used.
Figure 2 displays the resulting function for six values of the shape parameter using the global parameters given in Table 1. Each curve in Figure 2 is evaluated with mpeak set to 1. As shape increases the general effect is towards a lower starting age for the career. Once shape is above about 0.5 all careers end around age 45. Values of shape less than 0.2 describe careers that are fairly short with the peak age decreasing as shape decreases. The set of shape curves obtained from the minimum variance parameter set is not qualitatively different.
While the parameter shape is related to the length of a player's career, mpeak is a measure of the peak value of it. However, as can be seen in Figure 2, mpeak depends on shape. If player mpeak values are to be compared, a correction needs to be made to mpeak to compensate for this shape dependency. mpeak' (mpeak prime) is this corrected value:
10) mpeak' = mpeak * max(CFTN(shape,1.0))where CFTN is evaluated from age 18 to 45.
shows a plot of mpeak', shape pairs for the 1921 complete
careers having at least 1000 AB and at least 6 seasons in the majors.
The evaluation measure for this plot is the linear weights formula
SDTHWO which can be interpreted as runs contributed to the team
during a season. Table
2 lists the ten players having the highest
mpeak' values from this data set. While there will always be debate
on the question of which ten players have the highest peak values,
this list is suggestive that the values have some connection with
Figure 4 shows career and CFTN data for Rod Carew (shape=1.33, mpeak' =93.8) using the measure SDTHWO. The solid circles show his career values and the four solid curves display CFTN values. The curve labeled "entire" is based on his complete career. The large variations in season to season output is typical and the root of the difficulty in creating an accurate representation of a career.
Projections can be made with CFTN by limiting the data used for a player to fewer seasons than a complete career: the sums in 9) are simply evaluated for the first N seasons of a career. Figure 4 includes CFTN curves for Carew using 4, 8 and 12 seasons in the shape,mpeak determinations. The way normal season to season accomplishments affect projections based on a few seasons is also evident.
Quantitative results on the ability of these systems to predict careers will be given after a discussion of the Brock2 and Most Recent Seasons systems.
5) The Brock2 Projection System.
Bill James introduced this method in his 1985 Baseball Abstract, pp 301-305. While the James description allows implementation of the scheme, no details on how well it works were presented. Brock2 requires up to four seasons of data depending on the player's age to estimate one or more following seasons. Games played, at bats, singles, doubles, triples, home runs, walks, runs and RBIs are needed as input. These same quantities are individually projected by the calculations. To compare with actual data and CFTN I evaluate the same offensive measure for Brock2 as for CFTN. Up to age 27 Brock2 projects a slightly increasing ability. Thereafter, a smaller ability is projected until a threshold is reached which effectively terminates a career. The change in ability from season to season depends only on player age and does not depend on when a career starts. Brock2 generates projections using the preceding two seasons, either actual or projected. If the season immediately preceding the one being projected is missing a projection cannot be made. If the second season preceding the projection is missing it will be seriously in error. For an example of this effect see the eight year base Brock2 projection for Ted Williams' career in Figure 9.
I have used the David Grabiner C language implementation of 1997 as the starting point for this effort. In addition to the Grabiner code, and extensions needed to interface it with my evaluation framework, I packaged the Grabiner code in a C++ language class. The Brock2 system was easily modified to project from more than the minimum specified number of seasons. The Brock2 algorithm was not changed.
An important part of the Brock2 calculation is the player sustenance level that primarily affects the career termination part of the calculations. Determining the initial value for this requires knowledge of the fielding position played and league runs/game information. The yearly runs/game values are readily available. However, player position information is not readily available and often changes during the course of a career. I have used an initial position correction (-0.452) to the sustenance level that is the same for all players. This constant value was chosen to give the same average results with a data set extracted from the Lahman 2.0 database when used for all players as was obtained when using the position information that was included in the 2.0 version. This position independent constant was used for projections based on the Lahman 4.5 data base player data that is given in this study.
Figure 5 shows Brock2 projections based on 4, 8 and 12 years for Rod Carew. The quantity plotted is SDTHWO, the same as in Figure 4. The Brock2 curves are essentially parallel showing the same decrease from their starting points which are dependent on the last two actual career data points. The effect of the career termination calculation can be seen in the 4 and 8 year based projections in Figure 5.
Brock2 separately projects each of the stats it uses. A thorough test would separately test the accuracy of each of these quantities.
6) Most Recent Season
The simplest projection scheme that can be devised is to assume a player will continue to produce exactly as he did during the most recent season that he completed and will be referred to by the acronym MRS. MRS thus is a limited form of autocorrelation. I have not provided a plot similar to Figures 4 and 5 for MRS. Such a plot would consist of horizontal lines drawn from the last season of actual data used to make the projections. While obviously lacking any concept of how a career progresses, the comparisons with the more complex Brock2 and CFTN methods will prove useful.
The evaluations described in this section use the same player set as was used for for Figure 3: complete player careers with at least 1000 AB and at least 6 seasons played.
The first comparison to be shown is of the averages of the player season runs (using the SDTHWO measure) for the three projection methods and the actual data. Perhaps the simplest criteria for success in projections is that the projection and actual data give the same value. The comparison in this case is crude: Figure 6 displays ratios of the data set averages for a projection normalized (divided) by the data set average actually observed. Ratios greater than 1 indicate an overestimate by the projection method. In the figure legends, B2 (red) shows Brock2 results, CF (blue) is used for CFTN and MS (green) for the MRS projections. The number following the colon and symbol indicate how many seasons were used for the projections. MRS projections for 8 and 12 years show clearly the overestimate resulting from the lack of modeling an age related decline of ability. The MRS projection based on 4 years suggests that player ability continues to improve a little in the second 4 year segment of a career. While it is possible that some of this improvement may be due to marginal players retiring the selection criteria for the player data set minimizes this. Brock2 systematically underestimates projected careers. CFTN does a reasonable job on reproducing the projected averages for three or four years. This is not surprising since it was designed to match averages.
A stronger comparison of the projections and actual data results from comparing each player season projection to what was actually achieved. The linear correlation ("R") coefficient has been computed for the data used to create Figure 6. Figure 7 displays R2 , square of the correlation coefficient R, using the same conventions as Figure 6. In the legend for Figure 7 the circumflex "^" is used to suggest correlation otherwise the interpretation of the legends is the same. The interpretation of R2 is the fraction of variance in one the variables explained by the variance of the other. The projections based on 4 seasons clearly shows the difficulty of making predictions early in a career. Except for the first year estimate, Brock2 is clearly superior to CFTN for all the projections. Distressingly, for the most important next year projections, neither Brock2 or CFTN provide as much correlation with what is actually accomplished as the trivially simple MRS method.
Ted Williams' career offers an example of the ability of these systems to do interpolations. Fortuitously, his WW II absence occurred after 4 seasons enabling the 4 year base Brock2 to estimate them what he might have accomplished. The entire career CFTN estimate (shape=1.27, mpeak'=155.7) also agrees with what might have been. The entire career CFTN is appropriate here as a projection is not being made.
Brock2 contains thresholds a player must exceed at the beginning of his career if he is to reach the status of a "regular". Pitchers generally fail to meet these requirements so Brock2 is of no use in estimating pitcher batting. CFTN has no such restriction. Figure 10 summarizes Steve Carlton's NL batting career and indicates the projections made for it by CFTN (shape=1.36, mpeak'=5.04). It also demonstrates dangers inherent in projections based on small sets of data.
The low value for autocorrelation (MRS projection) is indicative of the great amount of variability in individual careers. This seems to be intrinsic and it is unlikely that a projection system can, in principle, do much better than displayed in this study. While it is obvious that baseball playing careers follow a pattern, the results of this study seem to indicate that the more carefully age effects are modelled the poorer the prediction system does. It also suggests that any player career projections be treated with a considerable amount of skepticism. This study also emphasizes the importance of relating any statistical evaluation system to the actual historical statistical record and its proper understanding.
Back to the J. F. Jarvis baseball page.
The following list provides access to Brock2 and CFTN career summary and projection plots similar to Figures 4 and 5 for Rod Carew and Figures 8 and 9 for Ted Williams.
cftn_a = 0.864861
cftn_ba = -0.031985 cftn_bb= -0.007588
cftn_c = 0.937970 cftn_cb= 0.774984
cftn_t1 = 33.893580 cftn_t1s= 11.006860
cftn_t2 = 27.122430 cftn_t2s= 24.778990
Table 1. CFTN global parameters
Player SHAPE mpeak'
Ruth, Babe 0.88 188.9
Gehrig, Lou 0.99 177.8
Foxx, Jimmie 1.10 163.0
Williams, Ted 1.27 155.7
Mantle, Mickey 1.13 148.2
Hornsby, Rogers 1.09 142.8
Waner, Paul 1.04 138.0
Boggs, Wade 0.85 135.4
Schmidt, Mike 0.80 135.2
Musial, Stan 1.32 135.1
Table 2. Players with top ten mpeak' values
Back to the J. F. Jarvis baseball page.