Baseball Event Files Parsing
John F. Jarvis

The best way to understand the need for and function of the events file processing program, the "parser", is to examine the data it processes. Following is a single game from the 1983 season (San Diego at Atlanta, April 8, 1983) obtained from Retrosheet, Inc and is coded using the pure textual scoring system adopted by Retrosheet, Inc and the Baseball Workshop.

id,ATL8304080
version,1
info,inputprogvers,"version 7RS(19) of 07/07/92"
info,visteam,SDN
info,hometeam,ATL
info,site,ATL01
info,date,1983/04/08
info,number,0
info,starttime,0:00PM
info,daynight,unknown
info,usedh,false
info,umphome,"Quick"
info,ump1b,"Pallone"
info,ump2b,"Engel"
info,ump3b,"Runge"
info,scorer,"Braves"
info,translator,"C. Chestnut"
info,inputter,"C. Chestnut"
info,inputtime,1995/02/07 9:01PM
info,edittime,1996/09/28 9:00PM
info,howscored,park
info,pitches,count
info,temp,0
info,winddir,unknown
info,windspeed,-1
info,fieldcond,unknown
info,precip,unknown
info,sky,unknown
info,timeofgame,0
info,attendance,32737
info,wp,bedrs001
info,lp,chiff001
info,save,garbg001
info,gwrbi,
start,richg001,"Gene Richards",0,1,7
start,bonij001,"Juan Bonilla",0,2,4
start,garvs001,"Steve Garvey",0,3,3
start,kennt001,"Terry Kennedy",0,4,2
start,lezcs001,"Sixto Lezcano",0,5,9
start,joner002,"Ruppert Jones",0,6,8
start,tempg001,"Garry Templeton",0,7,6
start,salal001,"Luis Salazar",0,8,5
start,showe001,"Eric Show",0,9,1
start,butlb001,"Brett Butler",1,1,8
start,ramir001,"Rafael Ramirez",1,2,6
start,washc001,"Claudell Washington",1,3,9
start,murpd001,"Dale Murphy",1,4,7
start,hornb001,"Bob Horner",1,5,5
start,chamc001,"Chris Chambliss",1,6,3
start,hubbg001,"Glenn Hubbard",1,7,4
start,beneb001,"Bruce Benedict",1,8,2
start,campr001,"Rick Camp",1,9,1
play,1,0,richg001,12,,S9
play,1,0,bonij001,00,,S9.1-2
play,1,0,garvs001,01,,64(1)3/GDP.2-3
play,1,0,kennt001,22,,S6.3-H
play,1,0,lezcs001,12,,3/FL
play,1,1,butlb001,11,,S6
play,1,1,ramir001,00,,CS2(24)
play,1,1,ramir001,32,,7
play,1,1,washc001,11,,S7
play,1,1,murpd001,01,,S7.1-3
play,1,1,hornb001,31,,W.1-2
play,1,1,chamc001,10,,8
play,2,0,joner002,20,,63
play,2,0,tempg001,32,,K/C
play,2,0,salal001,10,,S9
play,2,0,showe001,22,,K/C
play,2,1,hubbg001,01,,8
play,2,1,beneb001,10,,63
play,2,1,campr001,11,,3
play,3,0,richg001,22,,8
play,3,0,bonij001,00,,7
play,3,0,garvs001,00,,63
play,3,1,butlb001,20,,63
play,3,1,ramir001,21,,9
play,3,1,washc001,22,,K/C
play,4,0,kennt001,00,,31
play,4,0,lezcs001,10,,5/FL
play,4,0,joner002,20,,43
play,4,1,murpd001,32,,6
play,4,1,hornb001,22,,K
play,4,1,chamc001,12,,S9.BX2(96)
play,5,0,tempg001,12,,K
play,5,0,salal001,01,,2/G
play,5,0,showe001,12,,K
play,5,1,hubbg001,00,,4
play,5,1,beneb001,10,,S8
play,5,1,campr001,01,,HP.1-2
play,5,1,butlb001,00,,S8.2-H;1-3
play,5,1,ramir001,00,,S8.3-H;1-2
play,5,1,washc001,00,,46(1)/FO.2-3
play,5,1,murpd001,21,,13
play,6,0,richg001,30,,W
play,6,0,bonij001,32,,W.1-2
play,6,0,garvs001,22,,K/C
play,6,0,kennt001,00,,NP
sub,falcp001,"Pete Falcone",1,9,1
play,6,0,kennt001,20,,E5.2-3;1-2
play,6,0,lezcs001,00,,NP
sub,mahlr001,"Ricky Mahler",1,9,1
play,6,0,lezcs001,00,,9/SF.3-H(UR);2-3
play,6,0,joner002,00,,7
play,6,1,hornb001,32,,W
play,6,1,chamc001,21,,C/E2.1-2
play,6,1,hubbg001,00,,BK.2-3;1-2
play,6,1,hubbg001,12,,K/C
play,6,1,beneb001,10,,9
play,6,1,mahlr001,00,,NP
sub,pocob001,"Biff Pocoroba",1,9,11
play,6,1,pocob001,10,,9
play,7,0,tempg001,00,,NP
sub,bedrs001,"Steve Bedrosian",1,9,1
play,7,0,tempg001,00,,63
play,7,0,salal001,02,,S8
play,7,0,showe001,00,,SB2.1-3(E2/TH2)
play,7,0,showe001,00,,NP
sub,lefej001,"Joe Lefebvre",0,9,11
play,7,0,lefej001,01,,3/FL
play,7,0,richg001,10,,63
play,7,1,butlb001,00,,NP
sub,chiff001,"Floyd Chiffer",0,9,1
play,7,1,butlb001,00,,9
play,7,1,ramir001,22,,9
play,7,1,washc001,10,,D7
play,7,1,murpd001,21,,S4.2-3
play,7,1,hornb001,02,,64(1)/FO
play,8,0,bonij001,22,,K/C
play,8,0,garvs001,12,,S7
play,8,0,kennt001,11,,S8.1-2
play,8,0,lezcs001,32,,8.2-3
play,8,0,joner002,12,,K/C
play,8,1,chamc001,31,,W
play,8,1,hubbg001,10,,S9.1-3
play,8,1,beneb001,12,,D7.3-H;1-3
play,8,1,bedrs001,00,,NP
sub,smitk102,"Ken Smith",1,9,11
play,8,1,smitk102,00,,NP
sub,lucag001,"Gary Lucas",0,9,1
play,8,1,smitk102,00,,NP
sub,watsb001,"Bob Watson",1,9,11
play,8,1,watsb001,32,,K/C
play,8,1,butlb001,22,,43.3-H
play,8,1,ramir001,00,,63
play,9,0,tempg001,00,,NP
sub,garbg001,"Gene Garber",1,9,1
play,9,0,tempg001,01,,S9
play,9,0,salal001,20,,S8.1-2
play,9,0,lucag001,00,,NP
sub,turnj101,"Jerry Turner",0,9,11
play,9,0,turnj101,00,,3(B)6(1)/GDP.2-3
play,9,0,richg001,20,,3/G
data,er,showe001,2
data,er,chiff001,2
data,er,lucag001,0
data,er,campr001,1
data,er,falcp001,0
data,er,mahlr001,0
data,er,bedrs001,0
data,er,garbg001,0
 
 

The complete set of files for a season, one for each team containing all their home games, contain a description of every play that has taken place. These complete season files total 8 to 12 MBytes. The scoring system is entirely textual, thus readable, but the amount of detail contained is so great that help from a computer is needed if accurate statistical summaries for a team, a player or a season are desired.

It is the function of the events file analysis program, usually refered to as the parser, to process game or season descriptions and produce the reports or files containing data need for other projects or programs. A set of programs for parsing this data are available from Retrosheet, Inc.

I have written my own parsing program. There is no better way to understand the intricacies of the scoring system than to write a program to analyse these files. After completing the processing for a season, the appropriate statistics for the simulator can be written to a file. Likewise, complete hitting, batting and fielding statistics can be collected for all the players that appeared during the season. Have you ever tried to find batting data for pitchers? Have you ever tried to obtain base running statistics such as the fraction of the time a runner on second scores or stays at third after a single? The reward for writing the parser is the ability to compute exhaustively any statistical quantity that can be adequately defined.

While I will not attempt to describe the details of the scoring system and parser, some idea of the completeness of the record and the problems of interpretting it can be seen in the game description above. Each at bat is described by one or more "play" records. Each play record consists of the word play, the inning, 0 for visitor or 1 for the home team, a coded version of the players name, the count on the batter, an optional list of the pitches to the batter, and the detailed description of the play itself. For example:

 
play,6,0,lezcs001,00,,9/SF.3-H(UR);2-3
 


In the visitor's sixth inning Sixto Lezcano hits a sacrifice fly to right fielder scoring the runner on third, Gene Richards, which is unearned, (UR), and advancing the runner on second, Juan Bonilla. Starting lineup and substitution information allows identifying each player in a play. The detailed play description consists of three parts but all parts or not given on each play. The first item is the type of play. A slash, "/", separates a modifier or additional information field from the play type. If runners advance on the play this information is given following a period, ".".

In addition to the play record there are also "id", "start", "sub", "info" and "data" records. The "id" record contains a unique game identifier based on the home team, date and which game of a doubleheader if needed. Some additional insight into this scoring system is provided in: "The Joy of Keeping Score", Paul Dickson, 1996, Walker, pp 44-45.

Once the investment has been made in creating a program to process this data, creating new player level statistics and applying them to full major league seasons is straight forward. This is the method used to evaluate the apportioned wins and losses pitching evaluation statistic I presented at SABR97.

Following is a sample of the reports available, for the 1983 NL, giving team related statistics.

In the hitting summary, or is opponents runs scored, hbyp is battters hit by a pitch. The other column headings are obvious.

team hitting summary


team games  runs    or   lob    ab  hits     s     d     t    hr    bb  hbyp
 atl   162   746   640  1155  5472  1489  1096   218    45   130   582    17
 chn   162   701   719  1120  5512  1436   982   272    42   140   470    29
 cin   162   623   710  1090  5333  1274   896   236    35   107   588    19
 hou   162   643   646  1145  5502  1412  1016   239    60    97   517    19
 lan   163   654   609  1104  5440  1358   981   197    34   146   541    22
 mon   163   677   646  1213  5611  1482  1042   297    41   102   509    38
 nyn   162   575   680  1041  5444  1314  1004   172    26   112   436    31
 phi   163   696   635  1154  5426  1352   973   209    45   125   640    26
 pit   162   659   648  1142  5531  1460  1072   238    29   121   497    19
 sdn   163   653   653  1103  5527  1384  1050   207    34    93   482    20
 sfn   162   687   697  1104  5369  1324   946   206    30   142   619    28
 sln   162   679   710  1175  5550  1496  1088   262    63    83   543    24
----------------------------------------------------------------------------
 tot  1948  7993  7993 13546 65717 16781 12146  2753   484  1398  6424   292
input command ('?' for list): misc
 
 

The base stealing tabulation below also gives each teams double and triple play counts as well as counts of successes (sb) and failures (cs) for attempts on each base. Counts of pickoffs, both as offensive and defensive teams are also given.

 
 
base stealing stats
       offensive                                      defensive
team  dpl tpl po1 po2 po3  sb2 cs2 sb3 cs3 sb4 cs4    dpl tpl po1 po2 po3
 atl  159   1  10   3   0  135  75  11   7   0   3    176   0  10   0   0
 chn  143   0   4   2   1   78  34   5   3   1   3    163   1  10   4   0
 cin  137   0   7   1   0  145  65   8   5   1   6    121   0   4   2   0
 hou  102   0  10   1   0  156  87   8   3   0   5    165   0   5   1   0
 lan  140   0   7   0   0  155  65  10   4   1   5    131   0   7   4   0
 mon  153   1   7   1   1  131  37   7   5   0   1    128   0   6   4   1
 nyn  139   0   7   3   0  129  60  12   2   0   2    172   0   6   1   0
 phi  159   0   3   4   1  133  67  10   3   0   4    116   0   7   1   0
 pit  160   1   5   3   0  116  70   8   5   0   2    165   0  12   0   1
 sdn  143   0   9   3   0  162  56  17   4   0   6    134   1   4   3   0
 sfn  167   0   4   3   0  139  63   0   7   1   7    108   0   4   2   1
 sln  150   0  10   4   0  194  76  10   3   3  10    173   1   8   6   0
--------------------------------------------------   --------------------
 tot 1752   3  83  28   3 1673 755 106  51   7  54   1752   3  83  28   3
 
 

The following tabulation is for the entire 1983 National League. Runner advances under various condtions are tabulated. The first line reports on where the lead runner ends up following a single. In the headings read 1-3 as the number of times with the lead runner on first that a single advances him to third. An x as a destination indicates the runner was out while an h indicates the runner scored. If the starting base and ending base are the same the runner didn't advance.

The second line tabulates the advances for a runner trailing a runner on third when a single has been hit.

The third line documents the progress of the lead runner following a double as well as runner advances on double plays.

 
 
lead runner advance on single
   1-2   1-3   1-h    1x   2-2   2-3   2-h   2-x   3-3   3-h    3x
  1470   699    32    34    26   509  1147    71    14  1177     1
 
runner on third, single, next runner advances
   1-2   1-3   1-h   1-x   2-2   2-3   2-h   2-x
   754   313    30    34     3   134   295    16
 
lead runner advance on double ------------,  on double play
   1-3   1-h   1-x   2-3   2-h   2-x   3-h   2-2   2-3   3-3   3-h
   255   245    14     5   450     0   252     6   126    10    59

 

Back to the J. F. Jarvis baseball page.


Copyright 1997, John F. Jarvis