CoCalc -- WAR data.ipynb

Project: Brian Holliday - Courses/F19/Fall 2019

Path: projects/mort_Proj4/WAR data.ipynb

Views: ¹⁸⁷

Kernel: R (R-Project)

Abstract WAR is a baseball statistic that measures a player’s value against a replacement level player. A replacement level player being a player that will give you the production value of a minimum salary baseball player. This is an important statistic because it’s a way of measuring a player’s value against their peers in the number of extra wins that they will give you.

We looked to do a regression analysis of this statistic to see what measures are integral to the WAR statistic. Traditionally players were evaluated on classic baseball statistics such as batting average, home runs, stolen bases, and runs batted in, but we know that these stats do not tell the whole story. For example, those statistics do not tell you how often a player is getting on base, or strikeout rate. We want to know what are the best indicators for WAR.

Regression Analysis a good tool to figure this out because we can account for multiple factors that help us determine WAR score. With a standard deviation score for each factor we can determine how statistically significant each variable is. With the Regression Analysis we are trying to limit the number of outliers, outliers will make our adjusted R squared score worse, so we will try to make our R – Squared score as high as possible.

In [2]:

mlb<-read.csv('war-data (1).csv')

In [3]:

#Create our y...this is the war data
y<-matrix(mlb[,1], nrow = 26, ncol = 1)

In [2]:

#Create our b matrix
b<-matrix(1, nrow = 26, ncol = 1)

In [7]:

#Create and populate A matrix
A<-matrix(0,nrow = 26, ncol = 26)
A[,26]<-b
for (k in 1:25) {A[,k]<-mlb[,k+1]}

In [9]:

#Regression Analysis
L<-lm(y~A)
summary(L)

Call:
lm(formula = y ~ A)

Residuals:
        1         2         3         4         5         6         7         8 
-0.122076  0.239640  0.229624  0.200709 -0.297195 -0.254432  0.206614  0.221623 
        9        10        11        12        13        14        15        16 
 0.171591  0.076752 -0.046736 -0.305538  0.420955 -0.370879  0.193177 -0.095547 
       17        18        19        20        21        22        23        24 
-0.189749 -0.056964 -0.227216 -0.182676 -0.088005  0.396479  0.114347 -0.191193 
       25        26 
-0.045370  0.002065 

Coefficients: (2 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.155e+02  1.621e+02  -1.330    0.411
A1          -5.434e-01  3.442e-01  -1.579    0.359
A2           1.698e-01  2.090e-01   0.812    0.566
A3           2.046e+00  3.994e+00   0.512    0.699
A4          -1.709e+00  3.830e+00  -0.446    0.733
A5           2.291e-02  8.318e-02   0.275    0.829
A6          -1.102e+00  7.245e-01  -1.521    0.370
A7           3.943e-01  4.625e-01   0.853    0.551
A8           1.119e+00  9.502e-01   1.178    0.448
A9           1.292e+00  1.346e+00   0.959    0.513
A10         -8.679e-02  8.840e-02  -0.982    0.506
A11          6.152e-03  5.996e-02   0.103    0.935
A12          6.918e-02  2.718e-01   0.255    0.841
A13         -2.834e+00  4.686e+00  -0.605    0.654
A14         -3.071e-02  2.999e-02  -1.024    0.492
A15          1.372e+02  2.720e+02   0.504    0.703
A16          8.579e+02  1.120e+03   0.766    0.584
A17         -1.961e+02  5.539e+02  -0.354    0.783
A18         -7.897e+01  6.069e+02  -0.130    0.918
A19          2.272e-01  1.823e-01   1.246    0.430
A20                 NA         NA      NA       NA
A21          9.112e-02  1.176e-01   0.775    0.580
A22         -3.664e+00  5.181e+00  -0.707    0.608
A23         -3.578e+00  4.797e+00  -0.746    0.592
A24         -1.353e+00  3.479e+00  -0.389    0.764
A25         -1.461e-01  1.862e-01  -0.785    0.576
A26                 NA         NA      NA       NA

Residual standard error: 1.115 on 1 degrees of freedom
Multiple R-squared:  0.9829,	Adjusted R-squared:  0.5737 
F-statistic: 2.402 on 24 and 1 DF,  p-value: 0.4751

We see that we could get better with our fit. Delete A20 TB Delete A18 OPS

In [14]:

#import new version of csv file
mlb<-read.csv('war-data (1).csv')
A<-matrix(0,nrow = 26, ncol = 24)
A[,24]<-b
for (k in 1:23) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call:
lm(formula = y ~ A)

Residuals:
        1         2         3         4         5         6         7         8 
-0.084647  0.240122  0.226949  0.200364 -0.291981 -0.271040  0.195880  0.193228 
        9        10        11        12        13        14        15        16 
 0.182238  0.023081 -0.033878 -0.338058  0.429703 -0.394952  0.263633 -0.140128 
       17        18        19        20        21        22        23        24 
-0.181327 -0.073715 -0.219848 -0.166407 -0.044515  0.380878  0.080721 -0.162335 
       25        26 
-0.004325 -0.009641 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.081e+02  1.082e+02  -1.924    0.194
A1          -5.352e-01  2.413e-01  -2.218    0.157
A2           1.566e-01  1.303e-01   1.202    0.353
A3           1.931e+00  2.777e+00   0.695    0.559
A4          -1.603e+00  2.668e+00  -0.601    0.609
A5           2.311e-02  5.930e-02   0.390    0.734
A6          -1.065e+00  4.760e-01  -2.238    0.155
A7           3.706e-01  3.033e-01   1.222    0.346
A8           1.072e+00  6.264e-01   1.711    0.229
A9           1.220e+00  8.769e-01   1.392    0.299
A10         -8.362e-02  6.060e-02  -1.380    0.302
A11          4.432e-03  4.171e-02   0.106    0.925
A12          6.269e-02  1.905e-01   0.329    0.773
A13         -2.684e+00  3.239e+00  -0.829    0.494
A14         -2.960e-02  2.049e-02  -1.444    0.285
A15          1.339e+02  1.931e+02   0.693    0.560
A16          7.464e+02  5.145e+02   1.451    0.284
A17         -2.604e+02  1.781e+02  -1.462    0.281
A18          2.172e-01  1.180e-01   1.841    0.207
A19          8.737e-02  8.129e-02   1.075    0.395
A20         -3.479e+00  3.552e+00  -0.979    0.431
A21         -3.419e+00  3.307e+00  -1.034    0.410
A22         -1.264e+00  2.432e+00  -0.520    0.655
A23         -1.366e-01  1.222e-01  -1.118    0.380
A24                 NA         NA      NA       NA

Residual standard error: 0.7951 on 2 degrees of freedom
Multiple R-squared:  0.9827,	Adjusted R-squared:  0.7832 
F-statistic: 4.927 on 23 and 2 DF,  p-value: 0.1822

We got a better fit for the model, Delete A11 for a better A11 for better model

In [15]:

#Delete A11 SB from model
mlb<-read.csv('war-data (1).csv')
A<-matrix(0,nrow = 26, ncol = 23)
A[,23]<-b
for (k in 1:22) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call:
lm(formula = y ~ A)

Residuals:
       1        2        3        4        5        6        7        8 
-0.10399  0.25595  0.22201  0.19290 -0.29475 -0.23564  0.19854  0.19308 
       9       10       11       12       13       14       15       16 
 0.16947  0.02393 -0.03173 -0.35546  0.42493 -0.40001  0.28874 -0.11398 
      17       18       19       20       21       22       23       24 
-0.18465 -0.11794 -0.20766 -0.13801 -0.04774  0.38501  0.08151 -0.16661 
      25       26 
-0.02249 -0.01541 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -205.97687   87.04486  -2.366   0.0988 .
A1            -0.53170    0.19570  -2.717   0.0727 .
A2             0.15658    0.10670   1.467   0.2385  
A3             1.84699    2.18011   0.847   0.4591  
A4            -1.52216    2.09445  -0.727   0.5199  
A5             0.01910    0.03744   0.510   0.6452  
A6            -1.05633    0.38368  -2.753   0.0706 .
A7             0.37508    0.24591   1.525   0.2246  
A8             1.08841    0.49729   2.189   0.1164  
A9             1.24136    0.69937   1.775   0.1740  
A10           -0.08752    0.03948  -2.217   0.1133  
A11            0.07588    0.11835   0.641   0.5670  
A12           -2.59513    2.56195  -1.013   0.3857  
A13           -0.03012    0.01630  -1.848   0.1618  
A14          136.01867  157.25300   0.865   0.4507  
A15          741.31845  419.45733   1.767   0.1753  
A16         -263.04378  144.45329  -1.821   0.1662  
A17            0.22074    0.09276   2.380   0.0977 .
A18            0.08440    0.06251   1.350   0.2697  
A19           -3.39706    2.83937  -1.196   0.3175  
A20           -3.33523    2.63024  -1.268   0.2943  
A21           -1.17861    1.87988  -0.627   0.5752  
A22           -0.13774    0.09969  -1.382   0.2610  
A23                 NA         NA      NA       NA  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.651 on 3 degrees of freedom
Multiple R-squared:  0.9826,	Adjusted R-squared:  0.8547 
F-statistic: 7.683 on 22 and 3 DF,  p-value: 0.05887

In [16]:

#Our model is almost the same as our Multiple R-Squared Score
#Delete A5 R from model for better fit
mlb<-read.csv('war-data (1).csv')
A<-matrix(0,nrow = 26, ncol = 22)
A[,22]<-b
for (k in 1:21) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call:
lm(formula = y ~ A)

Residuals:
       1        2        3        4        5        6        7        8 
-0.07782  0.26687  0.22671  0.20076 -0.39230 -0.14239  0.25484  0.21917 
       9       10       11       12       13       14       15       16 
 0.08002  0.06437 -0.06608 -0.37385  0.49161 -0.28870  0.19461 -0.23578 
      17       18       19       20       21       22       23       24 
-0.18600 -0.19424 -0.31515 -0.07264 -0.01837  0.37586  0.06592 -0.14553 
      25       26 
 0.08836 -0.02026 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -192.63769   74.95185  -2.570  0.06197 . 
A1            -0.51432    0.17398  -2.956  0.04171 * 
A2             0.15263    0.09608   1.589  0.18734   
A3             1.63618    1.93249   0.847  0.44486   
A4            -1.33176    1.86056  -0.716  0.51369   
A5            -0.99718    0.33017  -3.020  0.03916 * 
A6             0.41473    0.21061   1.969  0.12028   
A7             1.17839    0.41974   2.807  0.04844 * 
A8             1.37429    0.58590   2.346  0.07889 . 
A9            -0.10337    0.02197  -4.704  0.00928 **
A10            0.08007    0.10658   0.751  0.49430   
A11           -2.38230    2.28202  -1.044  0.35545   
A12           -0.03229    0.01420  -2.273  0.08542 . 
A13          122.93523  140.06452   0.878  0.42966   
A14          735.57394  378.54495   1.943  0.12393   
A15         -282.71425  125.67659  -2.250  0.08769 . 
A16            0.24096    0.07571   3.183  0.03344 * 
A17            0.08033    0.05597   1.435  0.22452   
A18           -3.21521    2.54306  -1.264  0.27477   
A19           -3.10255    2.33856  -1.327  0.25528   
A20           -0.96321    1.65375  -0.582  0.59152   
A21           -0.14078    0.08984  -1.567  0.19219   
A22                 NA         NA      NA       NA   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5877 on 4 degrees of freedom
Multiple R-squared:  0.981,	Adjusted R-squared:  0.8816 
F-statistic:  9.86 on 21 and 4 DF,  p-value: 0.01923

In [3]:

#adjusted R-squared improvement
#Delete A20 SF from model for better fit
mlb<-read.csv('war-data (1).csv')
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 21)
A[,21]<-b
for (k in 1:20) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call:
lm(formula = y ~ A)

Residuals:
       1        2        3        4        5        6        7        8 
-0.01446  0.31200  0.11643  0.23669 -0.54171 -0.18726  0.21455  0.29241 
       9       10       11       12       13       14       15       16 
 0.09595  0.11559  0.09627 -0.43568  0.40766 -0.27398  0.17451 -0.19335 
      17       18       19       20       21       22       23       24 
-0.12908 -0.14836 -0.36732 -0.04200  0.01498  0.38707  0.02307 -0.09059 
      25       26 
 0.01664 -0.08004 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -1.593e+02  4.512e+01  -3.531  0.01672 * 
A1          -4.365e-01  1.037e-01  -4.208  0.00843 **
A2           1.133e-01  6.364e-02   1.780  0.13518   
A3           5.146e-01  1.512e-01   3.403  0.01918 * 
A4          -2.505e-01  1.160e-01  -2.160  0.08316 . 
A5          -8.645e-01  2.226e-01  -3.883  0.01160 * 
A6           3.134e-01  1.107e-01   2.832  0.03660 * 
A7           9.852e-01  2.396e-01   4.113  0.00924 **
A8           1.096e+00  3.158e-01   3.470  0.01785 * 
A9          -1.012e-01  2.019e-02  -5.015  0.00405 **
A10          4.664e-02  8.367e-02   0.557  0.60124   
A11         -1.068e+00  3.123e-01  -3.418  0.01888 * 
A12         -2.563e-02  7.854e-03  -3.264  0.02235 * 
A13          1.655e+02  1.113e+02   1.487  0.19706   
A14          5.443e+02  1.755e+02   3.101  0.02681 * 
A15         -2.225e+02  6.663e+01  -3.340  0.02056 * 
A16          2.081e-01  4.693e-02   4.433  0.00681 **
A17          6.696e-02  4.755e-02   1.408  0.21810   
A18         -1.762e+00  4.559e-01  -3.864  0.01183 * 
A19         -1.759e+00  3.624e-01  -4.856  0.00465 **
A20         -9.603e-02  4.338e-02  -2.213  0.07776 . 
A21                 NA         NA      NA       NA   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5475 on 5 degrees of freedom
Multiple R-squared:  0.9794,	Adjusted R-squared:  0.8972 
F-statistic: 11.91 on 20 and 5 DF,  p-value: 0.006024

In [5]:

#Better fit after removal of A20
#Remove A10 CS for better fit
mlb<-read.csv('war-data (1).csv')
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 20)
A[,20]<-b
for (k in 1:19) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call:
lm(formula = y ~ A)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.52475 -0.14784 -0.00275  0.18046  0.34833 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -1.521e+02  4.064e+01  -3.741  0.00961 **
A1          -4.307e-01  9.710e-02  -4.436  0.00440 **
A2           9.775e-02  5.383e-02   1.816  0.11929   
A3           4.990e-01  1.398e-01   3.570  0.01179 * 
A4          -2.436e-01  1.085e-01  -2.246  0.06585 . 
A5          -8.164e-01  1.931e-01  -4.229  0.00551 **
A6           2.914e-01  9.726e-02   2.996  0.02413 * 
A7           9.481e-01  2.165e-01   4.379  0.00468 **
A8           1.034e+00  2.781e-01   3.718  0.00987 **
A9          -1.022e-01  1.892e-02  -5.402  0.00166 **
A10         -1.032e+00  2.877e-01  -3.588  0.01154 * 
A11         -2.589e-02  7.376e-03  -3.511  0.01266 * 
A12          1.468e+02  9.983e+01   1.471  0.19176   
A13          5.245e+02  1.617e+02   3.244  0.01761 * 
A14         -2.079e+02  5.765e+01  -3.607  0.01127 * 
A15          1.969e-01  3.992e-02   4.932  0.00263 **
A16          5.650e-02  4.110e-02   1.375  0.21840   
A17         -1.685e+00  4.088e-01  -4.121  0.00621 **
A18         -1.688e+00  3.189e-01  -5.294  0.00184 **
A19         -9.087e-02  3.988e-02  -2.279  0.06290 . 
A20                 NA         NA      NA       NA   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5151 on 6 degrees of freedom
Multiple R-squared:  0.9782,	Adjusted R-squared:  0.909 
F-statistic: 14.15 on 19 and 6 DF,  p-value: 0.001731

In [6]:

#Better fit after removal of A10
#Remove A16 GDP for better fit
mlb<-read.csv('war-data (1).csv')
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 19)
A[,19]<-b
for (k in 1:18) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call:
lm(formula = y ~ A)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.73661 -0.19591  0.04551  0.21701  0.46571 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -1.278e+02  3.886e+01  -3.288  0.01334 * 
A1          -4.621e-01  1.002e-01  -4.611  0.00245 **
A2           9.543e-02  5.712e-02   1.671  0.13871   
A3           4.034e-01  1.287e-01   3.134  0.01653 * 
A4          -1.832e-01  1.053e-01  -1.740  0.12545   
A5          -7.320e-01  1.943e-01  -3.767  0.00701 **
A6           2.611e-01  1.006e-01   2.596  0.03561 * 
A7           8.148e-01  2.055e-01   3.964  0.00543 **
A8           9.494e-01  2.879e-01   3.298  0.01316 * 
A9          -9.111e-02  1.816e-02  -5.016  0.00154 **
A10         -8.434e-01  2.684e-01  -3.142  0.01632 * 
A11         -2.136e-02  7.005e-03  -3.050  0.01859 * 
A12          1.841e+02  1.020e+02   1.805  0.11411   
A13          4.229e+02  1.527e+02   2.770  0.02770 * 
A14         -1.941e+02  6.026e+01  -3.221  0.01464 * 
A15          1.876e-01  4.177e-02   4.492  0.00283 **
A16         -1.429e+00  3.865e-01  -3.698  0.00768 **
A17         -1.539e+00  3.185e-01  -4.834  0.00189 **
A18         -8.550e-02  4.213e-02  -2.029  0.08198 . 
A19                 NA         NA      NA       NA   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5468 on 7 degrees of freedom
Multiple R-squared:  0.9713,	Adjusted R-squared:  0.8975 
F-statistic: 13.15 on 18 and 7 DF,  p-value: 0.0009777

In [8]:

#Worse fit after removal of GDP
#Remove A16 HBP and A17 SH for better fit
mlb<-read.csv('war-data (1).csv')
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 17)
A[,17]<-b
for (k in 1:16) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call:
lm(formula = y ~ A)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.24900 -0.32868 -0.03965  0.40735  1.20246 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -4.619e+01  4.594e+01  -1.005   0.3409  
A1          -2.531e-01  1.524e-01  -1.661   0.1311  
A2           1.524e-02  7.881e-02   0.193   0.8509  
A3          -3.440e-02  1.134e-01  -0.303   0.7686  
A4           1.411e-01  1.316e-01   1.073   0.3114  
A5          -4.117e-01  2.567e-01  -1.604   0.1433  
A6          -1.553e-02  9.267e-02  -0.168   0.8706  
A7           2.165e-01  1.834e-01   1.181   0.2680  
A8           3.848e-02  2.701e-01   0.142   0.8899  
A9          -4.055e-02  2.693e-02  -1.506   0.1664  
A10          1.776e-01  9.939e-02   1.787   0.1076  
A11         -2.329e-03  1.069e-02  -0.218   0.8324  
A12          3.433e+02  1.751e+02   1.960   0.0816 .
A13         -1.291e+02  7.999e+01  -1.614   0.1409  
A14         -3.395e+00  5.282e+01  -0.064   0.9502  
A15          5.790e-02  4.836e-02   1.197   0.2618  
A16         -6.614e-02  7.049e-02  -0.938   0.3726  
A17                 NA         NA      NA       NA  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.008 on 9 degrees of freedom
Multiple R-squared:  0.8746,	Adjusted R-squared:  0.6517 
F-statistic: 3.924 on 16 and 9 DF,  p-value: 0.02149

In [12]:

#Worse fit after removal of HBP and SH
#Add OPS back to model
mlb<-read.csv('war-data (1).csv')
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 18)
A[,18]<-b
for (k in 1:17) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call:
lm(formula = y ~ A)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.14967 -0.36682 -0.06602  0.46206  1.07372 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.282e+01  4.762e+01  -0.899    0.395
A1          -2.727e-01  1.597e-01  -1.707    0.126
A2          -9.049e-04  8.465e-02  -0.011    0.992
A3          -2.870e-02  1.172e-01  -0.245    0.813
A4           1.326e-01  1.363e-01   0.973    0.359
A5          -3.873e-01  2.671e-01  -1.450    0.185
A6          -1.765e-02  9.560e-02  -0.185    0.858
A7           2.000e-01  1.906e-01   1.049    0.325
A8           9.077e-03  2.818e-01   0.032    0.975
A9          -3.177e-02  3.060e-02  -1.038    0.329
A10          1.610e-01  1.053e-01   1.529    0.165
A11         -1.087e-03  1.118e-02  -0.097    0.925
A12          3.150e+02  1.853e+02   1.700    0.128
A13         -4.360e+02  4.574e+02  -0.953    0.368
A14         -3.214e+02  4.694e+02  -0.685    0.513
A15          3.209e+02  4.706e+02   0.682    0.515
A16          5.269e-02  5.044e-02   1.044    0.327
A17         -5.117e-02  7.593e-02  -0.674    0.519
A18                 NA         NA      NA       NA

Residual standard error: 1.039 on 8 degrees of freedom
Multiple R-squared:  0.8815,	Adjusted R-squared:  0.6297 
F-statistic: 3.501 on 17 and 8 DF,  p-value: 0.03839

In [15]:

#Worse fit but better indicators for bad fit
#Remove A2 G and A11 SO for better fit
mlb<-read.csv('war-data (1).csv')
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 16)
A[,16]<-b
for (k in 1:15) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call:
lm(formula = y ~ A)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.15039 -0.37060 -0.05043  0.46164  1.06846 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -4.424e+01  3.878e+01  -1.141  0.28050   
A1          -2.805e-01  1.124e-01  -2.495  0.03174 * 
A2          -2.646e-02  1.027e-01  -0.258  0.80183   
A3           1.330e-01  1.219e-01   1.091  0.30088   
A4          -3.985e-01  1.851e-01  -2.153  0.05675 . 
A5          -1.856e-02  8.235e-02  -0.225  0.82621   
A6           1.943e-01  1.614e-01   1.204  0.25634   
A7           6.746e-03  2.235e-01   0.030  0.97651   
A8          -3.088e-02  2.310e-02  -1.337  0.21099   
A9           1.605e-01  9.177e-02   1.749  0.11084   
A10          3.230e+02  8.836e+01   3.656  0.00442 **
A11         -4.484e+02  3.830e+02  -1.171  0.26893   
A12         -3.325e+02  3.931e+02  -0.846  0.41735   
A13          3.318e+02  3.887e+02   0.854  0.41327   
A14          5.258e-02  4.471e-02   1.176  0.26680   
A15         -5.155e-02  5.815e-02  -0.886  0.39618   
A16                 NA         NA      NA       NA   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.93 on 10 degrees of freedom
Multiple R-squared:  0.8814,	Adjusted R-squared:  0.7034 
F-statistic: 4.952 on 15 and 10 DF,  p-value: 0.007328

In [18]:

#Better fit
#Remove A7 HR for better fit
mlb<-read.csv('war-data (1).csv')
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 15)
A[,15]<-b
for (k in 1:14) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call:
lm(formula = y ~ A)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.14835 -0.37245 -0.05783  0.45458  1.07269 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -45.06086   26.33293  -1.711  0.11506   
A1            -0.28036    0.10712  -2.617  0.02394 * 
A2            -0.02576    0.09538  -0.270  0.79206   
A3             0.13364    0.11441   1.168  0.26745   
A4            -0.40150    0.14920  -2.691  0.02099 * 
A5            -0.02067    0.04162  -0.497  0.62928   
A6             0.19074    0.10638   1.793  0.10049   
A7            -0.03063    0.02054  -1.491  0.16396   
A8             0.15978    0.08429   1.895  0.08460 . 
A9           323.33972   83.68635   3.864  0.00264 **
A10         -447.90609  364.95574  -1.227  0.24533   
A11         -331.14733  372.23794  -0.890  0.39271   
A12          331.73939  370.62263   0.895  0.38991   
A13            0.05190    0.03674   1.412  0.18550   
A14           -0.05186    0.05453  -0.951  0.36196   
A15                 NA         NA      NA       NA   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8868 on 11 degrees of freedom
Multiple R-squared:  0.8813,	Adjusted R-squared:  0.7303 
F-statistic: 5.836 on 14 and 11 DF,  p-value: 0.002848

In [2]:

#Worse fit but better indicators for bad fit
#Remove A4 2B  for better fit
mlb<-read.csv('war-data (1).csv')
b<-matrix(1, nrow = 26, ncol = 1)
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 14)
A[,14]<-b
for (k in 1:13) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call:
lm(formula = y ~ A)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.08776 -0.41730 -0.07005  0.48441  1.06377 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -44.12974   25.07769  -1.760  0.10390   
A1            -0.26623    0.08979  -2.965  0.01181 * 
A2             0.10537    0.04440   2.373  0.03520 * 
A3            -0.39553    0.14174  -2.791  0.01633 * 
A4            -0.02419    0.03797  -0.637  0.53595   
A5             0.18949    0.10209   1.856  0.08816 . 
A6            -0.03203    0.01908  -1.679  0.11901   
A7             0.14192    0.05024   2.825  0.01532 * 
A8           326.23363   79.72747   4.092  0.00149 **
A9          -461.45267  347.25002  -1.329  0.20860   
A10         -335.21359  357.27830  -0.938  0.36662   
A11          337.61742  355.40498   0.950  0.36088   
A12            0.04930    0.03407   1.447  0.17349   
A13           -0.05774    0.04803  -1.202  0.25241   
A14                 NA         NA      NA       NA   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8518 on 12 degrees of freedom
Multiple R-squared:  0.8806,	Adjusted R-squared:  0.7512 
F-statistic: 6.805 on 13 and 12 DF,  p-value: 0.001062

In [3]:

#Worse fit but better indicators for bad fit
#Remove A4 2B  for better fit
mlb<-read.csv('war-data (1).csv')
b<-matrix(1, nrow = 26, ncol = 1)
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 13)
A[,13]<-b
for (k in 1:12) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call:
lm(formula = y ~ A)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0228 -0.3696 -0.0893  0.5124  1.1570 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -35.32939   20.44839  -1.728  0.10770   
A1            -0.27246    0.08719  -3.125  0.00805 **
A2             0.08930    0.03569   2.502  0.02650 * 
A3            -0.35435    0.12323  -2.876  0.01301 * 
A4             0.19962    0.09852   2.026  0.06376 . 
A5            -0.02990    0.01835  -1.629  0.12724   
A6             0.15006    0.04746   3.162  0.00750 **
A7           311.56956   74.56980   4.178  0.00108 **
A8          -503.73164  332.97464  -1.513  0.15425   
A9          -372.58943  344.28540  -1.082  0.29882   
A10          373.52498  342.79913   1.090  0.29566   
A11            0.04467    0.03252   1.374  0.19277   
A12           -0.04672    0.04377  -1.068  0.30517   
A13                 NA         NA      NA       NA   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8321 on 13 degrees of freedom
Multiple R-squared:  0.8765,	Adjusted R-squared:  0.7625 
F-statistic:  7.69 on 12 and 13 DF,  p-value: 0.0004322

In [1]:

#Remove A12 IBB for better fit
#Worse fit by a little...I'll stick with this model. 
mlb<-read.csv('war-data (1).csv')
b<-matrix(1, nrow = 26, ncol = 1)
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 12)
A[,12]<-b
for (k in 1:11) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call:
lm(formula = y ~ A)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.99243 -0.50108 -0.03323  0.50326  1.20661 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -32.35895   20.35891  -1.589  0.13429   
A1            -0.29572    0.08485  -3.485  0.00364 **
A2             0.08446    0.03558   2.374  0.03246 * 
A3            -0.33448    0.12242  -2.732  0.01620 * 
A4             0.17842    0.09697   1.840  0.08709 . 
A5            -0.02535    0.01794  -1.413  0.17937   
A6             0.13098    0.04419   2.964  0.01025 * 
A7           290.18433   72.18564   4.020  0.00127 **
A8          -554.49045  331.20070  -1.674  0.11628   
A9          -441.33878  339.88981  -1.298  0.21511   
A10          440.12180  338.75052   1.299  0.21485   
A11            0.03939    0.03230   1.220  0.24275   
A12                 NA         NA      NA       NA   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8363 on 14 degrees of freedom
Multiple R-squared:  0.8657,	Adjusted R-squared:  0.7602 
F-statistic: 8.204 on 11 and 14 DF,  p-value: 0.0002335

In [5]:

### Lets see how this model predicts the WAR of 2019 batters:
AGE<-c(25,26,25,24,25,26,32,24,30,25,28,27,28,27,25,29,26,30,28,32,34,28,33,25,28,27)
AB<-c(520,471,547,661,578,574,539,594,569,606,554,596,618,590,598,632,413,593,534,564,596,480,365,433,586,395)
H<-c(180,147,152,183,156,187,166,170,188,176,162,170,191,175,174,192,115,172,169,145,159,119,114,114,178,104)
THREE_B<-c(5,4,6,2,4,7,2,1,2,9,5,4,4,2,6,3,0,5,2,0,1,3,1,8,3,2)
RBI<-c(80,79,68,92,105,110,38,103,130,111,75,93,98,110,108,60,67,
83,61,81,99,79,52,47,63,79)
BB<-c(81,122,58,70,106,68,71,96,69,29,35,70,76,73,47,61,76,90,55,102,78,90,47,80,32,79)
BA<-c(0.346,0.312,0.278,0.277,0.270,0.326,0.308,0.286,0.330,0.290,0.292,0.285,0.309,0.297,0.291,0.304,0.278,0.290,0.316,0.257,0.267,0.248,0.312,0.263,0.304,0.263)
OBP<-c(0.438,0.460,0.356,0.352,0.387,0.402,0.395,0.394,0.402,0.326,0.337,0.366,0.388,0.374,0.348,0.367,0.392,0.389,0.386,0.374,0.353,0.366,0.406,0.404,0.341,0.391)
SLG<-c(0.640,0.628,0.508,0.519,0.552,0.598,0.417,0.532,0.629,0.554,0.417,0.493,0.505,0.561,0.567,0.438,0.528,0.533,0.451,0.523,0.448,0.467,0.518,0.483,0.415,0.582)
OPS<-c(1.078,1.088,0.864,0.871,0.939,1,0.813,0.926,1.031,0.881,0.754,0.859,0.892,0.935,0.914,0.806,0.919,0.922,0.837,0.897,0.801,0.833,0.924,0.886,0.755,0.973)
OPS_P<-c(186,199,136,131,150,164,119,156,173,126,109,139,140,133,127,121,145,139,133,143,120,123,151,150,112,161)
b<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
WAR<-c(10.8,10.2,8.2,7.9,7.9,7.6,6.9,6.9,6.4,6.3,6.2,6.1,6.1,5.6,5.6,5.5,5.5,5.4,5.2,4.9,4.8,4.7,4.5,4.4,4.3,4.2)
Y<-matrix(c(WAR))
A<-matrix(c(AGE,AB,H,THREE_B,RBI,BB,BA,OBP,SLG,OPS,OPS_P,b),nrow=26,ncol=12)
M<-t(A)%*%A
X<-solve(M)%*%t(A)%*%Y
X

      [,1]         
 [1,]   -0.29572013
 [2,]    0.08446067
 [3,]   -0.33447882
 [4,]    0.17841504
 [5,]   -0.02535493
 [6,]    0.13098287
 [7,]  290.18432666
 [8,] -554.49044710
 [9,] -441.33877812
[10,]  440.12179753
[11,]    0.03939196
[12,]  -32.35894696

In [7]:

#Lets see how good this model is at predicting the war of Mike Trout
27*-0.29572013 + 470*0.08446067 + 137*-0.33447882 + 2*0.17841504 + 104*-0.02535493 + 110*0.13098287 + .291*290.18432666 + .438*-554.49044710 + .645*-441.33877812 + 1.083*440.12179753 + 185*0.03939196 + 1*-32.35894696

[1] 6.51029

In [8]:

8.3 - 6.51028981584996

[1] 1.78971

Model is off by about 1.8

In [9]:

#Lets see how good this model is at predicting the war of Cody Bellinger 
23*-0.29572013 + 588*0.08446067 + 170*-0.33447882 + 3*0.17841504 + 115*-0.02535493 + 95*0.13098287 + .305*290.18432666 + .406*-554.49044710 + .629*-441.33877812 + 1.035*440.12179753 + 169*0.03939196 + 1*-32.35894696

[1] 11.66807

In [10]:

9 - 11.66807378477
#Model is off by -2.668

[1] -2.668074

For the Regression Analysis, I used R and ran it in a CoCalc Jupyter notebook. This way made it easier to add and delete data, while making it look clean. For this analysis, I started out with 25 different factors and was able to narrow it down to 11 by adding and deleting different data points.

Over the course of the project, it became clearer that finding a fit that I was comfortable with was going to be an art and a science. This was the case because there were some statistics that are not great indicators for how good of a batter someone is, but they were getting high t - values. Keeping these would have yielded a better fit, but I could not keep them in good faith.

The first couple that were deleted were obvious bad fits. These were Total Bases, On-Base percentage, Stolen Bases, etc. Deleting these did help me get a better fit. I was able to go from a fit of 0.5737 to almost a one to one fit of 0.909. Although that is a great fit, like I mentioned before there were some stats that I felt needed to be removed from the model. I will talk about two examples from ones that I deleted.

The first example of these statistics is Hit by Pitch. This stat only tells us how many times a player gets unintentionally hit by a pitch. At the major league level getting hit by a pitch is not a common occurrence. The top five players had only between 8 – 10 in this category, and most players had over 600 plate appearances. Players should not expect to be hit by a pitch when they step into a Major League batters’ box. Yet, hit by pitch had a -3.698 t – score, indicating that it is a significant data point.

Another example of one of this Sacrifice Hits. Out of the 26 players with the highest WAR scores, 18 players had a zero in this category. Sacrifice Hits are situational and not great indictors for success per plate appearance. Sacrifice Hits happen when a player attempts to move baserunners into scoring position by bunting. The problem with this is that we are looking at the top 26 players when it comes to WAR. This means that they probably provide valuable offense to their team. If they are among the best players, why would you not just let them try to get a hit. Also, a player could provide more value by just getting a hit because the player at the plate and the baserunners could be on base. Despite this, Sacrifice Hits had a high t – value score of -4.834.

There were more stats like this that I removed, but I feel that I was able on a model that included stats like Batting Average, Slugging Percentage, On – Base plus Slugging. The thing that I like about these stats is that they are telling you what a player is doing on an at – bat basis. For example, On – Base plus Slugging is taking into account many aspects of great hitting. Slugging percentage tells us about player’s ability to hit for average and power, considering singles, doubles, triples, home runs and plate appearances as part of its calculation. While, On – Base percentage takes into account all of the ways that a player can get on base including hits, walks, and hit by pitch. Add On – Base percentage and Slugging Percentage, you get On – Base plus Slugging, a really high quality statistic telling us a lot about a batter.

I was able to get an adjusted R – Squared score of 0.7602, but it is not as good as a 0.909. When I tested the model for two of the best players in the MLB in the 2019 season, Mike Trout and Cody Bellinger, I was off by 1.8 and 2.6. Obviously, this model has room for improvement. Perhaps I should not have taken out as many stats that had high t – scores, although they might not be the greatest indictors for batting ability. Next time I do a project that has regression analysis, I would like to make a model that has a better balance of stats that give me a better fit. Perhaps a Regression Analysis that I do not have any background knowledge in, maybe that would improve my model because I would be looking at the numbers with no bias. This project has proven to me that the applications of regression analysis is endless. Pretty much anything in which you have a bunch of independent factors that led to a statistic.