Abstract
WAR is a baseball statistic that measures a player’s value against a replacement level player. A replacement level player being a player that will give you the production value of a minimum salary baseball player. This is an important statistic because it’s a way of measuring a player’s value against their peers in the number of extra wins that they will give you.

We looked to do a regression analysis of this statistic to see what measures are integral to the WAR statistic. Traditionally players were evaluated on classic baseball statistics such as batting average, home runs, stolen bases, and runs batted in, but we know that these stats do not tell the whole story. For example, those statistics do not tell you how often a player is getting on base, or strikeout rate. We want to know what are the best indicators for WAR.

Regression Analysis a good tool to figure this out because we can account for multiple factors that help us determine WAR score. With a standard deviation score for each factor we can determine how statistically significant each variable is. With the Regression Analysis we are trying to limit the number of outliers, outliers will make our adjusted R squared score worse, so we will try to make our R – Squared score as high as possible.

1

In [2]:

mlb<-read.csv('war-data (1).csv')

2

In [3]:

#Create our y...this is the war data y<-matrix(mlb[,1], nrow = 26, ncol = 1)

3

In [2]:

#Create our b matrix b<-matrix(1, nrow = 26, ncol = 1)

4

In [7]:

#Create and populate A matrix A<-matrix(0,nrow = 26, ncol = 26) A[,26]<-b for (k in 1:25) {A[,k]<-mlb[,k+1]}

5

In [9]:

#Regression Analysis L<-lm(y~A) summary(L)

6

Call:
lm(formula = y ~ A)
Residuals:
1 2 3 4 5 6 7 8
-0.122076 0.239640 0.229624 0.200709 -0.297195 -0.254432 0.206614 0.221623
9 10 11 12 13 14 15 16
0.171591 0.076752 -0.046736 -0.305538 0.420955 -0.370879 0.193177 -0.095547
17 18 19 20 21 22 23 24
-0.189749 -0.056964 -0.227216 -0.182676 -0.088005 0.396479 0.114347 -0.191193
25 26
-0.045370 0.002065
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.155e+02 1.621e+02 -1.330 0.411
A1 -5.434e-01 3.442e-01 -1.579 0.359
A2 1.698e-01 2.090e-01 0.812 0.566
A3 2.046e+00 3.994e+00 0.512 0.699
A4 -1.709e+00 3.830e+00 -0.446 0.733
A5 2.291e-02 8.318e-02 0.275 0.829
A6 -1.102e+00 7.245e-01 -1.521 0.370
A7 3.943e-01 4.625e-01 0.853 0.551
A8 1.119e+00 9.502e-01 1.178 0.448
A9 1.292e+00 1.346e+00 0.959 0.513
A10 -8.679e-02 8.840e-02 -0.982 0.506
A11 6.152e-03 5.996e-02 0.103 0.935
A12 6.918e-02 2.718e-01 0.255 0.841
A13 -2.834e+00 4.686e+00 -0.605 0.654
A14 -3.071e-02 2.999e-02 -1.024 0.492
A15 1.372e+02 2.720e+02 0.504 0.703
A16 8.579e+02 1.120e+03 0.766 0.584
A17 -1.961e+02 5.539e+02 -0.354 0.783
A18 -7.897e+01 6.069e+02 -0.130 0.918
A19 2.272e-01 1.823e-01 1.246 0.430
A20 NA NA NA NA
A21 9.112e-02 1.176e-01 0.775 0.580
A22 -3.664e+00 5.181e+00 -0.707 0.608
A23 -3.578e+00 4.797e+00 -0.746 0.592
A24 -1.353e+00 3.479e+00 -0.389 0.764
A25 -1.461e-01 1.862e-01 -0.785 0.576
A26 NA NA NA NA
Residual standard error: 1.115 on 1 degrees of freedom
Multiple R-squared: 0.9829, Adjusted R-squared: 0.5737
F-statistic: 2.402 on 24 and 1 DF, p-value: 0.4751

We see that we could get better with our fit. Delete A20 TB Delete A18 OPS

7

In [14]:

#import new version of csv file mlb<-read.csv('war-data (1).csv') A<-matrix(0,nrow = 26, ncol = 24) A[,24]<-b for (k in 1:23) {A[,k]<-mlb[,k+1]} L<-lm(y~A) summary(L)

8

Call:
lm(formula = y ~ A)
Residuals:
1 2 3 4 5 6 7 8
-0.084647 0.240122 0.226949 0.200364 -0.291981 -0.271040 0.195880 0.193228
9 10 11 12 13 14 15 16
0.182238 0.023081 -0.033878 -0.338058 0.429703 -0.394952 0.263633 -0.140128
17 18 19 20 21 22 23 24
-0.181327 -0.073715 -0.219848 -0.166407 -0.044515 0.380878 0.080721 -0.162335
25 26
-0.004325 -0.009641
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.081e+02 1.082e+02 -1.924 0.194
A1 -5.352e-01 2.413e-01 -2.218 0.157
A2 1.566e-01 1.303e-01 1.202 0.353
A3 1.931e+00 2.777e+00 0.695 0.559
A4 -1.603e+00 2.668e+00 -0.601 0.609
A5 2.311e-02 5.930e-02 0.390 0.734
A6 -1.065e+00 4.760e-01 -2.238 0.155
A7 3.706e-01 3.033e-01 1.222 0.346
A8 1.072e+00 6.264e-01 1.711 0.229
A9 1.220e+00 8.769e-01 1.392 0.299
A10 -8.362e-02 6.060e-02 -1.380 0.302
A11 4.432e-03 4.171e-02 0.106 0.925
A12 6.269e-02 1.905e-01 0.329 0.773
A13 -2.684e+00 3.239e+00 -0.829 0.494
A14 -2.960e-02 2.049e-02 -1.444 0.285
A15 1.339e+02 1.931e+02 0.693 0.560
A16 7.464e+02 5.145e+02 1.451 0.284
A17 -2.604e+02 1.781e+02 -1.462 0.281
A18 2.172e-01 1.180e-01 1.841 0.207
A19 8.737e-02 8.129e-02 1.075 0.395
A20 -3.479e+00 3.552e+00 -0.979 0.431
A21 -3.419e+00 3.307e+00 -1.034 0.410
A22 -1.264e+00 2.432e+00 -0.520 0.655
A23 -1.366e-01 1.222e-01 -1.118 0.380
A24 NA NA NA NA
Residual standard error: 0.7951 on 2 degrees of freedom
Multiple R-squared: 0.9827, Adjusted R-squared: 0.7832
F-statistic: 4.927 on 23 and 2 DF, p-value: 0.1822

We got a better fit for the model, Delete A11 for a better A11 for better model

9

In [15]:

#Delete A11 SB from model mlb<-read.csv('war-data (1).csv') A<-matrix(0,nrow = 26, ncol = 23) A[,23]<-b for (k in 1:22) {A[,k]<-mlb[,k+1]} L<-lm(y~A) summary(L)

10

Call:
lm(formula = y ~ A)
Residuals:
1 2 3 4 5 6 7 8
-0.10399 0.25595 0.22201 0.19290 -0.29475 -0.23564 0.19854 0.19308
9 10 11 12 13 14 15 16
0.16947 0.02393 -0.03173 -0.35546 0.42493 -0.40001 0.28874 -0.11398
17 18 19 20 21 22 23 24
-0.18465 -0.11794 -0.20766 -0.13801 -0.04774 0.38501 0.08151 -0.16661
25 26
-0.02249 -0.01541
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -205.97687 87.04486 -2.366 0.0988 .
A1 -0.53170 0.19570 -2.717 0.0727 .
A2 0.15658 0.10670 1.467 0.2385
A3 1.84699 2.18011 0.847 0.4591
A4 -1.52216 2.09445 -0.727 0.5199
A5 0.01910 0.03744 0.510 0.6452
A6 -1.05633 0.38368 -2.753 0.0706 .
A7 0.37508 0.24591 1.525 0.2246
A8 1.08841 0.49729 2.189 0.1164
A9 1.24136 0.69937 1.775 0.1740
A10 -0.08752 0.03948 -2.217 0.1133
A11 0.07588 0.11835 0.641 0.5670
A12 -2.59513 2.56195 -1.013 0.3857
A13 -0.03012 0.01630 -1.848 0.1618
A14 136.01867 157.25300 0.865 0.4507
A15 741.31845 419.45733 1.767 0.1753
A16 -263.04378 144.45329 -1.821 0.1662
A17 0.22074 0.09276 2.380 0.0977 .
A18 0.08440 0.06251 1.350 0.2697
A19 -3.39706 2.83937 -1.196 0.3175
A20 -3.33523 2.63024 -1.268 0.2943
A21 -1.17861 1.87988 -0.627 0.5752
A22 -0.13774 0.09969 -1.382 0.2610
A23 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.651 on 3 degrees of freedom
Multiple R-squared: 0.9826, Adjusted R-squared: 0.8547
F-statistic: 7.683 on 22 and 3 DF, p-value: 0.05887

In [16]:

#Our model is almost the same as our Multiple R-Squared Score #Delete A5 R from model for better fit mlb<-read.csv('war-data (1).csv') A<-matrix(0,nrow = 26, ncol = 22) A[,22]<-b for (k in 1:21) {A[,k]<-mlb[,k+1]} L<-lm(y~A) summary(L)

11

Call:
lm(formula = y ~ A)
Residuals:
1 2 3 4 5 6 7 8
-0.07782 0.26687 0.22671 0.20076 -0.39230 -0.14239 0.25484 0.21917
9 10 11 12 13 14 15 16
0.08002 0.06437 -0.06608 -0.37385 0.49161 -0.28870 0.19461 -0.23578
17 18 19 20 21 22 23 24
-0.18600 -0.19424 -0.31515 -0.07264 -0.01837 0.37586 0.06592 -0.14553
25 26
0.08836 -0.02026
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -192.63769 74.95185 -2.570 0.06197 .
A1 -0.51432 0.17398 -2.956 0.04171 *
A2 0.15263 0.09608 1.589 0.18734
A3 1.63618 1.93249 0.847 0.44486
A4 -1.33176 1.86056 -0.716 0.51369
A5 -0.99718 0.33017 -3.020 0.03916 *
A6 0.41473 0.21061 1.969 0.12028
A7 1.17839 0.41974 2.807 0.04844 *
A8 1.37429 0.58590 2.346 0.07889 .
A9 -0.10337 0.02197 -4.704 0.00928 **
A10 0.08007 0.10658 0.751 0.49430
A11 -2.38230 2.28202 -1.044 0.35545
A12 -0.03229 0.01420 -2.273 0.08542 .
A13 122.93523 140.06452 0.878 0.42966
A14 735.57394 378.54495 1.943 0.12393
A15 -282.71425 125.67659 -2.250 0.08769 .
A16 0.24096 0.07571 3.183 0.03344 *
A17 0.08033 0.05597 1.435 0.22452
A18 -3.21521 2.54306 -1.264 0.27477
A19 -3.10255 2.33856 -1.327 0.25528
A20 -0.96321 1.65375 -0.582 0.59152
A21 -0.14078 0.08984 -1.567 0.19219
A22 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5877 on 4 degrees of freedom
Multiple R-squared: 0.981, Adjusted R-squared: 0.8816
F-statistic: 9.86 on 21 and 4 DF, p-value: 0.01923

In [3]:

#adjusted R-squared improvement #Delete A20 SF from model for better fit mlb<-read.csv('war-data (1).csv') y<-matrix(mlb[,1], nrow = 26, ncol = 1) A<-matrix(0,nrow = 26, ncol = 21) A[,21]<-b for (k in 1:20) {A[,k]<-mlb[,k+1]} L<-lm(y~A) summary(L)

12

Call:
lm(formula = y ~ A)
Residuals:
1 2 3 4 5 6 7 8
-0.01446 0.31200 0.11643 0.23669 -0.54171 -0.18726 0.21455 0.29241
9 10 11 12 13 14 15 16
0.09595 0.11559 0.09627 -0.43568 0.40766 -0.27398 0.17451 -0.19335
17 18 19 20 21 22 23 24
-0.12908 -0.14836 -0.36732 -0.04200 0.01498 0.38707 0.02307 -0.09059
25 26
0.01664 -0.08004
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.593e+02 4.512e+01 -3.531 0.01672 *
A1 -4.365e-01 1.037e-01 -4.208 0.00843 **
A2 1.133e-01 6.364e-02 1.780 0.13518
A3 5.146e-01 1.512e-01 3.403 0.01918 *
A4 -2.505e-01 1.160e-01 -2.160 0.08316 .
A5 -8.645e-01 2.226e-01 -3.883 0.01160 *
A6 3.134e-01 1.107e-01 2.832 0.03660 *
A7 9.852e-01 2.396e-01 4.113 0.00924 **
A8 1.096e+00 3.158e-01 3.470 0.01785 *
A9 -1.012e-01 2.019e-02 -5.015 0.00405 **
A10 4.664e-02 8.367e-02 0.557 0.60124
A11 -1.068e+00 3.123e-01 -3.418 0.01888 *
A12 -2.563e-02 7.854e-03 -3.264 0.02235 *
A13 1.655e+02 1.113e+02 1.487 0.19706
A14 5.443e+02 1.755e+02 3.101 0.02681 *
A15 -2.225e+02 6.663e+01 -3.340 0.02056 *
A16 2.081e-01 4.693e-02 4.433 0.00681 **
A17 6.696e-02 4.755e-02 1.408 0.21810
A18 -1.762e+00 4.559e-01 -3.864 0.01183 *
A19 -1.759e+00 3.624e-01 -4.856 0.00465 **
A20 -9.603e-02 4.338e-02 -2.213 0.07776 .
A21 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5475 on 5 degrees of freedom
Multiple R-squared: 0.9794, Adjusted R-squared: 0.8972
F-statistic: 11.91 on 20 and 5 DF, p-value: 0.006024

In [5]:

#Better fit after removal of A20 #Remove A10 CS for better fit mlb<-read.csv('war-data (1).csv') y<-matrix(mlb[,1], nrow = 26, ncol = 1) A<-matrix(0,nrow = 26, ncol = 20) A[,20]<-b for (k in 1:19) {A[,k]<-mlb[,k+1]} L<-lm(y~A) summary(L)

13

Call:
lm(formula = y ~ A)
Residuals:
Min 1Q Median 3Q Max
-0.52475 -0.14784 -0.00275 0.18046 0.34833
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.521e+02 4.064e+01 -3.741 0.00961 **
A1 -4.307e-01 9.710e-02 -4.436 0.00440 **
A2 9.775e-02 5.383e-02 1.816 0.11929
A3 4.990e-01 1.398e-01 3.570 0.01179 *
A4 -2.436e-01 1.085e-01 -2.246 0.06585 .
A5 -8.164e-01 1.931e-01 -4.229 0.00551 **
A6 2.914e-01 9.726e-02 2.996 0.02413 *
A7 9.481e-01 2.165e-01 4.379 0.00468 **
A8 1.034e+00 2.781e-01 3.718 0.00987 **
A9 -1.022e-01 1.892e-02 -5.402 0.00166 **
A10 -1.032e+00 2.877e-01 -3.588 0.01154 *
A11 -2.589e-02 7.376e-03 -3.511 0.01266 *
A12 1.468e+02 9.983e+01 1.471 0.19176
A13 5.245e+02 1.617e+02 3.244 0.01761 *
A14 -2.079e+02 5.765e+01 -3.607 0.01127 *
A15 1.969e-01 3.992e-02 4.932 0.00263 **
A16 5.650e-02 4.110e-02 1.375 0.21840
A17 -1.685e+00 4.088e-01 -4.121 0.00621 **
A18 -1.688e+00 3.189e-01 -5.294 0.00184 **
A19 -9.087e-02 3.988e-02 -2.279 0.06290 .
A20 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5151 on 6 degrees of freedom
Multiple R-squared: 0.9782, Adjusted R-squared: 0.909
F-statistic: 14.15 on 19 and 6 DF, p-value: 0.001731

In [6]:

#Better fit after removal of A10 #Remove A16 GDP for better fit mlb<-read.csv('war-data (1).csv') y<-matrix(mlb[,1], nrow = 26, ncol = 1) A<-matrix(0,nrow = 26, ncol = 19) A[,19]<-b for (k in 1:18) {A[,k]<-mlb[,k+1]} L<-lm(y~A) summary(L)

14

Call:
lm(formula = y ~ A)
Residuals:
Min 1Q Median 3Q Max
-0.73661 -0.19591 0.04551 0.21701 0.46571
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.278e+02 3.886e+01 -3.288 0.01334 *
A1 -4.621e-01 1.002e-01 -4.611 0.00245 **
A2 9.543e-02 5.712e-02 1.671 0.13871
A3 4.034e-01 1.287e-01 3.134 0.01653 *
A4 -1.832e-01 1.053e-01 -1.740 0.12545
A5 -7.320e-01 1.943e-01 -3.767 0.00701 **
A6 2.611e-01 1.006e-01 2.596 0.03561 *
A7 8.148e-01 2.055e-01 3.964 0.00543 **
A8 9.494e-01 2.879e-01 3.298 0.01316 *
A9 -9.111e-02 1.816e-02 -5.016 0.00154 **
A10 -8.434e-01 2.684e-01 -3.142 0.01632 *
A11 -2.136e-02 7.005e-03 -3.050 0.01859 *
A12 1.841e+02 1.020e+02 1.805 0.11411
A13 4.229e+02 1.527e+02 2.770 0.02770 *
A14 -1.941e+02 6.026e+01 -3.221 0.01464 *
A15 1.876e-01 4.177e-02 4.492 0.00283 **
A16 -1.429e+00 3.865e-01 -3.698 0.00768 **
A17 -1.539e+00 3.185e-01 -4.834 0.00189 **
A18 -8.550e-02 4.213e-02 -2.029 0.08198 .
A19 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5468 on 7 degrees of freedom
Multiple R-squared: 0.9713, Adjusted R-squared: 0.8975
F-statistic: 13.15 on 18 and 7 DF, p-value: 0.0009777

In [8]:

#Worse fit after removal of GDP #Remove A16 HBP and A17 SH for better fit mlb<-read.csv('war-data (1).csv') y<-matrix(mlb[,1], nrow = 26, ncol = 1) A<-matrix(0,nrow = 26, ncol = 17) A[,17]<-b for (k in 1:16) {A[,k]<-mlb[,k+1]} L<-lm(y~A) summary(L)

15

Call:
lm(formula = y ~ A)
Residuals:
Min 1Q Median 3Q Max
-1.24900 -0.32868 -0.03965 0.40735 1.20246
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.619e+01 4.594e+01 -1.005 0.3409
A1 -2.531e-01 1.524e-01 -1.661 0.1311
A2 1.524e-02 7.881e-02 0.193 0.8509
A3 -3.440e-02 1.134e-01 -0.303 0.7686
A4 1.411e-01 1.316e-01 1.073 0.3114
A5 -4.117e-01 2.567e-01 -1.604 0.1433
A6 -1.553e-02 9.267e-02 -0.168 0.8706
A7 2.165e-01 1.834e-01 1.181 0.2680
A8 3.848e-02 2.701e-01 0.142 0.8899
A9 -4.055e-02 2.693e-02 -1.506 0.1664
A10 1.776e-01 9.939e-02 1.787 0.1076
A11 -2.329e-03 1.069e-02 -0.218 0.8324
A12 3.433e+02 1.751e+02 1.960 0.0816 .
A13 -1.291e+02 7.999e+01 -1.614 0.1409
A14 -3.395e+00 5.282e+01 -0.064 0.9502
A15 5.790e-02 4.836e-02 1.197 0.2618
A16 -6.614e-02 7.049e-02 -0.938 0.3726
A17 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.008 on 9 degrees of freedom
Multiple R-squared: 0.8746, Adjusted R-squared: 0.6517
F-statistic: 3.924 on 16 and 9 DF, p-value: 0.02149

In [12]:

#Worse fit after removal of HBP and SH #Add OPS back to model mlb<-read.csv('war-data (1).csv') y<-matrix(mlb[,1], nrow = 26, ncol = 1) A<-matrix(0,nrow = 26, ncol = 18) A[,18]<-b for (k in 1:17) {A[,k]<-mlb[,k+1]} L<-lm(y~A) summary(L)

16

Call:
lm(formula = y ~ A)
Residuals:
Min 1Q Median 3Q Max
-1.14967 -0.36682 -0.06602 0.46206 1.07372
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.282e+01 4.762e+01 -0.899 0.395
A1 -2.727e-01 1.597e-01 -1.707 0.126
A2 -9.049e-04 8.465e-02 -0.011 0.992
A3 -2.870e-02 1.172e-01 -0.245 0.813
A4 1.326e-01 1.363e-01 0.973 0.359
A5 -3.873e-01 2.671e-01 -1.450 0.185
A6 -1.765e-02 9.560e-02 -0.185 0.858
A7 2.000e-01 1.906e-01 1.049 0.325
A8 9.077e-03 2.818e-01 0.032 0.975
A9 -3.177e-02 3.060e-02 -1.038 0.329
A10 1.610e-01 1.053e-01 1.529 0.165
A11 -1.087e-03 1.118e-02 -0.097 0.925
A12 3.150e+02 1.853e+02 1.700 0.128
A13 -4.360e+02 4.574e+02 -0.953 0.368
A14 -3.214e+02 4.694e+02 -0.685 0.513
A15 3.209e+02 4.706e+02 0.682 0.515
A16 5.269e-02 5.044e-02 1.044 0.327
A17 -5.117e-02 7.593e-02 -0.674 0.519
A18 NA NA NA NA
Residual standard error: 1.039 on 8 degrees of freedom
Multiple R-squared: 0.8815, Adjusted R-squared: 0.6297
F-statistic: 3.501 on 17 and 8 DF, p-value: 0.03839

In [15]:

#Worse fit but better indicators for bad fit #Remove A2 G and A11 SO for better fit mlb<-read.csv('war-data (1).csv') y<-matrix(mlb[,1], nrow = 26, ncol = 1) A<-matrix(0,nrow = 26, ncol = 16) A[,16]<-b for (k in 1:15) {A[,k]<-mlb[,k+1]} L<-lm(y~A) summary(L)

17

Call:
lm(formula = y ~ A)
Residuals:
Min 1Q Median 3Q Max
-1.15039 -0.37060 -0.05043 0.46164 1.06846
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.424e+01 3.878e+01 -1.141 0.28050
A1 -2.805e-01 1.124e-01 -2.495 0.03174 *
A2 -2.646e-02 1.027e-01 -0.258 0.80183
A3 1.330e-01 1.219e-01 1.091 0.30088
A4 -3.985e-01 1.851e-01 -2.153 0.05675 .
A5 -1.856e-02 8.235e-02 -0.225 0.82621
A6 1.943e-01 1.614e-01 1.204 0.25634
A7 6.746e-03 2.235e-01 0.030 0.97651
A8 -3.088e-02 2.310e-02 -1.337 0.21099
A9 1.605e-01 9.177e-02 1.749 0.11084
A10 3.230e+02 8.836e+01 3.656 0.00442 **
A11 -4.484e+02 3.830e+02 -1.171 0.26893
A12 -3.325e+02 3.931e+02 -0.846 0.41735
A13 3.318e+02 3.887e+02 0.854 0.41327
A14 5.258e-02 4.471e-02 1.176 0.26680
A15 -5.155e-02 5.815e-02 -0.886 0.39618
A16 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.93 on 10 degrees of freedom
Multiple R-squared: 0.8814, Adjusted R-squared: 0.7034
F-statistic: 4.952 on 15 and 10 DF, p-value: 0.007328

In [18]:

#Better fit #Remove A7 HR for better fit mlb<-read.csv('war-data (1).csv') y<-matrix(mlb[,1], nrow = 26, ncol = 1) A<-matrix(0,nrow = 26, ncol = 15) A[,15]<-b for (k in 1:14) {A[,k]<-mlb[,k+1]} L<-lm(y~A) summary(L)

18

Call:
lm(formula = y ~ A)
Residuals:
Min 1Q Median 3Q Max
-1.14835 -0.37245 -0.05783 0.45458 1.07269
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -45.06086 26.33293 -1.711 0.11506
A1 -0.28036 0.10712 -2.617 0.02394 *
A2 -0.02576 0.09538 -0.270 0.79206
A3 0.13364 0.11441 1.168 0.26745
A4 -0.40150 0.14920 -2.691 0.02099 *
A5 -0.02067 0.04162 -0.497 0.62928
A6 0.19074 0.10638 1.793 0.10049
A7 -0.03063 0.02054 -1.491 0.16396
A8 0.15978 0.08429 1.895 0.08460 .
A9 323.33972 83.68635 3.864 0.00264 **
A10 -447.90609 364.95574 -1.227 0.24533
A11 -331.14733 372.23794 -0.890 0.39271
A12 331.73939 370.62263 0.895 0.38991
A13 0.05190 0.03674 1.412 0.18550
A14 -0.05186 0.05453 -0.951 0.36196
A15 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8868 on 11 degrees of freedom
Multiple R-squared: 0.8813, Adjusted R-squared: 0.7303
F-statistic: 5.836 on 14 and 11 DF, p-value: 0.002848

In [2]:

#Worse fit but better indicators for bad fit #Remove A4 2B for better fit mlb<-read.csv('war-data (1).csv') b<-matrix(1, nrow = 26, ncol = 1) y<-matrix(mlb[,1], nrow = 26, ncol = 1) A<-matrix(0,nrow = 26, ncol = 14) A[,14]<-b for (k in 1:13) {A[,k]<-mlb[,k+1]} L<-lm(y~A) summary(L)

19

Call:
lm(formula = y ~ A)
Residuals:
Min 1Q Median 3Q Max
-1.08776 -0.41730 -0.07005 0.48441 1.06377
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -44.12974 25.07769 -1.760 0.10390
A1 -0.26623 0.08979 -2.965 0.01181 *
A2 0.10537 0.04440 2.373 0.03520 *
A3 -0.39553 0.14174 -2.791 0.01633 *
A4 -0.02419 0.03797 -0.637 0.53595
A5 0.18949 0.10209 1.856 0.08816 .
A6 -0.03203 0.01908 -1.679 0.11901
A7 0.14192 0.05024 2.825 0.01532 *
A8 326.23363 79.72747 4.092 0.00149 **
A9 -461.45267 347.25002 -1.329 0.20860
A10 -335.21359 357.27830 -0.938 0.36662
A11 337.61742 355.40498 0.950 0.36088
A12 0.04930 0.03407 1.447 0.17349
A13 -0.05774 0.04803 -1.202 0.25241
A14 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8518 on 12 degrees of freedom
Multiple R-squared: 0.8806, Adjusted R-squared: 0.7512
F-statistic: 6.805 on 13 and 12 DF, p-value: 0.001062

In [3]:

#Worse fit but better indicators for bad fit #Remove A4 2B for better fit mlb<-read.csv('war-data (1).csv') b<-matrix(1, nrow = 26, ncol = 1) y<-matrix(mlb[,1], nrow = 26, ncol = 1) A<-matrix(0,nrow = 26, ncol = 13) A[,13]<-b for (k in 1:12) {A[,k]<-mlb[,k+1]} L<-lm(y~A) summary(L)

20

Call:
lm(formula = y ~ A)
Residuals:
Min 1Q Median 3Q Max
-1.0228 -0.3696 -0.0893 0.5124 1.1570
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -35.32939 20.44839 -1.728 0.10770
A1 -0.27246 0.08719 -3.125 0.00805 **
A2 0.08930 0.03569 2.502 0.02650 *
A3 -0.35435 0.12323 -2.876 0.01301 *
A4 0.19962 0.09852 2.026 0.06376 .
A5 -0.02990 0.01835 -1.629 0.12724
A6 0.15006 0.04746 3.162 0.00750 **
A7 311.56956 74.56980 4.178 0.00108 **
A8 -503.73164 332.97464 -1.513 0.15425
A9 -372.58943 344.28540 -1.082 0.29882
A10 373.52498 342.79913 1.090 0.29566
A11 0.04467 0.03252 1.374 0.19277
A12 -0.04672 0.04377 -1.068 0.30517
A13 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8321 on 13 degrees of freedom
Multiple R-squared: 0.8765, Adjusted R-squared: 0.7625
F-statistic: 7.69 on 12 and 13 DF, p-value: 0.0004322

In [1]:

#Remove A12 IBB for better fit #Worse fit by a little...I'll stick with this model. mlb<-read.csv('war-data (1).csv') b<-matrix(1, nrow = 26, ncol = 1) y<-matrix(mlb[,1], nrow = 26, ncol = 1) A<-matrix(0,nrow = 26, ncol = 12) A[,12]<-b for (k in 1:11) {A[,k]<-mlb[,k+1]} L<-lm(y~A) summary(L)

21

Call:
lm(formula = y ~ A)
Residuals:
Min 1Q Median 3Q Max
-0.99243 -0.50108 -0.03323 0.50326 1.20661
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -32.35895 20.35891 -1.589 0.13429
A1 -0.29572 0.08485 -3.485 0.00364 **
A2 0.08446 0.03558 2.374 0.03246 *
A3 -0.33448 0.12242 -2.732 0.01620 *
A4 0.17842 0.09697 1.840 0.08709 .
A5 -0.02535 0.01794 -1.413 0.17937
A6 0.13098 0.04419 2.964 0.01025 *
A7 290.18433 72.18564 4.020 0.00127 **
A8 -554.49045 331.20070 -1.674 0.11628
A9 -441.33878 339.88981 -1.298 0.21511
A10 440.12180 338.75052 1.299 0.21485
A11 0.03939 0.03230 1.220 0.24275
A12 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8363 on 14 degrees of freedom
Multiple R-squared: 0.8657, Adjusted R-squared: 0.7602
F-statistic: 8.204 on 11 and 14 DF, p-value: 0.0002335

In [5]:

### Lets see how this model predicts the WAR of 2019 batters: AGE<-c(25,26,25,24,25,26,32,24,30,25,28,27,28,27,25,29,26,30,28,32,34,28,33,25,28,27) AB<-c(520,471,547,661,578,574,539,594,569,606,554,596,618,590,598,632,413,593,534,564,596,480,365,433,586,395) H<-c(180,147,152,183,156,187,166,170,188,176,162,170,191,175,174,192,115,172,169,145,159,119,114,114,178,104) THREE_B<-c(5,4,6,2,4,7,2,1,2,9,5,4,4,2,6,3,0,5,2,0,1,3,1,8,3,2) RBI<-c(80,79,68,92,105,110,38,103,130,111,75,93,98,110,108,60,67, 83,61,81,99,79,52,47,63,79) BB<-c(81,122,58,70,106,68,71,96,69,29,35,70,76,73,47,61,76,90,55,102,78,90,47,80,32,79) BA<-c(0.346,0.312,0.278,0.277,0.270,0.326,0.308,0.286,0.330,0.290,0.292,0.285,0.309,0.297,0.291,0.304,0.278,0.290,0.316,0.257,0.267,0.248,0.312,0.263,0.304,0.263) OBP<-c(0.438,0.460,0.356,0.352,0.387,0.402,0.395,0.394,0.402,0.326,0.337,0.366,0.388,0.374,0.348,0.367,0.392,0.389,0.386,0.374,0.353,0.366,0.406,0.404,0.341,0.391) SLG<-c(0.640,0.628,0.508,0.519,0.552,0.598,0.417,0.532,0.629,0.554,0.417,0.493,0.505,0.561,0.567,0.438,0.528,0.533,0.451,0.523,0.448,0.467,0.518,0.483,0.415,0.582) OPS<-c(1.078,1.088,0.864,0.871,0.939,1,0.813,0.926,1.031,0.881,0.754,0.859,0.892,0.935,0.914,0.806,0.919,0.922,0.837,0.897,0.801,0.833,0.924,0.886,0.755,0.973) OPS_P<-c(186,199,136,131,150,164,119,156,173,126,109,139,140,133,127,121,145,139,133,143,120,123,151,150,112,161) b<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) WAR<-c(10.8,10.2,8.2,7.9,7.9,7.6,6.9,6.9,6.4,6.3,6.2,6.1,6.1,5.6,5.6,5.5,5.5,5.4,5.2,4.9,4.8,4.7,4.5,4.4,4.3,4.2) Y<-matrix(c(WAR)) A<-matrix(c(AGE,AB,H,THREE_B,RBI,BB,BA,OBP,SLG,OPS,OPS_P,b),nrow=26,ncol=12) M<-t(A)%*%A X<-solve(M)%*%t(A)%*%Y X

22

-0.29572013 |

0.08446067 |

-0.33447882 |

0.17841504 |

-0.02535493 |

0.13098287 |

290.18432666 |

-554.49044710 |

-441.33877812 |

440.12179753 |

0.03939196 |

-32.35894696 |

In [7]:

#Lets see how good this model is at predicting the war of Mike Trout 27*-0.29572013 + 470*0.08446067 + 137*-0.33447882 + 2*0.17841504 + 104*-0.02535493 + 110*0.13098287 + .291*290.18432666 + .438*-554.49044710 + .645*-441.33877812 + 1.083*440.12179753 + 185*0.03939196 + 1*-32.35894696

23

6.51028981584996

In [8]:

8.3 - 6.51028981584996

24

1.78971018415004

Model is off by about 1.8

25

In [9]:

#Lets see how good this model is at predicting the war of Cody Bellinger 23*-0.29572013 + 588*0.08446067 + 170*-0.33447882 + 3*0.17841504 + 115*-0.02535493 + 95*0.13098287 + .305*290.18432666 + .406*-554.49044710 + .629*-441.33877812 + 1.035*440.12179753 + 169*0.03939196 + 1*-32.35894696

26

11.66807378477

In [10]:

9 - 11.66807378477 #Model is off by -2.668

27

-2.66807378477

For the Regression Analysis, I used R and ran it in a CoCalc Jupyter notebook. This way made it easier to add and delete data, while making it look clean. For this analysis, I started out with 25 different factors and was able to narrow it down to 11 by adding and deleting different data points.

Over the course of the project, it became clearer that finding a fit that I was comfortable with was going to be an art and a science. This was the case because there were some statistics that are not great indicators for how good of a batter someone is, but they were getting high t - values. Keeping these would have yielded a better fit, but I could not keep them in good faith.

The first couple that were deleted were obvious bad fits. These were Total Bases, On-Base percentage, Stolen Bases, etc. Deleting these did help me get a better fit. I was able to go from a fit of 0.5737 to almost a one to one fit of 0.909. Although that is a great fit, like I mentioned before there were some stats that I felt needed to be removed from the model. I will talk about two examples from ones that I deleted.

The first example of these statistics is Hit by Pitch. This stat only tells us how many times a player gets unintentionally hit by a pitch. At the major league level getting hit by a pitch is not a common occurrence. The top five players had only between 8 – 10 in this category, and most players had over 600 plate appearances. Players should not expect to be hit by a pitch when they step into a Major League batters’ box. Yet, hit by pitch had a -3.698 t – score, indicating that it is a significant data point.

Another example of one of this Sacrifice Hits. Out of the 26 players with the highest WAR scores, 18 players had a zero in this category. Sacrifice Hits are situational and not great indictors for success per plate appearance. Sacrifice Hits happen when a player attempts to move baserunners into scoring position by bunting. The problem with this is that we are looking at the top 26 players when it comes to WAR. This means that they probably provide valuable offense to their team. If they are among the best players, why would you not just let them try to get a hit. Also, a player could provide more value by just getting a hit because the player at the plate and the baserunners could be on base. Despite this, Sacrifice Hits had a high t – value score of -4.834.

There were more stats like this that I removed, but I feel that I was able on a model that included stats like Batting Average, Slugging Percentage, On – Base plus Slugging. The thing that I like about these stats is that they are telling you what a player is doing on an at – bat basis. For example, On – Base plus Slugging is taking into account many aspects of great hitting. Slugging percentage tells us about player’s ability to hit for average and power, considering singles, doubles, triples, home runs and plate appearances as part of its calculation. While, On – Base percentage takes into account all of the ways that a player can get on base including hits, walks, and hit by pitch. Add On – Base percentage and Slugging Percentage, you get On – Base plus Slugging, a really high quality statistic telling us a lot about a batter.

I was able to get an adjusted R – Squared score of 0.7602, but it is not as good as a 0.909. When I tested the model for two of the best players in the MLB in the 2019 season, Mike Trout and Cody Bellinger, I was off by 1.8 and 2.6. Obviously, this model has room for improvement. Perhaps I should not have taken out as many stats that had high t – scores, although they might not be the greatest indictors for batting ability. Next time I do a project that has regression analysis, I would like to make a model that has a better balance of stats that give me a better fit. Perhaps a Regression Analysis that I do not have any background knowledge in, maybe that would improve my model because I would be looking at the numbers with no bias. This project has proven to me that the applications of regression analysis is endless. Pretty much anything in which you have a bunch of independent factors that led to a statistic.

28