CoCalc Public Filesprojects / mort_Proj4 / WAR data.ipynb
Author: Brian Holliday
Views : 62

Abstract WAR is a baseball statistic that measures a player’s value against a replacement level player. A replacement level player being a player that will give you the production value of a minimum salary baseball player. This is an important statistic because it’s a way of measuring a player’s value against their peers in the number of extra wins that they will give you.

We looked to do a regression analysis of this statistic to see what measures are integral to the WAR statistic. Traditionally players were evaluated on classic baseball statistics such as batting average, home runs, stolen bases, and runs batted in, but we know that these stats do not tell the whole story. For example, those statistics do not tell you how often a player is getting on base, or strikeout rate. We want to know what are the best indicators for WAR.

Regression Analysis a good tool to figure this out because we can account for multiple factors that help us determine WAR score. With a standard deviation score for each factor we can determine how statistically significant each variable is. With the Regression Analysis we are trying to limit the number of outliers, outliers will make our adjusted R squared score worse, so we will try to make our R – Squared score as high as possible.

In [2]:
mlb<-read.csv('war-data (1).csv')

In [3]:
#Create our y...this is the war data
y<-matrix(mlb[,1], nrow = 26, ncol = 1)

In [2]:
#Create our b matrix
b<-matrix(1, nrow = 26, ncol = 1)

In [7]:
#Create and populate A matrix
A<-matrix(0,nrow = 26, ncol = 26)
A[,26]<-b
for (k in 1:25) {A[,k]<-mlb[,k+1]}

In [9]:
#Regression Analysis
L<-lm(y~A)
summary(L)

Call: lm(formula = y ~ A) Residuals: 1 2 3 4 5 6 7 8 -0.122076 0.239640 0.229624 0.200709 -0.297195 -0.254432 0.206614 0.221623 9 10 11 12 13 14 15 16 0.171591 0.076752 -0.046736 -0.305538 0.420955 -0.370879 0.193177 -0.095547 17 18 19 20 21 22 23 24 -0.189749 -0.056964 -0.227216 -0.182676 -0.088005 0.396479 0.114347 -0.191193 25 26 -0.045370 0.002065 Coefficients: (2 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -2.155e+02 1.621e+02 -1.330 0.411 A1 -5.434e-01 3.442e-01 -1.579 0.359 A2 1.698e-01 2.090e-01 0.812 0.566 A3 2.046e+00 3.994e+00 0.512 0.699 A4 -1.709e+00 3.830e+00 -0.446 0.733 A5 2.291e-02 8.318e-02 0.275 0.829 A6 -1.102e+00 7.245e-01 -1.521 0.370 A7 3.943e-01 4.625e-01 0.853 0.551 A8 1.119e+00 9.502e-01 1.178 0.448 A9 1.292e+00 1.346e+00 0.959 0.513 A10 -8.679e-02 8.840e-02 -0.982 0.506 A11 6.152e-03 5.996e-02 0.103 0.935 A12 6.918e-02 2.718e-01 0.255 0.841 A13 -2.834e+00 4.686e+00 -0.605 0.654 A14 -3.071e-02 2.999e-02 -1.024 0.492 A15 1.372e+02 2.720e+02 0.504 0.703 A16 8.579e+02 1.120e+03 0.766 0.584 A17 -1.961e+02 5.539e+02 -0.354 0.783 A18 -7.897e+01 6.069e+02 -0.130 0.918 A19 2.272e-01 1.823e-01 1.246 0.430 A20 NA NA NA NA A21 9.112e-02 1.176e-01 0.775 0.580 A22 -3.664e+00 5.181e+00 -0.707 0.608 A23 -3.578e+00 4.797e+00 -0.746 0.592 A24 -1.353e+00 3.479e+00 -0.389 0.764 A25 -1.461e-01 1.862e-01 -0.785 0.576 A26 NA NA NA NA Residual standard error: 1.115 on 1 degrees of freedom Multiple R-squared: 0.9829, Adjusted R-squared: 0.5737 F-statistic: 2.402 on 24 and 1 DF, p-value: 0.4751

We see that we could get better with our fit. Delete A20 TB Delete A18 OPS

In [14]:
#import new version of csv file
A<-matrix(0,nrow = 26, ncol = 24)
A[,24]<-b
for (k in 1:23) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call: lm(formula = y ~ A) Residuals: 1 2 3 4 5 6 7 8 -0.084647 0.240122 0.226949 0.200364 -0.291981 -0.271040 0.195880 0.193228 9 10 11 12 13 14 15 16 0.182238 0.023081 -0.033878 -0.338058 0.429703 -0.394952 0.263633 -0.140128 17 18 19 20 21 22 23 24 -0.181327 -0.073715 -0.219848 -0.166407 -0.044515 0.380878 0.080721 -0.162335 25 26 -0.004325 -0.009641 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -2.081e+02 1.082e+02 -1.924 0.194 A1 -5.352e-01 2.413e-01 -2.218 0.157 A2 1.566e-01 1.303e-01 1.202 0.353 A3 1.931e+00 2.777e+00 0.695 0.559 A4 -1.603e+00 2.668e+00 -0.601 0.609 A5 2.311e-02 5.930e-02 0.390 0.734 A6 -1.065e+00 4.760e-01 -2.238 0.155 A7 3.706e-01 3.033e-01 1.222 0.346 A8 1.072e+00 6.264e-01 1.711 0.229 A9 1.220e+00 8.769e-01 1.392 0.299 A10 -8.362e-02 6.060e-02 -1.380 0.302 A11 4.432e-03 4.171e-02 0.106 0.925 A12 6.269e-02 1.905e-01 0.329 0.773 A13 -2.684e+00 3.239e+00 -0.829 0.494 A14 -2.960e-02 2.049e-02 -1.444 0.285 A15 1.339e+02 1.931e+02 0.693 0.560 A16 7.464e+02 5.145e+02 1.451 0.284 A17 -2.604e+02 1.781e+02 -1.462 0.281 A18 2.172e-01 1.180e-01 1.841 0.207 A19 8.737e-02 8.129e-02 1.075 0.395 A20 -3.479e+00 3.552e+00 -0.979 0.431 A21 -3.419e+00 3.307e+00 -1.034 0.410 A22 -1.264e+00 2.432e+00 -0.520 0.655 A23 -1.366e-01 1.222e-01 -1.118 0.380 A24 NA NA NA NA Residual standard error: 0.7951 on 2 degrees of freedom Multiple R-squared: 0.9827, Adjusted R-squared: 0.7832 F-statistic: 4.927 on 23 and 2 DF, p-value: 0.1822

We got a better fit for the model, Delete A11 for a better A11 for better model

In [15]:
#Delete A11 SB from model
A<-matrix(0,nrow = 26, ncol = 23)
A[,23]<-b
for (k in 1:22) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call: lm(formula = y ~ A) Residuals: 1 2 3 4 5 6 7 8 -0.10399 0.25595 0.22201 0.19290 -0.29475 -0.23564 0.19854 0.19308 9 10 11 12 13 14 15 16 0.16947 0.02393 -0.03173 -0.35546 0.42493 -0.40001 0.28874 -0.11398 17 18 19 20 21 22 23 24 -0.18465 -0.11794 -0.20766 -0.13801 -0.04774 0.38501 0.08151 -0.16661 25 26 -0.02249 -0.01541 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -205.97687 87.04486 -2.366 0.0988 . A1 -0.53170 0.19570 -2.717 0.0727 . A2 0.15658 0.10670 1.467 0.2385 A3 1.84699 2.18011 0.847 0.4591 A4 -1.52216 2.09445 -0.727 0.5199 A5 0.01910 0.03744 0.510 0.6452 A6 -1.05633 0.38368 -2.753 0.0706 . A7 0.37508 0.24591 1.525 0.2246 A8 1.08841 0.49729 2.189 0.1164 A9 1.24136 0.69937 1.775 0.1740 A10 -0.08752 0.03948 -2.217 0.1133 A11 0.07588 0.11835 0.641 0.5670 A12 -2.59513 2.56195 -1.013 0.3857 A13 -0.03012 0.01630 -1.848 0.1618 A14 136.01867 157.25300 0.865 0.4507 A15 741.31845 419.45733 1.767 0.1753 A16 -263.04378 144.45329 -1.821 0.1662 A17 0.22074 0.09276 2.380 0.0977 . A18 0.08440 0.06251 1.350 0.2697 A19 -3.39706 2.83937 -1.196 0.3175 A20 -3.33523 2.63024 -1.268 0.2943 A21 -1.17861 1.87988 -0.627 0.5752 A22 -0.13774 0.09969 -1.382 0.2610 A23 NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.651 on 3 degrees of freedom Multiple R-squared: 0.9826, Adjusted R-squared: 0.8547 F-statistic: 7.683 on 22 and 3 DF, p-value: 0.05887
In [16]:
#Our model is almost the same as our Multiple R-Squared Score
#Delete A5 R from model for better fit
A<-matrix(0,nrow = 26, ncol = 22)
A[,22]<-b
for (k in 1:21) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call: lm(formula = y ~ A) Residuals: 1 2 3 4 5 6 7 8 -0.07782 0.26687 0.22671 0.20076 -0.39230 -0.14239 0.25484 0.21917 9 10 11 12 13 14 15 16 0.08002 0.06437 -0.06608 -0.37385 0.49161 -0.28870 0.19461 -0.23578 17 18 19 20 21 22 23 24 -0.18600 -0.19424 -0.31515 -0.07264 -0.01837 0.37586 0.06592 -0.14553 25 26 0.08836 -0.02026 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -192.63769 74.95185 -2.570 0.06197 . A1 -0.51432 0.17398 -2.956 0.04171 * A2 0.15263 0.09608 1.589 0.18734 A3 1.63618 1.93249 0.847 0.44486 A4 -1.33176 1.86056 -0.716 0.51369 A5 -0.99718 0.33017 -3.020 0.03916 * A6 0.41473 0.21061 1.969 0.12028 A7 1.17839 0.41974 2.807 0.04844 * A8 1.37429 0.58590 2.346 0.07889 . A9 -0.10337 0.02197 -4.704 0.00928 ** A10 0.08007 0.10658 0.751 0.49430 A11 -2.38230 2.28202 -1.044 0.35545 A12 -0.03229 0.01420 -2.273 0.08542 . A13 122.93523 140.06452 0.878 0.42966 A14 735.57394 378.54495 1.943 0.12393 A15 -282.71425 125.67659 -2.250 0.08769 . A16 0.24096 0.07571 3.183 0.03344 * A17 0.08033 0.05597 1.435 0.22452 A18 -3.21521 2.54306 -1.264 0.27477 A19 -3.10255 2.33856 -1.327 0.25528 A20 -0.96321 1.65375 -0.582 0.59152 A21 -0.14078 0.08984 -1.567 0.19219 A22 NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5877 on 4 degrees of freedom Multiple R-squared: 0.981, Adjusted R-squared: 0.8816 F-statistic: 9.86 on 21 and 4 DF, p-value: 0.01923
In [3]:
#adjusted R-squared improvement
#Delete A20 SF from model for better fit
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 21)
A[,21]<-b
for (k in 1:20) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call: lm(formula = y ~ A) Residuals: 1 2 3 4 5 6 7 8 -0.01446 0.31200 0.11643 0.23669 -0.54171 -0.18726 0.21455 0.29241 9 10 11 12 13 14 15 16 0.09595 0.11559 0.09627 -0.43568 0.40766 -0.27398 0.17451 -0.19335 17 18 19 20 21 22 23 24 -0.12908 -0.14836 -0.36732 -0.04200 0.01498 0.38707 0.02307 -0.09059 25 26 0.01664 -0.08004 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -1.593e+02 4.512e+01 -3.531 0.01672 * A1 -4.365e-01 1.037e-01 -4.208 0.00843 ** A2 1.133e-01 6.364e-02 1.780 0.13518 A3 5.146e-01 1.512e-01 3.403 0.01918 * A4 -2.505e-01 1.160e-01 -2.160 0.08316 . A5 -8.645e-01 2.226e-01 -3.883 0.01160 * A6 3.134e-01 1.107e-01 2.832 0.03660 * A7 9.852e-01 2.396e-01 4.113 0.00924 ** A8 1.096e+00 3.158e-01 3.470 0.01785 * A9 -1.012e-01 2.019e-02 -5.015 0.00405 ** A10 4.664e-02 8.367e-02 0.557 0.60124 A11 -1.068e+00 3.123e-01 -3.418 0.01888 * A12 -2.563e-02 7.854e-03 -3.264 0.02235 * A13 1.655e+02 1.113e+02 1.487 0.19706 A14 5.443e+02 1.755e+02 3.101 0.02681 * A15 -2.225e+02 6.663e+01 -3.340 0.02056 * A16 2.081e-01 4.693e-02 4.433 0.00681 ** A17 6.696e-02 4.755e-02 1.408 0.21810 A18 -1.762e+00 4.559e-01 -3.864 0.01183 * A19 -1.759e+00 3.624e-01 -4.856 0.00465 ** A20 -9.603e-02 4.338e-02 -2.213 0.07776 . A21 NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5475 on 5 degrees of freedom Multiple R-squared: 0.9794, Adjusted R-squared: 0.8972 F-statistic: 11.91 on 20 and 5 DF, p-value: 0.006024
In [5]:
#Better fit after removal of A20
#Remove A10 CS for better fit
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 20)
A[,20]<-b
for (k in 1:19) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call: lm(formula = y ~ A) Residuals: Min 1Q Median 3Q Max -0.52475 -0.14784 -0.00275 0.18046 0.34833 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -1.521e+02 4.064e+01 -3.741 0.00961 ** A1 -4.307e-01 9.710e-02 -4.436 0.00440 ** A2 9.775e-02 5.383e-02 1.816 0.11929 A3 4.990e-01 1.398e-01 3.570 0.01179 * A4 -2.436e-01 1.085e-01 -2.246 0.06585 . A5 -8.164e-01 1.931e-01 -4.229 0.00551 ** A6 2.914e-01 9.726e-02 2.996 0.02413 * A7 9.481e-01 2.165e-01 4.379 0.00468 ** A8 1.034e+00 2.781e-01 3.718 0.00987 ** A9 -1.022e-01 1.892e-02 -5.402 0.00166 ** A10 -1.032e+00 2.877e-01 -3.588 0.01154 * A11 -2.589e-02 7.376e-03 -3.511 0.01266 * A12 1.468e+02 9.983e+01 1.471 0.19176 A13 5.245e+02 1.617e+02 3.244 0.01761 * A14 -2.079e+02 5.765e+01 -3.607 0.01127 * A15 1.969e-01 3.992e-02 4.932 0.00263 ** A16 5.650e-02 4.110e-02 1.375 0.21840 A17 -1.685e+00 4.088e-01 -4.121 0.00621 ** A18 -1.688e+00 3.189e-01 -5.294 0.00184 ** A19 -9.087e-02 3.988e-02 -2.279 0.06290 . A20 NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5151 on 6 degrees of freedom Multiple R-squared: 0.9782, Adjusted R-squared: 0.909 F-statistic: 14.15 on 19 and 6 DF, p-value: 0.001731
In [6]:
#Better fit after removal of A10
#Remove A16 GDP for better fit
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 19)
A[,19]<-b
for (k in 1:18) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call: lm(formula = y ~ A) Residuals: Min 1Q Median 3Q Max -0.73661 -0.19591 0.04551 0.21701 0.46571 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -1.278e+02 3.886e+01 -3.288 0.01334 * A1 -4.621e-01 1.002e-01 -4.611 0.00245 ** A2 9.543e-02 5.712e-02 1.671 0.13871 A3 4.034e-01 1.287e-01 3.134 0.01653 * A4 -1.832e-01 1.053e-01 -1.740 0.12545 A5 -7.320e-01 1.943e-01 -3.767 0.00701 ** A6 2.611e-01 1.006e-01 2.596 0.03561 * A7 8.148e-01 2.055e-01 3.964 0.00543 ** A8 9.494e-01 2.879e-01 3.298 0.01316 * A9 -9.111e-02 1.816e-02 -5.016 0.00154 ** A10 -8.434e-01 2.684e-01 -3.142 0.01632 * A11 -2.136e-02 7.005e-03 -3.050 0.01859 * A12 1.841e+02 1.020e+02 1.805 0.11411 A13 4.229e+02 1.527e+02 2.770 0.02770 * A14 -1.941e+02 6.026e+01 -3.221 0.01464 * A15 1.876e-01 4.177e-02 4.492 0.00283 ** A16 -1.429e+00 3.865e-01 -3.698 0.00768 ** A17 -1.539e+00 3.185e-01 -4.834 0.00189 ** A18 -8.550e-02 4.213e-02 -2.029 0.08198 . A19 NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5468 on 7 degrees of freedom Multiple R-squared: 0.9713, Adjusted R-squared: 0.8975 F-statistic: 13.15 on 18 and 7 DF, p-value: 0.0009777
In [8]:
#Worse fit after removal of GDP
#Remove A16 HBP and A17 SH for better fit
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 17)
A[,17]<-b
for (k in 1:16) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call: lm(formula = y ~ A) Residuals: Min 1Q Median 3Q Max -1.24900 -0.32868 -0.03965 0.40735 1.20246 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -4.619e+01 4.594e+01 -1.005 0.3409 A1 -2.531e-01 1.524e-01 -1.661 0.1311 A2 1.524e-02 7.881e-02 0.193 0.8509 A3 -3.440e-02 1.134e-01 -0.303 0.7686 A4 1.411e-01 1.316e-01 1.073 0.3114 A5 -4.117e-01 2.567e-01 -1.604 0.1433 A6 -1.553e-02 9.267e-02 -0.168 0.8706 A7 2.165e-01 1.834e-01 1.181 0.2680 A8 3.848e-02 2.701e-01 0.142 0.8899 A9 -4.055e-02 2.693e-02 -1.506 0.1664 A10 1.776e-01 9.939e-02 1.787 0.1076 A11 -2.329e-03 1.069e-02 -0.218 0.8324 A12 3.433e+02 1.751e+02 1.960 0.0816 . A13 -1.291e+02 7.999e+01 -1.614 0.1409 A14 -3.395e+00 5.282e+01 -0.064 0.9502 A15 5.790e-02 4.836e-02 1.197 0.2618 A16 -6.614e-02 7.049e-02 -0.938 0.3726 A17 NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.008 on 9 degrees of freedom Multiple R-squared: 0.8746, Adjusted R-squared: 0.6517 F-statistic: 3.924 on 16 and 9 DF, p-value: 0.02149
In [12]:
#Worse fit after removal of HBP and SH
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 18)
A[,18]<-b
for (k in 1:17) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call: lm(formula = y ~ A) Residuals: Min 1Q Median 3Q Max -1.14967 -0.36682 -0.06602 0.46206 1.07372 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -4.282e+01 4.762e+01 -0.899 0.395 A1 -2.727e-01 1.597e-01 -1.707 0.126 A2 -9.049e-04 8.465e-02 -0.011 0.992 A3 -2.870e-02 1.172e-01 -0.245 0.813 A4 1.326e-01 1.363e-01 0.973 0.359 A5 -3.873e-01 2.671e-01 -1.450 0.185 A6 -1.765e-02 9.560e-02 -0.185 0.858 A7 2.000e-01 1.906e-01 1.049 0.325 A8 9.077e-03 2.818e-01 0.032 0.975 A9 -3.177e-02 3.060e-02 -1.038 0.329 A10 1.610e-01 1.053e-01 1.529 0.165 A11 -1.087e-03 1.118e-02 -0.097 0.925 A12 3.150e+02 1.853e+02 1.700 0.128 A13 -4.360e+02 4.574e+02 -0.953 0.368 A14 -3.214e+02 4.694e+02 -0.685 0.513 A15 3.209e+02 4.706e+02 0.682 0.515 A16 5.269e-02 5.044e-02 1.044 0.327 A17 -5.117e-02 7.593e-02 -0.674 0.519 A18 NA NA NA NA Residual standard error: 1.039 on 8 degrees of freedom Multiple R-squared: 0.8815, Adjusted R-squared: 0.6297 F-statistic: 3.501 on 17 and 8 DF, p-value: 0.03839
In [15]:
#Worse fit but better indicators for bad fit
#Remove A2 G and A11 SO for better fit
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 16)
A[,16]<-b
for (k in 1:15) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call: lm(formula = y ~ A) Residuals: Min 1Q Median 3Q Max -1.15039 -0.37060 -0.05043 0.46164 1.06846 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -4.424e+01 3.878e+01 -1.141 0.28050 A1 -2.805e-01 1.124e-01 -2.495 0.03174 * A2 -2.646e-02 1.027e-01 -0.258 0.80183 A3 1.330e-01 1.219e-01 1.091 0.30088 A4 -3.985e-01 1.851e-01 -2.153 0.05675 . A5 -1.856e-02 8.235e-02 -0.225 0.82621 A6 1.943e-01 1.614e-01 1.204 0.25634 A7 6.746e-03 2.235e-01 0.030 0.97651 A8 -3.088e-02 2.310e-02 -1.337 0.21099 A9 1.605e-01 9.177e-02 1.749 0.11084 A10 3.230e+02 8.836e+01 3.656 0.00442 ** A11 -4.484e+02 3.830e+02 -1.171 0.26893 A12 -3.325e+02 3.931e+02 -0.846 0.41735 A13 3.318e+02 3.887e+02 0.854 0.41327 A14 5.258e-02 4.471e-02 1.176 0.26680 A15 -5.155e-02 5.815e-02 -0.886 0.39618 A16 NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.93 on 10 degrees of freedom Multiple R-squared: 0.8814, Adjusted R-squared: 0.7034 F-statistic: 4.952 on 15 and 10 DF, p-value: 0.007328
In [18]:
#Better fit
#Remove A7 HR for better fit
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 15)
A[,15]<-b
for (k in 1:14) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call: lm(formula = y ~ A) Residuals: Min 1Q Median 3Q Max -1.14835 -0.37245 -0.05783 0.45458 1.07269 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -45.06086 26.33293 -1.711 0.11506 A1 -0.28036 0.10712 -2.617 0.02394 * A2 -0.02576 0.09538 -0.270 0.79206 A3 0.13364 0.11441 1.168 0.26745 A4 -0.40150 0.14920 -2.691 0.02099 * A5 -0.02067 0.04162 -0.497 0.62928 A6 0.19074 0.10638 1.793 0.10049 A7 -0.03063 0.02054 -1.491 0.16396 A8 0.15978 0.08429 1.895 0.08460 . A9 323.33972 83.68635 3.864 0.00264 ** A10 -447.90609 364.95574 -1.227 0.24533 A11 -331.14733 372.23794 -0.890 0.39271 A12 331.73939 370.62263 0.895 0.38991 A13 0.05190 0.03674 1.412 0.18550 A14 -0.05186 0.05453 -0.951 0.36196 A15 NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8868 on 11 degrees of freedom Multiple R-squared: 0.8813, Adjusted R-squared: 0.7303 F-statistic: 5.836 on 14 and 11 DF, p-value: 0.002848
In [2]:
#Worse fit but better indicators for bad fit
#Remove A4 2B  for better fit
b<-matrix(1, nrow = 26, ncol = 1)
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 14)
A[,14]<-b
for (k in 1:13) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call: lm(formula = y ~ A) Residuals: Min 1Q Median 3Q Max -1.08776 -0.41730 -0.07005 0.48441 1.06377 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -44.12974 25.07769 -1.760 0.10390 A1 -0.26623 0.08979 -2.965 0.01181 * A2 0.10537 0.04440 2.373 0.03520 * A3 -0.39553 0.14174 -2.791 0.01633 * A4 -0.02419 0.03797 -0.637 0.53595 A5 0.18949 0.10209 1.856 0.08816 . A6 -0.03203 0.01908 -1.679 0.11901 A7 0.14192 0.05024 2.825 0.01532 * A8 326.23363 79.72747 4.092 0.00149 ** A9 -461.45267 347.25002 -1.329 0.20860 A10 -335.21359 357.27830 -0.938 0.36662 A11 337.61742 355.40498 0.950 0.36088 A12 0.04930 0.03407 1.447 0.17349 A13 -0.05774 0.04803 -1.202 0.25241 A14 NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8518 on 12 degrees of freedom Multiple R-squared: 0.8806, Adjusted R-squared: 0.7512 F-statistic: 6.805 on 13 and 12 DF, p-value: 0.001062
In [3]:
#Worse fit but better indicators for bad fit
#Remove A4 2B  for better fit
b<-matrix(1, nrow = 26, ncol = 1)
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 13)
A[,13]<-b
for (k in 1:12) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call: lm(formula = y ~ A) Residuals: Min 1Q Median 3Q Max -1.0228 -0.3696 -0.0893 0.5124 1.1570 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -35.32939 20.44839 -1.728 0.10770 A1 -0.27246 0.08719 -3.125 0.00805 ** A2 0.08930 0.03569 2.502 0.02650 * A3 -0.35435 0.12323 -2.876 0.01301 * A4 0.19962 0.09852 2.026 0.06376 . A5 -0.02990 0.01835 -1.629 0.12724 A6 0.15006 0.04746 3.162 0.00750 ** A7 311.56956 74.56980 4.178 0.00108 ** A8 -503.73164 332.97464 -1.513 0.15425 A9 -372.58943 344.28540 -1.082 0.29882 A10 373.52498 342.79913 1.090 0.29566 A11 0.04467 0.03252 1.374 0.19277 A12 -0.04672 0.04377 -1.068 0.30517 A13 NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8321 on 13 degrees of freedom Multiple R-squared: 0.8765, Adjusted R-squared: 0.7625 F-statistic: 7.69 on 12 and 13 DF, p-value: 0.0004322
In [1]:
#Remove A12 IBB for better fit
#Worse fit by a little...I'll stick with this model.
b<-matrix(1, nrow = 26, ncol = 1)
y<-matrix(mlb[,1], nrow = 26, ncol = 1)
A<-matrix(0,nrow = 26, ncol = 12)
A[,12]<-b
for (k in 1:11) {A[,k]<-mlb[,k+1]}
L<-lm(y~A)
summary(L)

Call: lm(formula = y ~ A) Residuals: Min 1Q Median 3Q Max -0.99243 -0.50108 -0.03323 0.50326 1.20661 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -32.35895 20.35891 -1.589 0.13429 A1 -0.29572 0.08485 -3.485 0.00364 ** A2 0.08446 0.03558 2.374 0.03246 * A3 -0.33448 0.12242 -2.732 0.01620 * A4 0.17842 0.09697 1.840 0.08709 . A5 -0.02535 0.01794 -1.413 0.17937 A6 0.13098 0.04419 2.964 0.01025 * A7 290.18433 72.18564 4.020 0.00127 ** A8 -554.49045 331.20070 -1.674 0.11628 A9 -441.33878 339.88981 -1.298 0.21511 A10 440.12180 338.75052 1.299 0.21485 A11 0.03939 0.03230 1.220 0.24275 A12 NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8363 on 14 degrees of freedom Multiple R-squared: 0.8657, Adjusted R-squared: 0.7602 F-statistic: 8.204 on 11 and 14 DF, p-value: 0.0002335
In [5]:
### Lets see how this model predicts the WAR of 2019 batters:
AGE<-c(25,26,25,24,25,26,32,24,30,25,28,27,28,27,25,29,26,30,28,32,34,28,33,25,28,27)
AB<-c(520,471,547,661,578,574,539,594,569,606,554,596,618,590,598,632,413,593,534,564,596,480,365,433,586,395)
H<-c(180,147,152,183,156,187,166,170,188,176,162,170,191,175,174,192,115,172,169,145,159,119,114,114,178,104)
THREE_B<-c(5,4,6,2,4,7,2,1,2,9,5,4,4,2,6,3,0,5,2,0,1,3,1,8,3,2)
RBI<-c(80,79,68,92,105,110,38,103,130,111,75,93,98,110,108,60,67,
83,61,81,99,79,52,47,63,79)
BB<-c(81,122,58,70,106,68,71,96,69,29,35,70,76,73,47,61,76,90,55,102,78,90,47,80,32,79)
BA<-c(0.346,0.312,0.278,0.277,0.270,0.326,0.308,0.286,0.330,0.290,0.292,0.285,0.309,0.297,0.291,0.304,0.278,0.290,0.316,0.257,0.267,0.248,0.312,0.263,0.304,0.263)
OBP<-c(0.438,0.460,0.356,0.352,0.387,0.402,0.395,0.394,0.402,0.326,0.337,0.366,0.388,0.374,0.348,0.367,0.392,0.389,0.386,0.374,0.353,0.366,0.406,0.404,0.341,0.391)
SLG<-c(0.640,0.628,0.508,0.519,0.552,0.598,0.417,0.532,0.629,0.554,0.417,0.493,0.505,0.561,0.567,0.438,0.528,0.533,0.451,0.523,0.448,0.467,0.518,0.483,0.415,0.582)
OPS<-c(1.078,1.088,0.864,0.871,0.939,1,0.813,0.926,1.031,0.881,0.754,0.859,0.892,0.935,0.914,0.806,0.919,0.922,0.837,0.897,0.801,0.833,0.924,0.886,0.755,0.973)
OPS_P<-c(186,199,136,131,150,164,119,156,173,126,109,139,140,133,127,121,145,139,133,143,120,123,151,150,112,161)
b<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
WAR<-c(10.8,10.2,8.2,7.9,7.9,7.6,6.9,6.9,6.4,6.3,6.2,6.1,6.1,5.6,5.6,5.5,5.5,5.4,5.2,4.9,4.8,4.7,4.5,4.4,4.3,4.2)
Y<-matrix(c(WAR))
A<-matrix(c(AGE,AB,H,THREE_B,RBI,BB,BA,OBP,SLG,OPS,OPS_P,b),nrow=26,ncol=12)
M<-t(A)%*%A
X<-solve(M)%*%t(A)%*%Y
X

 -0.29572 0.0844607 -0.334479 0.178415 -0.0253549 0.130983 290.184 -554.49 -441.339 440.122 0.039392 -32.3589
In [7]:
#Lets see how good this model is at predicting the war of Mike Trout
27*-0.29572013 + 470*0.08446067 + 137*-0.33447882 + 2*0.17841504 + 104*-0.02535493 + 110*0.13098287 + .291*290.18432666 + .438*-554.49044710 + .645*-441.33877812 + 1.083*440.12179753 + 185*0.03939196 + 1*-32.35894696

6.51028981584996
In [8]:
8.3 - 6.51028981584996

1.78971018415004

Model is off by about 1.8

In [9]:
#Lets see how good this model is at predicting the war of Cody Bellinger
23*-0.29572013 + 588*0.08446067 + 170*-0.33447882 + 3*0.17841504 + 115*-0.02535493 + 95*0.13098287 + .305*290.18432666 + .406*-554.49044710 + .629*-441.33877812 + 1.035*440.12179753 + 169*0.03939196 + 1*-32.35894696

11.66807378477
In [10]:
9 - 11.66807378477
#Model is off by -2.668

-2.66807378477

For the Regression Analysis, I used R and ran it in a CoCalc Jupyter notebook. This way made it easier to add and delete data, while making it look clean. For this analysis, I started out with 25 different factors and was able to narrow it down to 11 by adding and deleting different data points.

Over the course of the project, it became clearer that finding a fit that I was comfortable with was going to be an art and a science. This was the case because there were some statistics that are not great indicators for how good of a batter someone is, but they were getting high t - values. Keeping these would have yielded a better fit, but I could not keep them in good faith.

The first couple that were deleted were obvious bad fits. These were Total Bases, On-Base percentage, Stolen Bases, etc. Deleting these did help me get a better fit. I was able to go from a fit of 0.5737 to almost a one to one fit of 0.909. Although that is a great fit, like I mentioned before there were some stats that I felt needed to be removed from the model. I will talk about two examples from ones that I deleted.

The first example of these statistics is Hit by Pitch. This stat only tells us how many times a player gets unintentionally hit by a pitch. At the major league level getting hit by a pitch is not a common occurrence. The top five players had only between 8 – 10 in this category, and most players had over 600 plate appearances. Players should not expect to be hit by a pitch when they step into a Major League batters’ box. Yet, hit by pitch had a -3.698 t – score, indicating that it is a significant data point.

Another example of one of this Sacrifice Hits. Out of the 26 players with the highest WAR scores, 18 players had a zero in this category. Sacrifice Hits are situational and not great indictors for success per plate appearance. Sacrifice Hits happen when a player attempts to move baserunners into scoring position by bunting. The problem with this is that we are looking at the top 26 players when it comes to WAR. This means that they probably provide valuable offense to their team. If they are among the best players, why would you not just let them try to get a hit. Also, a player could provide more value by just getting a hit because the player at the plate and the baserunners could be on base. Despite this, Sacrifice Hits had a high t – value score of -4.834.

There were more stats like this that I removed, but I feel that I was able on a model that included stats like Batting Average, Slugging Percentage, On – Base plus Slugging. The thing that I like about these stats is that they are telling you what a player is doing on an at – bat basis. For example, On – Base plus Slugging is taking into account many aspects of great hitting. Slugging percentage tells us about player’s ability to hit for average and power, considering singles, doubles, triples, home runs and plate appearances as part of its calculation. While, On – Base percentage takes into account all of the ways that a player can get on base including hits, walks, and hit by pitch. Add On – Base percentage and Slugging Percentage, you get On – Base plus Slugging, a really high quality statistic telling us a lot about a batter.

I was able to get an adjusted R – Squared score of 0.7602, but it is not as good as a 0.909. When I tested the model for two of the best players in the MLB in the 2019 season, Mike Trout and Cody Bellinger, I was off by 1.8 and 2.6. Obviously, this model has room for improvement. Perhaps I should not have taken out as many stats that had high t – scores, although they might not be the greatest indictors for batting ability. Next time I do a project that has regression analysis, I would like to make a model that has a better balance of stats that give me a better fit. Perhaps a Regression Analysis that I do not have any background knowledge in, maybe that would improve my model because I would be looking at the numbers with no bias. This project has proven to me that the applications of regression analysis is endless. Pretty much anything in which you have a bunch of independent factors that led to a statistic.