11 Linear and Generalized Liner Models

Francis Galton, a polymath with interests ranging from statistics and psychology to meteorology and eugenics, made significant contributions to the field of statistical analysis. One of the concepts he developed was the regression towards the mean. This concept describes the tendency of extreme values in a population to become closer to the average in subsequent generations. Galton also developed the correlation coefficient, a measure of the linear relationship between two variables. This tool has become a fundamental concept in statistics and is used in various fields to analyze relationships between variables. Galton’s contributions to statistical analysis were numerous and profound. He not only developed new methodologies and techniques but also promoted a data-driven approach to research that continues to shape statistical practice today. Furthermore, he introduced the concept of quantiles, which divide a population into equal-sized subpopulations based on a specific variable. This allowed for a more nuanced analysis of data compared to simply considering the mean and median. He also popularized the use of percentiles, which are specific quantiles used to express the proportion of a population below a certain value. One of the main tools used by Galton was arguably a simplest predictive model for regression, the linear model. It is a linear combination of the input variables. \[ f(x) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p \] where $\beta_0,\ldots,\beta_p$ are the model parameters and $x_1,\ldots,x_p$ are the input variables. When the model parameters are estimated from the training data using least squares loss function, the estimated values are called the ordinary least squares (OLS) estimates.

Galton used regression analysis to show that the the offspring of exceptionally large or small seed size of sweet peas from tended to be closer to the average size. He also used it for family studies and investigated the inheritance of traits such as intelligence and talent. He used regression analysis to assess the degree to which these traits are passed down from parents to offspring.

Galton, along with Karl Pearson and F. Y. Edgeworth, were pioneers in the development of multiple regression and correlation analysis. These techniques helped quantify the relationships between multiple variables, providing valuable insights in social science research. Additionally, Galton developed family study methods to analyze the inheritance of traits, laying the groundwork for the study of genetics.

Galton’s overall approach to statistics was highly influential. He emphasized the importance of quantitative analysis, data-driven decision-making, and empirical research, which paved the way for modern statistical methods and helped to establish statistics as a recognized scientific discipline.

Example 11.1 To demonstrate linear regression we develop a model that predicts the price of a house, given its square footage. This is a simplified version of a regression model used by Zillow, a house pricing site. First, we look at property sales data where we know the price and some observed characteristics. Let’s look at the scatter plot of living areas measure in squared feet and the price measured in thousands of dollars.

Second, we build a model that predicts price as a function of the observed characteristics. Now, we need to decide on what characteristics do we use. There are many factors or variables that affect the price of a house, for example location. Some other basics ones include size, number of rooms, and parking.

We will use the SaratogaHouses dataset from the mlbench package. This dataset contains information about house sales in Saratoga County, New York, between 2004 and 2007. The dataset contains 1,172 observations and 16 variables. We will use the price variable as the dependent variable and the livingArea variable as the independent variable. The price variable contains the sale price of the house in thousands of dollars, and the livingArea variable contains the living area of the house in square feet. \[ \mathrm{price} = \beta_0 + \beta_1 \mathrm{livingArea} \]

The value that we seek to predict,price, is called the dependent variable, and we denote this by $y$. The variable that we use to construct our prediction, livingArea, is the predictor variable, and we denote it with $x$. Typically, we use those notations to write down the formulas \[ y = \beta_0 + \beta_1 x. \]

First, let’s see what’s does the data look like?

Code

d = read.csv("../../data/SaratogaHouses.csv")
d$price = d$price/1000; d$livingArea = d$livingArea/1000
plot(d$livingArea,d$price, 'p', pch=20, col="blue")

We use lm function to estimate the parameters of the line

Code

l = lm(price ~ livingArea, data=d)
plot(d$livingArea, d$price, 'p', pch=20, col="blue", xlab="Living Area", ylab="Price") 
abline(l, col="red", lwd=3)

This R code is fitting a linear regression model using the lm function. Let’s break down the code:

lm: This is the function used for fitting linear models, it used least squares loss function.
price: This is the dependent variable, the variable you are trying to predict. It is assumed to be in the dataset specified by the data argument.
~: In the context of the formula inside lm, the tilde (~) separates the dependent variable from the independent variable(s).
livingArea: This is the independent variable or predictor. In this case, the model is trying to predict the variable price based on the values of livingArea.
data = d: This specifies the dataset that contains the variables price and livingArea. The dataset is named d.

So, in plain English, this R code is fitting a linear regression model where the price is predicted based on the livingArea in the dataset d. The model is trying to find a coefficient for livingArea to minimize the difference between the predicted values and the actual values of price.

We can use coef function to find out the slope and the intercept of this line.

Code

coef(l)

(Intercept)  livingArea 
         13         113

The intercept value is in units of $y$ ($1000). The slope is in units of $y$ per units of $x$ ($1000/1 sq. feet). In our housing example, we find that $\beta_0 = 13.44$ and $\beta_1 =113.12$. Thus, every $1000$ sqft increase the price of the hous by $1.34^{4} and the price of a lot of land without a house is 113.12. The magnitude of $\beta_1$ shows the economic significance of house’s square footage.

We can now predict the price of a house when we only know that size: take the value off the regression line. For example, given a house size of $x = 2.2$. Predicted Price: \[ \hat{y} = 13.44 + 113.12 \times 2.2 = 262.31 \]

Code

plot(d$livingArea, d$price, 'p', pch=20, col="blue", xlab="Living Area", ylab="Price")
abline(l, col="red", lwd=3)
# Draw a line between the two points
x = 2.2
yhat = x*coef(l)[2]+coef(l)[1]
lines(c(x,x), c(0,yhat), col="green", lwd=3, lty=2)
lines(c(0,x), c(yhat,yhat), col="green", lwd=3, lty=2)
text(x-0.7,yhat+30, TeX("$\\hat{y} =  262.3$"), cex=1.5)

11.1 Prediction vs Interpretation

As we have discussed at the beginning of this chapter the predictive rule can be used for two purposes: prediction and interpretation. The goal of interpretation is to understand the relationship between the input and output variables. The two goals are not mutually exclusive, but they are often in conflict. For example, a model that is good at predicting the target variable might not be good at interpreting the relationship between the input and output variables. A nice feature of a linear model is that it can be used for both purposes, unlike more complex predictive rules with many parameters that can be difficult to interpret.

Typically the problem of interpretation requires a simpler model. We prioritize models that are easy to interpret and explain, even if they have slightly lower predictive accuracy. Also, evaluation metrics are different, we typically use coefficient of determination (R-squared) or p-values, which provide insights into the model’s fit and the significance of the estimated relationships.

The choice between using a model for prediction or interpretation depends on the specific task and desired outcome. If the primary goal is accurate predictions, a complex model with high predictive accuracy might be preferred, even if it is less interpretable. However, if understanding the underlying relationships and causal mechanisms is crucial, a simpler and more interpretable model might be chosen, even if it has slightly lower predictive accuracy. Typically interpretive models are used in scientific research, social sciences, and other fields where understanding the underlying causes and relationships is crucial.

In practice, it’s often beneficial to consider both prediction and interpretation when building and evaluating models. However, it is not unusual to build two different models, one for prediction and one for interpretation. This allows for a more nuanced analysis of the data and can lead to better insights than using a single model for both purposes.

Below we consider a linear model that allows us to understand the relations between the input and output variables.

Example 11.2 (Keynes Investment Performance) John Maynard Keynes was a good investor, achieving a long-term average annual return of 16% while managing the King’s College Cambridge endowment fund from 1921 to 1946. This performance significantly outperformed the overall market during that period. In the 1921-1929 period he experimented with investments in art and currency, as well as cyclical equity investing. However, later in 1930s and onwards Keynes transitioned to a value investing strategy, focusing on undervalued stocks with strong fundamentals. This coincided with the Great Depression, where such investments offered significant opportunities. This shift led to consistent outperformance, with the King’s College endowment generating returns exceeding the market on average. In our analysis we consider his returns during the 1928-1945 period. Below is the plot of his returns.

Code

keynes = read.csv("../../data/keynes.csv",header=T)

Year	Keynes	Market	Rate
1928	-3.4	7.9	4.2
1929	0.8	6.6	5.3
1930	-32.4	-20.3	2.5
1931	-24.6	-25.0	3.6
1932	44.8	-5.8	1.5
1933	35.1	21.5	0.6

Code

attach(keynes)
# Plot the data
plot(Year,Keynes,pch=20,col=34, cex=3, type='l')

Code

mean(Keynes)

Let’s compare his performance to the overall stock market. The Dow Jones Industrial Average, grew at an average annual rate of 8.5% during the same period (1921-1946). Keynes was able to consistently generate alpha, exceeding the market’s overall returns.

The plot below shows the relationship scatterplot of the maket returns vs Keynes portfolio returns. The correlation coefficient is 0.76, which is high, when markets are doing well, Keynes did also well.

Code

plot(Market, Keynes, xlab="Market Return", ylab="Keynes Excess Return",col=20, pch=16)

The correlation coefficient is

Code

cor(Keynes,Market)

 0.74

Now we fit a liner regression model \[ \text{Keynes} = \alpha + \beta \text{Market} \]

Code

plot(Market, Keynes, xlab="Market Return", ylab="Keynes Excess Return",col=20, pch=16)
model = lm(Keynes~Market)
abline(model, lwd=3)

Code

# Extract the model coefficients
coef(model)

(Intercept)      Market 
       13.2         1.7

Code

# summary(model)
model %>%  tidy() %>% knitr::kable(digits=2)

term	estimate	std.error	statistic	p.value
(Intercept)	13.2	4.75	2.8	0.01
Market	1.7	0.39	4.5	0.00

The intecept of the least squares line is $\alpha = 13.2$, which is significantly higher than 0. This indicates that Keynes was able to generate higher returns than the market, even when the market was performing well. This is consistent with his value investing strategy, which allowed him to identify undervalued stocks and generate significant alpha.

Now we adjust of the risk-free returns abc calculate excess return.

Code

Keynes = Keynes - Rate
Market = Market - Rate
modelnew = lm(Keynes~Market)
modelnew %>%  tidy() %>% knitr::kable(digits=2)

term	estimate	std.error	statistic	p.value
(Intercept)	14.5	4.69	3.1	0.01
Market	1.8	0.37	4.8	0.00

We see that after the adjustment, the intercept $\alpha=14.46$ is now even larger.

11.2 Linear Regression

Regression analysis is the most widely used statistical tool for understanding relationships among variables. It provides a conceptually simple method for investigating functional relationships between one or more factors and an outcome of interest. This relationship is expressed in the form of an equation, which we call the model, connecting the response or dependent variable and one or more explanatory or predictor variable.

It is convenient to introduce a regression model using the language of probability and uncertainty. A regression model assumes that mean of output variable $y$ depends linearly on predictors $x_1,\ldots,x_p$ \[ y = \beta_0 + \beta_1 x + \ldots + \beta_p x_p + \epsilon,~ \text{where}~\epsilon \sim N(0, \sigma^2). \] Often, we use simpler dot-product notation \[ y = \beta^Tx + \epsilon, \] where $\beta = (\beta_0,\beta_1,\ldots,\beta_p)$ is the vector regression coefficients and $x = (1,x_1,\ldots,x_p)$ is the vector of predictors, with 1 appended to the beginning.

Line coefficients $\beta_i$s have the same interpretation as in the deterministic approach. However, the additional term $\epsilon$ is a random variable that captures the uncertainty in the relationship between $y$ and $x$, it is called the error term or the residual. The error term is assumed to be normally distributed with mean zero and variance $\sigma^2$. Thus, the linear regression model has a new parameter $\sigma^2$ that models dispersion of $y_i$ around the mean $\beta^Tx$, let’s see an example.

Example 11.3 (House Prices) Let’s go back to the Saratoga Houses dataset

Code

d = read.csv("../../data/SaratogaHouses.csv")
d$price = d$price/1000; d$livingArea = d$livingArea/1000
l = lm(price ~ livingArea, data=d)
summary(l)


Call:
lm(formula = price ~ livingArea, data = d)

Residuals:
   Min     1Q Median     3Q    Max 
-277.0  -39.4   -7.7   28.4  553.3 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    13.44       4.99    2.69   0.0072 ** 
livingArea    113.12       2.68   42.17   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 69 on 1726 degrees of freedom
Multiple R-squared:  0.507, Adjusted R-squared:  0.507 
F-statistic: 1.78e+03 on 1 and 1726 DF,  p-value: <2e-16

The output of the lm function has several components. Besides calculating the estimated values of the coefficients, given in the Estimate column, the method also calculates standard error (Std. Error) and t-statistic (t value) for each coefficient. The t-statistic is the ratio of the estimated coefficient to its standard error. The p-value (Pr(>|t|)) is the probability of observing a value at least as extreme as the one observed, assuming the null hypothesis is true. The null hypothesis is that the coefficient is equal to zero. If the p-value is less than 0.05, we typically reject the null hypothesis and conclude that the coefficient is statistically significant. In this case, the p-value for the livingArea coefficient is less than 0.05, so we conclude that the coefficient is statistically significant. This means that the size of the house is a statistically significant predictor of the price. The Residual standard error the standard deviation of the residuals $\hat y_i - y_i,~i=1,\ldots,n$.

The estimated values of the parameters were calculated using least squares loss function discussed above. The residual standard error is also relatively easy to calculate from the model residuals \[ s_e = \sqrt{ \frac{1}{n-2} \sum_{i=1}^n ( \hat y_i - y_i )^2 }. \] Now the question is, how the p-value for the estimates was calculated? And why did we assume that $\epsilon$ is normally distributed in the first place? The normality of $\epsilon$ and as a consequence, the conditional normality of $y \mid x \overset{iid}{\sim} N(\beta^Tx, \sigma^2)$ is easy to explain, it is simply due to mathematical convenience. Plus, this assumption happen to describe the reality well in a wide range of applications. One of those conveniences, is our ability to estimate to calculate mean and variance of the distribution of $\hat \beta_0$ and $\hat \beta_1$.

To understand how to calculate the p-values, we first notice that there is uncertainty about the values of the parameters $\beta_i$s. To get a feeling for the amount of variability in our experimentation. Imagine that we have two sample data sets. For example, we have housing data from two different realtor firms. Do you think the estimated value of price per square foot will be the same for both of those? The answer is no. Let’s demonstrate with an example, we simulate 20 data sets from the same distribution and estimate 20 different linear models.

Code

set.seed(92) #Kuzy
d = read.csv("../../data/SaratogaHouses.csv")
d$price = d$price/1000; d$livingArea = d$livingArea/1000
plot(d$livingArea, d$price, pch=20, col="blue", xlab="Living Area", ylab="Price")
for (i in 1:10) {
  # Sample with replacement
  dnew = d[sample(1:nrow(d),replace=T),]
  # Fit a linear model
  l = lm(price ~ livingArea, data=dnew, subset = sample(1:nrow(d),100))
  abline(l, col="green", lwd=3)
}

Figure 11.1: Twenty different linear models estimated using randomly selected subsample of the data.

Figure 11.1 shows the results of this simulation. We can see that the estimated coefficients $\hat \beta_i$ are different for each of the 20 samples. This is due to the sampling error. The sampling error is the difference between the estimated value of a parameter and its true value. The value of $\beta_1$ will differ from sample to sample. In other words, it will be a random variable. The sampling distribution of $\beta_1$ describes how it varies over different samples with the $x$ values fixed. Statistical view of linear regression allows us to calculate confidence and prediction intervals for estimated parameters. It turns out that when least squares principle is used, the estimated $\hat\beta_1$ is normally distributed: $\hat\beta_1 \sim N( \beta_1 , s_1^2 )$. Let’s see how we can derive this result.

The extension of the central limit theorem, sometimes called the Lindeberg CLT, states that a linear combination of independent random variables that satisfy some mild condition are approximately normally distributed. We can show that estimates of $(\beta_0\ldots,\beta_p)$ are linear combinations of the observed values of $y$ and are therefore normally distributed. Indeed, if we write the linear regression model in matrix form \[ Y = X \beta + \epsilon, \] where $Y$ is the vector of observed values of the dependent variable, $X$ is the matrix of observed values of the independent variables, $\beta$ is the vector of unknown parameters, and $\epsilon$ is the vector of errors. Then, if we take the derivative of the loss function for linear regression and set it to zero, we get the following expression for the estimated parameters \[ \hat \beta = AY, \] where $A = (X^TX)^{-1}X^T$. Due to Lindeberg central limit theorem, $\hat \beta$ is normally distributed. This is a useful property that allows us to calculate confidence intervals and p-values for the estimated parameters.

Now, we need to compute the mean and variance of $\hat \beta$. The mean is easy to compute, since the expectation of the sum is the sum of expectations, we have \[ \hat{\beta} = A(X\beta + \epsilon) \] \[ \hat{\beta} = \beta + A\epsilon \]

The expectation of $\hat{\beta}$ is: \[ E[\hat{\beta}] = E[\beta + A\epsilon] = E[\beta] + E[A\epsilon] = \beta \] Since $\beta$ is constant, and $E[A\epsilon] = 0$ ($E[\epsilon] = 0$).

The variance is given by \[ Var(\hat{\beta}) = Var(A\epsilon) \] If we assume that $\epsilon$ is independent of $X$, we can write and elements of $\epsilon$ are uncorrelated, we can write: \[ Var(\hat{\beta}) = AVar(\epsilon)A^T \] Given $Var(\epsilon) = \sigma^2 I$, we have: \[ Var(\hat{\beta}) = \sigma^2 (X^TX)^{-1} \]

Putting together the expectation and variance, and the fact that we get the following distribution for $\hat{\beta}$: \[ \hat{\beta} \sim N(\beta, \sigma^2 (X^TX)^{-1}). \]

Statistical approach to linear regression is useful. We can think of the estimated coefficients $\hat \beta_i$ as an average amount of change in $y$, when $x_i$ goes up by one unit. Since this average was calculated using a sample data, it is subject to sampling error and the sampling error is modeled by the normal distribution. Assuming that residuals $\epsilon$ are independently normally distributed with a variance that does not depend on $x$ (homoscedasticity), we can calculate the mean and variance of the distribution of $\hat \beta_i$. This is a useful property that allows us to calculate confidence intervals and p-values for the estimated parameters.

In summary, the statistical view of the linear regression model is useful for understanding the uncertainty associated with the estimated parameters. It also allows us to calculate confidence intervals and prediction intervals for the output variable.

Average value of output $y$ is a linear function of input $x$ and lie on the straight line of regression $\hat y_i = \beta^Tx_i$.
The values of $y$ are statistically independent.
The true value of $y = \hat y_i + \epsilon_i$ is a random variable, and it is normally distributed around the mean with variance $\sigma^2$. This variance is the same for all values of $y$.
The estimated values of the parameters $\hat \beta_i$ are calculated from observed data and are subject to the sampling error and we are not certain about them. This uncertainty is modeled by the normally distributed around the true values $\beta$. Given that errors $\epsilon_i$ are homoscedastic and independent, we have $Var(\hat{\beta}) = \sigma^2 (X^TX)^{-1}$.

Again, consider a house example. Say in our data we have 10 houses with the same squire footage, say 2000. Now the third point states, that the prices of those houses should follow a normal distribution and if we are to compare prices of 2000 sqft houses and 2500 sqft houses, they will have the same standard deviation. The second point means that price of one house does not depend on the price of another house.

All of the assumptions in the regression model can be written using probabilist notations The looks like: \[ y \mid x \overset{iid}{\sim} N(\beta^Tx, \sigma^2). \]

In the case when we have only one predictor the variance of the estimated slope $\hat \beta_1$ is given by \[ Var(\hat \beta_1) = \frac{\sigma^2}{\sum_{i=1}^n ( x_i - \bar{x} )^2 } = \frac{ \sigma^2 }{ (n-1) s_x^2 }, \] where $s_x^2$ is the sample variance of $x$. Thus, there are three factors that impact the size of standard error for $\beta_1$: sample size ($n$), error variance ($s^2$), and $x$-spread, $s_x$.

We can empirically demonstrate the sampling error by simulating several samples from the same distribution and estimating several linear models. We can see that the estimated coefficients $\hat \beta_i$ are normally distributed around the true values $\beta_i$. If we plot coefficients for 1000 different models, we can see that the empirical distribution resembles a normal distribution.

Code

set.seed(92) #Kuzy
# Read housing data
d = read.csv("../../data/SaratogaHouses.csv")
d$price = d$price/1000; d$livingArea = d$livingArea/1000
# Simulate 1000 samples
n = 1000
# Create a matrix to store the results
beta = matrix(0, nrow=n, ncol=2)
# Simulate 1000 samples
for (i in 1:n) {
  # Sample with replacement
  dnew = d[sample(1:nrow(d),replace=T),]
  # Fit a linear model
  l = lm(price ~ livingArea, data=dnew)
  # Store the coefficients
  beta[i,] = coef(l)
}
ind = 2
# Plot the results
plot(beta[,1], beta[,2], pch=20, col="blue", xlab="Intercept", ylab="Slope")
abline(h=coef(l)[2], lwd=3, col="red")
abline(v=coef(l)[1], lwd=3, col="red")

Code

X = cbind(1, d$livingArea)
# Calculate the variance of the coefficients
var = sigma(l)^2 * solve(t(X) %*% X)
varb = var[ind,ind]/0.63
hist(beta[,ind], col="blue", xlab="Intercept", main="",freq=F)

Code

bt = seq(80,140,0.1)

Accounting for uncertainty in $\hat \beta$s we can calculate confidence intervals for the predicted average $\hat y$. When we additionally account for the uncertainty in the predicted value $\hat y$, we can calculate prediction intervals.

Another advantage of adopting a statistical view of the linear regression model is ability to quantify information about potential outliers. Outliers are points that are extreme relative to our model predictions. Recall, that residual is $e_i = y_i- \hat y_i$. Since our predicted value $\hat y_i$ follows a normal distribution, the residual also follows a normal distribution, since it is a difference of normal random variable $\hat y_i$ and a constant $y_i$. It easy to see that \[e_i \sim N(0, s_e^2),\] where $s_e^2$ is an empirical estimate of the error’s variance.

Consider the relation between the fitted values $\hat y_i$ and residuals $e_i$. Our predictions are given by the line. The residual $e_i$ and predicted value $\hat y_i$ for the $i$th observation are related via \[y_i = \hat{y}_i + ( y_i - \hat{y}_i ) = \hat{y}_i + e_i\].

Residuals allow us to define outliers. They simply have large residuals. We re-scale the residuals by their standard errors. This lets us define \[r_i = \frac{ e_i }{ s_{ e} } =\frac{y_i - \hat{y}_i }{ s_{ e } }\] Since residuals follow normal distribution $e \sim N(0,\sigma^2)$, in 95% of the time we expect the standardized residuals to satisfy $- 2 < r_i < 2$. Any observation with is an extreme outlier, it is three sigmas away from the mean.

Another types of observations we are interested are the influential points. Those are are observations that affect the magnitude of our estimates $\hat{\beta}$’s. They are important to find as they typically have economic consequences. We will use to assess the significance of an influential point. Cook’s distance associated with sample $i$ measure the change in estimated model parameters $\hat \beta$ when sample $i$ is removed from the training data set.

Intuitively, we model regression-back-to-the-mean effect. This is one of the most interesting statistical effects you’ll see in daily life. In statistics, regression does not mean “going backwards”, but rather the tendency for a variable that is extremely high or low to move closer to the average upon subsequent measurement. For example, Francis Galton, who was a cousin of Charles Darwin, in his study on regression to the mean height showed that if your parents are taller than the average, you’ll regress back to the average. While people might expect the children of tall parents to be even taller and the children of short parents to be even shorter, Galton found that this wasn’t the case. Instead, he observed that the heights of the children tended to be closer to the average height for the population. Galton termed this phenomenon “regression towards mediocrity” (now more commonly known as “regression to the mean”). It meant that extreme characteristics (in this case, height) in parents were likely to be less extreme (closer to the average) in their children. It is a classic example that helped introduce and explain this statistical concept. Galton’s finding was one of the first insights into what is now a well-known statistical phenomenon. It doesn’t imply that all individual cases will follow this pattern; rather, it’s a trend observed across a population. It’s important to understand that regression to the mean doesn’t suggest that extreme traits diminish over generations but rather that an extreme measurement is partly due to random variation and is likely to be less extreme upon subsequent measurement.

Another example was documented by Daniel Kahneman and Amos Tversky in their book Thinking, Fast and Slow. They found that when a person performs a task, their performance is partly due to skill and partly due to luck. They observed that when a person performs a task and achieves an extreme result, their subsequent performance is likely to be less extreme. Particularly they studied effect of criticism and praise used by Israeli Air Force fighter pilots trainers. After criticism, the low-scoring pilots were retested. Often, their scores improve. At first glance, this seems like a clear effect of feedback from the trainer. However, some of this improvement is likely a statistical artifact and demonstrates the regression to the mean effect.

Why? Those pilots who initially scored poorly were, statistically speaking, somewhat unlucky. Their low scores may have been due to a bad day, stress, or other factors. When retested, their scores are likely to be closer to their true skill level, which is closer to the average. This natural movement towards the average can give the illusion that the intervention (praise or criticism) was more effective than it actually was. Conversely, if the top performers were praised and retested, we might find their scores decrease slightly, not necessarily due to the inefficacy of the praise but due to their initial high scores being partly due to good luck or an exceptionally good day. In conclusion, in pilot training and other fields, it’s important to consider regression to the mean when evaluating the effectiveness of interventions. Without this consideration, one might draw incorrect conclusions about the impact of training or other changes.

Example 11.4 (Google vs S&P 500) We will demonstrate how we can use statistical properties of a linear regression model to understand the relationship between returns of a google stock and the S&P 500 index. We will use Capital Asset Pricing Model (CAPM) regression model to estimate the expected return of an investment into Google stock and to price the risk. The CAPM model is

\[ \mathrm{GOOG} = \alpha + \beta \mathrm{SP500} + \epsilon \] On the left hand side, we have the return that investors expect to earn from investing into Google stock. In the CAPM model, this return is typically modeled as a dependent variable.

The input variable SP500 represents the average return of the entire US market. Beta measures the volatility or systematic risk of a security or a portfolio in comparison to the market as a whole. A beta greater than 1 indicates that the security is more volatile than the market, while a beta less than 1 indicates it is less volatile. Alpha is the intercept of the regression line, it measures the excess return of the security over the market. The error term $\epsilon$ captures the uncertainty in the relationship between the returns of Google stock and the market.

In a CAPM regression analysis, the goal is to find out how well the model explains the returns of a security based on its beta. This involves regressing the security’s excess returns (returns over the risk-free rate) against the excess returns of the market. The slope of the regression line represents the beta, and the intercept should ideally be close to the risk-free rate, although in practice it often deviates. This model helps in understanding the relationship between the expected return and the systematic risk of an investment.

Based on the uncertainty associated with the estimates for alpha and beta, we can formulate some of hypothesis tests, for example

$H_0 :$ is Google related to the market?
$H_0 :$ does Google out-perform the market in a consistent fashion?

getSymbols(Symbols = c("GOOG","SPY"),from='2017-01-03',to='2023-12-29')

 "GOOG" "SPY"

gret = as.numeric(dailyReturn(GOOG))
spyret = as.numeric(dailyReturn(SPY))
l = lm(gret ~ spyret)
tidy(l) %>% knitr::kable(digits=4)

term	estimate	std.error	statistic	p.value
(Intercept)	0.0003	0.0003	1.1	0.27
spyret	1.1706	0.0240	48.8	0.00

Google vs S&P 500 returns between 2017-2023

Code

plot(gret, spyret, pch=20, col="blue", xlab="Google Return", ylab="SPY Return")
abline(l, lwd=3, col="red")

Here’s what we get after we fit the model using function

Our best estimates are \[ \hat \alpha = 0.0004 \; , \; \hat{\beta} = 1.01 \]

Now we can provide the results for hypothesis we set at he beginning. Given that the p-value for $H_0: \beta = 0$ is <2e-16 we can reject the null hypothesis and conclude that Google is related to the market. The p-value for $H_0: \alpha = 0$ is 0.06, which is grater than 0.05, so we cannot reject the null hypothesis and conclude that Google does not out-performs the market in a consistent fashion in the 2017-2023 period.

Further, we can answer some of the other important questions, such as how much will Google move if the market goes up $10$%?

alpha = coef(l)[1]
beta = coef(l)[2]
# Calculate the expected return
alpha + beta*0.1

(Intercept) 
       0.12

However, if we look at the earlier period between 2005-2016 (the earlier days of Google) the results will be different.

getSymbols(Symbols = c("GOOG","SPY"),from='2005-01-03',to='2016-12-29');

 "GOOG" "SPY"

gret = as.numeric(dailyReturn(GOOG))
spyret = as.numeric(dailyReturn(SPY))
l = lm(gret ~ spyret)
tidy(l) %>% knitr::kable(digits=4)

Table 11.1: Google vs S&P 500 returns between 2005-2016
term	estimate	std.error	statistic	p.value
(Intercept)	0.0006	0.0003	2.2	0.03
spyret	0.9231	0.0230	40.1	0.00

Code

plot(gret, spyret, pch=20, col="blue", xlab="Google Return", ylab="SPY Return")
abline(l, lwd=3, col="red")

Figure 11.2: Google vs S&P 500 returns between 2005-2016

In this period Google did consistently outperform the market. The p-value for $H_0: \alpha = 0$ is 0.03.

11.3 Factor Regression and Feature Engineering

A linear model assumes that output variable is proportional to the input variable plus an offset. However, this is not always the case. Often, we need to transform input variables by combining multiple inputs into a single predictor, for example by taking a ratio or putting inputs on a different scale, e.g. log-scale. In machine learning, this process is called feature engineering.

One of the classic examples of feature engineering is Fama-French three-factor model which is used in asset pricing and portfolio management. The model states that asset returns depend on (1) market risk, (2) the outperforming of small versus big companies, and (3) the outperformance of high book/market versus small book/market companies, mathematically \[r = R_f + \beta(R_m - R_f) + b_s\cdot SMB + b_v\cdot HML + \alpha\] Here $R_f$ is risk-free return, $R_m$ is the return of market, $SMB$ stands for "Small market capitalization Minus Big" and $HML$ for "High book-to-market ratio Minus Low"; they measure the historic excess returns of small caps over big caps and of value stocks over growth stocks. These factors are calculated with combinations of portfolios composed by ranked stocks (BtM ranking, Cap ranking) and available historical market data.

11.3.1 Logarithmic and Power Transformations

Consider, the growth of the Apple stock between 2000 and 2024. With the exception of the 2008 financial crisis period and 2020 COVID 19 related declines, the stock price has been growing exponentially. Figure 11.3 shows the price of the Apple stock between 2000 and 2024. The price is closely related to the company’s growth.

Code

getSymbols(Symbols = "AAPL",from='2000-01-01',to='2023-12-31');

 "AAPL"

Code

plot(AAPL$AAPL.Adjusted, type='l', col="blue", xlab="Date", ylab="Price")

Figure 11.3: Apple stock price growth in the 2000-2024 period

The 2008 and 2020 declines are more related to extraneous factors, rather than the growth of the company. Thus, we can conclude that the overall growth of the company is exponential. Indeed, if we try to fit a linear model to the time-price data, we will see that the model does not fit the data well

Code

time = index(AAPL) - start(AAPL)
plot(time,AAPL$AAPL.Adjusted, type='l', col="blue", xlab="Date", ylab="Price")
abline(lm(AAPL$AAPL.Adjusted ~ time), col="red", lwd=3)

Just by visual inspection we can see that a straight line will not be a good fit for this data. The growth of a successful company typically follows the rule of compounding. Compounding is a fundamental principle that describes how some quantity grows over time when this quantity increases by a fixed percentage periodically. This is a very common phenomenon in nature and business. For example, if two parents have 2.2 children on average, then the population increases by 10% every generation. Another example is growth of investment in a savings account.

A more intuitive example is probably an investment in a savings account. If you invest $1000$ in a savings account with $10\%$ annual interest rate and you get payed once a year, then your account value will be $1100$ by the end of the year. However, if you get get payed $n$ times a year, and initially invest $y_0$ final value $y$ of the account after $t$ years will be \[ y = y_0 \times (1 + r/n)^{nt} \] where $r$ is the annual interest rate. When you get payed every month ($n=12$), a traditional payout schedule used by banks, then \[ y = 1000 \times (1 + 0.1/12)^{12} = 1105. \] A value slightly higher than the annual payout of 1100.

The effect of compounding is minimal in the short term. However, the effect of compounding is more pronounced when the growth rate is higher and time periods are longer. For example at $r=2$, $n=365$ and 4-year period $t=4$, you get \[ y = 1000 \times (1 + 2/365)^{3\times 365} = 2,916,565. \] Your account is close to 3 million dollars! Compared to $n=1$ scenario \[ y = 1000 \times (1 + 2)^{4} = 81,000, \] when you will end up with mealy 81 thousand. This is why compounding is often referred to as the “eighth wonder of the world” in investing contexts, emphasizing its power in growing wealth over time.

In general, as $n$ goes up, the growth rate of the quantity approaches the constant

n = 1:300
r = 1
plot(n, (1+r/n)^n, type='l', col="blue", xlab="n", ylab="Future Value")
abline(h=exp(r), col="red", lwd=3)

Figure 11.4: Growth of an investment in a savings account when n increases and return rate is 100% per year

Figure 11.4 shows the growth of an investment in a savings account when $n$ increases and return rate is $100\%$ per year. We can see that the growth rate approaches the constant $e \approx 2.72$ as $n$ increases. \[ (1+r/n)^n \rightarrow e^r,~\text{as}~n \rightarrow \infty. \] This limit was first delivered by Leonhard Euler and the number $e$ is known as Euler’s number.

Coming back to the growth of the Apple company, we can think of it growing at small constant rate every day. The relation between the time and size of Apple is multiplicative. Meaning when when time increases by one day, the size of the company increases by a small constant percentage. This is a multiplicative relation. In contrast, linear relation is additive, meaning that when time increases by one day, the size of the company increases by a constant amount. The exponential growth model is given by the formula \[ y = y_0 \times e^{\beta^Tx}. \] There are many business and natural science examples where multiplicative relation holds. IF we apply the $\log$ function to both sides of the equation, we get \[ \log y = \log y_0 + \beta^Tx. \] This is a linear relation between $\log y$ and $x$. Thus, we can use linear regression to estimate the parameters of the exponential growth model by putting the output variable $y$ on the log-scale.

Another example of nonlinear relation that can be analyzed using linear regression is when variables are related via a power law. This concept helps us modeling proportional relationships or ratios. In a multiplicative relationship, when one variable changes on a percent scale, the other changes in a directly proportional manner, as long as the multiplying factor remains constant. For example, the relation between the size of a city and the number of cars registered in the city is given by a power law. When the size of the city doubles, the number of cars registered in the city is also expected to double. The power law model is given by the formula \[ y = \beta_0 x^{\beta_1}. \] If we apply the $\log$ function to both sides of the equation, we get \[ \log y = \log \beta_0 + \beta_1 \log x. \] This is a linear relation between $\log y$ and $\log x$. Thus, we can use linear regression to estimate the parameters of the power law model by putting the output variable $y$ and input variable $x$ on the log-scale.

However, there are several caveats when putting variables on the log-scale. We need to make sure that the variable is positive. It means that we cannot apply log transformations to dummy or count variables.

Example 11.5 (World’s Smartest Mammal) We will demonstrate the power relation using the data on the brain (measured in grams) and body (measured in kilograms) weights for 62 mammal species. The data was collected by Harry J. Jerison in 1973. The data set contains the following variables:

Code

mammals = read.csv("../../data/mammals.csv")
knitr::kable(head(mammals))

Mammal	Brain	Body
African_elephant	5712.0	6654.00
African_giant_pouched_rat	6.6	1.00
Arctic_Fox	44.5	3.38
Arctic_ground_squirrel	5.7	0.92
Asian_elephant	4603.0	2547.00
Baboon	179.5	10.55

Let’s build a linear model.

Code

attach(mammals)
model = lm(Brain~Body)
plot(Body,Brain, pch=21, bg="lightblue")
abline(model, col="red", lwd=3)

Brain vs Body weight for 62 mammal species

We see a few outliers with suggests that normality assumption is violated. We can check the residuals by plotting residuals against fitted values and plotting fitted vs true values.

Code

plot(model$fitted.values, model$residuals, pch=21, bg="lightblue", xlab="Fitted y", ylab="Residuals")
abline(h=0, col="red", lwd=3)
plot(Brain, model$fitted.values, pch=21, bg="lightblue", xlab="True y", ylab="Fitted y")
abline(a=0,b=1, col="red", lwd=3)

Remember, that residuals should roughly follow a normal distribution with mean zero and constant variance. We can see that the residuals are not normally distributed and the variance increases with the fitted values. This is a clear indication that we need to transform the data. We can try a log-log transformation.

Code

model = lm(log(Brain)~log(Body))
model %>%  tidy() %>% knitr::kable(digits=2)

term	estimate	std.error	statistic	p.value
(Intercept)	2.18	0.11	20	0
log(Body)	0.74	0.03	23	0

Code

plot(model$fitted.values, model$residuals, pch=21, bg="lightblue")
abline(h=0, col="red", lwd=3)
plot(log(Brain), model$fitted.values, pch=21, bg="lightblue")
abline(a=0,b=1, col="red", lwd=3)

That is much better! The residuals variance is constant and the plot of fitted vs true values shows a linear relationship. The log-log model is given by the formula \[ \log \mathrm{Brain} = 2.18 + 0.74 \log \mathrm{Body}. \] seem to achieve two important goals, namely linearity and constant variance. The coefficients are highly significant.

Although the log-log model fits the data rather well, there are a couple of outliers there. Let us print the observations with the largest residuals.

Code

# res = rstudent(model)
res = model$residuals/sd(model$residuals)
outliers = order(res,decreasing = T)[1:10]
cbind(mammals[outliers,],
      Std.Res = res[outliers], Residual=model$residuals[outliers],
      Fit = exp(model$fitted.values[outliers])) %>% knitr::kable(digits=2)

	Mammal	Brain	Body	Std.Res	Residual	Fit
11	Chinchilla	64	0.42	3.41	2.61	4.7
34	Man	1320	62.00	2.53	1.93	190.7
50	Rhesus_monkey	179	6.80	2.06	1.58	36.9
6	Baboon	180	10.55	1.64	1.26	51.1
42	Owl_monkey	16	0.48	1.44	1.10	5.1
10	Chimpanzee	440	52.16	1.26	0.96	167.7
27	Ground_squirrel	4	0.10	1.18	0.91	1.6
43	Patas_monkey	115	10.00	1.11	0.85	49.1
60	Vervet	58	4.19	1.06	0.81	25.7
3	Arctic_Fox	44	3.38	0.92	0.71	22.0

There are two outliers, the Chinchilla and the Human, both have disproportionately large brains!

In fact, the Chinchilla has the largest standartised residual of 3.41. Meaning that predicted value of 4.7 g is 3.41 standard diviaitons away from the recorded value of 64 g. This suggests that the Chinchilla is a master race of supreme intelligence! However, afer checking more carefully we realized that there was a recording error and the acual weight of an average Chinchilla’s brain is 6.4. We mistyped the decimal separator! Thus the actual residual is 0.4.

abs(model$fitted.values[11] - log(6.4))/sd(model$residuals)

 11 
0.4

In reality Chinchilla’s brain is not far from an average mamal of this size!

11.4 Interactions

In many situations, $X_1$ and $X_2$ interact when predicting $Y$. An interaction occurs when the effect of one independent variable on the dependent variable changes at different levels of another independent variable. For example, consider a study analyzing the effect of study hours $X_1$ and a tutoring program $X_2$, a binary variable where 0 = no tutoring, 1 = tutoring) on test scores $Y$. Without an interaction term, we assume the effect of study hours on test scores is the same regardless of tutoring. With an interaction term, we can explore whether the effect of study hours on test scores is different for those who receive tutoring compared to those who do not. Here are a few more examples when there is potential interraction.

Does gender change the effect of education on wages?
Do patients recover faster when taking drug A?
How does advertisement affect price sensitivity?
Interactions are useful. Particularly with dummy variables.
We build a kitchen-sink model with all possible dummies (day of the week, gender,...)

If we think that the effect of $X_1$ on $Y$ depends on the value of $X_2$, we model it using a liner relation \[ \beta_1 = \beta_{10} + \beta_{11} X_2 \] and the model without interaction $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2$ becomes \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon. \] The interaction term captures the effect of $X_1$ on $Y$ when $X_2=1$. The coefficient $\beta_3$ is the difference in the effect of $X_1$ on $Y$ when $X_2=1$ and $X_2=0$. If $\beta_3$ is significant, then there is an interaction effect. If $\beta_3$ is not significant, then there is no interaction effect.

In R:

Code

model = lm(y = x1 * x2)

gives $X_1+X_2+X_1X_2$, and

Code

model = lm(y = x1:x2)

gives only $X_1 X_2$

The coefficients $\beta_1$ and $\beta_2$ are marginal effects.

If $\beta_3$ is significant there’s an interaction effect and ee leave $\beta_1$ and $\beta_2$ in the model whether they are significant or not.

$X_1$ and $D$ dummy

$X_2 = D$ is a dummy variable with values of zero or one.
Model: typically we run a regression of the form \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_1 \star D + \epsilon\]
The coefficient $\beta_1 + \beta_2$ is the effect of $X_1$ when $D=1$. The coefficient $\beta_1$ is the effect when $D=0$.

Example 11.6 (Orange Juice)

Code

oj = read.csv("./../../data/oj.csv")
knitr::kable(oj[1:5,1:10], digits=2)

store	brand	week	logmove	price	AGE60	EDUC	ETHNIC	INCOME
2	tropicana	40	9.0	3.9	0.23	0.25	0.11	11
2	tropicana	46	8.7	3.9	0.23	0.25	0.11	11
2	tropicana	47	8.2	3.9	0.23	0.25	0.11	11
2	tropicana	48	9.0	3.9	0.23	0.25	0.11	11
2	tropicana	50	9.1	3.9	0.23	0.25	0.11	11

Code

model = lm(logmove ~ log(price)*feat, data=oj)
model %>%  tidy() %>% knitr::kable(digits=2)

term	estimate	std.error	statistic
(Intercept)	9.66	0.02	588
log(price)	-0.96	0.02	-51
feat	1.71	0.03	56
log(price):feat	-0.98	0.04	-23

Code

model = lm(log(price)~ brand-1, data = oj)
model %>%  tidy() %>% knitr::kable(digits=2)

term	estimate	statistic
branddominicks	0.53	254
brandminute.maid	0.79	382
brandtropicana	1.03	500

Code

brandcol <- c("green","red","gold")
oj$brand = factor(oj$brand)
boxplot(log(price) ~ brand, data=oj, col=brandcol)
plot(logmove ~ log(price), data=oj, col=brandcol[oj$brand], pch=20)

83 Chicagoland Stores (Demographic info for each)
Price, sales (log units moved), and whether advertised (feat)

Orange Juice: Price vs Sales

Code

plot(exp(logmove) ~ price, data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="move")

Orange Juice: Price vs log(Sales)

Code

plot(logmove ~ price, data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="log(move)")

Orange Juice: Price vs log(Sales)

Code

l1 <- loess(logmove ~ price, data=oj, span=2)
smoothed1 <- predict(l1) 
ind = order(oj$price)
plot(logmove ~ price, data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="log(move)")
lines(smoothed1[ind], x=oj$price[ind], col="blue", lwd=2)

Orange Juice: log(Price) vs log(Sales)

Code

plot(logmove ~ log(price), data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="log(move)")
l2 <- lm(logmove ~ log(price), data=oj)
smoothed2 <- predict(l2) 
ind = order(oj$price)
lines(smoothed2[ind], x=log(oj$price[ind]), col="blue", lwd=2)

Why? Multiplicative (rather than additive) change.

Now we are interested in how does advertisement affect price sensitivity?

Original model \[ \log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \beta_2 \mathrm{feat}. \] If we feature the brand (in-store display promo or flyer ad), does it affect price sensitivity $\beta_1$? If we assume it does \[ \beta_1 = \beta_3 + \beta_4\mathrm{feat}. \] The new model is \[ \log(\mathrm{sales}) = \beta_0 + (\beta_3 + \beta_4\mathrm{feat})\log(\mathrm{price}) + \beta_2 \mathrm{feat}. \] After expanding \[ \log(\mathrm{sales}) = \beta_0 + \beta_3\log(\mathrm{price}) + \beta_4\mathrm{feat}*\log(\mathrm{price}) + \beta_2 \mathrm{feat}. \]

Code

## and finally, consider 3-way interactions
ojreg <- lm(logmove ~ log(price)*feat, data=oj)
ojreg %>%  tidy() %>% knitr::kable(digits=2)

term	estimate	std.error	statistic
(Intercept)	9.66	0.02	588
log(price)	-0.96	0.02	-51
feat	1.71	0.03	56
log(price):feat	-0.98	0.04	-23

Code

lm(logmove ~ log(price), data=oj) %>%  tidy() %>% knitr::kable(digits=2)

term	estimate	std.error	statistic	p.value
(Intercept)	10.4	0.02	679	0
log(price)	-1.6	0.02	-87	0

Code

lm(logmove ~ log(price)+feat + brand, data=oj) %>%  tidy() %>% knitr::kable(digits=2)

term	estimate	std.error	statistic
(Intercept)	10.28	0.01	708
log(price)	-2.53	0.02	-116
feat	0.89	0.01	85
brandminute.maid	0.68	0.01	58
brandtropicana	1.30	0.01	88

Code

lm(logmove ~ log(price)*feat, data=oj) %>%  tidy() %>% knitr::kable(digits=2)

term	estimate	std.error	statistic
(Intercept)	9.66	0.02	588
log(price)	-0.96	0.02	-51
feat	1.71	0.03	56
log(price):feat	-0.98	0.04	-23

Code

lm(logmove ~ brand-1, data=oj) %>%  tidy() %>% knitr::kable(digits=2)

term	estimate	std.error	statistic
branddominicks	9.2	0.01	885
brandminute.maid	9.2	0.01	889
brandtropicana	9.1	0.01	879

Advertisement increases price sensitivity from -0.96 to -0.958 - 0.98 = -1.94!

Why?

One of the reasons is that the price was lowered during the Ad campaign!

Code

doj = oj %>% filter(brand=="dominicks")
par(mfrow=c(1,3), mar=c(4.2,4.6,2,1))
boxplot(price ~  feat, data = oj[oj$brand=="dominicks",], col=c(2,3), main="dominicks", ylab="Price ($)")
boxplot(price ~  feat, data = oj[oj$brand=="minute.maid",], col=c(2,3), main="minute.maid")
boxplot(price ~  feat, data = oj[oj$brand=="tropicana",], col=c(2,3), main="tropicana")

0 = not featured, 1 = featured

11.5 Dummies

We want to understand effect of the brand on the sales \[\log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \xcancel{\beta_2\mathrm{brand}}\]

But brand is not a number!

How can you use it in your regression equation?

We introduce dummy variables

Brand	Intercept	brandminute.maid	brandtropicana
minute.maid	1	1	0
tropicana	1	0	1
dominicks	1	0	0

\[\log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \beta_{21}\mathrm{brandminute.maid} + \beta_{22}\mathrm{brandtropicana}\]

R will automatically do it it for you

Code

print(lm(logmove ~ log(price)+brand, data=oj))


Call:
lm(formula = logmove ~ log(price) + brand, data = oj)

Coefficients:
     (Intercept)        log(price)  brandminute.maid    brandtropicana  
           10.83             -3.14              0.87              1.53

\[\log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \beta_3\mathrm{brandminute.maid} + \beta_4\mathrm{brandtropicana}\]

$\beta_3$ and $\beta_4$ are “change relative to reference" (dominicks here).

How does brand affect price sensitivity?

Interactions: logmove ~ log(price) * brand

No Interactions: logmove ~ log(price) + brand

Parameter	Interactions	No Interactions
(Intercept)	10.95	10.8288
log(price)	-3.37	-3.1387
brandminute.maid	0.89	0.8702
brandtropicana	0.96239	1.5299
log(price):brandminute.maid	0.057
log(price):brandtropicana	0.67

Example 11.7 (Golf Performance Data) Dave Pelz has written two best-selling books for golfers, Dave Pelz’s Short Game Bible, and Dave Pelz’s Putting Bible.

Dave Pelz was formerly a “rocket scientist” (literally) Data analytics helped him refine his analysis It’s the short-game that matters!
The optimal speed for a putt Best chance to make the putt is one that will leave the ball $17$ inches past the hole, if it misses.

Golf Data

Year-end performance data on 195 players from the 2000 PGA Tour.

nevents, the number of official PGA events included in the statistics
money, the official dollar winnings of the player
drivedist, the average number of yards driven on par 4 and par 5 holes
gir, greens in regulation, measured as the percentage of time that the first (tee) shot on a par 3 hole ends up on the green, or the second shot on a par 4 hole ends up on the green, or the third shot on a par 5 hole ends up on the green
avgputts, which is the average number of putts per round.

Analyze these data to see which of nevents, rivedist, gir, avgputts is most important for winning money.

Regression of Money on all explanatory variables:

Code

d00 = read_csv("../../data/pga-2000.csv")
d18 = read_csv("../../data/pga-2018.csv")

Code

model18 = lm(money ~ nevents + drivedist + gir + avgputts, data=d18)
model00 = lm(money ~ nevents + drivedist + gir + avgputts, data=d00)
model00 %>%  tidy() %>% knitr::kable(digits=2)

term	estimate	std.error	statistic	p.value
(Intercept)	1.5e+07	4206466	3.5	0.00
nevents	-3.0e+04	11183	-2.7	0.01
drivedist	2.1e+04	6913	3.1	0.00
gir	1.2e+05	17429	6.9	0.00
avgputts	-1.5e+07	2000905	-7.6	0.00

$R^2 = 50$%

Residuals

Code

hist(rstandard(model00), breaks=20, col="lightblue", xlab = "Standartized Residual", main="")
arrows(x0 = 7.5,y0 = 20,x1 = 8.5,y1 = 2,length = 0.1)
text(x = 7,y = 22,labels = "Tiger Woods", cex=1.5)

Regression

Transform with log(Money) as it has much better residual diagnostic plots.

Code

m = lm(formula = log(money) ~ nevents + drivedist + gir + avgputts, data = d00)
summary(m)


Call:
lm(formula = log(money) ~ nevents + drivedist + gir + avgputts, 
    data = d00)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5438 -0.3655 -0.0058  0.3968  1.9262 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.14923    3.57763   10.10   <2e-16 ***
nevents      -0.00899    0.00951   -0.94    0.346    
drivedist     0.01409    0.00588    2.40    0.018 *  
gir           0.16567    0.01482   11.18   <2e-16 ***
avgputts    -21.12875    1.70178  -12.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.57 on 190 degrees of freedom
Multiple R-squared:  0.671, Adjusted R-squared:  0.664 
F-statistic: 96.9 on 4 and 190 DF,  p-value: <2e-16

$R^2 = 67$%. There’s still $33$% of variation to go

Residuals for log(Money)

Code

par(mar = c(4,4.5,0,0),mfrow=c(1,1))
summary(model00log)


Call:
lm(formula = log(money) ~ nevents + drivedist + gir + avgputts, 
    data = d00)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5438 -0.3655 -0.0058  0.3968  1.9262 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.14923    3.57763   10.10   <2e-16 ***
nevents      -0.00899    0.00951   -0.94    0.346    
drivedist     0.01409    0.00588    2.40    0.018 *  
gir           0.16567    0.01482   11.18   <2e-16 ***
avgputts    -21.12875    1.70178  -12.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.57 on 190 degrees of freedom
Multiple R-squared:  0.671, Adjusted R-squared:  0.664 
F-statistic: 96.9 on 4 and 190 DF,  p-value: <2e-16

Code

hist(rstandard(model00log), breaks=20, col="lightblue", xlab = "Standartized Residual", main="")
arrows(x0 = 3,y0 = 20,x1 = 3.2,y1 = 2,length = 0.1)
text(x = 3,y = 22,labels = "Tiger Woods", cex=1.5)

Regression: Variable selection

$t$-stats for nevents is $<1.5$.

Code

lm(formula = log(money) ~ drivedist + gir + avgputts, data = d00)


Call:
lm(formula = log(money) ~ drivedist + gir + avgputts, data = d00)

Coefficients:
(Intercept)    drivedist          gir     avgputts  
    36.1737       0.0146       0.1658     -21.3684

The fewer the putts the better golfer you are. Duh!

avgputts per round is hard to decrease by one!

Evaluating the Coefficients

Greens in Regulation (GIR) has a $\hat{\beta} = 0.17$. If I can increase my GIR by one, I’ll earn $e^{0.17} = 1.18$% An extra $18$%
DriveDis has a $\hat{\beta} = 0.014$. A $10$ yard improvement, I’ll earn $e^{0.014 \times 10} =e^{0.14} = 1.15$% An extra $15$%

Caveat: Everyone has gotten better since 2000!

Main Findings

Tiger was $9$ standard deviations better than the model.

Taking logs of Money helps the residuals!
An exponential model seems to fit well. The residual diagnostics look good
The $t$-ratios for nevents are under $1.5$.

Over-Performers

Outliers: biggest over and under-performers in terms of money winnings, compared with the performance statistics.

Woods, Mickelson, and Els won major championships by playing well when big money prizes were available.

Over-Performers

Over-Performers
	name	money	Predicted	Error
1	Tiger Woods	9188321	3584241	5604080
2	Phil Mickelson	4746457	2302171	2444286
3	Ernie Els	3469405	1633468	1835937
4	Hal Sutton	3061444	1445904	1615540
20	Notah Begay III	1819323	426061	1393262
182	Steve Hart	107949	-1186685	1294634

Under-Performers

Underperformers are given by large negative residuals Glasson and Stankowski should win more money.

Under-Performers
	name	money	Predicted	Error
47	Fred Couples	990215	1978477	-988262
52	Kenny Perry	889381	1965740	-1076359
70	Paul Stankowski	669709	1808690	-1138981
85	Bill Glasson	552795	1711530	-1158735
142	Jim McGovern	266647	1397818	-1131171

Lets look at 2018 data, the highest earners are

Highest earners 2018
name	nevents	money	drivedist	gir	avgputts
Justin Thomas	23	8694821	312	69	1.7
Dustin Johnson	20	8457352	314	71	1.7
Justin Rose	18	8130678	304	70	1.7
Bryson DeChambeau	26	8094489	306	70	1.8
Brooks Koepka	17	7094047	313	68	1.8
Bubba Watson	24	5793748	313	68	1.8

Overperformers

Overperformers 2018
	name	money	Predicted	Error
1	Justin Thomas	8694821	5026220	3668601
2	Dustin Johnson	8457352	6126775	2330577
3	Justin Rose	8130678	4392812	3737866
4	Bryson DeChambeau	8094489	3250898	4843591
5	Brooks Koepka	7094047	4219781	2874266
6	Bubba Watson	5793748	3018004	2775744
9	Webb Simpson	5376417	2766988	2609429
11	Francesco Molinari	5065842	2634466	2431376
12	Patrick Reed	5006267	2038455	2967812
84	Satoshi Kodaira	1471462	-1141085	2612547

Underperformers

Underperformers 2018
	name	money	Predicted	Error
102	Trey Mullinax	1184245	3250089	-2065844
120	J.T. Poston	940661	3241369	-2300708
135	Tom Lovelady	700783	2755854	-2055071
148	Michael Thompson	563972	2512330	-1948358
150	Matt Jones	538681	2487139	-1948458
158	Hunter Mahan	457337	2855898	-2398561
168	Cameron Percy	387612	3021278	-2633666
173	Ricky Barnes	340591	3053262	-2712671
176	Brett Stegmaier	305607	2432494	-2126887

Findings

Here’s three interesting effects:

Tiger Woods is $8$ standard deviations better!
Increasing driving distance by $10$ yards makes you $15$% more money
Increasing GIR by one makes you $18$% more money.
Detect Under- and Over-Performers

Go Play!!