📈

Regression

Posted on Sun, Jun 20, 2021

Motivation of Linear Regression

"For prediction purposes they can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio or sparse data" - ESL, Stanford

Properties of Correlation

Simple Linear Regression

Find the best fitting line to a scatterplot:

f^(x)=B0^+Bj^xj\hat{f}(x) = \hat{B_0} + \sum\hat{B_j}x_j

where B^0\hat{B}_0 is the intercept, B^j\hat{B}_j is one of the parameters, xjx_j is one of the explanatory variables, and f^(x)\hat{f}(x) is the dependent variable

Least Squares Estimation

In order to generate a regression line through the data, we must define a cost-function that we can minimize as much as possible. This cost-function is called the Least squares estimation, and it attempts to calculate the B^\hat{B} parameters such that sum of squared errors is minimized.

SSE=i=1N(yif(xi))2=i=1N(yiB0^j=1pBj^xij)2SSE = \sum_{i=1}^N(y_i - f(x_i))^2 = \sum_{i=1}^N(y_i - \hat{B_0} - \sum_{j=1}^p\hat{B_j}x_{ij})^2

This equation calculates, for every data pair (x,y) denoted as i, the difference between the actual yiy_i value and the estimated y^i\hat{y}_i value, and squares the difference to avoid any negative values. If there were negative values, some errors would cancel out others, and thus the end sum would not be representative of how truly "off" the predictions were from the true values.

💡

The difference yiy^iy_i - \hat{y}_i is called a residual.

Types of Errors in Regression

After creating a regression on your data, you're left with a prediction y^\hat{y} for all X. These predictions will have some error associated with them, as the line that's been generated from the regression most likely won't perfectly cross all points of data. There are a few sums of errors that help quantify the model's variation:

SSR: Sum of squares due to regression — Quantifies the variation between the predicted y values and the sample mean of y, which in the case of a regression would signify there being no relationship between the x and y-values.

SSR=(yi^yiˉ)2SSR = \sum(\hat{y_i}-\bar{y_i})^2

SSE: Sum of squared errors — Quantifies the variation betwen the true y values around the regression line y^\hat{y}.

SSE=(yiyi^)2SSE = \sum(y_i -\hat{y_i})^2

SST: Sum of squares total — Quantifies the variation between the data y around their mean yˉ\bar{y}.

SST=SSR+SSE=(yi^yiˉ)2+(yiyi^)2SST = SSR + SSE = \sum(\hat{y_i}-\bar{y_i})^2 + \sum(y_i -\hat{y_i})^2

SST=(yiyiˉ)2SST = \sum(y_i-\bar{y_i})^2

R-Squared

The proportion of the total variation in Y being explained by the variation in X.

R2=SSR/SSTR^2 =SSR/SST

If there is a large SSE or sum of squared errors, then there is a large variation between the true y values around the regression line y^\hat{y}, and thus not a lot of the variation in the data can be explained by the regression line. In this case, SSR would be small, and thus R2R^2 will also be small, meaning the total variation of the data cannot be explained too much by the model.

SSE=SSR=R2\downarrow SSE = \uparrow SSR = \downarrow R^2

If there is a low SSE or sum of squared errors, then there is low variation between the true y values around the regression line y^\hat{y}, and thus a lot of the variation in the data can be explained by the regression line. In this case, SSR would be high, and thus R2R^2 will also be high, meaning the total variation of the data can be explained by the model.

The value of R2R^2 ranges from 0:1, where

💡

Note: If you are comparing models but have changed the Y variable in any way (e.g. predicting log(Y) instead of Y), comparing R-squared values is no longer possible as the distributions are different.

Population Regression Function and ε

When performing a regression on data, the objective is always to estimate the "true" relationship between a dependent variable y and explanatory variables X = (x_1, x_2, ... , x_p). Because we are performing a regression on a sample of data, the regression line y^\hat{y} is a linear model for the sample, and not for the population, so it is safe to assume there will be a difference between the prediction we've made y^\hat{y} and the true relationship line between y and X.

This theoretical "true" relationship line between y and X is called the Population Regression Function, denoted as:

Y=β0+β1X+ϵY = \beta_0 +\beta_1X + \epsilon

where the β\betas are the true coefficients of the relationship between Y and X, and ϵ\epsilon is the error term denoting the difference between the data and the population regression function Y. This error term is only theoretical, since we can never truly know the population regression function and therefore can never calculate the values for ϵ\epsilon.

Degrees of Freedom

When fitting a regression to data, a particular question might arise: If I have k explanatory variables, and I'm trying to predict Y, what is the minimum amount of observations I need to perform a regression?

Case 1: One explanatory variable, regression: y=β0+β1xy = \beta_0 + \beta_1x

Case 2: Two explanatory variables, regression: y=β0+β1x1+β2x2y = \beta_0 + \beta_1x_1 + \beta_2x_2

This brings us to degrees of freedom, which quantify the relationship between the number of observations, the number of explanatory variables, and the statistical power a regression line has to model a given data. The equation for degrees of freedom is:

df=nk1df = n - k - 1

where df = degrees of freedom, n = number of observations, and k = number of explanatory variables (X).

So in Case 1 above, with n = 3 observations and k = 1 explanatory variable, df comes out to be df=311=1df = 3 - 1-1=1, meaning that with three observations the regression had one degree of freedom. With n = 2 observations, the model had 0 degrees of freedom, and thus a regression could not be performed. With n = 1 observation, the value of degrees of freedom goes negative, so there is definitely not a regression line that can be plotted in that case.

In Case 2 above, the same logic can be applied to figure out how many observations are necessary for k = 2 explanatory variables. We need at least k = 1 degree of freedom to perform a regression, thus the number of observations required is n=df+k+1=1+2+1=4n = df + k + 1 = 1 + 2 + 1 = 4 observations, which aligns with the explanation from above.

Degrees of freedom are closely related to R2R^2:

💡

As more explanatory (X) variables are added to a model, df decreases, and as a result R2R^2 will ONLY increase.

This can make R2R^2 deceiving, because you could be adding lots of junk X variables that aren't adding any explanatory power to your model but your R2R^2 still increases, giving the impression that your model is being better trained by the variables added to it, when the reality is that you are just removing degrees of freedom which increase R2R^2.

Adjusted R-Squared

Rˉ2=1(SSESST)n1nk1\bar{R}^2=1-(\frac{SSE}{SST})\frac{n-1}{n-k-1}

The adjusted R2R^2 takes into account the degrees of freedom of the model and adjust the score provided by R2R^2 to reflect a closer estimate of how much variation in Y is really explained by the variation in X.

As k increases, Adjusted R2R^2 will tend to decrease, reflecting the reduced power in the model. In the case where you've added new explanatory variables that help explain the data very well, then the adjusted R2R^2 will also increase, but in the case where the extra variables don't do much you'll see a decrease in the adjusted score.

Regression Outputs

F-Statistic

The F-Statistic is used to reject the null hypothesis H0:β1=β2=...=βk=0H_0: \beta_1 = \beta_2 = ... = \beta_k = 0, which would imply that none of the explanatory variables have any relationship with the dependent variable Y.

If the p-value of the F-Statistic is less than the level of significance, then we can reject the null hypothesis and conclude that the beta values are in fact significant in the regression.

R-Squared

The proportion of the total variation in Y being explained by the variation in X.

Adjusted R-Squared

The proportion of the total variation in Y being explained by the variation in X, adjusted for degrees of freedom in the model

Variables Section

Functional Form & Transformations

Non-Linear Relationships and Logarithms

When performing a regression, we are attempting to linearly fit each x variable to the y variable, which may not always be easy to do depending on the distribution of the data. When looking at the numerical x variables you'll be using to calculate your regression, look at the scatter plots of y against each x.

Parabolic relationships

In the case that an x variable is correlated on a seemingly parabolic relationship with the y variable, try squaring the input variable x and inputting that into the regression model with the original x variable.

Hyperbolic / Logarithmic relationships

When performing a regression, it is beneficial for the explanatory (x) variables to follow some sort of normal distribution. Sometimes however they do not follow a normal distribution, as they may be skewed to the right or to the left.

A solution to a skewed variable xsx_s is to take the natural logarithm of it ln(xs)ln(x_s) and input that as a factor in the regression. This way, the variable data will have been scaled so as to present a more normal distribution.

💡

Note: The overall regression model is still linear in nature, even if you square a variable or take the natural log of another. Since you input the adjusted variables like all other variables, the model has no way of knowing what relationship you've presented. It will only see the new values that it wants to fit linearly, and thus the overal model will remain a linear regression model.

Interpreting logarithmic coefficients

In the case where you've input ln(xi)ln(x_i) as an explanatory variable for your model, you'll have a coefficient βi\beta{i}. In the case that your output variable is a continuous, non-logarithmic variable, the interpretation for βi\beta_i would be:

If the output variable is a logarithmic variable as well, the interpretation becomes:

Categorical X variables and Interaction Terms

Multi-Level Categorical Variables

Interaction Terms

💡

Misconception: "An interaction term is required when x1 and x2 are correlated". Not required!

Categorical Dependent Variables - Logit Models

Pi=β0+β1(Pricei)+ϵiP_i = \beta_0 + \beta_1(Price_i) + \epsilon_i

The issue with running a regression on the function above is that for certain x values, the prediction of y might fall out of the range 0:1, leading to estimates of probabilities higher than 100% or lower than 0%, which don't make sense. To increase the range of the dependent variable, we can turn it into an odds value

PiPi1Pi=β0+β1(Pricei)+ϵiP_i \rightarrow \frac{P_i}{1-P_i} = \beta_0 + \beta_1(Price_i) + \epsilon_i

However, this distribution is still heavily skewed, since the midpoint is at 1 and then a long right tail continues after. To combat this skewness, you take the natural log of the odds to get "Log Odds":

Pi1PiLn(Pi1Pi)=β0+β1(Pricei)+ϵi\frac{P_i}{1-P_i} \rightarrow Ln(\frac{P_i}{1-P_i})= \beta_0 + \beta_1(Price_i) + \epsilon_i

💡

The regression above is called a Binomial Logistic Regression. Binomial because the y-output is binary in its choices, and logistic because we have a log-odds output.

Maximum Likelihood Estimation (MLE)

In the case of categorical dependent variables, the usually used Least Squares Estimation (LSE) can't be used due to the log-odds Y variable having infinite range. In this case, maximum likelihood estimation is used to fit the regression line to the data, and the regression output will have some new information:

OR (Odds Ratio)

Denotes the multiplicative effect of a 1-unit increase in an independent variable on the odds of the dependent variable. For example, if x1x_1 has an OR of 0.84, then the addition of one unit of x1x_1 will lead to yy having new odds of happening of y0.84y*0.84 on average, all other variables held equally.

Interpreting the OR can be done as follows: For a 1-unit increase in x1x_1, the odds of yy happening decreases by 16% on average, all else held constant.

💡

Chance = Probability ≠ Odds. Odds = Pi / (1-Pi)

Regression Assumptions

Linearity (Correct functional form)

The regression must be linear in the betas or parameters. The equation must be additive, and linear. Sometimes, the X and Y variables you want to regress on hold a non-linear relationship, such as a quadratic relationship. In this case, if you do not represent the quadratic relationship in your regression model, you are violating the functional form of your regression.

Consequences of incorrect functional form

Detection

Remedies

Constant Error Variance (Homoscedasticity)

The regression assumes the variance in Y with regards to X is constant for all X. In the cases where variance increases or decreases according to the values of X, the data is considered heteroscedastic and violates the constant error variance assumption of regression.

Consequences of heteroscedasticity

Detection

Remedies

Independent Error Terms

Regression assumes there is no autocorrelation. This is a problem that only occurs with time-series data, since time-series data has relationships between each point, the point that came before and after it, and so on. When there is autocorrelation, you can make a pretty good guess about the error term of a given point if you know the error term of the point next to it, which is problematic.

Consequences of Autocorrelated Error Terms

Detection

Remedies

Normal Errors

When taking a look at the residuals of your regression, it is a weak assumption that the errors should follow a somewhat normal distribution around the 0 mark. In the case where there isn't a normal distribution around the 0-mark of error, the assumption of normal errors is violated.

Consequences of violating normality

If normality is violated and n is small, standard errors in output are affected

Detection

Remedies

No multi-collinearity

Multi-collinearity occurs when the X variables are themselves related. When regressing on explanatory variables, the objective is to differentiate the individual effects of each x variable on the output variable y, holding all else constant. However, if x1x_1 and x2x_2 are correlated, it's hard to interpret the effect of x1x_1 on yy holding all else constant, since x2x_2 is very likely to change according to x1x_1's change. Therefore, multi-collinearity is problematic for regression as it removes the explanatory power of the model.

Consequences of multi-collinearity

Detection

Remedies

💡

Note: Adding an interaction term will NOT fix the problem.

Exogeneity

In the case that there exists a variable that affects both X and Y variables but is not included in the regression model, it could cause omitted variable bias. Exogeneity assumes that every variable in the model is explained solely by itself, and not by anything else within the model. When there is an omitted variable that affects all other variables in the model, technically it is affecting the error term ϵi\epsilon_i in the model. Therefore, the X variables are no longer wholly exogenous as they can be explained in part by the error term.

Consequences of Endogeneity

Detection

Remedy