Motivation of Linear Regression
- Linear regression looks to model the linear relationship between two or more variables
- To find if linear regression will be useful for modeling the data, the correlation coefficient "r" can be computed. The value of "r" ranges from -1 to 1, where the two extremes indicate perfect linear association, and 0 indicates there is no linear relationship between the variables
- If two variables don't have a linear correlation, they could still be related through a model with more degrees of freedom (e.g. y = cos(x), y = x^2)
"For prediction purposes they can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio or sparse data" - ESL, Stanford
Properties of Correlation
- cor(A, B) = cor(B, A) ← corrleation is not affected by the order of A and B
- cor(100*A, B) = cor(A, B) ← correlation is not affected by the units used in A and B
- cor(A + 10, B) = cor(A, B) ← correlation is not affected by a translation in A or B
Simple Linear Regression
Find the best fitting line to a scatterplot:
where is the intercept, is one of the parameters, is one of the explanatory variables, and is the dependent variable
Least Squares Estimation
In order to generate a regression line through the data, we must define a cost-function that we can minimize as much as possible. This cost-function is called the Least squares estimation, and it attempts to calculate the parameters such that sum of squared errors is minimized.
This equation calculates, for every data pair (x,y) denoted as i, the difference between the actual value and the estimated value, and squares the difference to avoid any negative values. If there were negative values, some errors would cancel out others, and thus the end sum would not be representative of how truly "off" the predictions were from the true values.
The difference is called a residual.
Types of Errors in Regression
After creating a regression on your data, you're left with a prediction for all X. These predictions will have some error associated with them, as the line that's been generated from the regression most likely won't perfectly cross all points of data. There are a few sums of errors that help quantify the model's variation:
SSR: Sum of squares due to regression — Quantifies the variation between the predicted y values and the sample mean of y, which in the case of a regression would signify there being no relationship between the x and y-values.
SSE: Sum of squared errors — Quantifies the variation betwen the true y values around the regression line .
SST: Sum of squares total — Quantifies the variation between the data y around their mean .
R-Squared
The proportion of the total variation in Y being explained by the variation in X.
If there is a large SSE or sum of squared errors, then there is a large variation between the true y values around the regression line , and thus not a lot of the variation in the data can be explained by the regression line. In this case, SSR would be small, and thus will also be small, meaning the total variation of the data cannot be explained too much by the model.
If there is a low SSE or sum of squared errors, then there is low variation between the true y values around the regression line , and thus a lot of the variation in the data can be explained by the regression line. In this case, SSR would be high, and thus will also be high, meaning the total variation of the data can be explained by the model.
The value of ranges from 0:1, where
- means there is no relationship at all between the variation in X and the variation in Y
- means there is a perfect relationship between the variation in X and the variation in Y
Note: If you are comparing models but have changed the Y variable in any way (e.g. predicting log(Y) instead of Y), comparing R-squared values is no longer possible as the distributions are different.
Population Regression Function and ε
When performing a regression on data, the objective is always to estimate the "true" relationship between a dependent variable y and explanatory variables X = (x_1, x_2, ... , x_p). Because we are performing a regression on a sample of data, the regression line is a linear model for the sample, and not for the population, so it is safe to assume there will be a difference between the prediction we've made and the true relationship line between y and X.
This theoretical "true" relationship line between y and X is called the Population Regression Function, denoted as:
where the s are the true coefficients of the relationship between Y and X, and is the error term denoting the difference between the data and the population regression function Y. This error term is only theoretical, since we can never truly know the population regression function and therefore can never calculate the values for .
Degrees of Freedom
When fitting a regression to data, a particular question might arise: If I have k explanatory variables, and I'm trying to predict Y, what is the minimum amount of observations I need to perform a regression?
Case 1: One explanatory variable, regression:
- One observation: With one observation, no line can be plotted since there are infinitely many lines that cross a single point. Therefore, one observation is too little to fit a regression ❌
- Two observations: With two observations, a line can be plotted between them. However, this regression line will always have an , since the line crosses through both of the points. Therefore, there are still too few observations to fit a regression ❌
- Three observations: With three observations, a line can now be regressed that will most likely have an , meaning there is now some significance to the line of best fit. In this case, three observations is sufficient to perform a regression ✔️
Case 2: Two explanatory variables, regression:
- One - Three observations: Much like case 1, a regression cannot be performed with one or two points, because there are not sufficient points to draw a plane through them. At three observations, a plane can be drawn through them, but this plane will have an , since the plane fits perfectly through the three points. Therefore, up to three observations cannot fit a regression ❌
- Four observations: With four observations, a plane can now be regressed that will most likely have an , meaning there is now some significance to the plane of best fit. In this case, four observations is sufficient to perform a regression ✔️
This brings us to degrees of freedom, which quantify the relationship between the number of observations, the number of explanatory variables, and the statistical power a regression line has to model a given data. The equation for degrees of freedom is:
where df = degrees of freedom, n = number of observations, and k = number of explanatory variables (X).
So in Case 1 above, with n = 3 observations and k = 1 explanatory variable, df comes out to be , meaning that with three observations the regression had one degree of freedom. With n = 2 observations, the model had 0 degrees of freedom, and thus a regression could not be performed. With n = 1 observation, the value of degrees of freedom goes negative, so there is definitely not a regression line that can be plotted in that case.
In Case 2 above, the same logic can be applied to figure out how many observations are necessary for k = 2 explanatory variables. We need at least k = 1 degree of freedom to perform a regression, thus the number of observations required is observations, which aligns with the explanation from above.
Degrees of freedom are closely related to :
As more explanatory (X) variables are added to a model, df decreases, and as a result will ONLY increase.
This can make deceiving, because you could be adding lots of junk X variables that aren't adding any explanatory power to your model but your still increases, giving the impression that your model is being better trained by the variables added to it, when the reality is that you are just removing degrees of freedom which increase .
Adjusted R-Squared
The adjusted takes into account the degrees of freedom of the model and adjust the score provided by to reflect a closer estimate of how much variation in Y is really explained by the variation in X.
As k increases, Adjusted will tend to decrease, reflecting the reduced power in the model. In the case where you've added new explanatory variables that help explain the data very well, then the adjusted will also increase, but in the case where the extra variables don't do much you'll see a decrease in the adjusted score.
Regression Outputs
F-Statistic
The F-Statistic is used to reject the null hypothesis , which would imply that none of the explanatory variables have any relationship with the dependent variable Y.
If the p-value of the F-Statistic is less than the level of significance, then we can reject the null hypothesis and conclude that the beta values are in fact significant in the regression.
R-Squared
The proportion of the total variation in Y being explained by the variation in X.
Adjusted R-Squared
The proportion of the total variation in Y being explained by the variation in X, adjusted for degrees of freedom in the model
Variables Section
- Coef. = Coefficients of all explanatory variables
- Interpretation of a coefficient — For every increase of 1 unit of , the output y increases by on average, all other variables held constant.
- Std. Err. = Standard Error
- The typical variation of the coefficient
- t-statistic
- Divide Coefficient by the standard Error
- A greater t-statistic indicates a more statistically significant variable, while a t-statistic close to 1 indicates a less significant variable
- p-value (two-tailed)
- describes how extreme a given coefficient is under the null hypothesis that the coefficient = 0
- This p-value is found on a t-distribution, looking for the probability of getting a t-statistic equal to the calculated value assuming the null hyptoehsis that the t-statistic = 0
- confidence interval
- A 95% confidence interval creates bounds for the coefficient's value, estimating with 95% confidence that the coefficient is within the bounds
- If the confidence interval for a variable includes 0, then you need to be wary that the given variable is not statistically significant for your model
Functional Form & Transformations
Non-Linear Relationships and Logarithms
When performing a regression, we are attempting to linearly fit each x variable to the y variable, which may not always be easy to do depending on the distribution of the data. When looking at the numerical x variables you'll be using to calculate your regression, look at the scatter plots of y against each x.
Parabolic relationships
In the case that an x variable is correlated on a seemingly parabolic relationship with the y variable, try squaring the input variable x and inputting that into the regression model with the original x variable.
Hyperbolic / Logarithmic relationships
When performing a regression, it is beneficial for the explanatory (x) variables to follow some sort of normal distribution. Sometimes however they do not follow a normal distribution, as they may be skewed to the right or to the left.
A solution to a skewed variable is to take the natural logarithm of it and input that as a factor in the regression. This way, the variable data will have been scaled so as to present a more normal distribution.
Note: The overall regression model is still linear in nature, even if you square a variable or take the natural log of another. Since you input the adjusted variables like all other variables, the model has no way of knowing what relationship you've presented. It will only see the new values that it wants to fit linearly, and thus the overal model will remain a linear regression model.
Interpreting logarithmic coefficients
In the case where you've input as an explanatory variable for your model, you'll have a coefficient . In the case that your output variable is a continuous, non-logarithmic variable, the interpretation for would be:
- For every 1% increase in , the value of changes by on average, holding all other variables constant.
If the output variable is a logarithmic variable as well, the interpretation becomes:
- For every 1% increase in , the value of changes by (as a %) on average, holding all other variables constant.
Categorical X variables and Interaction Terms
Multi-Level Categorical Variables
- Dummy Variable Trap
- When separating categorical variables into their own columns, you can't leave all categories up
- This is because one variable will be explained by the values of the other variables, and thus this variable adds no information to the regression
- To avoid this, you must remove one categorical variable and leave the rest. The removed categorical variable will act as a "baseline" for the rest, and the dummy variable trap is avoided
Interaction Terms
- In the case where one X variable affects the relationship of Y with another X variable, then the model would benefit from an interaction term
Misconception: "An interaction term is required when x1 and x2 are correlated". Not required!
Categorical Dependent Variables - Logit Models
- If Y is categorical
- e.g. probability of selling a car — range: [0,1], midpoint = 0.5
The issue with running a regression on the function above is that for certain x values, the prediction of y might fall out of the range 0:1, leading to estimates of probabilities higher than 100% or lower than 0%, which don't make sense. To increase the range of the dependent variable, we can turn it into an odds value
- Range: [0:inf], Midpoint: 1
However, this distribution is still heavily skewed, since the midpoint is at 1 and then a long right tail continues after. To combat this skewness, you take the natural log of the odds to get "Log Odds":
- Range: [-inf:+inf], Midpoint: 0
The regression above is called a Binomial Logistic Regression. Binomial because the y-output is binary in its choices, and logistic because we have a log-odds output.
Maximum Likelihood Estimation (MLE)
In the case of categorical dependent variables, the usually used Least Squares Estimation (LSE) can't be used due to the log-odds Y variable having infinite range. In this case, maximum likelihood estimation is used to fit the regression line to the data, and the regression output will have some new information:
OR (Odds Ratio)
Denotes the multiplicative effect of a 1-unit increase in an independent variable on the odds of the dependent variable. For example, if has an OR of 0.84, then the addition of one unit of will lead to having new odds of happening of on average, all other variables held equally.
Interpreting the OR can be done as follows: For a 1-unit increase in , the odds of happening decreases by 16% on average, all else held constant.
Chance = Probability ≠ Odds. Odds = Pi / (1-Pi)
Regression Assumptions
Linearity (Correct functional form)
The regression must be linear in the betas or parameters. The equation must be additive, and linear. Sometimes, the X and Y variables you want to regress on hold a non-linear relationship, such as a quadratic relationship. In this case, if you do not represent the quadratic relationship in your regression model, you are violating the functional form of your regression.
Consequences of incorrect functional form
- If the functional form is incorrect, both the coefficients and standard errors in your output are unreliable
Detection
- Residual plots — Any patterns expressed in the plot of residuals based on your regression can be a sign of incorrect functional form
- Likelihood Ratio (LR) Test
Remedies
- Get the specification correct through trial and error
Constant Error Variance (Homoscedasticity)
The regression assumes the variance in Y with regards to X is constant for all X. In the cases where variance increases or decreases according to the values of X, the data is considered heteroscedastic and violates the constant error variance assumption of regression.
Consequences of heteroscedasticity
- If heteroscedasticity is present, standard errors in output cannot be relied upon
Detection
- Residual plots
- Goldfeldt-Quant test
- Breusch-Pagan test
Remedies
- White's standard errors (heteroscedastic-adjusted errors)
- Weighted least squares
- Log variables
Independent Error Terms
Regression assumes there is no autocorrelation. This is a problem that only occurs with time-series data, since time-series data has relationships between each point, the point that came before and after it, and so on. When there is autocorrelation, you can make a pretty good guess about the error term of a given point if you know the error term of the point next to it, which is problematic.
Consequences of Autocorrelated Error Terms
- Under autocorrelation, standard errors in output cannot be relied upon
Detection
- Durbin-Watsontest
- Breusch-Godfrey test
Remedies
- Investigated omitted variables
- Generalised difference equation (Cochrane-Orchutt or AR(1) methods)
Normal Errors
When taking a look at the residuals of your regression, it is a weak assumption that the errors should follow a somewhat normal distribution around the 0 mark. In the case where there isn't a normal distribution around the 0-mark of error, the assumption of normal errors is violated.
Consequences of violating normality
If normality is violated and n is small, standard errors in output are affected
Detection
- Histogram of residuals or Q-Q plot
- Shapiro-Wilk test
- Komolgorov-Smirnov test
- Anderson-Darling test
Remedies
- Change functional form (~log)
No multi-collinearity
Multi-collinearity occurs when the X variables are themselves related. When regressing on explanatory variables, the objective is to differentiate the individual effects of each x variable on the output variable y, holding all else constant. However, if and are correlated, it's hard to interpret the effect of on holding all else constant, since is very likely to change according to 's change. Therefore, multi-collinearity is problematic for regression as it removes the explanatory power of the model.
Consequences of multi-collinearity
- Coefficients and standard errors of affected variables are unreliable
Detection
- Look at correlation (p) between X variables
- Look at Variance Inflation Factors (VIF)
Remedies
- Remove one of the variables
Note: Adding an interaction term will NOT fix the problem.
Exogeneity
In the case that there exists a variable that affects both X and Y variables but is not included in the regression model, it could cause omitted variable bias. Exogeneity assumes that every variable in the model is explained solely by itself, and not by anything else within the model. When there is an omitted variable that affects all other variables in the model, technically it is affecting the error term in the model. Therefore, the X variables are no longer wholly exogenous as they can be explained in part by the error term.
Consequences of Endogeneity
- The model can only be used for predictive purposes (can not infer causation)
Detection
- Intuition
- C
Remedy
- Using instrumental variables