Least-Squares Linear Regression

(Chapter 17 in Zar, 2010)

Thus far, the analyses we have employed have been comparisons of means between and among groups delineated as categorical variables. Quite frequently, however, we will collect data where both the independent and dependent variables are continuous in nature. For such data sets, our observations will consist of paired values, which can be referred to as bivariate data. In these instances, there are two main analyses at our disposal: regression analysis and correlation analysis. Regression analysis involves constructing a model that attributes variation in the dependent variable to changes in the independent variable, i.e., there is a direct presumption of cause and effect. Correlation analysis assumes no such causal link between the two variables (for this reason, the terms "independent" and "dependent" technically do not apply), and simply quantifies the direction (positive or negative) and degree to which the variables covary (rise or fall together). This is why you so often hear the phrase "correlation is not evidence of causation"...especially from ~~tobacco~~ fossil fuel companies.

That difference between the two analyses should be the guiding principle in terms of deciding which analysis to apply to a particular set of data. If the hypothesis is based on a mechanistic relationship between the two variables, then regression would be the appropriate choice. If, however, covariation between two variables could be the result of some other underlying mechanism, then one should employ correlation analysis to examine the relationship.

Consider the relationship between the metabolic rate (measured as oxygen consumption) of a poikilotherm and the ambient temperature. The rates of the chemical reactions are directly determined by the temperature of the organism (which is, more or less, directly determined by the ambient temperature). Increasing the temperature will increase the metabolic rate, and decreasing it will have the opposite effect. This is clearly a cause and effect relationship, and would be an excellent candidate for regression analysis.

It should not be surprising that there is a strong relationship between arm length and leg length in humans. What would be surprising would be to see a persons legs get shorter after their arms had been cut off. The two variables (arm length and leg length) are related, but not causally, which means that the appropriate analysis would be correlation analysis. An oft-cited example of how correlation can lead one astray, if one forgets the basic tenet that it is covariation and not causation that is being examined, is the well-documented, statistically significant relationship between the number of Catholic priests in a city and the number of alcoholics in that city. Rather than jump to the spurious conclusion that Catholicism leads to alcoholism, or vice versa, it should be obvious that both circumstances occur at a relatively constant proportion in the population as a whole, and that larger cities would simply be expected to have a larger number of both as the simple result of having a larger population.

We will return to correlation analysis at a later time. This week we will focus on regression analysis; specifically we will address least-squares linear regression. The "linear" part should suggest to you that we are going to specifically examine relationships between variables that can be fit with a straight line. The slope of that line will be our measure of the significance of the relationship, testing the null hypothesis that the slope is equal to zero, i.e., there is no relationship between the two variables. While it might sound as though this analysis would have limited utility, many non-linear relationships are intrinsically linear, meaning that a simple transformation of the independent variable (log X, 1/X, etc.) will result in a linear relationship. Transformation for linear regression analysis greatly expands the utility of this analysis, and we will explore this in depth next week.

The basic procedure for regression analysis is to produce a line of best fit to the data, and then (as mentioned previously) test the null hypothesis that the slope of that line is equal to zero (no relationship between the variables). The best fit line estimates a value of Y for every value of X using estimates of the slope (b) and the Y-intercept (a):

Hopefully the differences thus far in our notation and the book notation will have trained you to focus on concepts and not letters, such that this deviation from the traditional "Y = mX + b" won't throw you for a loop. The "least-squares" part of least-squares linear regression describes the criterion that we will use to establish the "best" fit. In our past experiences using a ruler to draw a best fit line, we were taught that the goal was to be as close to all of the points as possible. In other words, we were trying to minimize the point to line distance. The vertical distances from each observation to a line (the line represents the predicted value of Y, denoted as "Y-hat" in the preceding equation) can be determined as:

These distances, depicted below as the red lines in the graph of the bird wing data from Example 17.1, are referred to as deviations, or more frequently as residuals because they represent "left-over" variation. We will refer to these distances as residuals to avoid confusion with deviations from the mean.

It should have occurred to you that, for a best fit line, the sum of these residuals should be equal to zero, just as the sum of deviations from the mean total to zero. Applying the same solution, we square the residuals:

Which brings us back to our criterion for "best fit". The line that minimizes the sum of the squared residuals, i.e., the least-squares solution, is what we will take as the line of best fit. Unfortunately, this doesn't help us much, as drawing an infinite number of lines and calculating the sum of squared residuals could possibly take several lifetimes. We will rely on Sir Isaac Newton to assist us with that process, but first, we can narrow things down a bit, just in case you ever are stuck with only a ruler and a pencil.

Given that the sum of the deviates from a sample mean is zero, it should not surprise you that the mean of Y is the horizontal line that minimizes the sum of the squared residuals. This is demonstrated on the following graph, where the residuals were calculated from the data in Example 17.1 (mean wing length = 3.42 cm):

As you can see, the horizontal line that intercepts the Y-axis at the value of the sample mean produces the minimum value for the squared residuals.

The same is true for the vertical line defined by the mean of the independent variable (mean age = 10 days):

What this tells us is that there is a single point, defined by the mean of X and the mean of Y, that the best fit (by the least squares criterion) line must pass through. This would have been an excellent starting point for the pencil and ruler days, but it still leaves us with a lot of work if we are going to do it by trial and error. Remember me mentioning Sir Isaac Newton? Most of you, presumably, are imagining a quaint man in a white wig sitting under an apple tree, but Sir Isaac also developed calculus (although Leibniz might have something to say about that claim). This clearly is a problem where we are searching for a minimum. The value that we are after in order to provide us with that minimum sum of squared residuals is the slope, i.e., the angle at which you should hold your ruler as you rotate it around the point defined by the mean of X and the mean of Y. Fear not...we are going to apply the calculus, not derive it. The slope for the line that meets the least squares criterion is calculated as:

This sample slope, b, is an estimate of the population slope β. Or, more formally:

Hopefully you recognize the denominator of the slope calculation as the sum of squares of X (the independent variable). If the symbol in the numerator looks unfamiliar to you, it should. This is the sum of the products of the deviations, or sum of cross-products. As the name implies, the raw deviations of X and Y are multiplied together for each of the paired observations, and these products are summed together.

Download this week's Excel workbook HERE. The first worksheet (Example 17.1) contains the data from...well...I'm sure that by now you can figure it out. The calculations are there as well for your viewing pleasure. As you can see, the means of both the independent (age) and dependent (wing length) variables are calculated in row 16. The deviates for age are calculated in column C and the deviates for wing length are calculated in column D. No shortcut of squaring the deviations or getting the sum of squares from variance this time, because we need the raw deviates in order to calculate their product. The deviates are squared in columns E and F, and the cross-products (products of the deviates) are calculated in column G. The slope is calculated in cell G18 as the sum of column G divided by the sum of column E.

Once we have our estimate of the slope, calculating the Y-intercept (a) is only a matter of simple algebra, provided that we have a known coordinate (X,Y) for our line.

Question 1: Describe how to calculate the Y-intercept for the least-squares linear regression line, and explain why the calculation works.

Put the book down! Giving me the formula won't help if you don't demonstrate an understanding of the concept by explaining why the calculation works. Read the sentence preceding the question again...if you are stuck, you might get a hint from cell G19, but you still have to explain why it works...

The sample Y-intercept (a) is an estimate of the population Y-intercept, α. Now that we have estimates of β and α, we can apply the equation for a straight line to our data:

Column H shows the regression model generated for the data in Example 17.1. That model is plotted along with the data in Chart 1. The line is referred to as a model because some of the variation in Y (wing length) is explained as a function of X (age). The residuals, d, are an indication of the variation in Y that is not explained by X. Calculating the sum of the squared residuals (column I) gives us an estimate of the unexplained variation (Σd²):

If we had an estimate of the total variation in Y, we would then be able to determine the amount of variation explained by our regression model by simple subtraction. Let's take another look at the relationship between wing length and age:

The red bracket indicates the total variation in wing length (Y), which, like any other sample, can be estimated as the sum of squares:

Thus, we can estimate the variation explained by the model as:

Of course, without the context of the amount of total variation, this value is not very interpretable, and so the variation explained is presented as a ratio of explained to total variation, a value known as the coefficient of determination, and denoted as r²:

The value of r² for the bird wing data (cell G20) is 0.973, indicating that age in the regression model explains more than 97% of the variation in wing length. While a high value for r² might sound impressive, it is not a test of significance, and should not be treated as such! All too often, people report r² for a regression analysis and fail to report the test of significance. Don't be that guy...

So what is our test of significance? Remember that we are testing for a relationship between two variables by applying a linear model to the data. The null expectation would be "no relationship", meaning that the null hypothesis that we will be testing is that the slope is equal to zero: β = 0. The test of significance is a comparison of our sample slope, b, to a population slope (β) of 0. The good news is that we already have a tool in our kit for such an occasion: the single-sample t-test. Dividing b - β by the standard error of the sample slope (s_b) will produce a t_s value for n - 2 degrees of freedom (where n is the number of paired observations):

Note that the equation on the right only applies when testing H₀: β = 0. The calculation for s_b could be considered cumbersome had we not already taken a shortcut to get one of the values:

The denominator within the surd is the sum of squares of X, which we already have calculated. The value s²_Y⋅X is called the residual mean square, and is calculated by dividing the sum of the squared residuals by n - 2:

The residual mean square (s²_Y⋅X) is calculated in cell G22, s_b is calculated in cell G23, and t_s is calculated in cell G24. The degrees of freedom associated with that t-value (n - 2) are calculated in cell G24, and the critical value is in cell G26. I used the one-tailed value based on the assumption that the investigators would not have expected a negative relationship, but your book applies a two-tailed critical value for reasons that I cannot fathom.

As we can see, there is a significant positive correlation (t = 20.03; df=11; p < 0.05) between age and wing length. Note that in reporting this, I included the direction of the relationship (positive), as well as the relevant parameters from the analysis. Make certain that you do the same when you report the results of your analyses.

If you run regression analysis on a software package, you likely will be presented with an ANOVA table in addition to the estimates of the slope (b) and Y-intercept (a), so...

Let us examine the application of variance ratios to linear regression...

Send comments, suggestions, and corrections to: Derek Zelmer