Pearson's Product-Moment Correlation

(Chapter 19 in Zar, 2010)

In our discussion of regression analysis, we addressed the circumstances in which correlation analysis would be preferable to regression analysis for dealing with continuous, bivariate data. The main criterion is whether or not there is an expectation of causality between the two variables, such that there is a clear independent variable (X) whose manipulation should result in a corresponding change in the dependent variable (Y). For such mechanistic relationships, regression analysis is the appropriate analysis. In cases where the question is simply an examination of the extent to which two variables covary, e.g., arm length and leg length, such that there is no objective means to assign one variable as the independent variable, and no clear mechanistic relationship between the two variables, then correlation analysis is the appropriate choice. If this seems a little vague to you, go back and review the first part of the discussion for week 10, which goes into more detail on the distinction between the two analyses, and when you should apply them. Seriously...you should look it over. Most people apply the wrong analysis on the final assignment.

You may recall that our discussion of regression focused on model I regression, where the measurement of the independent variable was assumed to have been made without error. This means that the distribution of the observations is 2-dimensional, with variation only along the axis of the dependent variable:

Note the positions of the X- and Y- axis on the above graph. The Z-axis (vertical) depicts the distribution of the observations. Note that there is variation in the Y-values for each value of X, but no variation in X at each of the X-values, i.e., they are fixed.

For correlation analysis (and for model II regression) there is assumed to be error in the measurement of both variables, such that the distribution of error terms becomes 3 dimensionsal:

With variation in both X and Y at each value of X, the combined variation looks like a pimple. This is one of the assumptions of the correlation analysis that we will be covering (Pearson's product-moment correlation); that the error terms follow a bivariate normal distribution (which looks like a pimple). There does not have to be similarity in the amount of variation for both variables, i.e., the pimple can be narrower in one dimension. Like a squeezed pimple...

The axes for the preceding graphs were labelled as X and Y to help orient you with a familiar situation (linear regression). For correlation analysis, the axes should be labelled as "Y₁" and "Y₂", because with correlation analysis, one cannot assign an independent or dependent variable (did you review the discussion for week 10 yet?). Thus, (unlike your textbook) we will adopt the convention of not designating any of our variables as X, because it does not matter upon which axes the variables are to be plotted.

The measure that we will use for the strength of the covariation between two variables is Pearson's product-moment correlation coefficient, r_P, and is sometimes referred to as Pearson's ρ (rho). The calculation for r_P is as follows:

Remember that we are using Y₁ and Y₂ instead of X and Y, and so the numerator is the sum of cross products (the sum of the products of the deviations), and the denominator is the square root of the product of the sums of squares for the 2 variables. In the same way that the sum of cross products determined the sign for the slope in least-squares linear regression, the sum of cross products determines the sign for our correlation coefficient, r_P. The sign tells us the direction (positive or negative) of the covariation. The population parameter estimated by r_P is ρ (rho), thus:

The maximum value of r_P is one. To convince yourself of this, imagine examining the covariation between a variable and itself, which will be perfect covariation:

Question 1: Use the preceding equation to explain why the maximum value for r_P will be equal to 1.

In our section on least squares linear regression, we calculated the explained sum of squares, SS_explained, as:

And also as:

There is yet another way to calculate SS_explained, which is as follows:

Why am I telling you this? Well, if you square r_P, not only do you get r_P² (duh), but you also get:

Do you see where this is going? No? Pretend for a moment that Y₁ is X, and apply that to the iteration of the equation on the far right. That would make the equation the sum of cross products divided by the sum of squares of X multiplied by the inverse of the sum of squares of Y. In other words, that would be SS_explained multiplied by 1 / SS_total , which is the same as SS_explained / SS_total...

...which is the same as the coefficient of determination for regression, r².

This tells us two things...well, three actually. The first is that Pearson's product-moment correlation coefficient, r_P is based upon the explained variation for a linear relationship. The second thing that it tells us is that the maximum and minimum values of r_P must range between -1 and +1. The coefficient of variation cannot exceed 1, because a model cannot explain more than 100% of the total variation. Thus, the values that can occur for r_P must range between -1 and 1. And the third thing? Well...now we know why the symbol for the coefficient of determination is r².

Download this week's Excel workbook HERE. The first worksheet (Example 19.1a) contains the wing length and tail length data from Example 19.1a in your book (page 382 in the 5th edition). I shouldn't have to say this, but DO NOT use the calculations in the book! The first step is to calculate the means of both variables so that you can calculate the deviates for both variables. As with linear regression, you will need to calculate the raw deviates, so that you can calculate their product. Also calculate the squared deviates, and you should be ready to calculate Pearson's correlation coefficient as:

All that remains is to conduct the test of significance. Any ideas? Not surprisingly, if there is no covariation between two variables, r_P should be equal to zero. Thus, as we did for regression, we can use a single- sample t-test to test the null hypothesis:

As I'm sure you recall, the single-sample t-test subtracts the population parameter from the sample parameter, and divides that difference by the standard error of the sample parameter. The standard error for r_P is calculated as:

Making the calculation of t_s:

Which is compared to the critical value of t (Table B.3) at n - 2 degrees of freedom. The use of one- or two-tailed probabilities will depend on whether the direction of the relationship (positive or negative) is specified as the alternate hypothesis, or whether the question is simply one of covariation without regard to direction.

Question 2: Is there significant positive covariation between wing length and tail length for the data from Example 19.1a? Be sure to include the relevant results (r_P, t, df, and p) from the analysis for this and the subsequent questions.

The second worksheet ("parasitoid") contains the same data set that you examined in week 10, looking at density-dependence in terms of parasitoid larva size.

Question 3: Is there significant positive covariation between clutch size and head width for the parasitoid larvae ?

The third worksheet ("keelback") also should look familiar, as it is the set of data examining the relationship between body size (SVL) and probability of recapture (as a proxy for survivorship) for keelback snakes.

Question 4: Is there a significant positive relationship between snout-vent length and survivorship for keelback snakes?

As mentioned earlier, Pearson's product-moment correlation analysis assumes a bivariate normal distribution of the observations. Violation of that assumption does not affect the correlation coefficient itself, but does affect the statistical test of H₀: ρ = 0. Without a large sample, it can be a difficult assumption to test, and so one must have some understanding of the likelihood of the data meeting the assumption before deciding whether or not to apply the parametric test.

One assumption that is easier to evaluate should have occurred to you already, when we demonstrated that r_P is based upon the explained variation for a linear relationship. That's right, Pearson's product-moment correlation analysis shares the linearity assumption of least-squares linear regression. I know how much you enjoyed last week's exercise, where you got to transform data to meet the linearity assumption. I am sure that you are pleased to realize that those techniques can be applied here as well. If those techniques fail to provide you with a linear relationship, or if you have reason to believe (or evidence) that your data do not meet the normality and/or homogeneity of variance assumptions, you should apply a nonparametric equivalent. As always, the nonparametric tests are less powerful than the parametric analyses, but only when the assumptions for the parametric analyses are met.

The non-parametric equivalent of Pearson's product-moment correlation is Spearman's rank correlation, which (as the name implies) uses the ranks of the observations. Similar to what we applied for nonparametric single-factor ANOVA, we can conduct the nonparametric test by performing the parametric test on the ranked data. In order to keep track of the observations, i.e., keep the data pairs together, it would be wise to first number the observations before sorting and ranking them. Each variable (Y₁ and Y₂) is ranked separately, and then those rankings are used in the analysis.

Unlike the parametric test, however, the value of Spearman's rho, r_S is not subjected to a t-test, but is compared directly to Table B.20 (p. 773 in the 5th edition) to evaluate the significance of the covariation.

The last (I figured that it was time to go a little easy on you) worksheet in this week's Excel workbook ("Example 19.12") contains the data from Example 20.3 in yout textbook...

Just kidding. It contains the data from Example 19.12 (p. 399 in the 5th edition), which looks at the possibility of a relationship between math exam scores and biology exam scores. The ranks are also included (you are welcome). The example in the textbook shows an alternative means of calculating r_S, but we will stick with feeding the ranks into the equation that we used to calculate r_P. Just remember that you are comparing the value of r_S directly to the critical value for r_S in table B.20.

Question 5: Is there a significant, positive relationship between math and biology scores? Does the value of r_S you arrived at differ from that calculated in the example using the alternative method?

Save your Word document and Excel spreadsheet as yourlastnameex12, and submit them via Blackboard.

Send comments, suggestions, and corrections to: Derek Zelmer