The option of data transformation to meet assumptions has been mentioned several times as a possible (and more powerful) alternative to nonparametric approaches. Transformation of data for least-squares linear regression greatly expands the utility of the analysis by allowing its application to nonlinear relationships. Before we get too far along, please realize that data transformation is not cheating! Transformation merely changes the scale at which the observations are analyzed and/or reported.

Least squares linear regression has 4 main assumptions, 2 of which we already have touched upon, i.e.,
the assumption of a causal and linear relationship between the independent (*X*) and dependent (*Y*) variable. In addition
to those assumptions, we assume that
the values of *Y* are normally distributed around each value of *X* (we are only going to consider model I regression, which
assumes that *X* is measured without error), and we assume that those distributions are all homoscedastic, i.e., they have equal
variances.

I have tried to depict these assumptions below. Just imagine that the graph has been tipped away from you so that the
frequency distributions of the observations (*Y* around each value of *X* can be observed. The line of best fit would pass through the base of the distributions:

As you can (hopefully) see, for each of the 4 values of *X*, there is a symmetrical distribution of the values of *Y*,
and the width
of the 4 distributions is comparable. When dealing with data that have this structure (numerous observations for each
value of *X*), we can apply our conventional tests for the assumptions, using the Shapiro-Wilk test for the
observations associated with each value of *X*, and Bartlett's test (or the *F _{max}* test) to examine the
variances across the values of

We will address transformation to meet the homogeneity of variance assumption momentarily (even though it is best to conduct the variance transformation first, as it might cure a non-normality issue as well), but if the normality assumption is violated, transformation will essentially be a process of trial and error: transform the observations for the dependent variable, test for normality of the transformed observations (which we will designate as Y'), and repeat. Some common transformations are log transformation (Y' = log(Y)), square root transformation (Y' = sqrt(Y)) and reciprocal square root transformation (Y' = 1/(sqrt(Y))).

There is one instance where you will almost certainly need to apply a known transformation to the dependent
variable, and that is when you are working with proportions. Because proportions are bounded by 0 and 1, the
distribution tends to be narrower than a normal distribution when the central tendency is near 0.5, and becomes
skewed as central tendencies approach 0 or 1. Arcsin transformation will generally fix this, where
*Y'* = asin(sqrt(*Y*))-0.2853982. The units of *Y'* are radians. The 0.2853982 that is subtracted is the difference between
asin(0.5) and 0.5, so subtracting that difference serves to keep 0.5 as the center of the distribution, while extending
the tails from 0 and 1 to -0.2853982 and 1.285398.

In Excel, the formula for the transformation would be: =ASIN(SQRT(*Y*))-0.2853982.

Transformation to meet the normality assumption follows the same process as described above
for the analyses that we already have
covered that share the assumption of normally distributed error terms (single-sample *t*-test,
2-sample *t*-test, and ANOVA).

In a large number of instances (including the data sets for this week's assignment) you will not have a sufficient
number of observations for each value of *X* to conduct statistical tests of the assumptions. Fear not! Although the normality issue will
have to remain a large question mark, we can still do our due diligence by examining **residual plots**.

Below is a hypothetical, linear relationship between *X* and *Y*, showing the regression line passing through the
observations:

Remember that we calculate residuals as:

If we plot the residuals against *X* (or the value of *Y* estimated by the regression), we produce what is referred to
as a residual plot:

The horizontal line running through zero on the *Y*-axis represents our regression line, allowing us to visualize
the distribution of the observations around the line. You could have done this with the preceding figure by tilting
your head to the left (which is the technique that I employ), but the residual plot gives you a clearer picture. The
plot above is exactly what we hope to see, with the distribution of the points around the line remaining constant
across all values of *X*. This suggests (strongly) that the homogeneity of variance assumption has been met. We also can
(to a much lesser degree) evaluate the spread of the observations as a (really poor) way of examining the normality
assumption. There should be more observations closer to zero (on the *Y*-axis) than further from zero.

It is worth mentioning that there are extensions of the linear model approach that allow you to specify the distribution of the residuals around the regression line when the residuals are demonstrated (or expected) to deviate from normality. These approaches are referred to as generalized linear models (GLMs). The application of GLMs is beyond the scope of this course, but it is worth remembering that they exist because of their utility in dealing with non-normally distributed data.

For this week's exercise, we will not concern ourselves with the normality assumption (apart from remembering the importance of the arcsine transformation for proportions). We will focus only on the assumption of linearity, and the assumption of homogeneity of variance (i.e., homoscedasticity).

The most common violation of the assumption of homoscedasticity (homogeneity of variance) is where the variance
is proportional to the mean
value of *Y*, such that the spread of the observations gets wider as the value of *X* (and, therefore the estimated value
of *Y*) increases. An example of such a violation can be seen below:

This heteroscedasticity (heterogeneity of variance) will look like a wedge on the residual plot:

In this instance, the variance in *Y* is proportional to the estimated value of *Y*, and the best option is
square root transformation of the observations (*Y'*=sqrt(*Y*)). Running regression analysis on the transformed data (*Y'*)
produces a residual plot that looks much better behaved:

Other transformations that can stabilize the variance when the residuals produce a wedge pattern are the log
transformation (*Y'* = log(*Y*) or *Y'* = ln(*Y*)), which works well when the variance is proportional to
the square of the
estimate of *Y*, the reciprocal square root transformation (*Y'* = 1/sqrt(*Y*)), which works well when the variance is
proportional to the cube of estimate of *Y*, and reciprocal transformation (*Y'* = 1/*Y*), which works well when the variance
proportional to *Y*^{4}. For the most part, this will be a trial and error process, with the end result being
improved precision of your estimates of α and β. The good news is that a wedge pattern such as that
shown above, will not bias the estimates of α and β (the *Y*-intercept and slope, respectively).

(FUN FACT: the transformations mentioned above also can be applied to categorical data
for a *t*-test or ANOVA to try and correct a violation of the assumption of homoscedasticity.)

And now we come back to the beginning. What if your residual plot looks like this?

A residual plot like this means that you weren't paying any attention at all to the data before conducting the
analysis. **The first step to ANY analysis should always be to plot and examine the data.** The pattern of residuals seen above
is the result of trying
to fit a straight line to a curvilinear relationship. It might seem like this would be a difficult mistake to make,
but I have seen it done more than once. The data (a relationship between sunfish mass and sunfish
length) are plotted below, with the ill-advised regression line from the above residual plot:

Clearly, the assumption of a linear relationship is violated in this example. Unlike transformations that seek to
stabilize the variance, or improve normality, when transforming data to make a relationship linear, it is generally the independent variable (X) that
is transformed. *This is an important point*. I have seen a lot of cases where transformations were applied for no
particular reason, or because they were common transformations. Transformation of the data should be done only to correct a
known issue with the data. For regression, it is the independent variable (*X*) that is first transformed
to try and meet the linearity assumption. If this fails, transformation of the dependent variable may be attempted (double
log transformation, i.e., log transformation of both variables, can make a straight line out of almost anything!). If
transformation succeeds in producing a linear relationship, then problems with normality or homoscedasticity are addressed by
transformation
of the dependent variable (*Y*).
We dealt
with issues of normality and homogeneity first in this lesson, because they apply to *all* of the analyses we have done thus far, but
hopefully it is an obvious point that one first has to determine whether the relationship can be transformed to be
linear before addressing other issues with the data set. Relationships that are not linear, but can be transformed to become linear
are referred to as **intrinsically linear**.

My approach to addressing transformations for linearity is to transform the independent variable in several ways, and simply plot the data to see which relationship appears to be the most linear. For this data set the most common transformations for the independent variable failed to linearize the relationship, and so a double log plot was employed:

The double log transformation did make the relationship linear. If you tilt your head to the left, it
may look as though there are variance issues, but the apparent wedge pattern in this case is simply the result of
having very few observations of larger fish (it's a demography thing...). Regression analysis on the transformed data demonstrated a significant
positive relationship (*t _{s}* = 91.51,

The equation takes this form because we did the regression analysis analysis on the logarithm of *Y* and the logarithm of
*X* instead of on the actual values of *Y* and *X*.
Make certain that you understand this before going further! The slope and *Y*-intercept that were derived are for the
relationship between the log of *X* and the log of *Y*.

In order to calculate an estimate of *Y* (mass) for each value of *X* (length), we just need
to use a little algebra (make each side of the equation an exponent of 10) to solve for *Y*:

This will allow us to plot the data at the original scale, and give us a curved regression line that will make us look oh so clever:

In the preceding example, transformation of both the dependent variable and independent variable was required to achieve linearity. As mentioned previously, one should first attempt to make the relationship linear by transformation of the independent variable. The figures below show examples of curvilinear relationships that can be made linear by transformation of the independent variable, so that you can get an idea of what transformations to try for specific patterns. The first example shows an intrisically linear function that can be made linear through square root transformation of the independent variable:

If the animation does not work, or if you want to examine the individual graphs, they can be viewed HERE.

The following figures depict a relationship that can be made linear by inverse transformation (*X'* = 1 / *X*) of the
independent variable:

If the animation does not work, or if you want to examine the individual graphs, they can be viewed HERE.

The following relationship also can be made linear by inverse transformation (*X'* = 1 / *X*) of the
independent variable:

If the animation does not work, or if you want to examine the individual graphs, they can be viewed HERE.

Do not be concerned that the direction of the relationship changes with inverse transformation. Once you have completed the
regression analysis, and completed the back transformation (by solving for *Y*), your curved line should fit the data nicely.

One of the most common transformations is the log transformation. It is so popular that it often is applied without
any real reason for doing so! The following relationships are examples of the types of curvilinear relationships that can
be made linear by log transformation (*X'* = log *X*) of the independent variable, for a positive relationship:

Or a negative relationship:

We have covered a lot of possible transformations, and so now would be a good time to summarize the steps involved in transformation
for least-squares linear regression. **The first step is always to graph the data**. Violations of the linearity and
homoscedasticity assumptions should be detectable to the trained eye. For the untrained eye (and a good practice for
even the most trained of eyes), run the regression analysis
on the data, and plot the residuals so that you can examine the pattern.

If the relationship looks linear, examine the spread of the residuals to see if there are any issues with homoscedasticity. Remember the points around the line should be equidistant along the line. If you do see a wedge or double-wedge pattern, look carefully to make sure that it isn't just the result of having a small number of observations where the wedge is "thin". If there are clear homoscedasticity issues, adrress those by evaluating different transformations of the dependent variable (Y).

If it looks like there is a linearity issue, things become a little more difficult. As mentioned earlier, the first step should be to try and correct any variance issues, but that can be harder to do when the relationship is not linear, because you can't plot a reference line. There are statistical tools that can be used to directly address the assumption, but we are just going to apply the same visual principles that we would if the relationship were linear. Envision the curvilinear trend that would fit the data, and look for wedge patterns along that line. If you examine the figure below, it should give you an idea of what a homoscedastic curvilinear relationship looks like (on the left), and what that relationship would look like (on the right) with heteroscedasticity (specifically with the variance proportional to the mean):

Once you find a transformation that makes the variance a little better behaved, then you can try some
transformations of the independent variable to see if the resulting relationship becomes more linear. This is something
that can be done with a basic graph, but "it looked good to me" is probably not a justification you will want to, or will
be able to successfully defend for a presentation or publication. An objective criterion for distinguishing among
transformations is the *r ^{2}*
value. The best transformation should have the highest

In some cases, transforming to make the relationship more linear may counteract
your previous transformation for homoscedasticity, so you may have to experiment with other combinations. Remember,
when all else fails, try the double-log plot. **Leave the graphs that you use to evaluate the transformations in your
spreadsheet so that I can see the evidence for your conclusions as to which transformations to apply.**

Once the transformations are complete, and the final regression
analysis completed, the resulting relationship must be back-transformed if you transformed the dependent variable, so that
it can be graphed as a line along with
the raw observations displayed as points. The back-transformation is just making sure that your function for the line
is expressed in terms of *Y*. If you only transformed the independent variable, just include that transformation as part of the final
function. For example, if you did an inverse transformation for *X*, your equation would be *Y*=*b*(1/*X*)+*a*.
If you transformed the
dependent variable, you need to simplify the entire equation to express in in terms of *Y*. For example, if you did a square root
transformation of *Y* in addition to the inverse transformation of *X*, your regression equation estimates the square root of *Y*,
and so you would have to square both sides to back-transform, making the equation: *Y*=(*b*(1/*X*)+*a*)^{2}.

Alternatively, once you have the column of estimated *Y*s from the regression on the transformed variables (*Y*-hat), you can
just do the back tranformation on the *Y*-hat values. So, for the above example, squaring the *Y*-hat values is the equivalent of
using the
equation:*Y*=(*b*(1/*X*)+*a*)^{2}.

Download this weeks Excel workbook HERE. The first worksheet, "abs", contains data generated for a standard curve. This is an application of regression that most people will be familiar with. In this instance known concentrations of a protein were measured for their absorbance at 450 nanometers in a spectrophotometer. The function that describes this relationship, which is generated by regression analysis, will then allow the concentration of protein from unknown samples to be determined from their absorbance at 450 nm.

You may compare transformations visually, i.e., graph the transformations and see whether they produce a straight line. This will allow you to quickly eliminate the poor candidates.

**Remember: you only will need to compare r^{2}
values if you cannot clearly distinguish between valid transformations.**

**Question 1: This relationship clearly violates the linearity assumption. Use transformation to produce a linear
relationship, find the function that relates the 2 variables, report the complete results ( t-test results,
r^{2}, and equation), and graph the relationship.**

The second worksheet, "copper", contains data pertaining to the relationship between the size of individuals
in a litter of snakes (SVL), and litter size (number of individuals in the litter) for the Australian highlands
copperhead,
*Austrelaps ramsayi*. This
demonstrates one interesting application of regression: the examination of residual variation. To remove the effect of
the mother's body size on litter size and juvenile SVL, regression analysis was conducted on the relationship between
the mass of the mother and litter size, and between mass of the mother and offspring SVL. The data presented are the
residuals from those relationships. The question is: how much of the residual variation in offspring SVL can be explained
by the residual variation in litter size? I'll let that sink in for a minute...

The issue that this creates is that there are negative values for the residuals, making certain transformations mathematically impossible. The specific problem with this data set is a violation of the homogeneity of variance assumption, and so in order to facilitate transformation of the dependent variable, there is an additional column ("adj SVL resid") that simply adds 0.198 to each SVL residual. This means that you will have to account for that in terms of your estimate of the Y-intercept as part of back-transformation of the data. I'll let that sink in for a minute...

Once again, you may make visual comparisons to narrow down your choices, and only compare *r ^{2}* values
for the closest candidate transformations.

**Question 2: Use the transformation that best stabilizes the variance (provide graphical evidence of this), and use
regression analysis to examine the relationship between litter size residuals and offspring SVL residuals for a significant
negative relationship. Be sure to include the graph of the results.**

The final data set is from mark-recapture data on keelback snakes (*Tropidonophis mairi*), looking at the number of
recaptures of an individual (as a proxy for survivorship) as a function of SVL. This particular data set appears to have both
linearity and homoscedasticity issues...

**Question 3: Find a transformation to linearize the data, and a transformation to stabilize the variance (if
necessary), and use regression analysis to examine the relationship between SVL and the number of recaptures for
the keelback snakes. Be sure to include a graph of the results.**

As always, save your Excel and Word files as *yourlastname*ex11 and submit them via Blackboard.

Send comments, suggestions, and corrections to:
**Derek Zelmer**