Hypothesis Testing

(Chapter 6, section 6.3 in Zar, 2010)

Last week we were introduced to the concept of a null expectation, i.e., the expectation of no difference between two samples because both are drawn from the same statistical population. In inferential statistics, this null expectation is more commonly referred to as a null hypothesis, for which we will use the notation H₀. The purpose of inferential statistics is to derive the probability of a particular null hypothesis (H₀) being true.

We also were introduced to the concept of a decision rule, which is the level of probability that we deem too improbable for our results to have been produced by the null expectation. Conventionally, and for reasons that we will examine momentarily, we use a decision rule of p < 0.05, which means that if the probability of the null hypothesis being true (p) is estimated to be less than 0.05, we will conclude that the null hypothesis is not true. If, however, p ≥ 0.05, we will accept the null hypothesis. Technically speaking, we can only either reject the null hypothesis, or fail to reject the null hypothesis, but I am willing to ignore the semantic arguments in order for us to adopt the terminology of accepting the null hypothesis when p ≥ 0.05, and rejecting the null hypothesis when p < 0.05.

In the application of inferential statistics, the decision rule is referred to as the type I error rate, and is denoted with the symbol α. A type I error is when the null hypothesis is, in fact, true, but it is rejected because the probability (as determined from our samples) of the null hypothesis being true is less than 0.05. Because that probability is not zero, we will make such an error 5% of the time. In other words, the rate at which we will reject true null hypotheses is 0.05, i.e., one out of every 20 tests.

In order to better understand type I errors (and also type II errors), we first need to visualize the probability distributions of some specific null hypotheses so that we can work through an example.

We have established that the distribution of sample means around the population mean, μ, can be described by the t-distribution with degrees of freedom equal to n-1, where n is the size of the sample from which the sample means were calculated. As we have seen, this is useful in terms of descriptive statistics, but when our goal is to compare samples, we need to base our null expectations on the distribution of differences between means. The distribution of:

Provides us with the null expectation for the difference between sample means (our estimates) and the actual mean of the population from which the sample was drawn. This distribution would allow us to test the null hypothesis (H₀) that our sample mean was estimating a specific population mean, i.e., that the sample belongs to a statistical population with that population mean:

If the null hypothesis was true, the distribution of differences should center on 0. Deviates that fall outside of the 95% boundaries of this distribution would lead us to conclude that it is too improbable (p < 0.05) for our sample to have been drawn from a population with a mean of μ.

The distribution of:

Where sample means 1 and 2 were drawn from the same population, gives us the null expectation for the differences between 2 sample means drawn from the same population. Therefore, that distribution will allow us to test the null hypothesis:

In other words, the two sample means cannot be considered different because they are both estimates of the same population mean. Again, the null distribution should be centered around 0, and differences between sample means that fall outside of the 95% boundaries of this distribution (p < 0.05) will lead us to conclude that the 2 samples are estimating 2 different population means, and are, therefore, different.

For the distribution pertaining to the first null hypothesis (sample mean - population mean), it should not be surprising that the distribution of the sample means, and the distribution of the deviates of the sample means from the population mean are the same, because subtracting the population mean from the sample means merely shifts the distribution to the left (it is centered on 0 because the sample mean estimates the population mean):

The distributions of sample means and (sample means - μ) were each produced from 5000 sample means for 100 random draws of values ranging from 1 to 10 using THIS R program, as were the data for the following animation.

If the animation does not work, or if you want to examine the individual graphs, they can be viewed HERE.

This is because the value of μ is fixed. On the other hand, we would expect more variation in the distribution of:

When both samples are drawn from the same statistical population, because both variables used to calculate the difference can vary:

If the animation does not work, or if you want to examine the individual graphs, they can be viewed HERE.

If, however, we standardize the distributions, i.e., scale them to each other, by dividing the differences by the standard deviation (making them standardized deviates), we can see that the distributions are, in fact, equivalent:

If the animation does not work, or if you want to examine the individual graphs, they can be viewed HERE.

Because the standardized sample means, the standardized difference between sample means and population means, and the standardized difference between two sample means drawn from the same statistical population all have the same distribution, they all can be fit by the t-distribution. Thus, we can make use of the t-distribution to test the null hypothesis that a sample was drawn from a specific statistical population, and also to test the null hypothesis that two sample means were drawn from the same statistical population, i.e., that the sample means do not differ.

Let's start with the former. A trematode is found in the esophagus of the frog Rana vaillanti in the dry forest of Costa Rica. It very closely resembles Halipegus eschi, a magnificent and beautiful worm, which is distinguished from other species of Halipegus by having relatively small testes. The average testis diameter for individuals of H. eschi that have been recorded is 320 μm (the μ in this case is the prefix "micro" in micrometer (μm), which is 1 thousandth of a millimeter). A sample of 20 of the recently collected worms is examined, resulting in a mean testis diameter of 419.89 μm, and a standard deviation of 21.78 μm for the sample.

If we take the literature value of 320 μm as the population mean, we can test the null hypothesis that the mean testis diameter of this sample estimates that population mean:

Another way of viewing this, is that we are testing to see (based on testis diameter) whether the worms in this sample differ from those identified as H. eschi. We know that the standardized difference between sample means and μ should be distributed as t, and so we can take our difference:

And then standardize that distance by dividing by the standard deviation. Remember that the standard deviation that we are interested in is that for the sample means, not that of the observations, and so we standardize by dividing by SE:

Question 1: Explain how the denominator value of 4.87 was obtained from the data provided above.

This gives us the location, in units of standard deviation (again, SE is the standard deviation of sample means around the population mean), of our difference between the sample mean and the population mean on the probability distribution of the same differences derived from a single statistical population (which is our null hypothesis). The question is, where does that value fall relative to the boundaries containing 95% of that distribution? As you may have guessed, the answer to that question is contained in Table B.3 (p. 678 in the 5th edition).

The size of the sample used to estimate mean testis diameter was 20 individuals (that's 40 testes in total, but let's clarify and say that only anterior testis diameter is being compared). That would make the degrees of freedom (n - 1) for this comparison = 19. The critical value from the table corresponding to v = 19, with α(2) = 0.05 (along the top row of probabilites) is 2.093. These are referred to as "critical values" because 95% of the differences between a sample mean and its population mean should lie within 2.093 standard deviations of the mean for a sample size of 20 when the null hypothesis is true. Placing those boundaries on the actual t-distribution for v = 19, divides the graph into 3 regions:

For 19 degrees of freedom, standardized deviates that fall within 2.093 standard deviations of the mean will lead to accepting the null hypothesis (the sample mean is an estimate of the population mean of 320 μm), and standardized deviates falling outside of those critical values will lead to rejecting the null hypothesis, and concluding (with 95% confidence) that the sample mean estimates some other population mean, i.e., these worms are a species other than H. eschi.

The value of our standardized deviate (20.51) is well outside these boundaries, and so the null hypothesis is rejected (p < 0.05). In fact, following the values on Table B.3 to the right, we can see the highest value on the table is 3.883, at α(2) = 0.001, and so the p-value (the probability that the null hypothesis is true based on our standardized deviate of 20.51) is clearly less than 0.001. As you can see, although we really are only interested in whether or not the probability associated with our null hypothesis, i.e., the p-value, is less than, or greater than 0.05, by bracketing our calculated standardized deviates within those at the appropriate degrees of freedom, we can get a reasonable estimate of the actual p-value. For example, if our standardized deviate had a value of 1.92, we could determine from the table that the p-value (α(2)) was between 0.05 and 0.10, which we would report as: 0.05 < p < 0.10.

Question 2: Use the critical values in table B.3 to determine or bracket the 2-tailed (α(2)) probabilities associated with the following standardized deviates:

1.473 at 23 degrees of freedom

2.94 at 84 degrees of freedom

27.6 at 2 degrees of freedom

When using tables such as B.3 to evaluate probabilities, you will encounter cases when your specific degrees of freedom will not be represented on the table. In such cases, it is important that you use the next lowest value. Using a higher value would be inflating your sample size, and given the importance of sample size in determining confidence, this would be crossing over into a very dark grey region in an ethical sense.

Establishing a decision rule prior to conducting the analysis is what makes our conclusions objective. The value that we obtain to compare to the appropriate probability distribution (in this case, we compared a standardized deviate to the t-distribution), in concert with the critical values determined by our decision rule, determines what conclusion we will reach, whether we like it or not.

Looking at the rejection region of the preceding graph might give you the impression that we could tighten things up by decreasing our type I error rate (α). Why settle for 95% confidence when you could have 99% confidence? Well, they wouldn't have called the type I error "type I error" if there wasn't a type II error, and the convention of α = 0.05 strikes a balance between these two types of error. Type I error, i.e., rejecting a true H₀, is the error that can be made in the rejection zone of the distribution, but there is also the possibility of making an error (apart from the semantics) when accepting a null hypothesis. Type II error is the probability of accepting a false null hypothesis:

While the type I error rate (α) can be evaluated directly from the probability distribution for the null hypothesis, the type II error rate (β) is dependent upon some alternate distribution, which is unknown. To demonstrate this, lets consider a comparison between two sample means that is testing the null hypothesis:

Because the alternate is unknown in experimental work (if it were known, there would be no need for the analysis), we will use a model to draw observations from known distributions. The R program used for this demonstration can be viewed HERE. To approximate our null hypothesis, 2 samples of 25 numbers between 1 and 10 are drawn at random (making the underlying distribution of the statistical population uniform), the sample means calculated for each, and the difference between the means is determined. This is repeated 5000 times. Each of these 5000 differences is then standardized by dividing the difference by the standard deviation of the 5000 differences. The distribution of these standardized differences is represented by the black bars in the graphs below. The red line represents the expected frequencies based on the t-distribution for 48 degrees of freedom (v = 48). Because the standardized differences are based on two samples, the degrees of freedom are v = (n₁ - 1) + (n₂ - 1), where the subscripts 1 and 2 indicate the 2 samples used to generate the difference. As you can see, our earlier contention that these differences should conform to the t-distribution at the appropriate number of degrees of freedom is supported. The vertical black lines on the graph correspond to the critical values for α(2) = 0.05 and v = 48, delineating the acceptance region for our null hypothesis.

Question 3: Explain why the distribution described above, and indicated below by the black bars, can be considered to represent a null expectation for differences between 2 sample means where n₁ = 25 and n₂ = 25 (ignore the fact that the distribution fits the t-distribution, and focus on what values were used to make the distribution).

With our null distribution in place (the black bars), we will now generate two (arbitrary) alternate distributions, one where the second distribution sampled (25 random numbers ranging from 3 to 12) differs in population mean from the first (μ₁ = 5.5) by a value of 2 (μ₂ = 7.5), and another where the second distribution sampled (25 random numbers ranging from 5 to 14) differs in population mean from the first by a value of 4 (μ₂ = 9.5). Thus, the open bars on the following graphs represent 5000 standardized differences between means of samples drawn from the distribution with μ = 7.5 and the distribution with μ = 5.5 (the title reads "1 - β = 0.7642"), or standardized differences between means of samples drawn from the distribution with μ = 9.5 and the distribution with μ = 5.5 (the title reads "1 - β = 0.9998").

If the animation does not work, or if you want to examine the individual graphs, they can be viewed HERE.

The bars in red indicate the portion of the alternate distribution (H₀ is known to be false) that overlaps with the acceptance region for the null hypothesis, i.e., the type II error rate (β). Hopefully it is obvious that we can only make a type II error in the acceptance region of the distribution. We can see from the animation that, although α remains unaffected (it should be equally obvious that we can only make a type I error in the rejection region of the distribution), a larger difference between population means when the sample means do estimate different population means decreases β (represented by the area of the red bars). Having no knowledge of what the alternate distribution might be (these were arbitrarily chosen), we have to be concerned with both error rates.

Question 4: Explain why α = 0.05 strikes a good balance between α and β (in other words, what would be the cost of reducing α?).

Remembering our brief lesson on probabilities, it should be obvious that 1 - β is the proportion of the alternate distribution that does not overlap the acceptance region. Not surprisingly, this value (1 - β) is a measure of statistical power; the farther the alternate distribution is from the null distribution, the more confidence we will have in concluding that two sample means estimate different population means. Of course, it is not possible to change the difference between the statistical populations that you are examining (if one indeed exists) in order to increase the power of your comparison, but there are other ways to improve the power of a test. Increased sample sizes produce better estimates of the sample mean, resulting in less variation among the sample means in the distribution (a narrower distribution will have less overlap) and a resultant smaller number in the denominator of the standardized difference, which pushes the distribution to the right. The following animation was generated in the same way as the first alternate distribution in the preceding example (the R program is HERE), with the means being generated by using 3 different sample sizes (n = 25, n = 50, and n = 100):

If the animation does not work, or if you want to examine the individual graphs, they can be viewed HERE.

Note that the mean of the alternate distribution is not changing, it is the average difference between the means that is increasing. Obviously, μ and σ do not change for the statistical population(s) being sampled, but increasing the sample size (n) results in a more precise characterization of the statistical population(s) that will improve the statistical power of whatever test is applied.

One final word about null hypotheses before we begin our tale of the tails: there is no need to report a null hypothesis in a paper or a presentation. The hypothesis that has value for your investigation is your working hypothesis. The null hypothesis is simply an analytical tool that we apply in order to objectively evaluate our results. Unless you are using unconventional analyses, or testing an unconventional statistical hypothesis, all that is required is for you to indicate in your methods section the analysis that you employed.

Now, as promised, we will now move on to distinguishing between one-tailed and 2-tailed hypotheses...

Send comments, suggestions, and corrections to: Derek Zelmer