Goodness-of-Fit

(Chapter 22 in Zar, 2010)

When frequency (or count) data are collected for classes at a nominal scale, these data can be compared to a null or theoretical distribution using goodness-of-fit tests. As a general guide as to what type of data should be subjected to such analyses (if "frequency data collected at a nominal scale" is less than meaningful), any data that you might express as a percentage (or, preferably, as a proportion) would be appropriate for such a comparison.

One example that you should be familiar with is the application of the Chi-square (χ²) goodness-of-fit (shouldn't that be "wellness-of-fit"?) to genetic crosses. Consider the cross of the F1 generation from a monohybrid cross of PP x pp:

Pp x Pp

Where P represents the allele for pigmentation that allows the production of chlorophyll, and p represents the allele for albinism. Because P is expressed dominantly, we would expect a 3:1 phenotypic ratio for green seedlings relative to white ("albino") seedlings, based upon a genotypic ratio of 1:2:1 for PP, Pp and pp respectively.

Assume that seeds from the cross were planted in trays for 5 different freshman biology lab sections, producing the following observed frequencies for green/albino: 2/2, 6/1, 16/4, 9/1, and 7/3. A very common mistake is for people to calculate the proportion from each sample, and then average them. Although in this case we get very close to the expected proportion of green plants (0.751), it should be clear to you by this point in the semester that taking the average gives equal weight to all of the samples, even though the sample sizes range from 4 to 20. Clearly we are getting better information about the actual proportion from the sample of 20. One solution is to use a weighted average, where you multiply the proportion calculated for each sample by the number of observations from that sample, and then divide by the total number of observations made:

This gives an estimated proportion of 0.784 for the green phenotype. Look back at the math. You are multiplying the proportion by the denominator used to calculate that proportion, which gives you back the original frequency. In other words, you are summing all of the original frequencies, and dividing by the total number of obervations. This is no different from pooling the samples together, and calculating a single proportion:

We can see that the 2 approaches produce equivalent estimates, but there is an additional problem with averaging the proportions, even when we use a weighted average. One of our cardinal rules has been: never report an estimate without some measure of the error. As we discussed in week 5, the appropriate measures of error for estimates of proportions are the 95% confidence intervals (a copy of the spreadsheet I gave you to calculate those can be downloaded HERE). Calculating 95% confidence intervals for the weighted mean proportion (using weighted deviates to produce a weighted sum of squares) gives us information about how the samples vary relative to one another, which is not that informative in terms of the error of the estimate of the proportion. The following graph shows the estimates and 95% confidence intervals associated with the 2 approaches:

One important difference is that the 95% confidence intervals for the pooled proportion (which are Wilson-Score intervals) are asymmetrical. This should be expected, as proportions are bounded by 0 and 1. Generating 95% confidence intervals for a weighted mean proportion produces symmetrical confidence intervals that can be less than 0 and/or exceed 1. This alone should be enough to convince you that only the pooling approach produces meaningful measures of the error of the estimate of the proportion. If not, you can download and read THIS paper (and the references within) to convince yourself that when reporting a proportion estimated from from multiple samples, the only appropriate approach is to pool the samples to estimate the proportion, and calculate the 95% confidence intervals from the pooled sample. While integration of the Bayesian posterior is practically an exact method (and sounds cool), it is computationally intensive, and so my recommendation (the basis for which can be found in the paper in the preceding link) is for you to use the Wilson score intervals.

Now that we understand how to report proportions, we need to get back to the methods by which we analyze them. Analysis of frequencies is called "analysis of frequencies" because the analyses are conducted on the frequencies themselves and not on the proportions. As mentioned previously, analysis of frequencies generally employs a goodness-of-fit test, the most familiar of which is likely the χ² goodness of fit test (pronounced: "Chi-square"). The χ² statistic is calculated as:

Where f is the observed frequency, i.e., the actual count, and the f with the hat (or, more correctly, the caret) represents the expected frequency. This value is compared to the distribution of the critical values of χ² (Table B.1) at k - 1 degrees of freedom, where k is the number of classes being examined (another way of looking at it is that k is the number of individual χ² values that go into your sum).

The expected values can have a theoretical basis, such as Mendel's expectations for a monohybrid cross, or can be based upon a null expectation of independence, as we will see momentarily. For the example above, we expect the green phenotype to occur with a probability of 0.75, and the (short-lived) albino phenotype to occur with a probability of 0.25, producing the expected 3:1 ratio. Of a total of 51 plants, 40 presented the green phenotype, and 11 presented the albino phenotype. The expected values are obtained simply by multiplying the expected probability by the total number of plants, and so the expected frequency of the green phenotype would be 0.75(51) = 38.25, and the expected frequency for the albino phenotype would by 0.25(51) = 12.75.

Question 1: Does the observed distribution of phenotypes differ significantly from Mendel's expectation for a monohybrid cross? (Make sure that you do the calculations in your Excel workbook, or show your work on the Word document!)

Not so fast. If you calculated your degrees of freedom correctly, you will notice that there is but a single degree of freedom. When this occurs, the Yate's correction, which merely subtracts 0.5 from each deviation of f from f-hat, must be applied to the χ² value, such that χ²_s = χ²_C, calculated as follows:

It seems a little cruel to make you recalculate the χ² value with the correction, so I'll just trust that you will remember to apply it when the circumstances warrant its application.

Let's take a quick look back at what the χ²-values represent. We calculate χ² as:

And we learned in week 5 that the χ² distribution is derived from a normal distribution as:

Our application of the χ² distribution to frequency data forces the following equivalence:

Look carefully. To satisy this equivalence, f-hat must be equivalent to both the mean and the variance, which means that the variance and the mean must be...anyone? Bueller? Anyone? Equal.

The variance must equal the mean.

This is the mathematical definition of a random distribution. Thus, when we apply the χ² distribution, our null hypothesis is that f is an estimate of f-hat with random variation, i.e., it is equally as likely to overestimate as it is to underestimate. In other words, the null hypothesis is that the observed distribution that we have is a sample from our expected distribution. Think about that for a minute. Goodness-of-fit tests (seriously...it should be "wellness-of-fit") that compare the data to a theoretically derived distribution represent one of the few instances where our null hypothesis and our working hypothesis will be the same.

Contingency Tables

(Chapter 23 in Zar, 2010)

Goodness-of-fit tests can also be applied to compare frequencies across groups to test the null hypothesis that the distributions are independent of the group, which is another way of saying that there is no difference among groups. In these cases one should apply contingency table analysis, which is a simple method of generating the expected frequencies given the null hypothesis of independence. It is so easy to do that I always do it using the procedure that I will outline here, rather than enter it into a software package.

The expected frequencies for a null hypothesis of distributions being independent of the groups (i.e., no difference among the groups) essentially pool all of the data into one large distribution, and use those proportions to determine the expected observations within a group. That makes is sound more difficult than it is. Download this week's workbook HERE. The first worksheet (snails) contains data from sampling snails (Physa sp.) from 4 different sites in West Pond at Brick Pond Park in North Augusta, and examining them for shed cercariae (the free-swimming larval stage in trematodes). The habitats at each of the 4 sites were different, and the question was whether the trematode parasites using these snails as intermediate hosts would differ among the 4 sites, given that the different habitats would be suitable for different vertebrate definitive hosts. The null hypothesis then would be that the distribution of the parasites in the snails would be independent of the site at which the snail was collected.

The first table on the spreadsheet contains the number of infected snails found for each species at each site. You can see that row sums (R_i) and column sums (C_j) have been calculated, as has the overall total (T). The table immediately below that contains the expected frequencies for each cell, calculated as:

Simply multiply the appropriate row sum by the appropriate column sum, and divide the product by the total. Another way of looking at this is that you are determining the proportion of the pooled data that fall into that column (column sum divided by the total) and multiplying it by the number of individuals in a group of interest (the row total) to find the expected number of individuals from that group (row) that should occur in that class (column) if the null hypothesis were true.

So, for example, the expected frequency for the cell in row 2 (site 2), and column 3 (Glypthelmins sp.) would be 146 x 239 / 313, which equals 111.48 snails. Examine the formula in cell C11, making specific note of the positions of the anchors. Copying this one formula to all the cells will produce the expected frequencies for all of the cells. Make certain that you understand how it works, using "F2" to highlight the cells being used, because that specific formula will only work for this table, but the concept can be applied to any contingency table.

The third table contains the calculations for the χ² values for each cell. The χ²_s value for the analysis is the sum of all of the χ² values from all of the cells, calculated in cell I23. By setting up the table in this manner, we can see where the largest contributions to the overall χ² value occur. In terms of parasite species, Posthodiplostomum minimum deviates the most from the null expectation because of its high frequency at site 4 (P. minimum infects piscivorous birds, like kingfishers and cormorants, and site 4 had numerous perching trees). For the sites, sites 2 (31.05) and 4 (51.32) contributed a large amount to the overall χ² value.

To evaluate whether or not we reject the null hypothesis, χ²_s is compared to the critical value in Table B.1 using (r - 1) x (c - 1) degrees of freedom, where r is the number of rows, and c is the number of columns. This is a one-tailed test, because we are looking at squared deviations from the expectation (i.e., the deviation can only be positive), and so the 0.05 probability should be used.

The data in the second worksheet ("Example 23.1") contains the data from Example 23.1 (page 490 in the 5th edition).

Question 2: Test the null hypothesis that hair color is independent of sex. If the null hypothesis is rejected, describe what factors contribute the most to the significant deviation.

One useful attribute of contigency table analysis is that you can subdivide the table and redo the analysis without negatively influencing your type I error rate. For the previous example, if you think the deviation is mostly due to one specific hair color, you can examine that possibility by removing that color from the data, generating new expected fequencies, and testing the null hypothesis that hair color is independent of sex for the remaining colors.

Question 3: Remove the hair color that contributed the most to the overall χ² value from the data set, and redo the analysis (you will have to build a new table and calculate new expected values). Is the null hypothesis accepted?

The third worksheet ("seal") presents data from an investigation of diving behavior in harbor seals. It has been shown that U-shaped dives represent foraging behavior, because of the time spent at depth, and that V-shaped dives represent non-foraging behaviors. The question being asked was whether there were developmental differences in diving/foraging behavior in harbor seals.

Question 4: Test the null hypothesis that diving behavior (the distribution of dive shapes) is independent of age for harbor seals. If the result is significant, subdivide the table to examine whether the deviation (if present) is related to reliance on foraging (suckling pups do not need to forage much, if at all). Present the data as a graph with proportions (plotting the frequencies will emphasize the differences in sample sizes and not behaviors), complete with 95% confidence intervals (use the spreadsheet that I linked earlier to calculate the Wilson-score intervals).

As discussed earlier for other goodness-of-fit testing, our confidence in accepting or rejecting a null hypothesis falls apart when we have a single degree of freedom. For contingency tables, this would be a 2 x 2 table. For this reason, the approach that we just covered must not be applied to 2 x 2 contingency tables. The recommended approach is to use the Fisher exact test, which is covered in section 24.16 of the 5th edition of your text. Read through Example 24.20 to convince yourself that some analyses are better left to computer software. The important thing is that you remember that this is the test that you will need to apply for 2 x 2 contingency tables.

Contingency tables can be expanded into multiple dimensions to allow additional comparisons, in a manner similar to the way that n-factor ANOVA expands single-factor ANOVA. For example, we could expand the comparison we examined above to determine whether diving behavior varied between two populations of seals in different habitats with different prey sources. In addition to subdividing tables, one can also employ pooling with multi-dimensional tables. For example, if we found that the distribution of dive shapes were independent of which population of seal was examined, the data from the two seal populations could be pooled together for a more powerful examination of the question of temporal changes in diving behavior. Although an in-depth description of this approach is beyond the scope of this course, it is a useful tool, and so you need to be aware of it. Multidimensional contingency tables are covered in section 23.8 (starting on page 510) of the 5th edition of your text.

One final topic that needs to be addressed is the log-likelihood ratio as an alternative to χ² as a goodness-of-fit test. The procedure for applying this test is covered in your textbook, and I will leave it there, as it is not a test that I employ. In certain instances, it is a more powerful test than the χ² test, specifically when:

For any part of the comparison. One of the main reasons that I have never used it is that the data that I collect often have frequency counts of zero. The log-likelihood ratio (G) is calculated as:

This produces undefined values for the log-likelihood ratio when there are frequencies of zero. I have always countered the recommendation of some to change the zeros to a very small number with the observation that zero is not "a very small number". Another issue is that the distribution of G is unknown, although 2G can be approximated by the χ² distribution. What this means is that although, under certain circumstances, the log-likelihood ratio might be a more powerful test (a higher probability of rejecting a false null hypothesis), the tradeoff is that you will have less certainty that your actual type I error rate is close to α. In other words you are giving up confidence for the possibility of improved power. One final strike (in my view) against the log- likelihood ratio is that its application to contingency tables precludes examining the contribution of separate cells, rows, and columns to the overall χ² value. The choice of goodness-of-fit test has been argued extensively in the literature, the citations for which can be found in your book, so you do not have to follow my lead, and can forge your own path. There are 2 important things that you should know before proceeding. First, it is only rarely that the 2 tests disagree, and second, if you develop a preference for the log-likelihood ratio, you will have to go back and apply it to all aspects of this exercise...

Save your Word document and Excel spreadsheet as yourlastnameex13, and submit them via Blackboard.

Send comments, suggestions, and corrections to: Derek Zelmer