Null Models

By this point in the semester we should be quite familiar with the concept of a null expectation. The analyses that we have covered thus far have all tested a null hypothesis. Null models are an extension of this, where a specific null expectation is developed that isn't covered by one of the theoretical distributions that we have been using or better suits a particular experimental setup. Null models attempt to establish a pattern that represents a baseline for a particular system. Another way of looking at a null model is that it represents a circumstance where all of the variables considered to be of interest have no effect. Still confused? Some examples might help...

The exponential growth model for population ecology represents a null model of population growth. There are a number of factors that we can think of that might influence per capita birth (b) and death (d) rates, but the exponential growth model "removes" the effects of those factors by holding the per capita rates constant. Because b and d do not change, the difference between them also is a constant, producing the characteristic J-shaped curve as abundance increases.

Another null model that you should be familiar with is the Hardy-Weinberg equilibrium model. We know that allele frequencies can change within a population by the actions of selection, genetic drift, gene flow, and mutation. If none of these factors is operating, the result is a large, panmictic, Mendelian population, where the probability of acquiring a specific allele is determined only by its proportional representation in the parent population. At a diallelic locus, where the proportion of one allele in the parent population is represented by p, and the proportion of the alternate allele is represented by q, such that p + q = 1, the distribution of genotypes can be determined as:

Because offspring are essentially drawing 2 alleles at random from the parent population.

Many null models rely on random sampling to produce the null patterns, a process known as Monte Carlo simulation. If we look at a locus with the alleles A and a, where the proportion of alleles in the parent population is 0.57 A and 0.43 a, i.e., p = 0.57 and q = 0.43, we could take a Monte Carlo approach and generate 10,000 offspring drawing alleles at random based on those proportions. Of course, using the binomial expansion (you did recognize that Hardy-Weinberg equilibrium is a binomial expansion, right?) is a much more practical approach in this case, but that will give us something to compare to and help us better understand the approach...right?

The following R program (R is a powerful language for dealing with data, has packages that will do pretty much any analysis that you will ever need, and is free) does exactly what we described:

Don't worry, you will not need to learn R programming (and if you do know R programming, you will see that I am not proficient at it), but I will explain the program to give you an idea of how the simulation works. Lines 2 and 3 are just housekeeping things that clear the memory, and tell R where to look for or save files. Line 6 sets the number of iterations at 10,000, representing our 10,000 individuals. lines 8 and 9 establish empty arrays. Just think of them as 2 columns with 10,000 empty spaces. Line 11 tells R to repeat lines 12 and 14 "x" times (remember that x = 10,000).

Line 12 generates one random number between 0 and 1, and if it is greater than 0.43, R will assign a value of 1 to the first row of the "mama" column. If the random number is less than 0.43, R will assign a value of zero to that row. Line 14 does the same thing for the "dada" column, and this is then repeated 9,999 times, filling the "mama" and "dada" columns with ones and zeros, with the ones corresponding to a draw of the A allele, and the zeros corresponding to a draw of the a allele.

Line 16 creates a new column (called "geno") with rows that are the sum of those of "mama" and "dada". That sum should be 2 for individuals that are AA, 1 for individuals that are Aa, and 0 for individuals that are aa. Lines 18-20 tally up the zeros, ones, and twos, naming the sums according to the genotype they represent, and lines 22-24 print those values.

The results from one run give us 3230 AA individuals, 4930 Aa individuals, and 1840 aa individuals, which is pretty close to the expectation we can derive from the binomial expansion of 3249 AA individuals, 4902 Aa individuals, and 1849 aa individuals. Hold on...put your χ² table away. That's not the point. The reason that we see the deviation is because 10,000 offspring is still a relatively small number, so we are getting sample error (which, incidentally, is exactly what genetic drift is). Because we used a randomization procedure, we will get a different result every time. A second run of the same model produced 3273 AA, 4932 Aa, and 1795 aa.

While this variation in the outcomes of the Monte Carlo simulation might seem like a failure of the approach, it is actually one of the strengths. The binomial expansion might provide an exact expectation, but the Monte Carlo approach can provide useful information about expected variation. For example, one could vary the iterations (population size) from 100 to 100,000 and examine the variation as a function of sample size. In this way, we can get an error term for our estimate. Doesn't that sound like an exciting way to spend the evening?

Let's apply a Monte Carlo approach to the coliform dataset that we examined with a single-sample t-test in week 7, to determine whether the coliform count of the sample deviated from the government standard of 150 CFU per 100 ml. In this instance, the approach that I took was to draw samples of 25 (the same sample size as the data set) from a normally distributed population with a mean of 150 CFU per 100 ml, and a standard deviation that matched that of the sample:

The above program is annotated (unlike the "Twitterverse", R ignores text following a "#" sign), so the explanation of what the program is doing can be read from the program itself. The p-value that we got for the single-sample t-test was 0.0004. The proportion of the Monte Carlo samples that met or exceeded the mean value of our sample was 0.00004 in one run (remember, the outcome will vary from run to run). There is an order of magnitude difference between the 2 probabilities...so, which one is right? They both are. Although it might not seem obvious, the 2 approaches are asking different questions. The single-sample t-test estimates the distribution of sample means (remember standard error?) and estimates the probability of drawing a sample mean that large or larger, whereas our Monte Carlo approach estimates observations and uses those to generate sample means. A subtle difference, but an important one.

Let us also examine the shorebird data from week 7, just to give you a better feel for a variety of approaches. This was a comparison of tissue mercury levels that we analyzed using a 2-sample t-test. For this data set I combined the two samples into a single set of data:

R will read data from a comma-delimited (csv) file, but I went ahead and entered these manually. Our first approach will be to draw samples of 25 and 18 from the pooled data set (remember that the null hypothesis is that the samples came from the same distribution), and compare the differences between the two sample means to the difference we observed between the 2 samples (0.8585). The first part of the program does this without replacement, meaning that each observation can only be used once. Because the total number of observations is the sum of 25 and 18, this means that each observation will be used exactly once:

From 100,000 draws, there were 0 differences equal to or larger than that observed. That means that the probability of randomly subdividing this data set and generating a difference in means that is greater than or equal to what we observed is less than 0.00001. Because of how we set this null model up, the conclusions only apply to our sample. Making broader conclusions would require the assumption that the observations in our sample are the only observations available, and occur in exactly those proportions in the statistical population.

We can move closer to a more general conclusion by sampling with replacement, which means that observations can be drawn by the Monte Carlo procedure more than once, in which case, some observations might not be drawn at all. This increases the possible variation among iterations, and also moves us closer to the assumption that our sample is representative, as opposed to all-inclusive. The second part of the above program does this, but again, there are no differences generated from 100,000 iterations that exceed that which we observed. This certainly can be taken as a good sign that the samples are different, but the probability deviates substantially from that generated by the 2-sample t-test, which was 0.0002. Once again, the difference between the probabilities is not an indication of a flaw in either approach, but rather results from the fact that the 2 approaches ask different questions.

An alternative approach would be to estimate the mean and standard deviation of a single population that contained both sets of observations by calculating the mean (1.117) and standard deviation (0.467) of the pooled observations, and finding the probability of drawing 2 samples (with sample sizes of 18 and 25) from a normal population of observations with that mean and standard deviation that have a difference in their means greater than or equal to the one that we observed from our samples. The program adopting this approach is:

Once again, we fail to draw 2 samples with a difference in means exceeding the difference that we observed in 100,000 attempts, so the probability of that occurrence is less than 0.00001.

The point of using examples that can be addressed by conventional analyses is not to suggest alternatives to these analyses (they should be employed when applicable) but, rather, to use something that you are familair with to introduce you to some possible approaches that might be useful when you have a null expectation that might not be covered by more conventional approaches.

To receive credit for this week's assignment, send me evidence (screenshot, e-mail, etc) that you have completed the teaching evaluation for this course. Do not send me any part of the evaluation itself, just the confirmation that you receive upon completion. Please send it directly to my e-mail address, and do not submit it via Blackboard.

Send comments, suggestions, and corrections to: Derek Zelmer