Data, Samples, and Frequency Distributions

The nature of data, regardless of whether they were collected in a field or a laboratory, will determine the type of analyses that will be applied. In general, data can be classified as continuous, discrete, or nominal. Continuous data can be measured in fractions, and include things like temperature and mass. Discrete data can only be measured as integers, and include variables such as number of offspring, or number of flowers produced in a growing season. Nominal data, as the name suggests, are observations that can’t be quantified and are, therefore, given names, such as sex (male or female), or color. There is another type of data, which we will not make use of in this course, called ordinal data, which is similar to nominal data, but the categories are organized as ranks, such as those found in the Likert scale (e.g., strongly disagree, disagree somewhat, neither agree nor disagree, agree somewhat, strongly agree). People often treat ordinal data as though they are quantitative, but the reality is that the distances among the categories are unknown, and almost certainly unequal, and so the descriptive statistics that we typically apply (e.g., mean and standard deviation) will produce meaningless numbers.

It is important to recognize that it is how the data are collected that determines how you will classify your data. For example, time is most certainly a continuous variable, but can be measured discretely, e.g., days, months, or years. Color was given above as an example of a nominal variable but, with a great deal of expense and/or effort, color can be recorded as a continuous variable if you measure the actual wavelengths. In fact, this is in essence what one does when using spectrophotometry, or some derivative of it like enzyme-linked immunosorbant assays (ELISA).

Question 1: Provide 2 examples each (other than those mentioned above) of continuous, discrete, and nominal variables.

Regardless of whether you are establishing patterns, or testing the tentative mechanisms capable of producing those patterns, i.e., hypotheses, there will be one (or more for multivariate analyses) variable that can be considered the dependent variable, and at least one variable that can be considered as the independent variable. In the case of experiments involving direct cause and effect, the variable that is being manipulated in order to determine whether it exerts an effect is the independent variable. The variable from which the possible effect, i.e., the response to the manipulation, is measured, is the dependent variable. Examining the developmental rate of the fruit fly, Drosophila melanogaster, at different temperatures would be an example of such an experiment, where the temperature is the independent variable, and the developmental rate is the dependent variable.

Some experiments, however, are mensurative, in that there is no direct manipulation of one of the variables. Data are collected from different predetermined groups in order to determine whether a pattern can be established. The data collected are the dependent variable(s), and the locations or groups from which those data were collected constitute the independent variable(s). In these instances the manipulation is implicit in that it is presumed to have occurred. For example, one could examine the number of zooplankton species in 5 ponds of varying surface areas that are located close to one another. Pond size itself was not manipulated directly by the investigator, but it is a difference that occurs among the ponds just the same.

It may have occurred to you that you will have a lot more confidence in ascribing changes in developmental rate to temperature differences in the first example than you would attributing differences in species richness to pond size in the second. If the former experiment were properly controlled (which we will discuss momentarily), then one could reasonably ascribe any changes in the dependent variable to the changes in temperature. In the case of the second experiment, however, there are a number of different things that could influence zooplankton richness apart from the surface area, or that are only indirectly related to the surface area of the pond. In such a case, the best we can do would be to determine, objectively, whether there were differences among the ponds. In other words, the independent variable would more correctly be viewed as “pond”, rather than surface area.

Not surprisingly, these 2 types of independent variables should be treated differently from the standpoint of data analysis. Independent variables like temperature in the first example, where cause and effect are being examined directly, are referred to as “fixed variables”, whereas independent variables that represent categories encompassing a number of possible effects, or effects whose level are not determined by the experimenter, are referred to as “random variables”. Another way of looking at the distinction between fixed variables (or effects) and random variables (or effects) is the scale at which the data can be interpreted. For fixed effects, like temperature, the values were chosen specifically, and the interpretation of the results is constrained to the range of values that were selected. For random effects, like "pond", the effects can be considered to be random samples of a larger probability distribution, e.g., all of the ponds in the region, allowing the inferences to be drawn about the broader distribution, not just the specific ponds chosen.

Question 2: For the following experiments, identify the dependent and independent variables, and explain (which requires some justification on your part) whether the independent variable is a fixed variable or a random variable. It is the justification that is important in your determination because there may be more than one correct answer.

Seasonal changes in lipids, diet, and body composition of free-ranging black-tailed prairie dogs.

The effects of impoundment on the community structure of freshwater mussels in the Neosho River, Kansas.

Effect of various doses of chlorpromazine on formation of 3-methoxytyramine and normetanephrine in mouse brain.

The effect of vegetation changes following deforestation on water yield and evapotranspiration.

Direct proinflammatory effect of C-reactive protein on expression of adhesion molecules in human endothelial cells.

It is important to recognize that scientific research can simply be descriptive, such as examining the life-cycle of a parasite, surveying the avian fauna of a Pacific island, sequencing the genome of a particular organism, or characterizing the courtship ritual of an insect. In many cases, however, this still involves experimentation at some level of the investigation.

Data Collection and Experimental Design

In most investigations conducted in the biological sciences, inferences are made about statistical populations from samples. A statistical population consists of all of the observations that possibly could be made, whereas the sample (as the name implies) is the subset of that statistical population that we actually work with. It might seem like a trivial point, but understanding the boundaries of your statistical population is a critical component of designing good experiments, conducting appropriate analyses, and making meaningful inferences. Good experimental design also requires very careful consideration and application of the three “R”s: Replication, Randomization, and contRol.

Replication

Experimentation involves manipulation. As has already been discussed, it is the independent variable that is manipulated, and potential responses are measured in the dependent variable. The smallest unit that can be manipulated, or treated, is called an experimental unit. Replication is simply making observations on multiple experimental units. The reason that this is necessary is that biological systems, and the parameters that one can measure within them, are quite variable. This is fortunate, because without variation, there would be no adaptation, and without adaptation, there would be no biology. The downside is that evaluating the results of experiments often requires large sample sizes (the number of experimental units) in order to determine whether a particular manipulation has had an effect.

The reason that replication is necessary is that it allows us to quantify the variation present in a particular variable. It is knowledge of this variation that gives our observations meaning. We all recognize that $1/gallon is a ridiculously low price for gasoline, because we understand the variation in gas prices, and have established that understanding through repeated observation. We understand that differences of a penny or 2 in the price of a gallon of fuel are trivial, while differences of a dime or more can be substantial.

In other contexts, differences of the same magnitude can be considered trivial, because the amount of variation differs. For example, we would consider differences of a dollar or more in the price of a textbook to be trivial. Certainly, the magnitude of the values in question comes into consideration, but it is the variation that is important in evaluating differences. Differences among the resting heart rates of individuals are to be expected, i.e., there is a fair amount of variation in this parameter, but differences in internal body temperature are not expected, unless some of the individuals are experiencing serious physiological problems.

Consider a hypothetical experiment, where 500 individuals of Tetrahymena pyriformis (a ciliophoran closely resembling Paramecium spp.) are placed in pond water, and an additional 500 individuals are placed in pond water with a small amount of glucose added. The independent variable in this case is the presence or absence of glucose, and the dependent variable is the number of contractions of the contractile vacuole (an organelle used to expel water from the cell) over the course of a minute. Two different possible outcomes for the experiment are presented as frequency distributions in the following graphs, with the black bars representing the observations where no glucose was present, and the white bars representing the observations made in the presence of glucose.

Note: These data were actually derived by random draws from distributions with defined parameters. The program used to generate these data in the software package R can be viewed HERE for anyone interested in such things.

This type of graph is referred to as a frequency distribution. All 500 observations (both graphs are based upon the same number of observations) are displayed as counts, i.e., the sum of all the values indicated by the bars representing one treatment, for either graph, would equal 500. The observed rates of vacuole contraction are located on the x-axis, and the number of individuals that exhibited those rates of contraction, i.e., the number of times those rates were observed, are on the y-axis. It is called a frequency distribution because the frequency with each value of the dependent variable was observed is displayed. For example, for the graph on the left, 46 individuals in the glucose-free treatment were observed to have vacuole contraction rates in the range of 14-15 contractions per minute.

It is critical that you be able to interpret frequency distributions, because they are the foundation for developing an understanding of inferential statistical analysis. Spend as much time as is necessary in order to make certain that you understand what is being displayed in the preceding graphs.

Make note of the fact that the scale of the y-axis differs between the two graphs. The sample size is 500 in both cases. The only substantive difference between these two sets of hypothetical results is the amount of variation present in the data.

Question 3: Which of these two sets of results (the left or the right) would give you more confidence in concluding that the addition of glucose had an effect on the rate of contraction of the contractile vacuole? The point of this exercise is to get you to think about what criteria you would apply to conclude whether 2 sample distributions represent 2 different distributions (i.e., there is a treatment effect), or 2 samples from the same distribution (i.e., no treatment effect).

Question 4: Explain how and why you might modify the sample sizes (number of experimental units) used in this experiment if you had prior knowledge (from a preliminary trial - always a good idea) that the distributions on the right were the correct ones.

Remember that the actual sample size is the number of experimental units, which are the smallest units that can be manipulated (treated), not the sum of all possible observations. One must be careful not to treat multiple measures on a single experimental unit as though they were replicates. Doing so is referred to as "pseudoreplication". Consider an experiment designed to examine the effects of grazing by cattle on the diversity of forbs. A single pasture is divided into two equal plots by a fence, and cattle are released onto one of the sides (chosen at random). After one year of grazing, 20 measurements of forb diversity are taken on each side of the pasture. Comparing grazed to ungrazed as though the sample sizes were 20 for each treatment would be an example of pseudoreplication. In reality, the pasture itself is the experimental unit, because it is the smallest unit that can be grazed, and so this would constitute an unreplicated experiment, i.e., the treatment and the control both have a sample size of 1. Multiple measurements within a single pasture will increase the precision (how close measurements are to each other) of our estimates of the forb diversity in that pasture, but we need measurements from multiple pastures to improve the accuracy (how close estimates are to the actual values) of our estimate of the effect of grazing.

Question 5: For each of the following examples, indicate the actual sample size, i.e., number of experimental units, for each treatment (just counting the potential observations will give you the wrong answer!):

The effect of sun exposure on the surface area of privet leaves: the surface area of 20 leaves from each of 3 privet plants are measured in a sun-exposed area, and also in a shaded area.

The effect of fertilizer on the growth rate of Brassica rapa: 10 pots are planted with 5 plants per pot, and fertilizer is added to 5 of the pots. The growth rates of each plant are measured over the course of 2 weeks.

The effect of the presence of predators on the rate of metamorphosis of tadpoles of Rana catesbeiana: 20 aquaria with 15 bullfrog tadpoles in each are filled either with artificial pond water (10 aquaria) or with artificial pond water taken from aquaria that housed largemouth bass, Micropterus salmoides (10 aquaria), and the amount of time required for the individual tadpoles to develop into frogs is recorded.

Randomization

Because we generally are dealing with samples, one of the most important components of data collection and experimental design is making sure that those samples are representative of the statistical population. This is difficult to accomplish if one has not given much thought to the delineation of that statistical population! Simply put, we want the frequency distribution of our sample to mimic the shape (variation) and position on the X-axis (central tendency) of the frequency distribution of the statistical population, i.e., the distribution of all possible observations. We will refer to the steps taken to ensure samples are representative collectively as “randomization”, even though we will discuss options other than random sampling.

Random sampling essentially means that all remaining experimental units are equally likely to become part of the treatment group, and also equally likely to become part of the control group, regardless of how many units have already been assigned to a given group. The only legitimate way to do this would be to assign all experimental units a number, and draw numbers to determine group membership using a random number generator. The "RAND" function in Excel will produce a random number between 0 and 1 by typing in:

=rand()

Copying the formula will produce a new draw for each cell to which the formula is copied, and each of those cells will draw a new number every time you perform a function in Excel. We will discover later on in this exercise that Excel does not actually do a good job of drawing random numbers (it tends to avoid the extremes), but for our current purposes, we will take the numbers generated by the RAND() function as random.

To draw random integers within a specified range where min is the smallest number in the range and max is the highest number in the range, you can use the formula:

=int(rand()*(max-min)+.5)+min

The "INT" function rounds the value of what is contained in the adjacent parentheses down to the nearest integer, e.g., 4.678 would be rounded down to 4. That is the purpose of the ".5" in the preceding equation. Adding that value will increase the value of fractions of .5 or greater such that the overall value will be rounded to the next highest integer. I'm sure that some of you will realize that this would change the rounding rule of thumb from alternating rounding values of 0.5 up and down to always rounding up, but we have to realize that Excel generates these fractions to 9 decimal places, and so this will lead us astray only half of the times that 0.500000000 is the fraction drawn. For our purposes (we won't be launching any space probes based on our calculations), we can consider this amount of error as being trivial, especially when compared to the issues with Excel's ability to draw random numbers.

Using the above formula we could draw random numbers between 0 and 10 using:

=int(rand()*10+.5)

To draw random numbers from 1 to 10, we could use:

=int(rand()*9+.5)+1

Or, to draw random numbers between 5 and 10, we could use:

=int(rand()*5+.5)+5

While the notion of random sampling is an excellent one in terms of removing bias, and also from the standpoint of independence (which is an assumption of the vast majority of the analyses that we will learn), there are practical issues with its application. Looking at our Tetrahymena pyriformis example, one would have to assume that the individuals used in the experiment were drawn from a much larger culture. Although there are methods that could be used to generate random subsamples from cultures (one good example of the considerations of such techniques can be found in Wrona et al., 1982), there is no practical way to number each individual for a random draw from the culture (which we should consider as our statistical population), or for assigning the 1000 individuals drawn, randomly or otherwise, to a particular treatment.

One would also assume that individuals would be drawn from the culture prior to each experiment, as opposed to holding 1000 individuals for the duration of the experiment. In that event, the fact that time is elapsing between the samples must be taken into consideration. If there is anything non-random about the method used to isolate individuals from the culture (let's say larger individuals are more likely to be selected), then treating the first 500 individuals one way, and the next 500 a different way will ensure that neither of the 2 samples will be representative of the statistical population.

One way around these difficulties is to use systematic sampling, where fixed intervals determine the fate of each experimental unit. For this example, this would entail alternating treatments for individuals drawn from the culture, such that each individual received a different treatment from the one selected before it. If the first individual was exposed to pond water without glucose, the second would be exposed to pond water with glucose, the third to pond water without glucose, and so on. If there were a temporal bias to the sampling of individuals from the culture, this would spread the effects of that bias through both groups, making both samples similar in terms of how representative they are of the statistical population.

The problem with a systematic approach is that the assumption of independence is violated. Essentially, "independence" means that the observations are independent of one another, such that every experimental unit in a statistical population has the same probability of being in the treatment group every time an experimental unit is selected for treatment. With a systematic approach, once you have determined what treatment the first individual is to receive, even if this is determined randomly, you will have established the treatment that all of the remaining individuals will receive. Although the assumption of independence is a critical one for the analyses we will employ, in practice it has been found that the effects of systematic sampling on the outcome of the analysis tend to be minimal, and is some cases are positive because systematic samples tend to be more representative than random samples.

Other examples of systematic sampling would include arranging organisms by size, or some other variable that might influence the dependent variable, and assigning every other one to the treatment group, or sampling every 5 meters along a transect, or every 30 meters around the edge of a pond.

Randomized block designs represent the best of both worlds in that independence is maintained, but the samples for the various treatment groups are spread across the same ranges of potential bias. From our Tetrahymena example, we could sample 2 individuals at a time from the culture, and then determine randomly (by a coin toss perhaps) which of the two would receive the glucose treatment. In plot experiments, common in plant ecology, where the distances among plots could introduce variation, a randomized block design would consist of each plot being divided up into as many subplots as needed to allow the control and treatment groups, and then the treatments being assigned randomly within the subplots. In this way, whatever variation exists as the result of plot location is spread equally through all of the treatments.

By far, the worst approach would be attempts to be random, such as throwing a lawn dart to determine where a sample is taken, or selecting individuals for treatment with a blindfold on. Such attempts are properly referred to as "haphazard sampling". Human beings are incredibly poor at recognizing randomness, and are utterly inept at producing it intentionally ("The Drunkards Walk" is an excellent book on the subject). The end result is non-random (and therefore lacking independence), because we generally wind up trying to be systematic, but none of the benefits of systematic sampling will be realized. In other words, haphazard sampling will give you the worst of both of the alternatives.

One final sampling concept that we will address is that of stratified samples. For example, one may sample only from a particular size or age class of individual, to narrow the focus of the experiment, or prevent wasted effort (such as measuring fecundity in juveniles). If the sampling were random within these groups, this would be characterized as a "stratified random sample". Another example of a stratified random approach might be drawing random co-ordinates that determine where to collect samples within a lake (I would actually prefer a systematic approach here for practical reasons), but only sampling those that occur in water less than one meter in depth.

Control

Although it can be one of the most difficult aspects of experimental design to get right, the concept behind controlled experiments is a simple one: only manipulate a single variable. If only one thing has been changed, then any resultant changes in the dependent variable(s) can reliably be ascribed to that manipulation.

This does not mean that experiments have to be simple in order to be controlled, but it does mean that adding variables will increase the complexity of the design. Satisying the concept of a controlled experiment when there are multiple dependent variables requires what is referred to as a "fully-crossed" design. This means that there are treatment groups with all possible combinations of variables, such that different levels of a single independent variable can compared across groups where other independent variables are held constant.

Let's take the tadpole experiment above, and add the effect of temperature on the rate of tadpole development to the mix, by examining rates of development at 10, 20, and 30 degrees Celsius. For a fully-crossed design with the same sample sizes, this would require setting up 40 additional aquaria; 10 for each treatment (regular or bass-conditioned water) at each temperature. This way, the effects of the bass conditioned water can be compared within each of the 3 temperatures, and the effect of temperature can be examined in untreated artificial pond water, and also in bass-conditioned artificial pond water.

Fully-crossed designs allow us to examine how variables interact. For example, the effect of temperature might be different in the different types of water, as opposed to just adding the temperature effect to the effect of the type of water. In general, the interactions among variables are more interesting than the separate effects of individual variables. We will examine some methods of examining such interactions when we address n-factor analysis of variance (ANOVA) in a few weeks.

But for now, let's move on to building frequency distributions in Excel...

Literature Cited

Wrona, F.J, J.M. Culp, and R.W. Davies. 1982. Macroinvertebrate subsampling: a simplified apparatus and approach. Canadian Journal of Fisheries and Aquatic Sciences 39: 1051-1054

Send comments, suggestions, and corrections to: Derek Zelmer