Descriptive Statistics

As mentioned previously, a frequency distribution contains all of the observations for a particular sample, which we refer to as the raw data. Only on rare occasions do we present the raw data. The purpose of descriptive statistics is to provide a means of summarizing the information contained within a frequency distribution. The two most important pieces of information that need to be provided for any distribution are the central tendency of the distribution, and the dispersion of the distribution. Measures of central tendency essentially describe the position of the distribution on the X-axis (the value of the variable being measured), whereas measures of dispersion describe how spread out the observations are along the X-axis. In a number of cases the shape of the distribution, specifically the degree of symmetry, will also be important to describe.

Central Tendency

(Chapter 3 in Zar, 2010)

Where a particular distribution of data is located on the X-axis (which represents the values of the variable being measured) is summarized by reference to some value associated with the approximate center of the distribution. The 3 standard measures of central tendency are mean, median, and mode.

The mean simply is the arithmetic average of the observations, and is a summary statistic that we all are familiar with. For this reason, we will use the formula for calculation of the sample mean to indicate some of the notation that we will be using throughout the semester:

We will use Y to denote the value of an observation. The total number of observations, i.e., the sample size, will be denoted as n. When we wish to identify a specific observation, we will use a subscript. For example, Y₄ would indicate the 4th observation. ∑Y is the notation that we will use for the sum of all the obervations (Y₁+Y₂+Y₃+...Y_n). In this case a subscript for the Y would indicate a particular group of observations, e.g., ∑Y_control would indicate the sum of all the observations for the control group. The Y with the bar over it is the sample mean, and it is an estimate, based upon our sample, of the actual mean of the statistical population. We can express this formally as:

The population mean is represented by the symbol μ (lower case Mu in the Greek alphabet), and the caret (^) over the top of it indicates that it is an estimate. Thus, the preceding formula can be read as "the sample mean is an estimate of the population mean".

When making estimates, we are far more concerned with accuracy (proximity to the actual value) than we are with precision (proximity of the estimates to one another). Of critical importance to obtaining accuracy in our estimates is the use of estimators that are unbiased. An unbiased estimator is as likely to overestimate as it is to underestimate, whereas a biased estimator will tend to consistently overestimate, or consistently underestimate. Write that down...you are about to be asked to evaluate bias.

We can use our newfound skills at reading frequency distributions (if you are feeling less than skillful, review last week's material) to examine the behavior of the sample mean as an estimate of the population mean. The following graph was produced by drawing (at random) 1000 samples of 50 observations each from a statistical population where μ=10. This population mean of 10 was subtracted from each of the sample means (such that a value of 0 would indicate that the sample mean and population mean were identical) calculated from the 1000 samples to produce the following distribution:

Note: These data were produced as the "smean" object in this R program

Question 1: From these data, does the sample mean appear to be an unbiased estimate of the population mean? Justify your answer.

The other 2 measures of central tendency, the median and the mode, will return values similar to the mean for distributions that are symmetrical, like the one above, but can convey different, and sometimes important, information when applied to asymmetrical distributions. The median is the middle observation when the observations are aligned in ascending (or descending) order by the magnitude of their values. This can be a useful measure, because 50% of your observations are above that value, and 50% of your observations fall below that value. The mode is the observation that occurs the most frequently, i.e., the peak of the frequency distribution.

For distributions that are symmetrical, the mean, median, and mode should converge on the same value. The distribution that follows displays observations of feeding rates of fruit fly (Drosophila melanogaster) larvae, measured by counting the number of times the feeding apparatus (cephalopharyngeal sclevites) contracted over the period of a minute.

The distribution of feeding rates is (more or less) symmetrical, resulting in the sample mean, sample median, and sample mode all being approximately 85 contractions per minute. For symmetrical distributions, all 3 measures convey basically the same information. When distributions are asymmetrical, you have to carefully consider what information you wish to convey when choosing a measure of central tendency. The following distribution was created by examining the distribution of Vica sp. (vetch), a twining legume growing in the lawn outside of the Pacer Commons dorm on campus, by counting the number of individuals present in a series of 0.5 m² quadrats:

As you can see, this distribution exhibits a positive (right) skew, resulting in different values for the sample mean, sample median, and sample mode. Reporting the sample mean will give a value that does not occur frequently as an observation, and so you would have to weigh whether frequency is a more important piece of information than the position of the distribution for the question that you are addressing.

In some instances, a distribution may be suggestive of more than one coherent group of observations, such as the distribution of exam grades shown below:

In such cases, the sample mean and median are poor indications of the pattern, and one should report both modes (this type of distribution is referred to as "bimodal" because...well...I'm sure that you can sort out why).

While it is important to recognize the existence and potential uses for other measures of central tendency, it will be a rare occasion when a measure of central tendency other than the sample mean is reported.

Dispersion

(Chapter 4 in Zar, 2010)

While the position of a distribution on the X-axis is a critical piece of information to convey, the relevance of that measure depends on how wide that distribution is, i.e., the amount of variation in that variable, especially when making comparisons between or among distributions. Measures of dispersion are indices of how spread out the observations are along the X-axis.

The simplest measure of dispersion is the range, which involves reporting the lowest and highest observation, or the difference between them. This measure is very sensitive to outliers, which are values that are unusually high or low relative to the other observations. While it is not difficult to find recommendations for excluding outliers from a set of data, unless it is clear that the observation is impossible, e.g., a (living) human body temperature of 183 degrees C, or it is known that an error in measurement occurred, one should always be hesitant to remove such observations (see section 2.5 in chapter 2 of your text).

The reason that range is sensitive to outliers is that it relies on only 2 of your observations. Clearly a measure of dispersion that relied upon all of your observations would be of more value, and better justify all of the hard work that went into collecting those observations. Our newfound, and in-depth, understanding of central tendency suggests one possible measure: the average distance of the observations from the center of the distribution.

The distance of an observation from the sample mean can be calculated by subtracting the sample mean from the observation as follows:

This value, indicated by a lowercase y, is called a deviate. Intuitively then, the average distance would be the sum of the deviates, ∑y, divided by the number of observations, n. The problem with this approach can be illustrated by examining the following table of quiz scores from 2 separate sections of a biology class:

Because the sample mean is the mathematical center of the observations, the sum of the deviates will always (within rounding error) be equal to zero. The two distributions of quiz scores are clearly different, but the average deviations will provide no information about these differences.

The solution that we will apply is to square the deviates, making all of the differences positive. The notation that we will use for a squared deviate will be y², such that ∑y² will indicate the sum of the squared deviates. The sum of the squared deviates is generally referred to as the sum of squares, and is a value that will figure prominantly in virtually all of the analyses that we will address, so make sure that you are familiar with how to calculate it, and what it represents.

Applying this to the quiz score data, we can see that the sum of squares (∑y²) better reflects the differences between the two distributions:

Dividing the sum of the squared deviates by the number of observations (∑y²/n) will give us the average squared distance of the observations from the mean of the observations. While it should be intuitive that this is a good measure of the spread of the observations (apart from using squared distances, which we will address shortly), we cannot lose sight of the fact that the purpose of deriving this value from a sample is to estimate the same parameter for the statistical population. Thus, it is important to establish whether calculating this value as described will introduce a bias in the estimation of the same population parameter.

Calculation of the average squared distance of the observations from the mean for a statistical population, i.e., using every observation that exists, is a parameter that we call the population variance, and denote using the symbol: σ². Unfortunately, using the same calculation for sample data produces a biased estimate of σ². The following distribution was produced by taking 1000 random samples from a statistical population with μ=10, and σ²=4, and calculating the average squared distance of the observations from the mean of the observations for each sample. For each sample, the population variance (σ²) was subtracted from the average squared distance of the observations from the sample mean ((∑y²/n)-σ²) to produce the values shown below, such that an estimate matching the population variance would result in a value of 0:

Note: These data were produced as the "pvd" object in this R program

Question 2: In what direction (tends to underestimate or tends to overestimate) is the bias demonstrated for the average squared distance of the observations from the sample mean as an estimate of σ²?

The distribution above suggests that a different calculation must be used to produce an unbiased estimate of σ² from sample data. In this instance the correction is a simple one, involving the use of n-1 in the denominator instead of n. The resulting formula calculates a parameter we call sample variance, denoted as s²:

In the following graph, the sample variance (s²) calculated from the same series of 1000 random draws has been plotted as a second series (SS/(n-1)):

Note: The additional series was produced as the "svd" object in this R program

From this distribution, we can see that the correction for sample variance removes the bias from the estimate. Thus, we will use sample variance (s²) as our best estimate of population variance (σ²):

The only issue one may take with variance as an indication of the spread of the data, is that the units are squared relative to the values of the observations and, therefore, the mean. The solution to this, as you might imagine, is a simple one: simply take the square root of the variance. This produces a value referred to as the standard deviation, which, for a sample, we denote as s, and for a population, we denote as σ. Obviously (at least I hope that it is obvious), the square root of a sample variance (calculated with n-1 as the denominator) will produce a sample standard deviation (s), and the square root of a population variance (calculated using n as the denominator) will produce a population standard deviation (σ). Given that we will almost always be working with samples, we will use sample standard deviation as our estimate of population standard deviation:

Now let's practice calculating some descriptive statistics for some actual data. Download the Excel workbook for this week's exercise HERE.

Bird Data

The first worksheet (birds) contains the data from Example 3.3 in your textbook (p. 25). This will allow you to double-check your calculations, and the ones Excel does for you.

In cell F15, type the formula to calculate the sample mean for species B as:

=SUM(F3:F12)/COUNT(F3:F12)

Type "mean" in the cell immediately adjacent to the cell containing the sample mean (G15), so that you don't become confused later (and so that I am not confused when I review your spreadsheet). Excel has a function to calculate the median that we will use in cell F16:

=MEDIAN(F3:F12)

Add a label for the median in the adjacent cell as you did for the mean. Note that the value for the median does not occur among the list of observations. The reason for this is that when there are an even number of observations, we interpolate between the 2 middle observations to get the median value.

Now highlight the 2 cells containing the formulae for mean and median, use "Ctrl+c" to copy the cells, click on cell A15, and use "Ctrl-v" to paste. That feeling of anxiety that you are experiencing is the result of your conscious (or subconscious) recognition that the sample sizes for the 2 groups of observations differ. Pasting the formulas results in calculations for species A that include a blank cell. Use "F2" to verify this.

Remember the words inscribed in friendly letters upon each copy of The Hitchhiker's Guide to the Galaxy: "Don't Panic" (if you have yet to read any of the 5 books in this trilogy, please correct this alarming oversight at your earliest convenience). For now, let's take an objective and analytical approach to examining the consequences of our actions.

Because there are an odd number of observations for the life span of species A, and because these observations have been sorted in order of ascending value, we can see at a glance that there is, in fact, a middle observation, and that the value of that observation matches the value of the median as calculated by Excel. It would appear that the "MEDIAN" function ignores blank cells. We can verify that the same is true for both the "SUM" and "COUNT" functions by recalculating the mean using the "AVERAGE" function. Type the following into cell A17:

=AVERAGE(A3:A11)

Not only have we verified that several important functions ignore blank cells, which makes life a little easier (because we can paste formulas) when dealing with unequal sample sizes, but we also have verified that the "AVERAGE" function follows the formula that we learned (or more likely were reminded of) for the sample mean. Feel the tension draining away?

We now are going to work on calculating the variance for both samples. In cell G3, type the formula to calculate the deviate as:

=F3-F$15

Having the anchor ($) for the row number allows you to copy the formula down for the remaining observations while referencing the same cell for the mean. Anchoring the column is not necessary when the formula is only being copied down, and leaving the column unanchored will allow you to copy the column in its entirety to calculate the deviates for the observations for species B, because the reference will match the location of the sample mean. You will be doing yourself a favor if you take the time to verify this...

In the next column, type the formula to square the deviate as:

=G3^2

Copy the formula down the column. We could have eliminated a step by using a single formula (=(F3-F$15)^2) in column G, but this is a good reminder of the steps that we discussed (and besides, I made you put labels where we would need to calculate sums).

It's time to take the training wheels off. Let's remind ourselves of the formula for sample variance:

You should be able to calculate the sample variance using the "SUM" function, and the "COUNT" function. Presumably you can count the observations on your own, but this will be good practice for when we use larger sample sizes. Just make sure that you use parentheses in your formula to get the correct order of operations when subtracting 1 from the count, or you will be subtracting 1 from the population variance! You also should be able to repeat these calculations for Species A by cutting and pasting if you have been careful with your cell references.

Lastly, calculate the sample standard deviation for the two samples. To find the square root of a value in Excel, the "SQRT" function is used as:

=SQRT(value)

The value can be an actual number, or the cell location for a value. For example, if your sample variance was located in cell C15, the sample standard deviation could be calculated as:

=SQRT(C15)

Make sure to label both sample variance and sample standard deviation clearly on your worksheet, and remember to save your work!

Sunfish Data

The second worksheet (fish) contains mass and standard length measurements for bluegill sunfish (Lepomis macrochirus) and hybrids of bluegill sunfish and green sunfish (Lepomis cynanellus), collected from a constructed pond in Sedgewick County, Kansas. These measurements have been used to calculate a "condition factor" (K), which is a ratio of the mass to the cube of the length (in cm). Fish with a larger value for K will have more mass for a given length. Because green sunfish have a larger gape, and tend to be more aggressive, there was some question as to whether the introduction of the hybrids might have a negative effect on the condition of the bluegill sunfishes as the result of interspecific competition.

The following graph shows the frequency distributions for the condition factor for both species:

It should be immediately evident that the distributions are similar in terms of their central tendencies, but differ in the degree of dispersion of the data.

Question 3: Calculate the mean, variance, and standard deviation of both sets of condition factor data and determine whether these summary statistics reflect the similarities and differences that can be observed between the two distributions.

Let's move on to examining symmetry and standard error...

Send comments, suggestions, and corrections to: Derek Zelmer