Cumulative Probabilities

Promise that you won't get mad? The individual probabilities that we have just covered are of limited utility in the application of inferential statistics. We will not be very interested in the probability of a single event occurring, but rather the probabililty of an event or one more extreme, occurring. This is more similar to the second part of the question 5, where you determined the probability of getting a result other than 2 red and 6 blue marbles from a sample of 8 marbles. Hopefully you used 1 - P{2 red and 6 blue} instead of summing all of the remaining probabilities...

In order to establish cumulative probabilities from a frequency distribution, we simply look at the proportion of the observations less than, or equal to specific value for that observation. For example, the graph below depicts a frequency distribution for the mass of pigeons collected on the campus of the University of Texas at Austin (by Dr. David Hall):

If we plot the cumulative proportion, i.e., the proportion of the total number of birds that fall at or below a particular value, we can produce the following plot:

Determining the probability of encountering a pigeon less than a specific mass, let's say 221 grams, can be estimated from the shaded part of the curve below:

In this case, that probability would be estimated as 0.14. The probability of encountering a larger pigeon, would be (1 - 0.14), or 0.86. Another way to look at this is that the first probability (less than or equal to 221 grams) is the proportion of the total area under the curve of the distribution to the left of 221 grams on the X-axis.

Download this weeks Excel workbook from here.

This workbook contains 2 sets of data: the first contains the results (provided by Dr. DeLaurier) of bone alkaline phosphatase (BAP) assays from cat serum, and the second consists of seed masses for the large seed of barbed goatgrass (Aegilops triuncialis) from a single population in California (provided by Dr. Dyer). You will be estimating cumulative probabilities from these data sets, and applying these probabilities.

The first step is to produce frequency distributions from the data. Because we are building these distributions for analytical, rather than graphical, purposes, our class sizes are going to be as small as possible. Thus, your step sizes will be 0.1 for both data sets, and the ranges can be determined by using the MAX() and MIN() functions. Remember that the FREQUENCY() function looks for values less than the bin number, so start your classes one step below the minimum value of your observations (this is critical), and one step above the maximum observations, just for good measure.

Once you have the frequency distribution built, you can work out the cumulative proportions. To do this in 2 steps, you could first generate the cumulative sum for each value (remember the lesson on anchors in week 1?). Using the BAP data as an example, as shown below:

In the first cell next to the frequency of the first class (cell G1 in the picture above), I have typed:

=sum(f$2:f2)

While it may seem a little ridiculous to calculate the sum of a single cell, the anchor ($) in the formula will ensure that all preceding cells are included in the sum as the formula is copied downward...which you should do. We can turn these cumulative sums into cumulative proportions in the next column, by typing the following into cell H2 (in my example):

=g2/sum(f$2:f$293)

The anchors ensure that as the formula is copied, the sum for the denominator stays constant. Copying the formula down will produce the cumulative probabilities. In calculating the cumulative sums, you might have noticed that the final sum was 76. Because this is the total number of observations, this could have been substituted into the denominator, making the formula:

=g2/76

It probably has occurred to you that these 2 steps could have been combined into a single formula:

=sum(f$2:f2)/sum(f$2:f$293)

As you can see from the following screen shot, this formula (applied in column I) produces the same result:

Just for reference (you do not have to graph the probabilities), a plot of the estimated cumulative probabilities looks like this:

These values will allow you to calculate the proportion of the sample below (and including) a certain value, which we can take as our estimate of the probability of encountering a value of that magnitude or less. To determine the probability of encountering a value that is the same or higher as one of these values, we need to determine the same relationship in the opposite direction. One could do this by rearranging the formulas that we already have applied, but the simplest way is to simply subtract the cumulative probabilities that we have calculated from 1 in the next column. Again, for my example, this would require typing the following into cell J2:

=1-I2

Performing the same steps for the goatgrass seed data should allow you to easily answer the following:

Question 7: Determine the following probabilities:

Finding a BAP assay value higher than 22.4.

Finding a BAP assay less than or equal to 8.9.

Finding a large seed from barbed goatgrass weighing 10.6 mg or higher.

So that you can see the value in this week's exercise, I will make one last point about probabilities that is critical to our understanding of statistical analyses. Remember our bag of red and blue marbles, where the probability of drawing a red marble at random was 0.25? One could easily calculate the probability of drawing 100 red marbles in a row (assuming the bag was large enough that losing 100 marbles did not affect the probabilities) as 0.25¹⁰⁰. This is a vanishingly small number (a 6 with 60 zeros between it and the decimal place), but it is not zero. This means that it is not impossible to draw 100 red marbles in a row, just that it is very (very, very, very, very, very) improbable.

Statistical analysis is never about certainty. We base our conclusions based on how probable or improbable our outcomes are given a certain set of assumptions. For example, if we are comparing two sample means using a t-test, or ANOVA, we are in fact estimating the probability that the 2 sample means are estimating the same population mean. That probability will not be zero, and so we must determine when the probability is low enough that we can conclude that our assumption that both samples estimate the same population mean can be considered too improbable. For such analyses, we typically define "improbable" as less than 0.05. Clearly, this is nowhere near the order of improbability of drawing 100 red marbles in a row, and should suggest to you that this means that mistakes will be made. That is, in fact, the case. We expect to make an error in interpreting our results one time in every 20 analyses. We will discuss this decision rule, and the possibility it leaves for erroneous conclusions 2 weeks hence. For now, please save your Word documents as yourlastnameex4.doc and your Excel workbook as yourlastnameex4.xlsx and submit them via Blackboard.

Week 4 Objectives

Understand the use of the product rule and the sum rule in estimating probabilities

Understand the concept of binomial probabilities, and how to calculate binomial probabilities

Understand the concept of cumulative probability and how those probabilities can be derived from a frequency distribution

Send comments, suggestions, and corrections to: Derek Zelmer