Probability

(Chapter 5 in Zar, 2010)

A probability is the likelihood of an event occurring, and is expressed as a number between 0 (the event will never occur) and 1 (the event will always occur). Most of us seem to prefer thinking in percentages (the probability multipled by 100, to get the expected frequency of occurrences within 100 trials), but it will be to your advantage to think of probabilities as fractions, because that is how we will be dealing with them. Moreover, mathematical application of probabilities to establish expected frequencies requires working with fractions, so converting to a percentage just adds an unnecessary step.

Many simple probabilities can be established from first principles. For example, a coin has 2 sides, and so we would expect that when tossed, there are only two possible outcomes as to which face of the coin will face upward. Moreover, the shape of the coin suggests that the 2 sides should land facing upward with equal frequency. Thus, we could predict that the probability of tossing a coin and having it come up "heads" would be equal to the probability of the coin coming up "tails", and that the 2 probabilities should sum to 1 because they are the only possible outcomes (if you are "that guy" that feels that we need to discuss the edge landing as a possibility, I will gladly provide you the opportunity to demonstrate its relevance).

In other words, if we denote the probability of the result "heads" as p, and the probability of the result "tails" as q, we know that p + q = 1, and we predict that p = q, such that:

p = q = 0.5

Implicit in our above estimates are the assumptions of randomness and independence. We assume that either outcome is equally likely, and that for repeated tosses, the result of one toss of the coin has no influence on the outcome of subsequent tosses. Hopefully you have learned by now that the old adage "don't assume" has no application in scientific endeavors. We are always making assumptions, and characterizing those assumptions and their effects is a critical part of the scientific process. Our interpretations are only valid under a well-defined set of circumstances that are characterized by our assumptions. The same is true for the statistical analyses that we will learn. If the assumptions of the analysis are not met, then the answers that we get from the analysis are potentially meaningless (Fun Fact: more than 80% of the students who have taken this class forgot to test at least one assumption on the final assignment...).

When the assumptions of randomness and independence are the basis for our estimates of probability, it is important to understand that the useful application of those probabilities is heavily influenced by the number of trials being examined. To illustrate this point, the following graph shows the result of a model of 500 coin tosses employing the probabilities that we established above. The cumulative proportion of the number of tosses that came up "heads" was recalculated each time a new toss of the coin was completed:

The random and independent nature of the process creates the possibility for sampling error, i.e., combinations of results that do not reflect the underlying probabilities. This is due to "runs" (repeated occurrences) of "heads" or "tails". As we can see, given enough iterations (in this example, 300 or more), these runs will average out to the expected probabilities, but for smaller sample sizes, we can expect our established probabilities to be poor predictors of a specific outcome.

Question 1: Estimate the probabilities (as decimals) of the following events occurring, under the assumptions of randomness and independence (make sure to describe your process):

Producing a roll of 3 from a 6-sided die.

Having your name drawn from a hat with 142 other names in the hat.

Hitting a randomly placed battleship (4 squares in length, and 1 square in width) on a 10 by 10 grid by selecting a single square.

Realistically, we tend to be more interested in probabilities that are a little more complicated than single events. In these cases, individual probabilities can be combined according to certain rules to estimate the probability of a more complex outcome. One of these rules is the product rule, which states that if specific individual independent events must occur in a particular sequence, the probability of that outcome is estimated by multiplying the individual probabilities together. For example, to get "heads" twice in a row from a series of 2 coin tosses, both the first and second toss must come up "heads". Thus, we can estimate the probability of that outcome as the product of those individual probabilities, i.e., 0.5 x 0.5 = 0.25. Using our notation above, with p denoting the probabilty of having a coin come up heads from a single toss, we could express this as p².

Question 2: Estimate the probabilities of the following events occurring, under the assumptions of randomness and independence (make sure to describe your process):

Rolling 2, 6-sided dice and having both come up 3.

Flipping a coin 3 times and getting "tails" all 3 times.

Having your name and then your spouses name drawn from a hat with 141 other names in the hat (remember that the number of names is reduced by each draw).

In some cases, the specific order of events is not as important for a particular outcome, because there is more than one way to produce that outcome. In such cases, the probability of that outcome is estimated by applying the sum rule, where the probabilities of each means by which that outcome can be satisfied are added together. For example, in order to get one "heads" result and one "tails" result from two coin tosses, one could either get "heads" as the first result, and "tails" as the second, or "tails" as the first result, and "heads" as the second. Following our earlier notation, the probability of the first possibility would be pq, and the probability of the second possibility would be qp. Thus, the probability of getting one each of "heads" and "tails" from 2 tosses could be estimated as pq + qp. Naturally, this could be more efficiently written as 2pq.

The probabilities that you are asked to calculate below require application of both the product rule and the sum rule. This means that there are two approaches to deriving the answer: apply the product rule first, and the sum rule second, or apply the sum rule first and the product rule second. When we apply the sum rule first we often do it without being aware that we are doing it…which means that it is often not described explicitly in the answer by students (and therefore doesn’t receive credit).

Let’s consider the probability of drawing two face cards in a row from a deck of cards. Most people start with the fact that 12 of 52 cards are face cards, making the probability of drawing a single face card 12/52. What is missing from this consideration is that this is an implicit use of the sum rule. Each individual face card is drawn with a 1/52 probability, and drawing any of the 12 in the deck satisfies the condition of drawing a face card, so we sum the probabilities (1/52) across the number of outcomes that satisfy our condition (12), making the probability of drawing a face card 12(1/52) (=12/52).

To get the probability of drawing 2 face cards in a row, we multiply the previous probability by the probability of drawing a second face card. With one face card gone, we now have 11 face cards, each with a probability of 1/51 of being drawn, so our second probability is 11(1/51). Multiplying (11/51) by (12/52) gives us the overall probability of about 0.0498.

Taking the opposite approach, we can apply the product rule first. We know that drawing a specific face card first (e.g., the queen of spades) and a specific remaining face card (e.g., the jack of hearts) second can be determined as the product of 1/52 and 1/51, which is 1/2652. This probability applies to all of the possible combinations of face cards. We we can determine the total number of combinations that satisfy our outcome by writing them all down (preferably in a matrix form so as not to lose track), or by applying the counting rule, where the number of outcomes with 2 face cards is the product of the number of face cards available for the first draw, and the number of face cards available for the second draw, i.e., the product of 12 and 11, which is 132. And so, if we sum our individual probability of 1/2652 across the number of outcomes that satisfy our condition of drawing two face cards (132), we can calculate the probability of drawing two face cards as 132(1/2652), which is 132/2652, which is roughly 0.0498, the exact same probability we arrived at using the “sum rule first” approach. For the following questions, you may apply either, provided that your application of the sum rule is made explicit. Of course, applying both approaches would be a good way to check that your math is correct…

Question 3: APPLY THE SUM RULE to estimate the probabilities of the following events occurring, under the assumptions of randomness and independence (make sure to describe your process):

Rolling 2, 6-sided dice and having both come up as even numbers.

Drawing 2 cards from a deck of 52 and having both cards be queens.

Flipping a coin 4 times, and having "heads" come up half of the time. What does this probability tell you about sampling error with small sample sizes?

Let's apply these concepts to a problem that we should have some familiarity with: predicting genotype frequencies. Let's take a simplified view of the genetic basis for eye color, and assume that there is a single eye color locus with 2 possible alleles. The allele for brown eyes (B) is dominant, and the allele for blue eyes (b) is recessive, such that BB and Bb both result in the brown-eyed phenotype, and only the bb genotype will result in the blue-eyed phenotype. Consider the potential offspring of a blue-eyed parent (bb) and a parent heterozygous for brown-eyes (Bb). Assuming that Mendel was correct about segregation, all of the gametes of the blue-eyed parent should contain the blue-eyed allele, and half of the gametes of the brown-eyed parent should also contain the blue-eyed allele. One approach to determining the probability is to label the b alleles of the blue-eyed parent as b₁ and b₂. To help keep things straight, we will apply the subscript "h" to the alleles of the heterozygous parent: B_h and b_h.

In this case, the possible combinations that will result in a blue-eyed offspring are:

b_hb₁ or b_hb₂

The combinations resulting in a brown-eyed offspring are:

B_hb₁ or B_hb₂

We can denote P{event} as the probability of an event occurring. Applying the product rule, P{B_hb₁} = P{B_h} x P{b₁}, gives us 0.5 x 0.5 = 0.25. Again, applying the product rule, P{B_hb₂} = P{B_h} x P{b₂}, also gives us 0.5 x 0.5 = 0.25. We then apply the sum rule to the separate probabilites for producing a brown-eyed offspring:

P{B_hb₁ or B_hb₂} = P{B_hb₁} + P{B_hb₂}

= 0.25 + 0.25 = 0.5

Theoretically, one can then determine the probability of producing blue eyed offspring as 1 - P{brown eyes}, because those are the only 2 possibilities, but in practice it is better to calculate that probability directly, and make sure that the two probabilities sum to 1 as a check on the calculations.

Once the intermediate steps are understood, the probabilities of the Bb x bb cross could be more easily determined by recognizing that the blue-eyed parent would always donate the b allele, and so the outcome only depends on what the zygote receives from the heterozygous parent. Thus, half of the offspring should be Bb, and half should be bb. There is nothing wrong with this reasoning, but in some instances, taking shortcuts can lead you astray. Perhaps the most famous example is that of the so-called "Monty Hall problem".

Monty Hall was the host of a game show called "Let's Make a Deal". One of the segments of the show had contestants choose between 3 doors. Behind one of the doors was a new car, and behind the other 2 were lesser prizes; let's say a package of Ramen noodles, and a goat. After the contestant had made their choice, Monty would open one of the doors to show the contestant the Ramen noodles, or the goat, and give the contestant the opportunity to change their choice.

The question is: should the contestant stick with their first choice or switch?

In considering your answer, assume the following:

Monty knows where the car is, and will never open that door.

Monty will never open the door that was first chosen.

For all intents and purposes, the first door chosen is a random choice.

Many people will take the shortcut, and assume that because there are now 2 possibilities, the probability of the car being behind the chosen door is 0.5, which means that switching won't improve the odds. In reality, opening one of the doors without the prize does not change the initial probability of having selected the right door, which is 1/3. What opening the door does accomplish is to pool the 2/3 probability that it is behind a door other than that chosen at first, into a single door. In other words, your chances of winning the car by switching choices are 2/3, and your chances of winning by sticking with your initial choice are 1/3. This means that the best strategy is to switch doors. The 0.5 probability only applies if you ignore your first choice, and randomly select a door. This will result in a success rate of 0.5, but that still is less than the 2/3 rate that would be achieved by switching.

What is most interesting about this problem is that it can be impossible to move someone from the wrong answer (each door gives a 0.5 chance of finding the car) once they have settled upon it. A good description of the phenomenon can be found here. I have encountered this on several occasions. The most convincing (but not always successful) explanation that I have come up with is based upon thinking about how to model the problem. Let's say that as a test, you will always adopt the strategy of not switching, and I will always adopt the strategy of switching. For each trial, we both start with the same randomly chosen door. The choice to switch involves only two doors, and so every time that you lose, I will win, and every time that you win, I will lose. Because staying with your initial choice means that you will lose at a rate of 2 out of every 3 trials, that means that I would win 2 out of every 3 trials.

With that cautionary tale behind us (unless you remain unconvinced, in which case you will obsess over it for a few more hours), let us return to Mendelian genetics, and apply our cautious approach to the following problem:

Question 4: Determine the probabilities for all possible genotypes from a cross of 2 heterozygotes (Bb x Bb) following the methodology described above. Describe your steps in detail (if I suspect that you are using a Punnet square, I will be grading the rest of the assignment angry!).

These types of probabilities are representative of a class of probabilites known as binomial probabilities. The term "binomial" refers to the fact that only 2 outcomes are possible, which means that the 2 separate probabilities must sum to 1. We already have formalized this as:

p + q = 1

To determine the probabilities of various outcomes where a set of binomial probabilities is sampled more than once, we can apply the binomial expansion:

(p + q)^k

Where k is the number of samples, or "draws". You have seen an application of this in the Hardy-Weinberg model of population genetics, where the proportion of 1 allele (let's use B) in a population is represented as p, and the proportion of the alternative allele (let's use b) is represented as q. Each of the offspring is considered to draw 2 alleles from this distribution at random, and so k = 2. The resulting expansion of (p + q)² = p² + 2pq + q². The expected probability of an individual being BB (and therefore the expected proportion of the offspring that will be BB) is p². The expected probability of an individual being Bb is 2pq, and the expected probability of an individual being bb is q². Because (p + q) = 1, (p + q)^k also is equal to 1.

Let's look a little more in depth at binomial probabilities...

Send comments, suggestions, and corrections to: Derek Zelmer