8.1: Introduction to Hypothesis Testing
Learning Objectives
- Explain the logic behind and the process of hypotheses testing. In particular, explain what the p-value is and how it is used to draw conclusions.
The purpose of this section is to gradually build your understanding about how statistical hypothesis testing works. We start by explaining the general logic behind the process of hypothesis testing. Once we are confident that you understand this logic, we will add some more details and terminology.
General Idea and Logic of Hypothesis Testing
To start our discussion about the idea behind statistical hypothesis testing, consider the following example:
Example
A case of suspected cheating on an exam is brought in front of the disciplinary committee at a certain university.
There are two opposing claims in this case:
-
The student’s claim: I did not cheat on the exam.
-
The instructor’s claim: The student did cheat on the exam.
Adhering to the principle “innocent until proven guilty,” the committee asks the instructor for evidence to support his claim. The instructor explains that the exam had two versions, and shows the committee members that on three separate exam questions, the student used in his solution numbers that were given in the other version of the exam.
The committee members all agree that it would be extremely unlikely to get evidence like that if the student’s claim of not cheating had been true. In other words, the committee members all agree that the instructor brought forward strong enough evidence to reject the student’s claim, and conclude that the student did cheat on the exam.
What does this example have to do with statistics?
While it is true that this story seems unrelated to statistics, it captures all the elements of hypothesis testing and the logic behind it. Before you read on to understand why, it would be useful to read the example again. Please do so now.
Statistical hypothesis testing is defined as:
Assessing evidence provided by the data in favor of or against some claim about the population.
Here is how the process of statistical hypothesis testing works:
- We have two claims about what is going on in the population. Let’s call them for now claim 1 and claim 2. Much like the story above, where the student’s claim is challenged by the instructor’s claim, claim 1 is challenged by claim 2.
(Comment: as you’ll see in the examples that follow, these claims are usually about the value of population parameter(s) or about the existence or nonexistence of a relationship between two variables in the population).
- We choose a sample, collect relevant data and summarize them (this is similar to the instructor collecting evidence from the student’s exam).
- We figure out how likely it is to observe data like the data we got, had claim 1 been true. (Note that the wording “how likely …” implies that this step requires some kind of probability calculation). In the story, the committee members assessed how likely it is to observe the evidence like that which the instructor provided, had the student’s claim of not cheating been true.
- Based on what we found in the previous step, we make our decision:
- If we find that if claim 1 were true it would be extremely unlikely to observe the data that we observed, then we have strong evidence against claim 1, and we reject it in favor of claim 2.
- If we find that if claim 1 were true observing the data that we observed is not very unlikely, then we do not have enough evidence against claim 1, and therefore we cannot reject it in favor of claim 2.
In our story, the committee decided that it would be extremely unlikely to find the evidence that the instructor provided had the student’s claim of not cheating been true. In other words, the members felt that it is extremely unlikely that it is just a coincidence that the student used the numbers from the other version of the exam on three separate problems. The committee members therefore decided to reject the student’s claim and concluded that the student had, indeed, cheated on the exam. (Wouldn’t you conclude the same?)
Hopefully this example helped you understand the logic behind hypothesis testing. To strengthen your understanding of the process of hypothesis testing and the logic behind it, let’s look at three statistical examples.
Example
1
A recent study estimated that 20% of all college students in the United States smoke. The head of Health Services at Goodheart University suspects that the proportion of smokers may be lower there. In hopes of confirming her claim, the head of Health Services chooses a random sample of 400 Goodheart students, and finds that 70 of them are smokers.
Let’s analyze this example using the 4 steps outlined above:
-
Stating the claims:
There are two claims here:
-
claim 1: The proportion of smokers at Goodheart is .20.
-
claim 2: The proportion of smokers at Goodheart is less than .20.
Claim 1 basically says “nothing special goes on in Goodheart University; the proportion of smokers there is no different from the proportion in the entire country.” This claim is challenged by the head of Health Services, who suspects that the proportion of smokers at Goodheart is lower.
-
-
Choosing a sample and collecting data:
A sample of n = 400 was chosen, and summarizing the data revealed that the sample proportion of smokers is ˆp=70400=.175.
While it is true that .175 is less than .20, it is not clear whether this is strong enough evidence against claim 1.
-
Assessment of evidence:
In order to assess whether the data provide strong enough evidence against claim 1, we need to ask ourselves: How surprising is it to get a sample proportion as low as ˆp=.175 (or lower), assuming claim 1 is true?
In other words, we need to find how likely it is that in a random sample of size n = 400 taken from a population where the proportion of smokers is p = .20 we’ll get a sample proportion as low as ˆp=.175 (or lower).
It turns out that the probability that we’ll get a sample proportion as low as ˆp=.175 (or lower) in such a sample is roughly .106 (do not worry about how this was calculated at this point).
-
Conclusion:
Well, we found that if claim 1 were true there is a probability of .106 of observing data like that observed.
Now you have to decide …
Do you think that a probability of .106 makes our data rare enough (surprising enough) under claim 1 so that the fact that we did observe it is enough evidence to reject claim 1?
Or do you feel that a probability of .106 means that data like we observed are not very likely when claim 1 is true, but they are not unlikely enough to conclude that getting such data is sufficient evidence to reject claim 1.
Basically, this is your decision. However, it would be nice to have some kind of guideline about what is generally considered surprising enough.
Example
2
A certain prescription allergy medicine is supposed to contain an average of 245 parts per million (ppm) of a certain chemical. If the concentration is higher than 245 ppm, the drug will likely cause unpleasant side effects, and if the concentration is below 245 ppm, the drug may be ineffective. The manufacturer wants to check whether the mean concentration in a large shipment is the required 245 ppm or not. To this end, a random sample of 64 portions from the large shipment is tested, and it is found that the sample mean concentration is 250 ppm with a sample standard deviation of 12 ppm. Let’s analyze this example according to the four steps of hypotheses testing we outlined on the previous page:
-
Stating the claims:
-
Claim 1: The mean concentration in the shipment is the required 245 ppm.
-
Claim 2: The mean concentration in the shipment is not the required 245 ppm.
Note that again, claim 1 basically says: “There is nothing unusual about this shipment, the mean concentration is the required 245 ppm.” This claim is challenged by the manufacturer, who wants to check whether that is, indeed, the case or not.
-
-
Choosing a sample and collecting data:
A sample of n = 64 portions is chosen and after summarizing the data it is found that the sample concentration is ¯x=250 and the sample standard deviation is s = 12.
Is the fact that ¯x=250 is different from 245 strong enough evidence to reject claim 1 and conclude that the mean concentration in the whole shipment is not the required 245? In other words, do the data provide strong enough evidence to reject claim 1?
- Assessing the evidence:
In order to assess whether the data provide strong enough evidence against claim 1, we need to ask ourselves the following question: If the mean concentration in the whole shipment were really the required 245 ppm (i.e., if claim 1 were true), how surprising would it be to observe a sample of 64 portions where the sample mean concentration is off by 5 ppm or more (as we did)? It turns out that it would be extremely unlikely to get such a result if the mean concentration were really the required 245. There is only a probability of .0007 (i.e., 7 in 10,000) of that happening. (Do not worry about how this was calculated at this point.)
-
Making conclusions:
Here, it is pretty clear that a sample like the one we observed is extremely rare (or extremely unlikely) if the mean concentration in the shipment were really the required 245 ppm. The fact that we did observe such a sample therefore provides strong evidence against claim 1, so we reject it and conclude with very little doubt that the mean concentration in the shipment is not the required 245 ppm.
Do you think that you’re getting it? Let’s make sure, and look at another example.
Example
3
Is there a relationship between gender and combined scores (Math + Verbal) on the SAT exam?
Following a report on the College Board website, which showed that in 2003, males scored generally higher than females on the SAT exam (http://www.collegeboard.com/prod_downloads/about/news_info/cbsenior/yr2003/pdf/2003CBSVM.pdf), an educational researcher wanted to check whether this was also the case in her school district. The researcher chose random samples of 150 males and 150 females from her school district, collected data on their SAT performance and found the following:
Males | ||||||
---|---|---|---|---|---|---|
|
Females | ||||||
---|---|---|---|---|---|---|
|
Again, let’s see how the process of hypothesis testing works for this example:
-
Stating the claims:
-
Claim 1: Performance on the SAT is not related to gender (males and females score the same).
- Claim 2: Performance on the SAT is related to gender – males score higher.
Note that again, claim 1 basically says: “There is nothing going on between the variables SAT and gender.” Claim 2 represents what the researcher wants to check, or suspects might actually be the case.
-
-
Choosing a sample and collecting data:
Data were collected and summarized as given above.
Is the fact that the sample mean score of males (1,025) is higher than the sample mean score of females (1,010) by 15 points strong enough information to reject claim 1 and conclude that in this researcher’s school district, males score higher on the SAT than females?
-
Assessment of evidence:
In order to assess whether the data provide strong enough evidence against claim 1, we need to ask ourselves: If SAT scores are in fact not related to gender (claim 1 is true), how likely is it to get data like the data we observed, in which the difference between the males’ average and females’ average score is as high as 15 points or higher? It turns out that the probability of observing such a sample result if SAT score is not related to gender is approximately .29 (Again, do not worry about how this was calculated at this point).
-
Conclusion:
Here, we have an example where observing a sample like the one we observed is definitely not surprising (roughly 30% chance) if claim 1 were true (i.e., if indeed there is no difference in SAT scores between males and females). We therefore conclude that our data does not provide enough evidence for rejecting claim 1.
Comment
Go back and read the conclusion sections of the three examples, and pay attention to the wording. Note that there are two type of conclusions:
-
“The data provide enough evidence to reject claim 1 and accept claim 2”; or
-
“The data do not provide enough evidence to reject claim 1.”
In particular, note that in the second type of conclusion we did not say: “I accept claim 1,” but only “I don’t have enough evidence to reject claim 1.” We will come back to this issue later, but this is a good place to make you aware of this subtle difference.
Hopefully by now, you understand the logic behind the statistical hypothesis testing process. Here is a summary:
More Details and Terminology
Now that we understand the general idea of how statistical hypothesis testing works, let’s go back to each of the steps and delve slightly deeper, getting more details and learning some terminology.
Hypothesis testing step 1: Stating the claims.
In all three examples, our aim is to decide between two opposing points of view, Claim 1 and Claim 2. In hypothesis testing, Claim 1 is called the null hypothesis (denoted “H0“), and Claim 2 plays the role of the alternative hypothesis (denoted “Ha“). As we saw in the three examples, the null hypothesis suggests nothing special is going on; in other words, there is no change from the status quo, no difference from the traditional state of affairs, no relationship. In contrast, the alternative hypothesis disagrees with this, stating that something is going on, or there is a change from the status quo, or there is a difference from the traditional state of affairs. The alternative hypothesis, Ha, usually represents what we want to check or what we suspect is really going on.
Let’s go back to our three examples and apply the new notation:
In example 1:
-
H0: The proportion of smokers at Goodheart is .20.
- Ha: The proportion of smokers at Goodheart is less than .20.
In example 2:
-
H0: The mean concentration in the shipment is the required 245 ppm.
- Ha: The mean concentration in the shipment is not the required 245 ppm.
In example 3:
- H0: Performance on the SAT is not related to gender (males and females score the same).
-
Ha: Performance on the SAT is related to gender – males score higher.
Hypothesis testing step 2: Choosing a sample and collecting data.
This step is pretty obvious. This is what inference is all about. You look at sampled data in order to draw conclusions about the entire population. In the case of hypothesis testing, based on the data, you draw conclusions about whether or not there is enough evidence to reject Ho.
There is, however, one detail that we would like to add here. In this step we collect data and summarize it. Go back and look at the second step in our three examples. Note that in order to summarize the data we used simple sample statistics such as the sample proportion (ˆp), sample mean (¯x) and the sample standard deviation (s).
In practice, you go a step further and use these sample statistics to summarize the data with what’s called a test statistic. We are not going to go into any details right now, but we will discuss test statistics when we go through the specific tests.
Hypothesis testing step 3: Assessing the evidence.
As we saw, this is the step where we calculate how likely is it to get data like that observed when Ho true. In a sense, this is the heart of the process, since we draw our conclusions based on this probability. If this probability is very small (see example 2), then that means that it would be very surprising to get data like that observed if H0 were true. The fact that we did observe such data is therefore evidence against H0, and we should reject it. On the other hand, if this probability is not very small (see example 3) this means that observing data like that observed is not very surprising if H0 were true, so the fact that we observed such data does not provide evidence against Ho. This crucial probability, therefore, has a special name. It is called the p-value of the test.
In our three examples, the p-values were given to you (and you were reassured that you didn’t need to worry about how these were derived):
-
Example 1: p-value = .106
-
Example 2: p-value = .0007
-
Example 3: p-value = .29
Obviously, the smaller the p-value, the more surprising it is to get data like ours when H0 is true, and therefore, the stronger the evidence the data provide against H0. Looking at the three p-values of our three examples, we see that the data that we observed in example 2 provide the strongest evidence against the null hypothesis, followed by example 1, while the data in example 3 provides the least evidence against H0.
Comments:
Right now we will not go into specific details about p-value calculations, but just mention that since the p-value is the probability of getting data like those observed when H0 is true, it would make sense that the calculation of the p-value will be based on the data summary, which, as we mentioned, is the test statistic. Indeed, this is the case. In practice, we will mostly use software to provide the p-value for us.
It should be noted that in the past, before statistical software was such an integral part of intro stats courses it was common to use critical values (rather than p-values) in order to assess the evidence provided by the data. While this courses focuses on p-values, we will provide some details about the critical values approach later in this module for those students who are interested in learning more about it.
Hypothesis testing step 4: Making conclusions.
Since our conclusion is based on how small the p-value is, or in other words, how surprising our data are when Ho is true, it would be nice to have some kind of guideline or cutoff that will help determine how small the p-value must be, or how “rare” (unlikely) our data must be when Ho is true, for us to conclude that we have enough evidence to reject Ho.
This cutoff exists, and because it is so important, it has a special name. It is called the significance level of the test and is usually denoted by the Greek letter α. The most commonly used significance level is α = .05 (or 5%). This means that:
- if the p-value < α (usually .05), then the data we got is considered to be “rare (or surprising) enough” when Ho is true, and we say that the data provide significant evidence against Ho, so we reject Ho and accept Ha.
- if the p-value > α (usually .05), then our data are not considered to be “surprising enough” when Ho is true, and we say that our data do not provide enough evidence to reject Ho (or, equivalently, that the data do not provide enough evidence to accept Ha).
Important comment about wording.
Another common wording (mostly in scientific journals) is:
“The results are statistically significant” – when the p-value < α.
“The results are not statistically significant” – when the p-value > α.
Comments
-
Although the significance level provides a good guideline for drawing our conclusions, it should not be treated as an incontrovertible truth. There is a lot of room for personal interpretation. What if your p-value is .052? You might want to stick to the rules and say “.052 > .05 and therefore I don’t have enough evidence to reject Ho“, but you might decide that .052 is small enough for you to believe that Ho should be rejected.
It should be noted that scientific journals do consider .05 to be the cutoff point for which any p-value below the cutoff indicates enough evidence against Ho, and any p-value above it, or even equal to it, indicates there is not enough evidence against Ho.
-
It is important to draw your conclusions in context. It is never enough to say: “p-value = …, and therefore I have enough evidence to reject Ho at the .05 significance level.”You should always add: “… and conclude that … (what it means in the context of the problem)”.
-
Let’s go back to the issue of the nature of the two types of conclusions that I can make.
Either I reject Ho and accept Ha (when the p-value is smaller than the significance level) or I cannot reject Ho (when the p-value is larger than the significance level).
As we mentioned earlier, note that the second conclusion does not imply that I accept Ho, but just that I don’t have enough evidence to reject it. Saying (by mistake) “I don’t have enough evidence to reject Ho so I accept it” indicates that the data provide evidence that Ho is true, which is not necessarily the case. Consider the following slightly artificial yet effective example:
Example
An employer claims to subscribe to an “equal opportunity” policy, not hiring men any more often than women for managerial positions. Is this credible? You’re not sure, so you want to test the following two hypotheses:
- Ho: The proportion of male managers hired is .5
-
Ha: The proportion of male managers hired is more than .5
Data: You choose at random three of the new managers who were hired in the last 5 years and find that all 3 are men.
Assessing Evidence: If the proportion of male managers hired is really .5 (Ho is true), then the probability that the random selection of three managers will yield three males is therefore .5 * .5 * .5 = .125. This is the p-value.
Conclusion: Using .05 as the significance level, you conclude that since the p-value = .125 > .05, the fact that the three randomly selected mangers were all males is not enough evidence to reject Ho. In other words, you do not have enough evidence to reject the employer’s claim of subscribing to an equal opportunity policy.
However, the data (all three selected are males) definitely does not provide evidence to accept the employer’s claim (Ho).
Let’s summarize
We learned quite a lot about hypothesis testing. We learned the logic behind it, what the key elements are, and what types of conclusions we can and cannot draw in hypothesis testing. Here is a quick recap: