8.1: Introduction to Hypothesis Testing

Colorado Online

8.1: Introduction to Hypothesis Testing

Learning Objectives

Explain the logic behind and the process of hypotheses testing. In particular, explain what the p-value is and how it is used to draw conclusions.

The purpose of this section is to gradually build your understanding about how statistical hypothesis testing works. We start by explaining the general logic behind the process of hypothesis testing. Once we are confident that you understand this logic, we will add some more details and terminology.

General Idea and Logic of Hypothesis Testing

To start our discussion about the idea behind statistical hypothesis testing, consider the following example:

Example

A case of suspected cheating on an exam is brought in front of the disciplinary committee at a certain university.

There are two opposing claims in this case:

The student’s claim: I did not cheat on the exam.
The instructor’s claim: The student did cheat on the exam.

Adhering to the principle “innocent until proven guilty,” the committee asks the instructor for evidence to support his claim. The instructor explains that the exam had two versions, and shows the committee members that on three separate exam questions, the student used in his solution numbers that were given in the other version of the exam.

The committee members all agree that it would be extremely unlikely to get evidence like that if the student’s claim of not cheating had been true. In other words, the committee members all agree that the instructor brought forward strong enough evidence to reject the student’s claim, and conclude that the student did cheat on the exam.

What does this example have to do with statistics?

While it is true that this story seems unrelated to statistics, it captures all the elements of hypothesis testing and the logic behind it. Before you read on to understand why, it would be useful to read the example again. Please do so now.

Statistical hypothesis testing is defined as:

Assessing evidence provided by the data in favor of or against some claim about the population.

Here is how the process of statistical hypothesis testing works:

We have two claims about what is going on in the population. Let’s call them for now claim 1 and claim 2. Much like the story above, where the student’s claim is challenged by the instructor’s claim, claim 1 is challenged by claim 2.
(Comment: as you’ll see in the examples that follow, these claims are usually about the value of population parameter(s) or about the existence or nonexistence of a relationship between two variables in the population).
We choose a sample, collect relevant data and summarize them (this is similar to the instructor collecting evidence from the student’s exam).
We figure out how likely it is to observe data like the data we got, had claim 1 been true. (Note that the wording “how likely …” implies that this step requires some kind of probability calculation). In the story, the committee members assessed how likely it is to observe the evidence like that which the instructor provided, had the student’s claim of not cheating been true.
Based on what we found in the previous step, we make our decision:
- If we find that if claim 1 were true it would be extremely unlikely to observe the data that we observed, then we have strong evidence against claim 1, and we reject it in favor of claim 2.
- If we find that if claim 1 were true observing the data that we observed is not very unlikely, then we do not have enough evidence against claim 1, and therefore we cannot reject it in favor of claim 2.

In our story, the committee decided that it would be extremely unlikely to find the evidence that the instructor provided had the student’s claim of not cheating been true. In other words, the members felt that it is extremely unlikely that it is just a coincidence that the student used the numbers from the other version of the exam on three separate problems. The committee members therefore decided to reject the student’s claim and concluded that the student had, indeed, cheated on the exam. (Wouldn’t you conclude the same?)

Hopefully this example helped you understand the logic behind hypothesis testing. To strengthen your understanding of the process of hypothesis testing and the logic behind it, let’s look at three statistical examples.

Example 1

A recent study estimated that 20% of all college students in the United States smoke. The head of Health Services at Goodheart University suspects that the proportion of smokers may be lower there. In hopes of confirming her claim, the head of Health Services chooses a random sample of 400 Goodheart students, and finds that 70 of them are smokers.

Let’s analyze this example using the 4 steps outlined above:

Stating the claims:

There are two claims here:
- claim 1: The proportion of smokers at Goodheart is .20.
- claim 2: The proportion of smokers at Goodheart is less than .20.
Claim 1 basically says “nothing special goes on in Goodheart University; the proportion of smokers there is no different from the proportion in the entire country.” This claim is challenged by the head of Health Services, who suspects that the proportion of smokers at Goodheart is lower.
Choosing a sample and collecting data:

A sample of n = 400 was chosen, and summarizing the data revealed that the sample proportion of smokers is [latex]\hat{\mathcal{p}}=\frac{70}{400}=.175[/latex]

While it is true that .175 is less than .20, it is not clear whether this is strong enough evidence against claim 1.
Assessment of evidence:

In order to assess whether the data provide strong enough evidence against claim 1, we need to ask ourselves: How surprising is it to get a sample proportion as low as $ˆ p = . 175$ (or lower), assuming claim 1 is true?

In other words, we need to find how likely it is that in a random sample of size n = 400 taken from a population where the proportion of smokers is p = .20 we’ll get a sample proportion as low as $ˆ p = . 175$ (or lower).

It turns out that the probability that we’ll get a sample proportion as low as $ˆ p = . 175$ (or lower) in such a sample is roughly .106 (do not worry about how this was calculated at this point).
Conclusion:

Well, we found that if claim 1 were true there is a probability of .106 of observing data like that observed.

Now you have to decide …

Do you think that a probability of .106 makes our data rare enough (surprising enough) under claim 1 so that the fact that we did observe it is enough evidence to reject claim 1?

Or do you feel that a probability of .106 means that data like we observed are not very likely when claim 1 is true, but they are not unlikely enough to conclude that getting such data is sufficient evidence to reject claim 1.

Basically, this is your decision. However, it would be nice to have some kind of guideline about what is generally considered surprising enough.

Example 2

A certain prescription allergy medicine is supposed to contain an average of 245 parts per million (ppm) of a certain chemical. If the concentration is higher than 245 ppm, the drug will likely cause unpleasant side effects, and if the concentration is below 245 ppm, the drug may be ineffective. The manufacturer wants to check whether the mean concentration in a large shipment is the required 245 ppm or not. To this end, a random sample of 64 portions from the large shipment is tested, and it is found that the sample mean concentration is 250 ppm with a sample standard deviation of 12 ppm. Let’s analyze this example according to the four steps of hypotheses testing we outlined on the previous page:

Stating the claims:
- Claim 1: The mean concentration in the shipment is the required 245 ppm.
- Claim 2: The mean concentration in the shipment is not the required 245 ppm.
Note that again, claim 1 basically says: “There is nothing unusual about this shipment, the mean concentration is the required 245 ppm.” This claim is challenged by the manufacturer, who wants to check whether that is, indeed, the case or not.
Choosing a sample and collecting data:

A sample of n = 64 portions is chosen and after summarizing the data it is found that the sample concentration is $¯ x = 250$ and the sample standard deviation is s = 12.

Is the fact that $¯ x = 250$ is different from 245 strong enough evidence to reject claim 1 and conclude that the mean concentration in the whole shipment is not the required 245? In other words, do the data provide strong enough evidence to reject claim 1?
Assessing the evidence:
In order to assess whether the data provide strong enough evidence against claim 1, we need to ask ourselves the following question: If the mean concentration in the whole shipment were really the required 245 ppm (i.e., if claim 1 were true), how surprising would it be to observe a sample of 64 portions where the sample mean concentration is off by 5 ppm or more (as we did)? It turns out that it would be extremely unlikely to get such a result if the mean concentration were really the required 245. There is only a probability of .0007 (i.e., 7 in 10,000) of that happening. (Do not worry about how this was calculated at this point.)
Making conclusions:

Here, it is pretty clear that a sample like the one we observed is extremely rare (or extremely unlikely) if the mean concentration in the shipment were really the required 245 ppm. The fact that we did observe such a sample therefore provides strong evidence against claim 1, so we reject it and conclude with very little doubt that the mean concentration in the shipment is not the required 245 ppm.

Do you think that you’re getting it? Let’s make sure, and look at another example.

Example 3

Is there a relationship between gender and combined scores (Math + Verbal) on the SAT exam?

Following a report on the College Board website, which showed that in 2003, males scored generally higher than females on the SAT exam (http://www.collegeboard.com/prod_downloads/about/news_info/cbsenior/yr2003/pdf/2003CBSVM.pdf), an educational researcher wanted to check whether this was also the case in her school district. The researcher chose random samples of 150 males and 150 females from her school district, collected data on their SAT performance and found the following:

Males

n

mean

standard deviation

150

1025

212

Females

n

mean

standard deviation

150

1010

206

Again, let’s see how the process of hypothesis testing works for this example:

Stating the claims:
- Claim 1: Performance on the SAT is not related to gender (males and females score the same).
- Claim 2: Performance on the SAT is related to gender – males score higher.
Note that again, claim 1 basically says: “There is nothing going on between the variables SAT and gender.” Claim 2 represents what the researcher wants to check, or suspects might actually be the case.
Choosing a sample and collecting data:

Data were collected and summarized as given above.

Is the fact that the sample mean score of males (1,025) is higher than the sample mean score of females (1,010) by 15 points strong enough information to reject claim 1 and conclude that in this researcher’s school district, males score higher on the SAT than females?
Assessment of evidence:

In order to assess whether the data provide strong enough evidence against claim 1, we need to ask ourselves: If SAT scores are in fact not related to gender (claim 1 is true), how likely is it to get data like the data we observed, in which the difference between the males’ average and females’ average score is as high as 15 points or higher? It turns out that the probability of observing such a sample result if SAT score is not related to gender is approximately .29 (Again, do not worry about how this was calculated at this point).
Conclusion:

Here, we have an example where observing a sample like the one we observed is definitely not surprising (roughly 30% chance) if claim 1 were true (i.e., if indeed there is no difference in SAT scores between males and females). We therefore conclude that our data does not provide enough evidence for rejecting claim 1.

Comment

Go back and read the conclusion sections of the three examples, and pay attention to the wording. Note that there are two type of conclusions:

“The data provide enough evidence to reject claim 1 and accept claim 2”; or
“The data do not provide enough evidence to reject claim 1.”

In particular, note that in the second type of conclusion we did not say: “I accept claim 1,” but only “I don’t have enough evidence to reject claim 1.” We will come back to this issue later, but this is a good place to make you aware of this subtle difference.

Hopefully by now, you understand the logic behind the statistical hypothesis testing process. Here is a summary:

A flow chart describing the process. First, we state Claim 1 and Claim 2. Claim 1 says "nothing special is going on" and is challenged by claim 2. Second, we collect relevant data and summarize it. Third, we assess how surprising it woudl be to observe data like that observed if Claim 1 is true. Fourth, we draw conclusions in context.

Learn by Doing

For many years “working full-time” has meant 40 hours per week. Nowadays it seems that corporate employers expect their employees to work more than this amount. A researcher decides to investigate this hypothesis.

Claim 1: The average time full-time corporate employees work per week is 40 hours.
Claim 2: The average time full-time corporate employees work per week is more than 40 hours.

To substantiate his claim, the researcher randomly selects 250 corporate employees and finds that they work an average of 47 hours per week with a standard deviation of 3.2 hours.

According to the Center for Disease Control (CDC), roughly 21.5% of all high-school seniors in the United States. have used marijuana. (Comments: The data were collected in 2002. The figure represents those who smoked during the month prior to the survey, so the actual figure might be higher). A sociologist suspects that the rate among African-American high school seniors is lower, and wants to check that. In this case, then,

Claim 1: The rate of African-American high-school seniors who have used marijuana is 21.5% (same as the overall rate of seniors).
Claim 2: The rate of African-American high-school seniors who have used marijuana is lower than 21.5%.

To check his claim, the sociologist chooses a random sample of 375 African-American high school seniors, and finds that 16.5% of them have used marijuana.

Did I get this?

The most commonly accepted tradition is that college students will study 2 hours outside of class for every hour in class. This means 30 hours/week for a full-time student taking 15 units (hours of class). An educator suspects that this figure is different now than in the past.

Claim 1: The average time full-time college students study outside of class per week is 30 hours.
Claim 2: The average time full-time college students study outside of class per week is not 30 hours.

To substantiate her claim, the educator randomly selects 1,500 college students and finds that they study an average of 27 hours per week with a standard deviation of 1.7 hours.

More Details and Terminology

Now that we understand the general idea of how statistical hypothesis testing works, let’s go back to each of the steps and delve slightly deeper, getting more details and learning some terminology.

Hypothesis testing step 1: Stating the claims.

In all three examples, our aim is to decide between two opposing points of view, Claim 1 and Claim 2. In hypothesis testing, Claim 1 is called the null hypothesis (denoted “H₀“), and Claim 2 plays the role of the alternative hypothesis (denoted “H_a“). As we saw in the three examples, the null hypothesis suggests nothing special is going on; in other words, there is no change from the status quo, no difference from the traditional state of affairs, no relationship. In contrast, the alternative hypothesis disagrees with this, stating that something is going on, or there is a change from the status quo, or there is a difference from the traditional state of affairs. The alternative hypothesis, H_a, usually represents what we want to check or what we suspect is really going on.

Let’s go back to our three examples and apply the new notation:

In example 1:

H₀: The proportion of smokers at Goodheart is .20.
H_a: The proportion of smokers at Goodheart is less than .20.

In example 2:

H₀: The mean concentration in the shipment is the required 245 ppm.
H_a: The mean concentration in the shipment is not the required 245 ppm.

In example 3:

H₀: Performance on the SAT is not related to gender (males and females score the same).
H_a: Performance on the SAT is related to gender – males score higher.

Learn by Doing

According to the Centers for Disease Control and Prevention, the proportion of U.S. adults age 25 or older who smoke is .22. A researcher suspects that the rate is lower among U.S. adults 25 or older who have a bachelor’s degree or higher education level.

A study investigated whether there are differences between the mean IQ level of people who were reared by their biological parents and those who were reared by someone else.

Did I get this?

Data were collected in order to determine whether there is a relationship between a person’s level of education and whether or not the person is a smoker.

Hypothesis testing step 2: Choosing a sample and collecting data.

This step is pretty obvious. This is what inference is all about. You look at sampled data in order to draw conclusions about the entire population. In the case of hypothesis testing, based on the data, you draw conclusions about whether or not there is enough evidence to reject H_o.

There is, however, one detail that we would like to add here. In this step we collect data and summarize it. Go back and look at the second step in our three examples. Note that in order to summarize the data we used simple sample statistics such as the sample proportion ( $ˆ p$ ), sample mean ( $¯ x$ ) and the sample standard deviation (s).

In practice, you go a step further and use these sample statistics to summarize the data with what’s called a test statistic. We are not going to go into any details right now, but we will discuss test statistics when we go through the specific tests.

Hypothesis testing step 3: Assessing the evidence.

As we saw, this is the step where we calculate how likely is it to get data like that observed when H_o true. In a sense, this is the heart of the process, since we draw our conclusions based on this probability. If this probability is very small (see example 2), then that means that it would be very surprising to get data like that observed if H₀ were true. The fact that we did observe such data is therefore evidence against H₀, and we should reject it. On the other hand, if this probability is not very small (see example 3) this means that observing data like that observed is not very surprising if H₀ were true, so the fact that we observed such data does not provide evidence against H_o. This crucial probability, therefore, has a special name. It is called the p-value of the test.

In our three examples, the p-values were given to you (and you were reassured that you didn’t need to worry about how these were derived):

Example 1: p-value = .106
Example 2: p-value = .0007
Example 3: p-value = .29

Obviously, the smaller the p-value, the more surprising it is to get data like ours when H₀ is true, and therefore, the stronger the evidence the data provide against H₀. Looking at the three p-values of our three examples, we see that the data that we observed in example 2 provide the strongest evidence against the null hypothesis, followed by example 1, while the data in example 3 provides the least evidence against H₀.

Comments:

Right now we will not go into specific details about p-value calculations, but just mention that since the p-value is the probability of getting data like those observed when H₀ is true, it would make sense that the calculation of the p-value will be based on the data summary, which, as we mentioned, is the test statistic. Indeed, this is the case. In practice, we will mostly use software to provide the p-value for us.

It should be noted that in the past, before statistical software was such an integral part of intro stats courses it was common to use critical values (rather than p-values) in order to assess the evidence provided by the data. While this courses focuses on p-values, we will provide some details about the critical values approach later in this module for those students who are interested in learning more about it.

Hypothesis testing step 4: Making conclusions.

Since our conclusion is based on how small the p-value is, or in other words, how surprising our data are when H_o is true, it would be nice to have some kind of guideline or cutoff that will help determine how small the p-value must be, or how “rare” (unlikely) our data must be when H_o is true, for us to conclude that we have enough evidence to reject H_o.

This cutoff exists, and because it is so important, it has a special name. It is called the significance level of the test and is usually denoted by the Greek letter α. The most commonly used significance level is α = .05 (or 5%). This means that:

if the p-value < α (usually .05), then the data we got is considered to be “rare (or surprising) enough” when H_o is true, and we say that the data provide significant evidence against H_o, so we reject H_o and accept H_a.
if the p-value > α (usually .05), then our data are not considered to be “surprising enough” when H_o is true, and we say that our data do not provide enough evidence to reject H_o (or, equivalently, that the data do not provide enough evidence to accept H_a).

Important comment about wording.

Another common wording (mostly in scientific journals) is:

“The results are statistically significant” – when the p-value < α.

“The results are not statistically significant” – when the p-value > α.

Comments

Although the significance level provides a good guideline for drawing our conclusions, it should not be treated as an incontrovertible truth. There is a lot of room for personal interpretation. What if your p-value is .052? You might want to stick to the rules and say “.052 > .05 and therefore I don’t have enough evidence to reject H_o“, but you might decide that .052 is small enough for you to believe that H_o should be rejected.

It should be noted that scientific journals do consider .05 to be the cutoff point for which any p-value below the cutoff indicates enough evidence against H_o, and any p-value above it, or even equal to it, indicates there is not enough evidence against H_o.
It is important to draw your conclusions in context. It is never enough to say: “p-value = …, and therefore I have enough evidence to reject H_o at the .05 significance level.”You should always add: “… and conclude that … (what it means in the context of the problem)”.
Let’s go back to the issue of the nature of the two types of conclusions that I can make.

Either I reject H_o and accept H_a (when the p-value is smaller than the significance level) or I cannot reject H_o (when the p-value is larger than the significance level).

As we mentioned earlier, note that the second conclusion does not imply that I accept H_o, but just that I don’t have enough evidence to reject it. Saying (by mistake) “I don’t have enough evidence to reject H_o so I accept it” indicates that the data provide evidence that H_o is true, which is not necessarily the case. Consider the following slightly artificial yet effective example:

Example

An employer claims to subscribe to an “equal opportunity” policy, not hiring men any more often than women for managerial positions. Is this credible? You’re not sure, so you want to test the following two hypotheses:

H_o: The proportion of male managers hired is .5
H_a: The proportion of male managers hired is more than .5

Data: You choose at random three of the new managers who were hired in the last 5 years and find that all 3 are men.

Assessing Evidence: If the proportion of male managers hired is really .5 (H_o is true), then the probability that the random selection of three managers will yield three males is therefore .5 * .5 * .5 = .125. This is the p-value.

Conclusion: Using .05 as the significance level, you conclude that since the p-value = .125 > .05, the fact that the three randomly selected mangers were all males is not enough evidence to reject H_o. In other words, you do not have enough evidence to reject the employer’s claim of subscribing to an equal opportunity policy.

However, the data (all three selected are males) definitely does not provide evidence to accept the employer’s claim (H_o).

Learn by Doing

The following two hypotheses are tested:

H_o: The proportion of U.S. adults who oppose gay marriage is roughly 50%.
H_a: The proportion of U.S. adults who oppose gay marriage is above 50% (i.e., the majority oppose).

Suppose a survey was conducted in which a random sample of 1,100 U.S. adults was asked about their opinions about gay marriage, and based on the data, the p-value was found to be .002.

Comment: Throughout this activity use a .05 (5%) significance level (cutoff).

Did I get this?

The following two hypotheses are tested:

H_o: The average number of miles driven per year is 12,000.
H_a: The average number of miles driven per year is less than 12,000.

In a survey, 1,600 randomly selected drivers were asked the number of miles they drive yearly. Based upon the results, the p-value = .068.

Comment: Throughout this activity use a .05 (5%) significance level.

Let’s summarize

We learned quite a lot about hypothesis testing. We learned the logic behind it, what the key elements are, and what types of conclusions we can and cannot draw in hypothesis testing. Here is a quick recap:

Did I get this?

Background: Based on the National Center of Health Statistics, the proportion of babies born at low birth weight (below 2,500 grams) in the United States is roughly .078, or 7.8% (based on all the births in the United States in the year 2002). A study was done in order to check whether smoking by pregnant women increases the risk of low birth weight. In other words, the researchers wanted to check whether the proportion of babies born at low birth weight among women who smoked during their pregnancy is higher than the proportion in the general population. The researchers followed a sample of 400 women who had smoked during their pregnancy and recorded the birth weight of the newborns. Based on the data, the p-value was found to be .016.

Did I get this?

The same researchers also wanted to examine whether second-hand smoking (exposure to a another person smoking) by pregnant women increases the risk of low birth weight (i.e., the proportion of babies born at a low birth weight among women who were second-hand smokers during their pregnancy is higher than the proportion in the general population). The researchers obtained a sample of 175 pregnant women who were second-hand smokers, followed them during their pregnancies, and found that 10.2% of the newborns had low birth weight. Based on these data, the p-value was found to be .119.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

General Idea and Logic of Hypothesis Testing

Comment

More Details and Terminology

Comments:

Comments

Let’s summarize

License

Share This Book