10.3: Matched Pairs

Colorado Online

10.3: Matched Pairs

Learning Objectives

In a given context, carry out the inferential method for comparing groups and draw the appropriate conclusions.
Specify the null and alternative hypotheses for comparing groups.

Comparing Two Means—Matched Pairs (Paired t-Test)

Overview

We are still in Case C→Q of inference about relationships, where the explanatory variable is categorical and the response variable is quantitative. As we mentioned in the introduction, we introduce three inferential procedures in this case.

So far we have introduced the first procedure—the two-sample t-test that is used when we are comparing two means and the samples are independent. We now move on to the second procedure, where we also compare two means, but the samples are paired or matched. Every observation in one sample is linked with an observation in the other sample. In this case, the samples are dependent.

The X variable is a two-valued categorical explanatory variable. Using the categories we split the population into Population 1 and population 2. Each has its own Y mean, μ_1 and μ_2. For each population we generate a matched pair SRS of size n.

One of the most common cases where dependent samples occur is when both samples have the same subjects and they are “paired by subject.” In other words, each subject is measured twice on the response variable, typically before and then after some kind of treatment/intervention in order to assess its effectiveness.

Example

SAT Prep Class

Suppose you want to assess the effectiveness of an SAT prep class. It would make sense to use the matched pairs design and record each sampled student’s SAT score before and after the SAT prep classes are attended:

Recall that the two populations represent the two values of the explanatory variable. In this situation, those two values come from a single set of subjects. In other words, both populations really have the same students. However, each population has a different value of the explanatory variable. Those values are: no prep class, prep class.

This, however, is not the only case where the paired design is used. Other cases are when the pairs are “natural pairs,” such as siblings, twins, or couples. We will present two examples in this part. The first one will be of the type where each subject is measured twice, and the second one will be a study involving twins.

This section on matched pairs design will be organized very much like the previous section on two independent samples. We will first introduce our leading example, and then present the paired t-test illustrating each step using our example. We will then look at another example, and finally talk about estimation using a confidence interval. As usual, you’ll be able to check your understanding along the way, and will learn how to use software to carry out this test.

Example

Drunk Drivers

Drunk driving is one the main causes of car accidents. Interviews with drunk drivers who were involved in accidents and survived revealed that one of the main problems is that drivers do not realize that they are impaired, thinking “I only had 1-2 drinks … I am OK to drive.” A sample of 20 drivers was chosen, and their reaction times in an obstacle course were measured before and after drinking two beers. The purpose of this study was to check whether drivers are impaired after drinking two beers. Here is a figure summarizing this study:

Comments

Note that the categorical explanatory variable here is “drinking 2 beers (Yes/No)”, and the quantitative response variable is the reaction time.
Note that by using the matched pairs design in this study (i.e., by measuring each driver twice), the researchers isolated the effect of the two beers on the drivers and eliminated any other confounding factors that might influence the reaction times (such as the driver’s experience, age, etc.).
For each driver, the two measurements are the total reaction time before drinking two beers, and after. You can see the data by following these instructions:

To open Excel with the data in the worksheet, right click to download the beers file to your computer. Then find the downloaded file and double-click it to open it in Excel. When Excel opens, you may have to enable editing.

So far, we have discussed and illustrated cases in which the matched pairs design comes up, and we are now ready to discuss how to carry out the test in this case. We will first present the idea behind the paired t-test, and then go through the four steps in the testing process.

The Paired t-test

Idea

The idea behind the paired t-test is to reduce this two-sample situation, where we are comparing two means, to a single sample situation where we are doing inference on a single mean, and then use a simple t-test that we introduced in the previous module. We will first illustrate this idea using our example, and then more generally.

In other words, by reducing the two samples to one sample of differences, we are essentially reducing the problem from a problem where we’re comparing two means (i.e., doing inference on [latex]\mu _{1}-\mu _{2}[/latex]:

to a problem where we are making an inference about a single mean — the mean of the differences:

The population of drivers is represented by a large circle. We are interested in μ for this population, which represents the mean of the difference in total reaction time (before 2 beers - after 2 beers). We generate a sample of size n = 20, and get 20 differences.

In general, in every matched pairs problem, our data consist of 2 samples which are organized in n pairs:

A set of matched pairs, numbered 1 through n. The first element in each pair is sample 1 and the second element in each pair is sample 2. The data is presented in a table which has 3 rows, labeled "Pairs," "Sample 1," and "Sample 2."

We reduce the two samples to only one by calculating for each pair the difference between the two observations (in the figure we used d₁, d₂, d₃, …, d_nto denote the differences).

Each pair is reduced to a difference, by calculating sample1 - sample2. This is shown on the table by adding an extra row labeled " differences" and for each column, adding a value in the differences row describing the pair represented by the column.

The paired t-test is based on this one sample of n differences,

We can now ignore the sample 1 and sample 2 data in each pair and instead just focus on the differences.

and it uses those differences as data for a simple t-test on a single mean — the mean of the differences.

This is the general idea behind the paired t-test; it is nothing more than a regular one-sample t-test for the mean of the differences. We will now go through the 4-step process of the paired t-test.

Step 1: Stating the hypotheses.

Recall that in the t-test for a single mean our null hypothesis was: [latex]H_{0}:\mu = \mu _{0}[/latex] and the alternative was one of [latex]H_{a}:\mu < or> or\neq \mu _{0}[/latex]. Since the paired t-test is a special case of the one-sample t-test, the hypotheses are the same except that:

Instead of simply μ we use the notation [latex]\mu _{d}{/latex] to denote that the parameter of interest is the mean of the differences.
In this course our null value [latex]\mu _{0}{/latex] is always 0 (although technically, it does not have to be).

Therefore, in the paired t-test:

The null hypothesis is always:

[latex]H_{0}:\mu _{d}=0[/latex]

and the alternative is one of :

H_a: μ_d < 0 (one-sided), H_a: μ_d > 0 (one-sided), H_a: μ_d ≠ 0 (two-sided)

depending on the context.

Let’s go back to our example to see how this works and why it makes sense.

Example

Drunk Driving

Recall that in our “Are drivers impaired after drinking two beers?” example, our data was reduced to one sample of differences (one for each driver),

A table with the rows "Driver," "Sample 1 (before)," "Sample 2 (after)," and "Differences (before - after)." We only care about the Driver and Differences row.

so our problem was reduced to inference about the mean of the differences $μ_{d}$ .

For the population of all drivers, we are trying to find μ_d, which represents the mean of the difference in total reaction time (before 2 beers - after 2 beers). To do this, we generate a sample from the population. The sample consists of 20 differences.

As we mentioned, the null hypothesis is:

[latex]H_{0}:\mu _{d}=0[/latex].

The null hypothesis claims that the differences in reaction times are centered at (or around) 0, indicating that drinking two beers has no real impact on reaction times. In other words, drivers are not impaired after drinking two beers.

In order to decide which of the alternatives is appropriate here we have to think about the context of the problem. Recall that we want to check whether drivers are impaired after drinking two beers. Thus, we want to know whether their reaction times are longer after the two beers. Since the differences were calculated before-after, longer reaction times after the beers would translate into negative differences. These differences are: 6.25 – 6.85, 2.96 – 4.78, etc.

Therefore, the appropriate alternative here is:

H₀: μ_d< 0

indicating that the differences are centered at a negative number.

Comment

Recall that originally, the following figure represented our problem:

Later, we reduced the problem to inference about a single mean, the mean of the differences:

For the population of all drivers, we are trying to find μ_d, which represents the mean of the difference in total reaction time (before 2 beers - after 2 beers). To do this, we generate a sample from the population. The sample consists of 20 differences.

Some students find it helpful to know that it turns out that μ_d= μ₁ - μ₂. In other words, the difference between the means [latex]\mu _{1}-\mu _{2}[/latex] in the first representation is the same as the mean of the differences, [latex]\mu _{d}[/latex],in the second one. Some students find it easier to first think about the hypotheses in terms of [latex]\mu _{1}-\mu _{2}[/latex] (as we did in the two-sample case) and then represent it in terms of [latex]\mu _{d}[/latex].

In our example, since we want to test whether the reaction times in population 1 are shorter, we are testing

$H_{o} : μ_{1} - μ_{2} = 0 vs . H_{a} : μ_{1} - μ_{2} < 0$ , which in the matched pairs design notation is translated to $H_{o} : μ_{d} = 0 vs . H_{a} : μ_{d} < 0$ .

Here is another example:

Example

Suppose the effectiveness of a low-carb diet is studied with a matched pairs design, recording each participant’s weight before and after dieting. What would be the appropriate hypotheses in this case?

As before, $μ_{d}$ is the mean of the differences (weight before diet)-(weight after diet). In this case, if the diet is effective and participants’ weight after the diet was indeed lower, we would expect the differences to be positive, and therefore the appropriate hypotheses in this case are: $H_{o} : μ_{d} = 0 vs . H_{a} : μ_{d} > 0$ .

Did I get this?

In each of the following cases, decide based on the context what the appropriate set of hypotheses is.

Step 2: Checking Conditions and Calculating the Test Statistic

The paired t-test, as a special case of a one-sample t-test, can be safely used as long as:

The sample of differences is random (or at least can be considered so in context).
We are in one of the three situations marked with a green check mark in the following table

In other words, in order to use the paired t-test safely, the differences should vary normally unless the sample size is large, in which case it is safe to use the paired t-test regardless of whether the differences vary normally or not. As we indicated in the figure above (and have seen many times already), in practice, normality is checked by looking at the histogram of differences and as long as no clear violation of normality (such as extreme skewness and/or outliers) is apparent, normality is assumed. Assuming that the we can safely use the paired t-test, the data are summarized by a test statistic:

[latex]t=\frac{\overline{x_{d}}-0}{\frac{s_{d}}{\sqrt{n}}}[/latex]

where [latex]\overline{x_{d}}[/latex] is the sample mean of the differences, and s_d is the standard deviation of the differences. This is the test statistic we’ve developed for the one sample t-test (with μ₀ = 0), and has the same conceptual interpretation; it measures (in standard errors) how far our data are (represented by the average of the differences) from the null hypothesis (represented by the null value, 0).

Example

Let’s first check whether we can safely proceed with the paired t-test, by checking the two conditions.

The sample of drivers was chosen at random.
The sample size is not large enough (n = 20), so in order to proceed, we need to look at the histogram of the differences and make sure there is no evidence that the normality assumption is not met. Here is the histogram:

A distribution histogram titled "Histogram of Differences." The vertical axis is labeled "Frequency," and the horizontal axis is labeled "Differences." The histogram is roughly normal shape. The data, given in "Difference: Frequency" format: -2.0: 2 -1.5: 2 -1.0: 4 -0.5: 5 0.0: 3 0.5: 2 1.0: 2

There is no evidence of violation of the normality assumption (on the contrary, the histogram looks quite normal).

Also note that the vast majority of the differences are negative (i.e., the total reaction times for most of the drivers are larger after the two beers), suggesting that the data provide evidence against the null hypothesis.

The question (which the p-value will answer) is whether these data provide strong enough evidence or not. We can safely proceed to calculate the test statistic (which in practice we leave to the software to calculate for us).

Here is the output of the paired t-test for our example:

n = 20, mean difference = -0.501500, stdev difference = 0.868600, t-value = -2.58

According to the output, the test statistic is -2.58, indicating that the data (represented by the sample mean of the differences) are 2.58 standard errors below the null hypothesis (represented by the null value, 0). Note in the output, that beyond the test statistic itself, we also highlighted the part of the output that provides the ingredients needed in order to calculate it: [latex]n=20, \overline{x_{d}}=-0.5015, s_{d}=0.8686[/latex]. Indeed [latex]\frac{-0.5015}{\frac{0.8686}{\sqrt{20}}}=-2.58[/latex]

Step 3: Finding the p-value

As a special case of the one-sample t-test, the null distribution of the paired t-test statistic is a t distribution (with n – 1 degrees of freedom), which is the distribution under which the p-values are calculated. We will let the software find the p-value for us, and in this case, Excel gives us a p-value of 0.009.

The small p-value tells us that there is very little chance of getting data like those observed (or even more extreme) if the null hypothesis were true. More specifically, there is less than a 1% chance (.009=.9%) of obtaining a test statistic of -2.58 (or lower), assuming that 2 beers have no impact on reaction times.

Step 4: Conclusion in Context.

As usual, we draw our conclusion based on the p-value. If the p-value is small, there is a significant difference between what was observed in the sample and what was claimed in H_o, so we reject H_o and conclude that the categorical explanatory variable does affect the quantitative response variable as specified in H_a. If the p-value is not small, we do not have enough statistical evidence to reject H_o. In particular, if a cutoff probability, α (significance level), is specified, we reject H_o if the p-value is less than α. Otherwise, we do not reject H_o.

In our example, the p-value is .009, indicating that the data provide enough evidence to reject H_o and conclude that drinking two beers does slow the reaction times of drivers, and thus that drivers are impaired after drinking two beers.

Comment

It is very important to pay attention to whether the two-sample t-test or the paired t-test is appropriate. In other words, being aware of the study design is extremely important. Consider our example. If we had not “caught” that this is a matched pairs design, and had analyzed the data as if the two samples were independent using the two-sample t-test, we would have obtained a p-value of 0.057.

Note that using this (wrong) method to analyze the data, and a significance level of .05, we would conclude that the data do not provide enough evidence for us to conclude that drivers are impaired after drinking two beers. This is an example of how using the wrong statistical method can lead you to wrong conclusions, which in this context can have very serious implications.

The “driving after having 2 beers” example is a case in which observations are paired by subject. In other words, both samples have the same subject, so that each subject is measured twice. Typically, as in our example, one of the measurements occurs before a treatment/intervention (2 beers in our case), and the other measurement after the treatment/intervention. Our next example is another typical type of study where the matched pairs design is used—it is a study involving twins.

Example

Researchers have long been interested in the extent to which intelligence, as measured by IQ score, is affected by “nurture” as opposed to “nature”: that is, are people’s IQ scores mainly a result of their upbringing and environment, or are they mainly an inherited trait? A study was designed to measure the effect of home environment on intelligence, or more specifically, the study was designed to address the question: “Are there significant differences in IQ scores between people who were raised by their birth parents, and those who were raised by someone else?”

In order to be able to answer this question, the researchers needed to get two groups of subjects (one from the population of people who were raised by their birth parents, and one from the population of people who were raised by someone else) who are as similar as possible in all other respects. In particular, since genetic differences may also affect intelligence, the researchers wanted to control for this confounding factor.

We know from our discussion on study design (in the Producing Data unit of the course) that one way to (at least theoretically) control for all confounding factors is randomization—randomizing subjects to the different treatment groups. In this case, however, this is not possible. This is an observational study; you cannot randomize children to either be raised by their birth parents or to be raised by someone else. How else can we eliminate the genetics factor? We can conduct a “twin study.”

Because identical twins are genetically the same, a good design for obtaining information to answer this question would be to compare IQ scores for identical twins, one of whom is raised by birth parents and the other by someone else. Such a design (matched pairs) is an excellent way of making a comparison between individuals who only differ with respect to the explanatory variable of interest (upbringing) but are as alike as they can possibly be in all other important aspects (inborn intelligence). Identical twins raised apart were studied by Susan Farber, who published her studies in the book “Identical Twins Reared Apart” (1981, Basic Books). In this problem, we are going to use the data that appear in Farber’s book in table E6, of the IQ scores of 32 pairs of identical twins who were reared apart.

Here is a figure that will help you understand this study:

Here are the important things to note in the figure:

We are essentially comparing the mean IQ scores in two populations that are defined by our (two-valued categorical) explanatory variable — upbringing (X), whose two values are: raised by birth parents, raised by someone else.
This is a matched pairs design (as opposed to a two independent samples design), since each observation in one sample is linked (matched) with an observation in the second sample. The observations are paired by twins.

To look at the data set, follow these instructions:

To open Excel with the data in the worksheet, right click to download the twins file to your computer. Then find the downloaded file and double-click it to open it in Excel. When Excel opens you may have to enable editing.

Each of the 32 rows represents one pair of twins. Keeping the notation that we used above, twin 1 is the twin that was raised by his/her birth parents, and twin 2 is the twin that was raised by someone else. Let’s carry out the analysis.

Stating the hypotheses.

Recall that in matched pairs, we reduce the data from two samples to one sample of differences:

and we state our hypotheses in terms of the mean of the differences, $μ_{d}$ .

Since we would like to test whether there are differences in IQ scores between people who were raised by their birth parents and those who weren’t, we are carrying out the two-sided test:

Comment:

Again, some students find it easier to first think about the hypotheses in terms of μ₁ and μ₂, and then write them in terms of $μ_{d}$ . In this case, since we are testing for differences between the two populations, the hypotheses will be:

and since $μ_{d} = μ_{1} - μ_{2}$ we get back to the hypotheses above.
Checking conditions and summarizing the data with a test statistic.

Is it safe to use the paired t-test in this case?
1. Clearly, the samples of twins are not random samples from the two populations. However, in this context, they can be considered as random, assuming that there is nothing special about the IQ of a person just because he/she has an identical twin.
2. The sample size here is n = 32. Even though it’s the case that if we use the n > 30 rule of thumb our sample can be considered large, it is sort of a borderline case, so just to be on the safe side, we should look at the histogram of the differences just to make sure that we do not see anything extreme. (Comment: Looking at the histogram of differences in every case is useful even if the sample is very large, just in order to get a sense of the data. Recall: “Always look at the data.”)
The data don’t reveal anything that we should be worried about (like very extreme skewness or outliers), so we can safely proceed. Looking at the histogram, we note that most of the differences are negative, indicating that in most of the 32 pairs of twins, twin 2 (raised by someone else) has a higher IQ.

From this point we rely on statistical software, and find that:
- t-value = -1.85
- p-value = 0.074
Our test statistic is -1.85. Our data (represented by the average of the differences) are 1.85 standard errors below the null hypothesis (represented by the null value 0).
Finding the p-value.

The p-value is 0.074, indicating that there is a 7.4% chance of obtaining data like those observed (or even more extreme) assuming that H_o is true (i.e., assuming that there are no significant differences in IQ scores between people who were raised by their natural parents and those who weren’t).
Making conclusions.

Using the conventional significance level (cut-off probability) of .05, our p-value is not small enough, and we therefore cannot reject H_o. In other words, our data do not provide enough evidence to conclude that whether a person was raised by his/her natural parents has an impact on the person’s intelligence (as measured by IQ scores).

Learn by Doing

Comment:

This means that if, based on prior knowledge, prior research, or just a hunch, we had wanted to test the hypothesis that the IQ level of people raised by their birth parents is lower, on average, than the IQ level of people who were raised by someone else, we would have rejected H_o and accepted that hypothesis (at the .05 significance level, since .037 < .05).

It should be stressed, though, that one should set the hypotheses before looking at the data. It would be ethically wrong to look at the histogram of differences, note that most of the differences are negative, and then decide to carry out the one-sided test that the data seem to support. This is known as “data snooping,” and is considered to be a very bad statistical practice.

Confidence Interval for μ_d (Paired t Confidence Interval)

So far we’ve discussed the paired t-test, which checks whether there is enough evidence stored in the data to reject the claim that $μ_{d} = 0$ in favor of one of the three possible alternatives.

If we would like to estimate $μ_{d}$ , the mean of the differences (response 1 – response 2), we can use the natural point estimate, [latex]\overline{x_{d}}[/latex], the sample mean of the differences, or preferably, use a 95% confidence interval, which will provide us with a set of plausible values for $μ_{d}$ .

In particular, if the test has rejected $H_{0} : μ_{d} = 0$ , a confidence interval for $μ_{d}$ can be insightful, since it quantifies the effect that the categorical explanatory variable has on the response variable.

Comment: We will not go into the formula and calculation of the confidence interval, but rather ask our statistical software to do it for us, and focus on interpretation.

Example

Recall our leading example about whether drivers are impaired after having two beers:

which is reduced to inference about a single mean, the mean of the differences (before – after):

For the population of all drivers, we are trying to find μ_d, which represents the mean of the difference in total reaction time (before 2 beers - after 2 beers). To do this, we generate a sample from the population. The sample consists of 20 differences.

The p-value of our test, $H_{0} : μ_{d} = 0$ vs. $H_{0} : μ_{d} < 0$ was .009, and we therefore rejected H_o and concluded that the mean difference in total reaction time (before beer – after beer) was negative, or in other words, that drivers are impaired after having two beers. As a follow-up to this conclusion, it would be interesting to quantify the effect that two beers have on the driver, using the 95% confidence interval for $μ_{d}$ .

Using statistical software, we find that the 95% confidence interval for $μ_{d}$ , the mean of the differences (before – after), is roughly (-.9, -.1).

We can therefore say with 95% confidence that drinking two beers increases the total reaction time of the driver by between .1 and .9 of a second.

Comment

As we’ve seen in previous tests, as well as in the matched pairs case, the 95% confidence interval for $μ_{d}$ can be used for testing in the two-sided case ( $H_{0} : μ_{d} = 0$ vs. $H_{a} : μ_{d} \neq 0$ ):

If the null value, 0, falls outside the confidence interval, H_o is rejected.

If the null value, 0, falls inside the confidence interval, H_o is not rejected.

Example

Let’s go back to our twin study example, where we found a 95% confidence interval for $μ_{d}$ of (-6.11322, 0.30072) and a p-value of 0.074.

We used the fact that the p-value is .074 to conclude that H_o can not be rejected (at the .05 significance level), and that whether or not a person was raised by his or her birth parents doesn’t necessarily have an effect on intelligence (as measured by IQ scores). The last comment tells us that we can also use the confidence interval to reach the same conclusion, since 0 falls inside the confidence interval for $μ_{d}$ . In other words, since 0 is a plausible value for $μ_{d}$ we cannot reject H_o which claims that $μ_{d} = 0$ .

Learn by Doing

A publishing company wanted to test whether typing speed differs when using word processor A or word processor B. A random sample of 25 typists was selected and the typing speeds (in words per minute) were recorded for each secretary when using word processor A and then when using word processor B. (Which word processor was used first was determined for each typist by a coin flip).

Based on the collected data, a 95% confidence interval for μ_d, the mean difference (word processor A - word processor B) was found to be (2.5, 7.8).

The appropriate hypotheses for testing whether the typing speeds differ when using word processor A or word processor B is the two-sided test:

H_0: μ_d = 0, H_a: μ_d ≠ 0

Let’s summarize

The paired t-test is used to compare two population means when the two samples (drawn from the two populations) are dependent in the sense that every observation in one sample can be linked to an observation in the other sample. Such a design is called “matched pairs.”
The most common case in which the matched pairs design is used is when the same subjects are measured twice, usually before and then after some kind of treatment and/or intervention. Another classic case are studies involving twins.
As in the “two independent samples” case, in the background, we have a two-valued categorical explanatory whose categories define the two populations we are comparing and whose effect on the response variable we are trying to assess.
The idea behind the paired t-test is to reduce the data from two samples to just one sample of the differences, and use these observed differences as data for inference about a single mean — the mean of the differences, $μ_{d}$ .
The paired t-test is therefore simply a one-sample t-test for the mean of the differences $μ_{d}$ , where the null value is 0.
Once we verify that we can safely proceed with the paired t-test, we use software output to carry it out.
A 95% confidence interval for $μ_{d}$ can be very insightful after a test has rejected the null hypothesis, and can also be used for testing in the two-sided case.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Comparing Two Means—Matched Pairs (Paired t-Test)

Overview

SAT Prep Class

Drunk Drivers

Comments

The Paired t-test

Idea

Step 1: Stating the hypotheses.

Drunk Driving

Comment

Step 2: Checking Conditions and Calculating the Test Statistic

Step 3: Finding the p-value

Step 4: Conclusion in Context.

Comment

Comment:

Confidence Interval for μd (Paired t Confidence Interval)

Comment

Let’s summarize

License

Share This Book

Confidence Interval for μ_d (Paired t Confidence Interval)