10.3: Matched Pairs
Learning Objectives
- In a given context, carry out the inferential method for comparing groups and draw the appropriate conclusions.
- Specify the null and alternative hypotheses for comparing groups.
Comparing Two Means—Matched Pairs (Paired t-Test)
Overview
We are still in Case C→Q of inference about relationships, where the explanatory variable is categorical and the response variable is quantitative. As we mentioned in the introduction, we introduce three inferential procedures in this case.
So far we have introduced the first procedure—the two-sample t-test that is used when we are comparing two means and the samples are independent. We now move on to the second procedure, where we also compare two means, but the samples are paired or matched. Every observation in one sample is linked with an observation in the other sample. In this case, the samples are dependent.
One of the most common cases where dependent samples occur is when both samples have the same subjects and they are “paired by subject.” In other words, each subject is measured twice on the response variable, typically before and then after some kind of treatment/intervention in order to assess its effectiveness.
Example
SAT Prep Class
Suppose you want to assess the effectiveness of an SAT prep class. It would make sense to use the matched pairs design and record each sampled student’s SAT score before and after the SAT prep classes are attended:
Recall that the two populations represent the two values of the explanatory variable. In this situation, those two values come from a single set of subjects. In other words, both populations really have the same students. However, each population has a different value of the explanatory variable. Those values are: no prep class, prep class.
This, however, is not the only case where the paired design is used. Other cases are when the pairs are “natural pairs,” such as siblings, twins, or couples. We will present two examples in this part. The first one will be of the type where each subject is measured twice, and the second one will be a study involving twins.
This section on matched pairs design will be organized very much like the previous section on two independent samples. We will first introduce our leading example, and then present the paired t-test illustrating each step using our example. We will then look at another example, and finally talk about estimation using a confidence interval. As usual, you’ll be able to check your understanding along the way, and will learn how to use software to carry out this test.
Example
Drunk Drivers
Drunk driving is one the main causes of car accidents. Interviews with drunk drivers who were involved in accidents and survived revealed that one of the main problems is that drivers do not realize that they are impaired, thinking “I only had 1-2 drinks … I am OK to drive.” A sample of 20 drivers was chosen, and their reaction times in an obstacle course were measured before and after drinking two beers. The purpose of this study was to check whether drivers are impaired after drinking two beers. Here is a figure summarizing this study:
Comments
-
Note that the categorical explanatory variable here is “drinking 2 beers (Yes/No)”, and the quantitative response variable is the reaction time.
-
Note that by using the matched pairs design in this study (i.e., by measuring each driver twice), the researchers isolated the effect of the two beers on the drivers and eliminated any other confounding factors that might influence the reaction times (such as the driver’s experience, age, etc.).
-
For each driver, the two measurements are the total reaction time before drinking two beers, and after. You can see the data by following these instructions:
StatCrunch Instructions
Minitab Instructions
Excel 2007 Instructions
Excel 2003 Instructions
TI Instructions
Excel 2019 PC
Excel 2019 Mac
To open Excel with the data in the worksheet, click here to download the file to your computer. Then find the downloaded file and double-click it to open it in Excel. When Excel opens, you may have to enable editing.
So far, we have discussed and illustrated cases in which the matched pairs design comes up, and we are now ready to discuss how to carry out the test in this case. We will first present the idea behind the paired t-test, and then go through the four steps in the testing process.
The Paired t-test
Idea
The idea behind the paired t-test is to reduce this two-sample situation, where we are comparing two means, to a single sample situation where we are doing inference on a single mean, and then use a simple t-test that we introduced in the previous module. We will first illustrate this idea using our example, and then more generally.
In other words, by reducing the two samples to one sample of differences, we are essentially reducing the problem from a problem where we’re comparing two means (i.e., doing inference on μ1−μ2):
to a problem where we are making an inference about a single mean — the mean of the differences:
In general, in every matched pairs problem, our data consist of 2 samples which are organized in n pairs:
We reduce the two samples to only one by calculating for each pair the difference between the two observations (in the figure we used d1,d2,d3,...,dn to denote the differences).
The paired t-test is based on this one sample of n differences,
and it uses those differences as data for a simple t-test on a single mean — the mean of the differences.
This is the general idea behind the paired t-test; it is nothing more than a regular one-sample t-test for the mean of the differences. We will now go through the 4-step process of the paired t-test.
Step 1: Stating the hypotheses.
Recall that in the t-test for a single mean our null hypothesis was: Ho:μ=μo and the alternative was one of Ha:μ<or>or≠μ0 . Since the paired t-test is a special case of the one-sample t-test, the hypotheses are the same except that:
-
Instead of simply μ we use the notation μd to denote that the parameter of interest is the mean of the differences.
-
In this course our null value μ0 is always 0 (although technically, it does not have to be).
Therefore, in the paired t-test:
The null hypothesis is always:
Ho:μd=0
and the alternative is one of :
<img class=”img-responsive popimg” style=”box-sizing: border-box; vertical-align: middle; border: none; max-width: 100%; display: block; height: auto; margin: auto; padding: 0px; break-inside: avoid !important; outline: 0px; cursor: pointer;” title=”H_a: μ_d 0 (one-sided), H_a: μ_d ≠ 0 (two-sided)” src=”https://oli.cmu.edu/repository/webcontent/72712ec00a0001dc418a87e73e8ebb77/_u5_inference/_m2_inference_for_relationships/webcontent/image065.gif” alt=”H_a: μ_d 0 (one-sided), H_a: μ_d ≠ 0 (two-sided)”>
depending on the context.
Let’s go back to our example to see how this works and why it makes sense.
Example
Drunk Driving
Recall that in our “Are drivers impaired after drinking two beers?” example, our data was reduced to one sample of differences (one for each driver),
so our problem was reduced to inference about the mean of the differences μd .
As we mentioned, the null hypothesis is:
Ho:μd=0 .
The null hypothesis claims that the differences in reaction times are centered at (or around) 0, indicating that drinking two beers has no real impact on reaction times. In other words, drivers are not impaired after drinking two beers.
In order to decide which of the alternatives is appropriate here we have to think about the context of the problem. Recall that we want to check whether drivers are impaired after drinking two beers. Thus, we want to know whether their reaction times are longer after the two beers. Since the differences were calculated before-after, longer reaction times after the beers would translate into negative differences. These differences are: 6.25 – 6.85, 2.96 – 4.78, etc.
Therefore, the appropriate alternative here is:
Ha:μd<0
indicating that the differences are centered at a negative number.
Comment
Recall that originally, the following figure represented our problem:
Later, we reduced the problem to inference about a single mean, the mean of the differences:
Some students find it helpful to know that it turns out that μd=μ1−μ2. In other words, the difference between the means μ1−μ2 in the first representation is the same as the mean of the differences, μd,in the second one. Some students find it easier to first think about the hypotheses in terms of μ1−μ2 (as we did in the two-sample case) and then represent it in terms of μd.
In our example, since we want to test whether the reaction times in population 1 are shorter, we are testing Ho:μ1−μ2=0 vs. Ha:μ1−μ2<0, which in the matched pairs design notation is translated to Ho:μd=0 vs. Ha:μd<0 .
Here is another example:
Example
Suppose the effectiveness of a low-carb diet is studied with a matched pairs design, recording each participant’s weight before and after dieting. What would be the appropriate hypotheses in this case?
As before, μd is the mean of the differences (weight before diet)-(weight after diet). In this case, if the diet is effective and participants’ weight after the diet was indeed lower, we would expect the differences to be positive, and therefore the appropriate hypotheses in this case are: Ho:μd=0 vs. Ha:μd>0 .
Step 2: Checking Conditions and Calculating the Test Statistic
The paired t-test, as a special case of a one-sample t-test, can be safely used as long as:
-
The sample of differences is random (or at least can be considered so in context).
-
We are in one of the three situations marked with a green check mark in the following table
In other words, in order to use the paired t-test safely, the differences should vary normally unless the sample size is large, in which case it is safe to use the paired t-test regardless of whether the differences vary normally or not. As we indicated in the figure above (and have seen many times already), in practice, normality is checked by looking at the histogram of differences and as long as no clear violation of normality (such as extreme skewness and/or outliers) is apparent, normality is assumed. Assuming that the we can safely use the paired t-test, the data are summarized by a test statistic:
t=¯¯¯¯xd−0sd√n
where ¯¯¯¯xd is the sample mean of the differences, and sd is the sample standard deviation of the differences. This is the test statistic we’ve developed for the one sample t-test (with μ0=0 ), and has the same conceptual interpretation; it measures (in standard errors) how far our data are (represented by the average of the differences) from the null hypothesis (represented by the null value, 0).
Example
Let’s first check whether we can safely proceed with the paired t-test, by checking the two conditions.
-
The sample of drivers was chosen at random.
-
The sample size is not large enough (n = 20), so in order to proceed, we need to look at the histogram of the differences and make sure there is no evidence that the normality assumption is not met. Here is the histogram:
There is no evidence of violation of the normality assumption (on the contrary, the histogram looks quite normal).
Also note that the vast majority of the differences are negative (i.e., the total reaction times for most of the drivers are larger after the two beers), suggesting that the data provide evidence against the null hypothesis.
The question (which the p-value will answer) is whether these data provide strong enough evidence or not. We can safely proceed to calculate the test statistic (which in practice we leave to the software to calculate for us).
Here is the output of the paired t-test for our example:
According to the output, the test statistic is -2.58, indicating that the data (represented by the sample mean of the differences) are 2.58 standard errors below the null hypothesis (represented by the null value, 0). Note in the output, that beyond the test statistic itself, we also highlighted the part of the output that provides the ingredients needed in order to calculate it: n=20, ¯¯¯¯xd=−0.5015, sd=0.8686. Indeed −0.50150.8686√20=−2.58.
Step 3: Finding the p-value
As a special case of the one-sample t-test, the null distribution of the paired t-test statistic is a t distribution (with n – 1 degrees of freedom), which is the distribution under which the p-values are calculated. We will let the software find the p-value for us, and in this case, Excel gives us a p-value of 0.009.
The small p-value tells us that there is very little chance of getting data like those observed (or even more extreme) if the null hypothesis were true. More specifically, there is less than a 1% chance (.009=.9%) of obtaining a test statistic of -2.58 (or lower), assuming that 2 beers have no impact on reaction times.
Step 4: Conclusion in Context.
As usual, we draw our conclusion based on the p-value. If the p-value is small, there is a significant difference between what was observed in the sample and what was claimed in Ho, so we reject Ho and conclude that the categorical explanatory variable does affect the quantitative response variable as specified in Ha. If the p-value is not small, we do not have enough statistical evidence to reject Ho. In particular, if a cutoff probability, α (significance level), is specified, we reject Ho if the p-value is less than α. Otherwise, we do not reject Ho.
In our example, the p-value is .009, indicating that the data provide enough evidence to reject Ho and conclude that drinking two beers does slow the reaction times of drivers, and thus that drivers are impaired after drinking two beers.
Comment
It is very important to pay attention to whether the two-sample t-test or the paired t-test is appropriate. In other words, being aware of the study design is extremely important. Consider our example. If we had not “caught” that this is a matched pairs design, and had analyzed the data as if the two samples were independent using the two-sample t-test, we would have obtained a p-value of 0.057.
Note that using this (wrong) method to analyze the data, and a significance level of .05, we would conclude that the data do not provide enough evidence for us to conclude that drivers are impaired after drinking two beers. This is an example of how using the wrong statistical method can lead you to wrong conclusions, which in this context can have very serious implications.
The “driving after having 2 beers” example is a case in which observations are paired by subject. In other words, both samples have the same subject, so that each subject is measured twice. Typically, as in our example, one of the measurements occurs before a treatment/intervention (2 beers in our case), and the other measurement after the treatment/intervention. Our next example is another typical type of study where the matched pairs design is used—it is a study involving twins.
Example
Researchers have long been interested in the extent to which intelligence, as measured by IQ score, is affected by “nurture” as opposed to “nature”: that is, are people’s IQ scores mainly a result of their upbringing and environment, or are they mainly an inherited trait? A study was designed to measure the effect of home environment on intelligence, or more specifically, the study was designed to address the question: “Are there significant differences in IQ scores between people who were raised by their birth parents, and those who were raised by someone else?”
In order to be able to answer this question, the researchers needed to get two groups of subjects (one from the population of people who were raised by their birth parents, and one from the population of people who were raised by someone else) who are as similar as possible in all other respects. In particular, since genetic differences may also affect intelligence, the researchers wanted to control for this confounding factor.
We know from our discussion on study design (in the Producing Data unit of the course) that one way to (at least theoretically) control for all confounding factors is randomization—randomizing subjects to the different treatment groups. In this case, however, this is not possible. This is an observational study; you cannot randomize children to either be raised by their birth parents or to be raised by someone else. How else can we eliminate the genetics factor? We can conduct a “twin study.”
Because identical twins are genetically the same, a good design for obtaining information to answer this question would be to compare IQ scores for identical twins, one of whom is raised by birth parents and the other by someone else. Such a design (matched pairs) is an excellent way of making a comparison between individuals who only differ with respect to the explanatory variable of interest (upbringing) but are as alike as they can possibly be in all other important aspects (inborn intelligence). Identical twins raised apart were studied by Susan Farber, who published her studies in the book “Identical Twins Reared Apart” (1981, Basic Books). In this problem, we are going to use the data that appear in Farber’s book in table E6, of the IQ scores of 32 pairs of identical twins who were reared apart.
Here is a figure that will help you understand this study:
Here are the important things to note in the figure:
-
We are essentially comparing the mean IQ scores in two populations that are defined by our (two-valued categorical) explanatory variable — upbringing (X), whose two values are: raised by birth parents, raised by someone else.
-
This is a matched pairs design (as opposed to a two independent samples design), since each observation in one sample is linked (matched) with an observation in the second sample. The observations are paired by twins.
To look at the data set, follow these instructions:
To open Excel with the data in the worksheet, click here to download the file to your computer. Then find the downloaded file and double-click it to open it in Excel. When Excel opens you may have to enable editing.
Each of the 32 rows represents one pair of twins. Keeping the notation that we used above, twin 1 is the twin that was raised by his/her birth parents, and twin 2 is the twin that was raised by someone else. Let’s carry out the analysis.
-
Stating the hypotheses.
Recall that in matched pairs, we reduce the data from two samples to one sample of differences:
and we state our hypotheses in terms of the mean of the differences, μd.
Since we would like to test whether there are differences in IQ scores between people who were raised by their birth parents and those who weren’t, we are carrying out the two-sided test:
Comment:
Again, some students find it easier to first think about the hypotheses in terms of μ1 and μ2, and then write them in terms of μd. In this case, since we are testing for differences between the two populations, the hypotheses will be:
and since μd=μ1−μ2 we get back to the hypotheses above.
-
Checking conditions and summarizing the data with a test statistic.
Is it safe to use the paired t-test in this case?
-
Clearly, the samples of twins are not random samples from the two populations. However, in this context, they can be considered as random, assuming that there is nothing special about the IQ of a person just because he/she has an identical twin.
-
The sample size here is n = 32. Even though it’s the case that if we use the n > 30 rule of thumb our sample can be considered large, it is sort of a borderline case, so just to be on the safe side, we should look at the histogram of the differences just to make sure that we do not see anything extreme. (Comment: Looking at the histogram of differences in every case is useful even if the sample is very large, just in order to get a sense of the data. Recall: “Always look at the data.”)
The data don’t reveal anything that we should be worried about (like very extreme skewness or outliers), so we can safely proceed. Looking at the histogram, we note that most of the differences are negative, indicating that in most of the 32 pairs of twins, twin 2 (raised by someone else) has a higher IQ.
From this point we rely on statistical software, and find that:
-
t-value = -1.85
-
p-value = 0.074
Our test statistic is -1.85. Our data (represented by the average of the differences) are 1.85 standard errors below the null hypothesis (represented by the null value 0).
-
-
Finding the p-value.
The p-value is 0.074, indicating that there is a 7.4% chance of obtaining data like those observed (or even more extreme) assuming that Ho is true (i.e., assuming that there are no significant differences in IQ scores between people who were raised by their natural parents and those who weren’t).
-
Making conclusions.
Using the conventional significance level (cut-off probability) of .05, our p-value is not small enough, and we therefore cannot reject Ho. In other words, our data do not provide enough evidence to conclude that whether a person was raised by his/her natural parents has an impact on the person’s intelligence (as measured by IQ scores).
Comment:
This means that if, based on prior knowledge, prior research, or just a hunch, we had wanted to test the hypothesis that the IQ level of people raised by their birth parents is lower, on average, than the IQ level of people who were raised by someone else, we would have rejected Ho and accepted that hypothesis (at the .05 significance level, since .037 < .05).
It should be stressed, though, that one should set the hypotheses before looking at the data. It would be ethically wrong to look at the histogram of differences, note that most of the differences are negative, and then decide to carry out the one-sided test that the data seem to support. This is known as “data snooping,” and is considered to be a very bad statistical practice.
Exercise
Confidence Interval for μd (Paired t Confidence Interval)
So far we’ve discussed the paired t-test, which checks whether there is enough evidence stored in the data to reject the claim that μd=0 in favor of one of the three possible alternatives.
If we would like to estimate μd, the mean of the differences (response 1 – response 2), we can use the natural point estimate, ¯¯¯¯xd, the sample mean of the differences, or preferably, use a 95% confidence interval, which will provide us with a set of plausible values for μd.
In particular, if the test has rejected H0:μd=0, a confidence interval for μd can be insightful, since it quantifies the effect that the categorical explanatory variable has on the response variable.
Comment: We will not go into the formula and calculation of the confidence interval, but rather ask our statistical software to do it for us, and focus on interpretation.
Example
Recall our leading example about whether drivers are impaired after having two beers:
which is reduced to inference about a single mean, the mean of the differences (before – after):
The p-value of our test, H0:μd=0 vs. H0:μd<0 was .009, and we therefore rejected Ho and concluded that the mean difference in total reaction time (before beer – after beer) was negative, or in other words, that drivers are impaired after having two beers. As a follow-up to this conclusion, it would be interesting to quantify the effect that two beers have on the driver, using the 95% confidence interval for μd.
Using statistical software, we find that the 95% confidence interval for μd, the mean of the differences (before – after), is roughly (-.9, -.1).
We can therefore say with 95% confidence that drinking two beers increases the total reaction time of the driver by between .1 and .9 of a second.
Comment
As we’ve seen in previous tests, as well as in the matched pairs case, the 95% confidence interval for μd can be used for testing in the two-sided case (H0:μd=0 vs. Ha:μd≠0):
If the null value, 0, falls outside the confidence interval, Ho is rejected.
If the null value, 0, falls inside the confidence interval, Ho is not rejected.
Let’s go back to our twin study example, where we found a 95% confidence interval for μd of (-6.11322, 0.30072) and a p-value of 0.074.
We used the fact that the p-value is .074 to conclude that Ho can not be rejected (at the .05 significance level), and that whether or not a person was raised by his or her birth parents doesn’t necessarily have an effect on intelligence (as measured by IQ scores). The last comment tells us that we can also use the confidence interval to reach the same conclusion, since 0 falls inside the confidence interval for μd. In other words, since 0 is a plausible value for μd we cannot reject Ho which claims that μd=0.
Let’s summarize
-
The paired t-test is used to compare two population means when the two samples (drawn from the two populations) are dependent in the sense that every observation in one sample can be linked to an observation in the other sample. Such a design is called “matched pairs.”
- The most common case in which the matched pairs design is used is when the same subjects are measured twice, usually before and then after some kind of treatment and/or intervention. Another classic case are studies involving twins.
-
As in the “two independent samples” case, in the background, we have a two-valued categorical explanatory whose categories define the two populations we are comparing and whose effect on the response variable we are trying to assess.
-
The idea behind the paired t-test is to reduce the data from two samples to just one sample of the differences, and use these observed differences as data for inference about a single mean — the mean of the differences, μd.
-
The paired t-test is therefore simply a one-sample t-test for the mean of the differences μd, where the null value is 0.
- Once we verify that we can safely proceed with the paired t-test, we use software output to carry it out.
-
A 95% confidence interval for μd can be very insightful after a test has rejected the null hypothesis, and can also be used for testing in the two-sided case.