10.2: Inference for Two Independent Means
Learning Objectives
- In a given context, carry out the inferential method for comparing groups and draw the appropriate conclusions.
- Specify the null and alternative hypotheses for comparing groups.
Comparing Two Means—Two Independent Samples (The Two-Sample t-Test)
Overview
As we mentioned in the summary of the introduction to Case C→Q, the first case that we will deal with is comparing two means when the two samples are independent:
Recall that here we are interested in the effect of a two-valued (k = 2) categorical variable (X) on a quantitative response (Y). Samples are drawn independently from the two sub-populations (defined by the two categories of X), and we need to evaluate whether or not the data provide enough evidence for us to believe that the two sub-population means are different.
In other words, our goal is to test whether the means μ1 and μ2 (which are the means of the variable of interest in the two sub-populations) are equal or not, and in order to do that we have two samples, one from each sub-population, which were chosen independently of each other. As the title of this part suggests, the test that we will learn here is commonly known as the two-sample t-test. As the name suggests, this is a t-test, which as we know means that the p-values for this test are calculated under some t distribution. Here is how this part is organized.
We first introduce our leading example, and then go in detail through the four steps of the two-sample t-test, illustrating each step using our example.
Note…Up until now, we have been dividing our population into sub-populations, then sampling from these sub-populations.
From now on, instead of calling them sub-populations, we will usually call the groups we wish to compare population 1, population 2, and so on. These two descriptions of the groups we are comparing can be used interchangeably.
Example
What is more important to you — personality or looks?
This question was asked of a random sample of 239 college students, who were to answer on a scale of 1 to 25. An answer of 1 means personality has maximum importance and looks no importance at all, whereas an answer of 25 means looks have maximum importance and personality no importance at all. The purpose of this survey was to examine whether males and females differ with respect to the importance of looks vs. personality.
StatCrunch Instructions
Minitab Instructions
Excel 2007 Instructions
Excel 2003 Instructions
TI Instructions
Excel 2019 PC
Excel 2019 Mac
Tip: Alternative versions are available, click the arrow to switch.
To open Excel with the data in the worksheet, right-click here to download the file to your computer. Then find the downloaded file and double-click it to open it in Excel. When Excel opens, you may have to enable editing.
Note that the data have the following format:
|
The format of the data reminds us that we are essentially examining the relationship between the two-valued categorical variable, gender, and the quantitative response, score. The two values of the categorical explanatory variable define the two populations that we are comparing — males and females. The comparison is with respect to the response variable score. Here is a figure that summarizes the example:
Comments:
-
Note that this figure emphasizes how the fact that our explanatory is a two-valued categorical variable means that in practice we are comparing two populations (defined by these two values) with respect to our response Y.
-
Note that even though the problem description just says that we had 239 students, the figure tells us that there were 85 males in the sample, and 150 females.
-
Following up on comment 2, note that 85 + 150 = 235 and not 239. In these data (which are real) there are four “missing observations”—4 students for which we do not have the value of the response variable, “importance.” This could be due to a number of reasons, such as recording error or nonresponse. The bottom line is that even though data were collected from 239 students, effectively we have data from only 235. (Recommended: Go through the data file and note that there are 4 cases of missing observations: students 34, 138, 179, and 183).
The Two-Sample t-Test
Here again is the general situation which requires us to use the two-sample t-test:
Our goal is to compare the means μ1 and μ2 based on the two independent samples.
Step 1: Stating the Hypotheses
The hypotheses represent our goal, comparing the means: μ1 and μ2 .
-
The null hypothesis has the form:
-
Ho:μ1−μ2=0 (which is the same as Ho:μ1=μ2 )
-
-
The alternative hypothesis takes one of the following three forms (depending on the context):
-
Ha:μ1−μ2<0 (which is the same as Ha:μ1<μ2 ) (one-sided)
-
Ha:μ1−μ2>0 (which is the same as Ha:μ1>μ2 ) (one-sided)
-
Ha:μ1−μ2≠0 (which is the same as Ha:μ1≠μ2 ) (two-sided)
-
Note that the null hypothesis claims that there is no difference between the means, which can either represented as the difference is 0 (no difference), or as its (algebraically and conceptually) equivalent, μ1=μ2 (the means are equal). Either way, conceptually, Ho claims that there is no relationship between the two relevant variables.
The first way of writing the hypotheses (using a difference between the means) will be easier to use when (in the future) we look for a difference that is not 0.
Each one of the three alternatives claims that there is a difference between the means. The two one-sided alternatives specify the nature of the difference; either negative, indicating that μ1 is smaller than μ2, or positive, indicating that μ1 is larger than μ2. The two-sided alternative, as usual, is more general and simply claims that a difference exists. As before, it should be clear from the context of the problem which of the three alternatives is appropriate.
Comment
Note that our parameter of interest in this case (the parameter about which we are making an inference) is the difference between the means μ1−μ2 , and that the null value is 0.
Example
Recall that the purpose of this survey was to examine whether the opinions of females and males differ with respect to the importance of looks vs. personality. The hypotheses in this case are therefore:
where μ1 represents the mean importance for females and μ2 represents the mean importance for males.
It is important to understand that conceptually, the two hypotheses claim:
Ho: Score (of looks vs. personality) is not related to gender
Ha: Score (of looks vs. personality) is related to gender
Exercise
In order to check the claim that the pregnancy length of women who smoke during pregnancy is shorter, on average, than the pregnancy length of women who do not smoke, a random sample of 35 pregnant women who smoke and a random sample of 35 pregnant women who do not smoke were chosen and their pregnancy lengths were recorded. Here is a figure of this example:
The null hypothesis in this case is
—
mu1 − mu2 = 0
mu1 − mu2 < 0
mu1 − mu2 > 0
mu1 − mu2 not equal to 0
which claims that pregnancy length
—
is
is not
related to (or affected by) whether or not the woman smokes during pregnancy.
The alternative hypothesis in this case is
—
mu1 − mu2 = 0
mu1 − mu2 < 0
mu1 − mu2 > 0
mu1 − mu2 not equal to 0
which claims that pregnancy length
—
is
is not
related to (or affected by) whether or not the woman smokes during pregnancy.
Note that “mu” stands for the Greek letter μ, the population mean—mu1 stands for the mean of population 1 and mu2 stands for the mean of population 2.
Step 2: Check Conditions, and Summarize the Data Using a Test Statistic
The two-sample t-test can be safely used as long as the following conditions are met:
-
The two samples are indeed independent.
-
We are in one of the following two scenarios:
-
Both populations are normal, or more specifically, the distribution of the response Y in both populations is normal, and both samples are random (or at least can be considered as such). In practice, checking normality in the populations is done by looking at each of the samples using a histogram and checking whether there are any signs that the populations are not normal. Such signs could be extreme skewness and/or extreme outliers.
-
The populations are known or discovered not to be normal, but the sample size of each of the random samples is large enough (we can use the rule of thumb that > 30 is considered large enough).
-
Assuming that we can safely use the two-sample t-test, we need to summarize the data, and in particular, calculate our data summary—the test statistic.
The two-sample t-test statistic is:
t=(¯¯¯y1−¯¯¯y2)−0√s21n1+s22n2
Where:
¯¯¯y1, ¯¯¯y2 are the sample means of the samples from population 1 and population 2 respectively.
s1, s2 are the sample standard deviations of the samples from population 1 and population 2 respectively.
n1, n2 are the sample sizes of the two samples.
Comment
Let’s see why this test statistic makes sense, bearing in mind that our inference is about μ1−μ2.
-
¯¯¯y1 estimates μ1 and ¯¯¯y2 estimates μ2, and therefore ¯¯¯y1− ¯¯¯y2 is what the data tell me about (or, how the data estimate)
μ1−μ2.
-
0 is the “null value” — what the null hypothesis, Ho, claims that μ1−μ2 is.
-
The denominator √s21n1+s22n2 is the standard error of ¯¯¯y1− ¯¯¯y2. (We will not go into the details of why this is true.)
We therefore see that our test statistic, like the previous test statistics we encountered, has the structure:
sample estimate−null valuestandard error
and therefore, like the previous test statistics, measures (in standard errors) the difference between what the data tell us about the parameter of interest μ1−μ2 (sample estimate) and what the null hypothesis claims the value of the parameter is (null value).
Example
Let’s first check whether the conditions that allow us to safely use the two-sample t-test are met.
-
Here, 239 students were chosen and were naturally divided into a sample of females and a sample of males. Since the students were chosen at random, the sample of females is independent of the sample of males.
-
Here we are in the second scenario — the sample sizes (150 and 85), are definitely large enough, and so we can proceed regardless of whether the populations are normal or not.
In order to avoid tedious calculations, we will lift the test statistic from the output. The StatCrunch output (edited) is shown below:
As you can see we highlighted the “ingredients” needed to calculate the test statistic, as well as the test statistic itself. Just for this first example, let’s make sure that we understand what these ingredients are and how to use them to find the test statistic.
And when we put it all together we get that indeed,
t=(¯¯¯¯y1−¯¯¯¯y2)−0√s21n1+s22n2=10.73−13.33√4.252150+4.02285=−4.66
The test statistic tells us what the data tell us about μ1−μ2. In this case that difference (10.73 – 13.33) is 4.66 standard errors below what the null hypothesis claims this difference to be (0). 4.66 standard errors is quite a lot and probably indicates that the data provide evidence against Ho.
Exercise
We have completed step 2 and are ready to proceed to step 3, finding the p-value of the test.
Step 3: Finding the p-value of the test
Since our test is called the two-sample t test ,we know that the p-values are calculated under a t distribution. Indeed, it turns out that the null distribution of our test statistic is approximately t. Figuring out which one of the t distributions (in other words, how many degrees of freedom this t distribution has) is quite involved and will not be discussed here. Instead, we use a statistics package to find that the p-value in this case is 0.
Example
Here, again is the relevant output for our example:
According to the output the p-value of this test is less than 0.0001. How do we interpret this?
A p-value which is practically 0 means that it would be almost impossible to get data like that observed (or even more extreme) had the null hypothesis been true.
More specifically to our example, if there were no differences between females and males with respect to whether they value looks vs. personality, it would be almost impossible (probability approximately 0) to get data where the difference between the sample means of females and males is -2.596 (that difference is 10.733 – 13.329 = -2.596) or higher.
Comment: Note that the output tells us that ¯¯¯y1− ¯¯¯y2 is approximately -2.6. But more importantly, we want to know if this difference is significant. To answer this, we use the fact that this difference is 4.66 standard errors below the null value.
Step 4: Conclusion in context
As usual a small p-value provides evidence against Ho. In our case our p-value is practically 0 (which smaller than any level of significance that we will choose). The data therefore provide very strong evidence against Ho so we reject it and conclude that the mean Importance score (of looks vs personality) of males differs from that of females. In other words, males and females differ with respect to how they value looks vs. personality.
Comments
You might ask yourself: “Where do we use the test statistic?”
It is true that for all practical purposes all we have to do is check that the conditions which allow us to use the two-sample t-test are met, lift the p-value from the output, and draw our conclusions accordingly.
However, we feel that it is important to mention the test statistic for two reasons:
-
The test statistic is what’s behind the scenes; based on its null distribution and its value, the p-value is calculated.
-
Apart from being the key for calculating the p-value, the test statistic is also itself a measure of the evidence stored in the data against Ho. As we mentioned, it measures (in standard errors) how different our data is from what is claimed in the null hypothesis.
Let’s look at another example, and then you’ll do one yourself.
According to the National Health And Nutrition Examination Survey (NHANES) sponsored by the U.S. government, a random sample of 712 males between 20 and 29 years of age and a random sample of 1,001 males over the age of 75 were chosen, and the weight of each of the males was recorded (in kg). Here is a summary of the results (source: http://www.cdc.gov/nchs/data/ad/ad347.pdf):
Do the data provide evidence that the younger male population weighs more (on average) than the older male population? (Note that here the data are given in a summarized form, unlike the previous problem, where the raw data were given.)
Here is a figure that summarizes this example:
Note that we defined the younger age group and the older age group as population 1 and population 2, respectively, and μ1 and μ2 as the mean weight of population 1 and population 2, respectively.
Step 1:
Since we want to test whether the older age group (population 2) weighs less on average than the younger age group (population 1), we are testing:
or equivalently,
Step 2:
We can safely use the two-sample t-test in this case since:
-
The samples are independent, since each of the samples was chosen at random.
-
Both sample sizes are very large (712 and 1,001), and therefore we can proceed regardless of whether the populations are normal or not.
It is possible from these data to calculate the t-statistic of 5.31 and the p-value of 0.000. The t-value is quite large, and the p-value correspondingly small, indicating that our data are very different from what is claimed in the null hypothesis.
Step 3:
The p-value is essentially 0, indicating that it would be nearly impossible to observe a difference between the sample mean weights of 4.9 (or more) if the mean weights in the age group populations were the same (i.e., if Ho were true).
Step 4:
A p-value of 0 (or very close to it) indicates that the data provide strong evidence against Ho, so we reject it and conclude that the mean weight of males 20-29 years old is higher than the mean weight of males 75 years old and older. In other words, males in the younger age group weigh more, on average, than males in the older age group.
Confidence Interval for (Two-Sample t Confidence Interval)
So far we’ve discussed the two-sample t-test, which checks whether there is enough evidence stored in the data to reject the claim that μ1−μ2=0 (or equivalently, that μ1=μ2 ) in favor of one of the three possible alternatives.
If we would like to estimate μ1−μ2 we can use the natural point estimate, ¯¯¯y1− ¯¯¯y2 , or preferably, a 95% confidence interval which will provide us with a set of plausible values for the difference between the population means μ1−μ2 .
In particular, if the test has rejected Ho:μ1−μ2=0 , a confidence interval for μ1−μ2 can be insightful since it quantifies the effect that the categorical explanatory variable has on the response.
Comment
We will not go into the formula and calculation of the confidence interval, but rather ask our software to do it for us, and focus on interpretation.
Example
Recall our leading example about the looks vs. personality score of females and males:
Here again is the output:
Recall that we rejected the null hypothesis in favor of the two-sided alternative and concluded that the mean score of females is different from the mean score of males. It would be interesting to supplement this conclusion with more details about this difference between the means, and the 95% confidence interval for μ1−μ2 does exactly that.
According to the output the 95% confidence interval for μ1−μ2 is roughly (-3.7, -1.5). First, note that the confidence interval is strictly negative suggesting that μ1 is lower than μ2 . Furthermore, the confidence interval tells me that we are 95% confident that the mean “looks vs. personality score” of females ( μ1 ) is between 1.5 and 3.7 points lower than the mean looks vs. personality score of males ( μ2 ). The confidence interval therefore quantifies the effect that the explanatory variable (gender) has on the response (looks vs personality score).
Comment
As we’ve seen in previous tests, as well as in the two-samples case, the 95% confidence interval for μ1−μ2 can be used for testing in the two-sided case (Ho:μ1−μ2=0 vs. Ha:μ1−μ2≠0 ):
If the null value, 0, falls outside the confidence interval, Ho is rejected
If the null value, 0, falls inside the confidence interval, Ho is not rejected
Example
Let’s go back to our leading example of the looks vs. personality score where we had a two-sided test.
We used the fact that the p-value is so small to conclude that Ho can be rejected. We can also use the confidence interval to reach the same conclusion since 0 falls outside the confidence interval. In other words, since 0 is not a plausible value for μ1−μ2 we can reject Ho, which claims that μ1−μ2=0 .
Exercise
Let’s Summarize
We have completed our discussion of the two-sample t-test for comparing two populations’ means when the samples are independent. Let’s summarize what we have learned.
- The two sample t-test is used for comparing the means of a quantitative variables (Y) in two populations (which we initially called sub-populations).
- Our goal is comparing μ1 and μ2 (which in practice is done by making inference on the difference μ1 – μ2). The null hypotheses is
- Ho: μ1 – μ2 = 0
and the alternative hypothesis is one of the following (depending on the context of the problem):
- Ha: μ1 – μ2 < 0
- Ha: μ1 – μ2 > 0
- Ha: μ1 – μ2 ≠ 0
- The two-sample t-test can be safely used when the samples are independent and at least one of the following two conditions hold:
- The variable Y is known to have a normal distribution in both populations
- The two sample sizes are large.
When the sample sizes are not large (and we therefore need to check the normality of Y in both population), what we do in practice is look at the histograms of the two samples and make sure that there are no signs of non-normality such as extreme skewedness and/or outliers.
- The test statistic is as follows and has a t distribution when the null hypothesis is true:
- P-values are obtained from the output, and conclusions are drawn as usual, comparing the p-value to the significance level alpha.
- If Ho is rejected, a 95% confidence interval for μ1 – μ2 can be very insightful and can also be used for the two-sided test.