9.3: Confidence Intervals for Proportions
Learning Objectives
- Explain what a confidence interval represents and determine how changes in sample size and confidence level affect the precision of the confidence interval.
- Find confidence intervals for the population mean and the population proportion (when certain conditions are met), and perform sample size calculations.
Overview
As we mentioned in the introduction to this module, when the variable that we’re interested in studying in the population is categorical, the parameter we are trying to infer about is the population proportion (p) associated with that variable. We also learned that the point estimator for the population proportion p is the sample proportion ˆp.
To refresh your memory, here is a picture that summarizes an example we looked at.
We are now moving on to interval estimation of p. In other words, we would like to develop a set of intervals that, with different levels of confidence, will capture the value of p. We’ve actually done all the groundwork and discussed all the big ideas of interval estimation when we talked about interval estimation for μ, so we’ll be able to go through it much faster. Let’s begin.
Recall that the general form of any confidence interval for an unknown parameter is:
estimate ± margin of error
Since the unknown parameter here is the population proportion p, the point estimator (as I reminded you above) is the sample proportion ˆp. The confidence interval for p, therefore, has the form:
[latex]\hat{\mathcal{p}}\pm\mathcal{m}[/latex]
(Recall that m is the notation for the margin of error.) The margin of error (m) tells us with a certain confidence what the maximum estimation error is that we are making, or in other words, that ˆp is different from p (the parameter it estimates) by no more than m units.
From our previous discussion on confidence intervals, we also know that the margin of error is the product of two components:
m=confidence multiplier ⋅ SD of the estimator
To figure out what these two components are, we need to go back to a result we obtained in the Sampling Distributions module of the Probability unit about the sampling distribution of ˆp. We found that under certain conditions (which we’ll come back to later), ˆp has a normal distribution with mean p, and standard deviation [latex]\sqrt{\frac{\mathcal{p}\left(1-\mathcal{p}\right)}{\mathcal{n}}}[/latex]. This result makes things very simple for us, because it reveals what the two components are that the margin of error is made of:
* Since, like the sampling distribution of ¯¯¯X, the sampling distribution of ˆp is normal, the confidence multipliers that we’ll use in the confidence interval for p will be the same z* multipliers we use for the confidence interval for μ when σ is known (using exactly the same reasoning and the same probability results). The multipliers we’ll use, then, are: 1.645, 2, and 2.576 at the 90%, 95% and 99% confidence levels, respectively.
* The standard deviation of our estimator ˆp is [latex]\sqrt{\frac{\mathcal{p}\left(1-\mathcal{p}\right)}{\mathcal{n}}}[/latex]
Putting it all together, we find that the confidence interval for p should be: [latex]\hat{\mathcal{p}}\pm\mathcal{z}*\ \bullet\sqrt{\frac{\mathcal{p}\left(1-\mathcal{p}\right)}{\mathcal{n}}}[/latex]. We just have to solve one practical problem and we’re done. We’re trying to estimate the unknown population proportion p, so having it appear in the confidence interval doesn’t make any sense. To overcome this problem, we’ll do the obvious thing…
We’ll replace p with its sample counterpart, ˆp, and work with the standard error of ˆp, [latex]\sqrt{\frac{\mathcal{p}\left(1-\mathcal{p}\right)}{\mathcal{n}}}[/latex]
Now we’re done. The confidence interval for the population proportion p is:
[latex]\hat{\mathcal{p}}\pm\mathcal{z}*\ \bullet\sqrt{\frac{\mathcal{p}\left(1-\mathcal{p}\right)}{\mathcal{n}}}[/latex]
As you’ll see from the examples we’ll present in this unit, estimating the population proportion comes up a lot in the context of polls.
Example
The drug Viagra became available in the U.S. in May, 1998, in the wake of an advertising campaign that was unprecedented in scope and intensity. A Gallup poll found that by the end of the first week in May, 643 out of a random sample of 1,005 adults were aware that Viagra was an impotency medication (based on “Viagra A Popular Hit,” a Gallup poll analysis by Lydia Saad, May 1998).
Let’s estimate the proportion p of all adults in the U.S. who by the end of the first week of May 1998 were already aware of Viagra and its purpose by setting up a 95% confidence interval for p.
We first need to calculate the sample proportion ˆp. Out of 1,005 sampled adults, 643 knew what Viagra is used for, so [latex]\hat{\mathcal{p}}=\frac{643}{1005}=.64[/latex]
Therefore,
A 95% confidence interval for p is [latex]\hat{\mathcal{p}}\pm2\ \bullet\sqrt{\frac{\hat{\mathcal{p}}\left(1-\hat{\mathcal{p}}\right)}{\mathcal{n}}}=.64\pm2\sqrt{\frac{.64\left(1-.64\right)}{1005}=.64\pm.03=\left(.61,\ .67\right)}[/latex]
We can be 95% sure that the proportion of all U.S. adults who were already familiar with Viagra by that time was between .61 and .67 (or 61% and 67%).
The fact that the margin of error equals .03 says we can be 95% confident that unknown population proportion p is within .03 (3%) of the observed sample proportion .64 (64%). In other words, we are 95% confident that 64% is “off” by no more than 3%.
Did I get this?
Comment
We would like to share with you the methodology part of the poll release of the Viagra example, and show you that you now have the tools to understand how polls results are analyzed:
“The results are based on telephone interviews with a randomly selected national sample of 1,005 adults, 18 years and older, conducted May 8-10, 1998. For results based on samples of this size, one can say with 95 percent confidence that the error attributable to sampling and other random effects could be plus or minus 3 percentage points. In addition to sampling error, question wording and practical difficulties in conducting surveys can introduce error or bias into the findings of public opinion polls.”
Learn by Doing
Two important results that we discussed at length when we talked about the confidence interval for μ also apply here:
- There is a trade-off between level of confidence and the width (or precision) of the confidence interval. The more precision you would like the confidence interval for p to have, the more you have to pay by having a lower level of confidence.
- Since n appears in the denominator of the margin of error of the confidence interval for p, for a fixed level of confidence, the larger the sample, the narrower, or more precise it is. This brings us naturally to our next point.
Determining Sample Size for a Given Margin of Error in Estimating Proportions
Just as we did for means, when we have some level of flexibility in determining the sample size, we can set a desired margin of error for estimating the population proportion and find the sample size that will achieve that.
For example, a final poll on the day before an election would want the margin of error to be quite small (with a high level of confidence) in order to be able to predict the election results with the most precision. This is particularly relevant when it is a close race between the candidates. The polling company needs to figure out how many eligible voters it needs to include in their sample in order to achieve that.
Let’s see how we do that.
(Comment: For our discussion here we will focus on a 95% confidence level (z* = 2), since this is the most commonly used level of confidence.)
The 95% confidence interval for p is
[latex]\hat{\mathcal{p}}\pm\mathcal{z}*\ \bullet\sqrt{\frac{\hat{\mathcal{p}}\left(1-\hat{\mathcal{p}}\right)}{\mathcal{n}}}[/latex]
The margin of error, then, is
[latex]\mathcal{m}=2\sqrt{\frac{\hat{\mathcal{p}}\left(1-\hat{\mathcal{p}}\right)}{\mathcal{n}}}[/latex]
Now we isolate n (i.e., express it as a function of m).
[latex]\mathcal{n}=\frac{4\hat{\mathcal{p}}\left(1-\hat{\mathcal{p}}\right)}{\mathcal{m}^2}[/latex]
There is a practical problem with this expression that we need to overcome.
Practically, you first determine the sample size, then you choose a random sample of that size, and then use the collected data to find ˆp.
So the fact that the expression above for determining the sample size depends on ˆp is problematic.
The way to overcome this problem is to take the conservative approach by setting ˆp=12 .
Why do we call this approach conservative?
It is conservative because the expression that appears in the numerator, 4ˆp(1−ˆp) is maximized when ˆp=12.
That way, the n we get will work in giving us the desired margin of error regardless of what the value of ˆp is. This is a “worst case scenario” approach. So when we do that we get:
[latex]\mathcal{n}=\frac{\left(4\right)\frac{1}{2}\left(1-\frac{1}{2}\right)}{\mathcal{m}^2}=\frac{1}{\mathcal{m}^2}[/latex]
Example
It seems like media polls usually use a sample size of 1,000 to 1,200. This could be puzzling.
How could the results obtained from, say, 1,100 U.S. adults give us information about the entire population of U.S. adults? 1,100 is such a tiny fraction of the actual population. Here is the answer:
What sample size n is needed if a margin of error m = .03 is desired? [latex]\mathcal{n}=\frac{1}{{.03}^2}=1111.11\rightarrow\mathcal{n}=1112[/latex] (remember, always round up). In fact, .03 is a very commonly used margin of error, especially for media polls. For this reason, most media polls work with a sample of around 1,100 people.
Example
A few days before an election, a media outlet would like to estimate p, the proportion of eligible voters who support the Democratic candidate. The media outlet would like the estimate to be within 1% (that is, .01) of the true proportion. What is the sample size needed to achieve this in a poll? Set
[latex]\mathcal{n}=\frac{1}{\mathcal{m}^2}=\frac{1}{{0.1}^2}=10000[/latex]
Note that if I take the same conservative approach for the margin of error:
[latex]\mathcal{m}=2\sqrt{\frac{\hat{\mathcal{p}}\left(1-\hat{\mathcal{p}}\right)}{\mathcal{n}}}\\ and use \hat{\mathcal{p}}=\frac{1}{2}\\ \mathcal{m}=\frac{\left(4\right)\frac{1}{2}\left(1-\frac{1}{2}\right)}{\mathcal{n}}=\frac{1}{\sqrt\mathcal{n}}[/latex],
m=√(4)12(1−12)n=1√n,
a conservative estimate for the margin of error, which is useful when we want to get a rough idea of its size without taking the trouble to make detailed calculations.
Also, typically, there are several questions in polls, each yielding a different ˆp. Rather than reporting the separate margin of error for each question using [latex]\mathcal{m}=2\sqrt{\frac{\hat{\mathcal{p}}\left(1-\hat{\mathcal{p}}\right)}{\mathcal{n}}}[/latex], polls report just one, the conservative margin of error [latex]\mathcal{m}=\frac{1}{\sqrt\mathcal{n}}[/latex]
as the margin of error of the poll, which is guaranteed to work for all the questions regardless what the value of ˆp ends up being.
Example
A random sample of 2,500 U.S. adults was chosen to participate in a public opinion survey about different issues related to crime. What is the margin of error of this survey?
We’ll simply use [latex]\mathcal{m}=\frac{1}{\sqrt\mathcal{n}}=\frac{1}{\sqrt{2500}}=.02\[/latex]. The survey has a margin of error of 2%. This means that for each of the questions asked, the obtained sample proportion will be within 2% of the proportion among all U.S. adults.
Did I get this?
When Is It Safe to Use These Methods?
As we mentioned before, one of the most important things to learn with any inference method is the conditions under which it is safe to use it.
As we did for the mean, the assumption we made in order to develop the methods in this unit was that the sampling distribution of the sample proportion, ˆp, is roughly normal. Recall from module 4 of the Probability unit that the conditions under which this happens are that n⋅p≥10 and n⋅(1−p)≥10. Since p is unknown, we will replace it with its estimate, the sample proportion, and set
n⋅ˆp≥10 and n⋅(1−ˆp)≥10
to be the conditions under which it is safe to use the methods we developed in this section.
Let’s Summarize
In general, a confidence interval for the unknown population proportion (p) is [latex]\hat{\mathcal{p}}\pm\mathcal{z}*\ \bullet\sqrt{\frac{\hat{\mathcal{p}}\left(1-\hat{\mathcal{p}}\right)}{\mathcal{n}}}[/latex], where z* is 1.645 for 90% confidence, 2 for 95% confidence, and 2.576 for 99% confidence.
To obtain a desired margin of error (m) in a confidence interval for an unknown population proportion, a conservative sample size is n=1m2.
The margin of error of a poll is determined (conservatively) by [latex]\frac{1}{\sqrt\mathcal{n}}[/latex].
The methods developed in this unit are safe to use as long as n⋅ˆp≥10 and n⋅(1−ˆp)≥10 .