7.3: Sampling Distribution of the Sample Proportions
Learning Objectives
- Apply the sampling distribution of the sample proportion (when appropriate). In particular, be able to identify unusual samples from a given population.
The first step to drawing conclusions about parameters based on the accompanying statistics is to understand how sample statistics behave relative to the parameter that summarizes the entire population. We begin with the behavior of sample proportion relative to population proportion (when the variable of interest is categorical). After that, we will explore the behavior of sample mean relative to population mean (when the variable of interest is quantitative).
Behavior of Sample Proportion p̂
Example
Approximately 60% of all part-time college students in the United States are female. (In other words, the population proportion of females among part-time college students is p = .6.) What would you expect to see in terms of the behavior of sample proportion of females (ˆp) if random samples of size 100 were taken from the population of all part-time college students?
As we saw before, due to sampling variability, sample proportion in random samples of size 100 will take numerical values which vary according to the laws of chance: in other words, sample proportion is a random variable. To summarize the behavior of any random variable, we focus on three features of its distribution: the center, the spread, and the shape.
Based only on our intuition, we would expect the following:
Center: Some sample proportions will be on the low sidesay, .55 or .58— while others will be on the high side— say, .61 or .66. It is reasonable to expect all the sample proportions in repeated random samples to average out to the underlying population proportion, .6. In other words, the mean of the distribution of ˆp should be p.
Spread: For samples of 100, we would expect sample proportions of females not to stray too far from the population proportion .6. Sample proportions lower than .5 or higher than .7 would be rather surprising. On the other hand, if we were only taking samples of size 10, we would not be at all surprised by a sample proportion of females even as low as 4/10 = .4, or as high as 8/10 = .8. Thus, sample size plays a role in the spread of the distribution of sample proportion: there should be less spread for larger samples, more spread for smaller samples.
Shape: Sample proportions closest to .6 would be most common, and sample proportions far from .6 in either direction would be progressively less likely. In other words, the shape of the distribution of sample proportion should bulge in the middle and taper at the ends: it should be somewhat normal.
Comment
The distribution of the values of the sample proportions (ˆp) in repeated samples is called the sampling distribution of ˆp.
The purpose of the next activity is to check whether our intuition about the center, spread and shape of the sampling distribution of ˆp was right via simulations.
Again, the simulations on the previous page reinforced what makes sense to our intuition. Larger random samples will better approximate the population proportion. When the sample size is large, sample proportions will be closer to p. In other words, the sampling distribution for large samples has less variability. Advanced probability theory confirms our observations and gives a more precise way to describe the standard deviation of the sample proportions. This is described next.
The Sampling Distribution of the Sample Proportion
If repeated random samples of a given size n are taken from a population of values for a categorical variable, where the proportion in the category of interest is p, then the mean of all sample proportions (ˆp) is the population proportion (p). As for the spread of all sample proportions, theory dictates the behavior much more precisely than saying that there is less spread for larger samples. In fact, the standard deviation of all sample proportions (ˆp) is exactly √p(1−p)n.
Since sample size n appears in the denominator of the square root, the standard deviation does decrease as sample size increases. Finally, the shape of the distribution of ˆp will be approximately normal as long as the sample size n is large enough. The convention is to require both np and n(1 – p) to be at least 10.
We can summarize all of the above by the following:
ˆp has a normal distribution with a mean of μˆp=p and standard deviation σˆp=√p(1−p)n (and as long as np and n(1 – p) are at least 10).
Let’s apply this result to our example and see how it compares with our simulation.
In our example, n = 25 (sample size) and p = 0.6. Note that np = 15 ≥ 10 and n(1 – p) = 10 ≥ 10. Therefore we can conclude that ˆp is approximately a normal distribution with mean p = 0.6 and standard deviation √p(1−p)n=√0.6(1−0.6)25=0.097 (which is very close to what we saw in our simulation).
If a sampling distribution is normally shaped, then we can apply the Standard Deviation Rule and use z-scores to determine probabilities. Let’s look at some examples.
Example
A random sample of 100 students is taken from the population of all part-time students in the United States, for which the overall proportion of females is .6.
(a) There is a 95% chance that the sample proportion (ˆp) falls between what two values? First note that the distribution of ˆp has the mean p = .6, standard deviation √p(1−p)n=√0.6(1−0.6)100=0.05, and a shape that is close to normal, since np = 100(.6) = 60 and n(1 – p) = 100(.4) = 40 are both greater than 10. The Standard Deviation Rule applies: the probability is approximately .95 that ˆp falls within 2 standard deviations of the mean, that is, between 0.6 – 2(.05) and 0.6 + 2(.05). There is roughly a 95% chance that ˆp falls in the interval (.5, .7).
(b) What is the probability that sample proportion ˆp is less than or equal to .56?
To find P(ˆp≤0.56), we standardize .56 to z = (.56-.60) / .05 = -.80:
P(ˆp≤0.56)=P(Z≤−0.80)=0.2119
To see the impact of the sample size on these probability calculations, consider the following variation of our example.
Example
A random sample of 2,500 students is taken from the population of all part-time students in the United States, for which the overall proportion of females is .6.
(a) There is a 95% chance that the sample proportion (ˆp) falls between what two values? First note that the distribution of ˆp has the mean p = .6, standard deviation √p(1−p)n=√0.6(1−0.6)2500=0.01, and a shape that is close to normal, since np = 2500(.6) = 1500 and n(1 – p) = 2500(.4) = 1000 are both greater than 10. The standard deviation rule applies: the probability is approximately .95 that ˆp falls within 2 standard deviations of the mean, that is, between 0.6 – 2(.01) and 0.6 + 2(.01). There is roughly a 95% chance that ˆp falls in the interval (.58, .62).
(b) What is the probability that sample proportion ˆp is less than .56?
To find P(ˆp≤0.56), we standardize .56 to z = (.56 – .60) / .01 = -4.00:
P(ˆp≤0.56)=P(Z≤−4.0)=0, approximately.
Comment
As long as the sample is truly random, the distribution of ˆp is centered at p, no matter what size sample has been taken. Larger samples have less spread. Specifically, when we multiplied the sample size by 25, increasing it from 100 to 2,500, the standard deviation was reduced to 1/5 of the original standard deviation. Sample proportion strays less from population proportion .6 when the sample is larger: it tends to fall anywhere between .5 and .7 for samples of size 100, whereas it tends to fall between .58 and .62 for samples of size 2,500. It is not so improbable to take a value as low as .56 for samples of 100 (probability is more than 20%) but it is almost impossible to take a value as low as low as .56 for samples of 2,500 (probability is virtually zero).
Theoretical Comment (Optional)
The above results for the distribution of sample proportion ˆp are directly related to the results already obtained for the distribution of sample count X in a binomial experiment. Remember that X had mean np, standard deviation √np(1−p), and a shape that allowed for normal approximations as long as both np and n(1 – p) were at least 10. Since sample proportion is ˆp=Xn, we could derive the mean and standard deviation of ˆp by applying the Rules for Means and Variances: μˆp=μXn=1nμx=1n(np)=p and σ2ˆp=σ2Xn=1n2σ2x=1n2(np)(1−p)=1np(1−p) so σˆp=√p(1−p)n. The requirements that np and n(1 – p) be at least 10 are the same, whether we are focusing on the distribution of sample count or the distribution of sample proportion. After all, the shape of ˆp is the same as the shape of X: the scale of the horizontal axis is just uniformly divided by n.