An Introduction to Sampling Distributions

A few words about sampling

The following are some important terms we need to use and understand accurately in order to do inferential statistics. To make things concrete, let's consider two examples:

We want to know what percentage of Michigan residents who will vote in an upcoming election favor candidate A.
We want to know the mean birth weight of robins born in Michigan this year.

Population:

The entire collection individuals about which we desire information. [Examples: Michigan residents who will vote in the upcoming election; all robins born in Michigan this year.] Note that there may be practical difficulties in identifying exactly which individuals are in the population.

Parameter:

A numerical summary (usually unknown but desired) based on the entire population. [Examples: percent of Michigan residents who will vote for candidate A in the upcoming election; mean birth weight of all robins in Michigan.]

Sample:

A subset of the population from which data is actually collected. [Examples: 1000 Michigan residents who claim they will vote and answered a telephone survey; 250 robins that are found and weighed within 2 hours of birth.]

Note: It is actually possible that the sample is not actually a subset of the population. Although we want the sample to be a subset of the population, because it may difficult to determine the population (think about the voting example) we may sample some individuals that aren't actually in the population (because they say they will vote but then don't, for example).

Statistic:

A numerical summary based on data collected from the sample. [Examples: percent of Michigan residents in a telephone survey who say they favor candidate A; mean weight of a sample of new born robins.]

Population Distribution (of a variable):

The value of a variable over a population can be thought of as a random variable because the value of the variable depends on which individual is selected. The probability distribution of this random variable is called the population distribution.

Sampling Distribution (of a statistic):

A statistic computed from a random sample (or in a randomized experiment) is a random variable because the outcome depends on which individuals are included in the sample. The probability distribution of the sample statistic is called the sampling distribution.

A couple things to keep in mind:

The main big idea that we need to make precise and quantify is that the results of sampling vary from sample to sample, but that the nature of this variability (the sampling distribution) can, in many situations, be determined (or at least approximated). This allows us to make statements about a population based on results from a sample.
Of course, we can only find out information about a population based on a sample if the sample is properly selected. The mathematics always assumes a simple random sample (SRS), where each individual in the population has an equal chance of being selected for the sample. In practice, we often have to use more complicated sampling techniques (because we don't have list of the entire population to sample from).

Sampling Distributions for Counts and Proportions

Let's consider our first example above: sampling to determine public opinion (election outcome). This is really just one example of a common situation where we want to estimate the percentage of a population with a certain trait (i.e., a certain value of some categorical variable).

Here's the short version of what happens in this situation:

The distribution of sample counts is approximately Binomial. [B(n,p)].
- n is the sample size;
- p is the percent of population with the trait of interest;
- this approximation is good enough when the population size is at least 10 times larger than the sample size.
Binomial distributions are approximately normal.
- mean = np
- variance = n(p)(1-p) (standard deviation is square root of this).
- this approximation is good enough when np and n(1-p) are both at least 10. (That is, we should "expect" at least 10 success and 10 failures.)
So sample counts are approximately normal with the mean and standard deviation given above.
The distribution of sample proportions is gotten by dividing by n, so it is also approximately normal (under the two good enough conditions above)
- mean = np/n = p
- variance = n(p)(1-p) / (n*n) = p(1-p) / n (standard deviation is square root of this).

Details follow:

Binomial Distributions

A binomial setting is one with the following four characteristics:

There are a fixed, finite number of trials or observations (n);
Each of the n trials is independent;
Each trial has two possible outcomes (generically called success and failure).
The probability of success in each trial is the same (p).

Free-throw Freddy (from test 1) and coin flipping are examples of random events with binomial distributions if we count the number of made free-throws or the number of heads flipped.

Sample counts are almost binomial. If we don't allow individuals to be selected multiple times, then the value of p goes up and down depending on the values obtained from previously sampled individuals. This variability is very small if the population is much larger than the sample. (One person in 200 million does not affect the percentage very much, but one person in 25 does.)

Means and Standard Deviations

Consider first a single trial (or B(1,p) if you like). Let X count the number of successes (so X is either 0 or 1 in this case). The distribution of X is

k	0	1
P(X=k)	1-p	p

From this we can easily compute a mean of p and a variance of p(1-p).

If we have n trials instead, we can use our rules for combining means and variances (means and variances both add because the trials are independent.) So

mean = pn/n = p
variance = n(p)(1-p) (standard deviation is square root of this)

Computing Binomial probabilities

We can compute probabilities for binomial distributions in several ways

formulas: If X is B(n,p) then

             n!      k      n-k
P(X=k) = ---------  p  (1-p)
         k! (n-k)!

Tables (like table C in Moore/McCabe).
Computers
Normal approximations (when the normal approximation is good enough.)

The Punch Line

The punch line is this: in the situations that interest us most (large populations, reasonable sample sizes), the distribution of sample proportions is approximately normal, and we already know how to get a lot if information from a normal distribution.

Sampling Distributions for Sample Means

Now consider a sampling values of a quantitative variable from n individuals.

Mean and Standard Deviation of the Sample Mean

If n is 1, then we just have the population distribution. So the mean and standard deviation of the sample mean is the same as the mean and standard deviation of the population.
For larger n we can again combine using our rules for means and variances to get
- mean = mean of population
- variance = variance of population / n (standard deviation is square root of this).

Central Limit Theorem

Of course, knowing the mean and standard deviation are not enough, we need to know what the distribution is. Here's the good news: if n is "large enough", then the sampling distribution is approximately normal (and we already know the mean and standard deviation!)

How large "large enough" is depends on the population distribution.

If the population distribution is normal, then the sampling distribution is also exactly normal, for any n.
If the population distribution is approximately normal, then the sampling distribution is also approximately normal, even for small n.
The less like normal the population distribution is, the larger n must be, but for most distributions, 30 or 40 is plenty large.

The same punch line

Notice that the punch line is the same: regardless of the population distributions, the sampling distribution will be very nearly normal provided the sample size is large enough.

What next?

We already know how to work with normal distributions, so what remains to be done?

We will apply the results above in a large number of situations. In particular we will
- Use sampling to estimate population parameters and give some indication of the quality of the estimates
- We will use sampling to do hypothesis testing to answer questions about population parameters
- We will combine some of the information here to work with more than one population at a time. (This will let us compare two populations, for example.)
Notice that in our analysis above, we assumed that we know a lot about the population (p in the sample proportions settings and both the mean and standard deviation in the sample means settings). But we usually don't know this information -- that's why we are sampling! We'll need to deal with this little problem.

Last Modified: