An Introduction to Sampling Distributions
A few words about sampling
The following are some important terms we need to use and understand
accurately in order to do inferential statistics. To make things
concrete, let's consider two examples:
-
We want to know what percentage of Michigan residents who
will vote in an upcoming election favor candidate A.
-
We want to know the mean birth weight of robins born in Michigan this
year.
-
Population:
-
The entire collection individuals about which we
desire information. [Examples:
Michigan residents who will vote in the upcoming election;
all robins born in Michigan this year.]
Note that there may be practical difficulties in identifying exactly
which individuals are in the population.
-
Parameter:
-
A numerical summary (usually unknown but desired)
based on the entire population.
[Examples: percent of Michigan residents who will vote for candidate A
in the upcoming election;
mean birth weight of all robins in Michigan.]
-
Sample:
-
A subset of the population from which data is actually collected.
[Examples: 1000 Michigan residents who claim they will vote and
answered a telephone survey;
250 robins that are found and weighed within 2 hours of birth.]
Note: It is actually possible that the sample is not actually a
subset of the population.
Although we want the sample to be a subset of the population,
because it may difficult to determine the population (think about
the voting example) we may sample some individuals that aren't
actually in the population (because they say they will vote but then
don't, for example).
-
Statistic:
-
A numerical summary based on data collected from the sample.
[Examples: percent of Michigan residents in a telephone survey who
say they favor candidate A; mean weight of a sample of new born robins.]
-
Population Distribution (of a variable):
-
The value of a variable over a population can be thought of as
a random variable because the value of the variable depends on which
individual is selected. The probability distribution of this
random variable is called the population distribution.
-
Sampling Distribution (of a statistic):
-
A statistic computed from a random sample (or in a randomized experiment)
is a random variable because the outcome depends on which individuals
are included in the sample. The probability distribution of the
sample statistic is called the sampling distribution.
A couple things to keep in mind:
-
The main big idea that we need to make precise and quantify
is that the results of sampling vary from sample to sample, but that
the nature of this variability (the sampling distribution) can,
in many situations, be determined (or at least approximated).
This allows us to make statements about a population based on results
from a sample.
-
Of course, we can only find out information about a population
based on a sample if the sample is properly selected. The
mathematics always assumes a simple random sample (SRS), where each
individual in the population has an equal chance of being selected for
the sample. In practice, we often have to use more complicated sampling
techniques (because we don't have list of the entire population to
sample from).
Sampling Distributions for Counts and Proportions
Let's consider our first example above: sampling to determine
public opinion (election outcome). This is really just one example
of a common situation where we want to estimate the percentage of
a population with a certain trait (i.e., a certain value of
some categorical variable).
Here's the short version of what happens in this situation:
-
The distribution of sample counts is approximately Binomial.
[B(n,p)].
- n is the sample size;
- p is the percent of population with the trait of interest;
- this approximation is good enough when the population size
is at least 10 times larger than the sample size.
-
Binomial distributions are approximately normal.
- mean = np
- variance = n(p)(1-p) (standard deviation is square root of this).
- this approximation is good enough when np and n(1-p) are
both at least 10. (That is, we should "expect" at least 10 success and 10
failures.)
-
So sample counts are approximately normal with the mean and standard
deviation given above.
-
The distribution of sample proportions is gotten by dividing by n,
so it is also approximately normal
(under the two good enough conditions above)
- mean = np/n = p
- variance = n(p)(1-p) / (n*n) = p(1-p) / n
(standard deviation is square root of this).
Details follow:
Binomial Distributions
A binomial setting is one with the following four characteristics:
-
There are a fixed, finite number of trials or observations (n);
-
Each of the n trials is independent;
-
Each trial has two possible outcomes (generically called success and failure).
-
The probability of success in each trial is the same (p).
Free-throw Freddy (from test 1) and coin flipping are examples of
random events with binomial distributions if we count the number
of made free-throws or the number of heads flipped.
Sample counts are almost binomial. If we don't allow individuals to
be selected multiple times, then the value of p goes up and
down depending on the values obtained from previously sampled
individuals. This variability is very small if the population is much
larger than the sample. (One person in 200 million does not affect
the percentage very much, but one person in 25 does.)
Means and Standard Deviations
Consider first a single trial (or B(1,p) if you like).
Let X count the number of successes (so X is either 0 or 1 in this case).
The distribution of X is
From this we can easily compute a mean of p and a variance
of p(1-p).
If we have n trials instead, we can use our rules for combining
means and variances (means and variances both add because
the trials are independent.) So
-
mean = pn/n = p
-
variance = n(p)(1-p) (standard deviation is square root of this)
Computing Binomial probabilities
We can compute probabilities for binomial distributions in several ways
The Punch Line
The punch line is this: in the situations that interest us most (large
populations, reasonable sample sizes), the distribution of sample
proportions is approximately normal, and we already know how to get
a lot if information from a normal distribution.
Sampling Distributions for Sample Means
Now consider a sampling values of a quantitative variable
from n individuals.
Mean and Standard Deviation of the Sample Mean
-
If n is 1, then we just have the population distribution.
So the mean and standard deviation of the sample mean is the same
as the mean and standard deviation of the population.
-
For larger n we can again combine using our rules for means
and variances to get
- mean = mean of population
- variance = variance of population / n (standard deviation is
square root of this).
Central Limit Theorem
Of course, knowing the mean and standard deviation are not enough, we
need to know what the distribution is. Here's the good news: if n
is "large enough", then the sampling distribution is approximately
normal (and we already know the mean and standard deviation!)
How large "large enough" is depends on the population distribution.
-
If the population distribution is normal, then the sampling distribution
is also exactly normal, for any n.
-
If the population distribution is approximately normal,
then the sampling distribution is also approximately normal, even for
small n.
-
The less like normal the population distribution is,
the larger n must be, but for most distributions, 30 or 40 is
plenty large.
The same punch line
Notice that the punch line is the same: regardless of the population
distributions, the sampling distribution will be very nearly normal
provided the sample size is large enough.
What next?
We already know how to work with normal distributions, so what remains to
be done?
-
We will apply the results above in a large number of situations. In particular
we will
-
Use sampling to estimate population parameters and give some indication
of the quality of the estimates
-
We will use sampling to do hypothesis testing to answer questions about
population parameters
-
We will combine some of the information here to work with more than
one population at a time. (This will let us compare two populations,
for example.)
-
Notice that in our analysis above, we assumed that we know a lot about
the population (p in the sample proportions settings and both the
mean and standard deviation in the sample means settings). But we usually
don't know this information -- that's why we are sampling! We'll need
to deal with this little problem.
Last Modified: