Review: Exam 3 covering 2.1-2.4, 2.6-2.7, Chs. 9, 10, and 12
Two quantitative variables
- Scatterplots
- Used to discover associations between variables
- How to read them and construct them from data
- Each point corresponds to one case (individual)
- Explanatory variable (if there is one) should be
on horizontal axis, response variable on vertical
- Using different symbols to indicate categorical
variables (see Figure 2.2, p. 109)
- What to look for
- overall pattern (indicating an association) and
deviations from it
- form (are there clusters?), direction (positive
association? negative? neither?; Note: not
all positive/negative associations are linear,
and not all associations are positive/negative)
and strength (how closely data points adhere
to pattern) of relationship
- outliers (ones that are influential in regression,
ones with large residuals or lie far off pattern)
- Normal quantile (or probability) plots
- A particular type of scatterplot, most easily
produced by software, and only requiring
one quantitative variable
- Purpose: to determine if the data values for
that quantitative variable appear to be
normally distributed (determined by
whether or not the plot appears linear)
- Linear associations
- Correlation r
- Formula for r as described in classroom handout
- A quantitative measure of how strongly linear
the relationship is
- Properties of r (see pp. 128-29)
- Role r plays in determining the regression line
(see formula for slope, p. 141)
- What r2 tells you about the amount
of variation in data explained by fit (Recall:
DATA = FIT + RESIDUAL)
- Simple linear regression
- Assumptions that ensure its validity
- Each subpopulation is normally distributed with
means that lie along some line b0 + b1x
- There is one value s
that is the spread of each subpopulation
- Residuals ei are normally
distributed as N(0,s)
Check by looking at histogram/normal quantile
plot of residuals
- Residuals are independent between subpopulations
- Check by looking at plot of residuals vs.
fits (i.e., a residual plot) and
residuals vs. time/observation number
(look for `no discernible pattern or
trend')
- Patterns in these plots may reveal presence
of lurking variables
- Regression as a predictive tool
- Main purpose of regression
- Interpretation of slope
- Interpolation vs. extrapolation
- Finding the least-squares regression line for
a given sample
- idea behind it: pick line that gives minimum
sum of squares of residuals
- using equations on p. 141
- Inference on regression
- Inference on b1
- Interpreting b1
= 0 asno linear association
- Determining level C confidence interval
- Critical value is t* with
df = n - 2
- SEb1
provided by software
- Test of significance
- null and alternative hypotheses
- use t(n - 2) distribution
- interpretation of P-value
- Given appropriate Minitab output, identify
and interpret
- CIs for subpopulation mean at x = x*
- prediction intervals for individuals with
explanatory value x = x*
- Be able to interpret ANOVA table output from regression
-
Terms to know:
-
explanatory and response variables, association, strength, correlation,
interpolation, extrapolation, residual, slope, y-intercept,
influential observations
Two categorical variables
- Two-way tables
- Construction
- Values of explanatory variable along columns, response
variable along rows
- Cells contain counts from sample/census
- Marginal distributions (Totals row and
column) provided
- Conditional distributions
- Calculating column (also row) percents
- Plotting as histograms (compare Figure 9.2, p. 628, with
data Table of Example 9.3, p. 625)
- Can suggest ways in which populations are different
once a difference has been established (say,
by a chi-square test)
- Determining differences between populations (inference)
- Blanket approach: chi-square test
- Hypotheses
- H0: no difference in distributions
of various populations represented
- Ha: there is some difference
between poulation distributions
- always a 2-sided alternative
- if accepted, leaves open many
possibilities for
how the populations differ
calls for closer investigation
via conditional distributions,
pairwise 2-proportion methods
- Carrying out the test
- finding expected counts assuming
H0 is true
- calculating the X2 statistic
- determining the number of degrees of freedom
- finding and interpreting the associated
P-value from Table F
- Conditions under which test is valid
- when each of explanatory and response variables
have just two values: each expected count
should be at least 5
- when two-way table is larger than 2-by-2: each
expected count should be at least 1 and
the average of all expected counts at
least 5
- Comparisons approach: 2-proportion methods
- Scenarios/types of questions for which such
an approach is possible (response variable
is reduced to two values: yes or no, success
or failure)
- Allows for more detailed comparisons including
CIs, tests of significance with one-sided
Ha
- See Exam 2 review for specifics (though you will
not be asked to carry out such procedures
for Exam 3, you should be able to interpret
their results)
-
Terms to know:
-
marginal and conditional distributions, bar graphs, expected cell
counts
One categorical variable, one quantitative
- Visual displays and comparisons
- Side-by-side boxplots
- Back-to-back stemplots
- Review how to read and construct
- Only possible to do pairwise comparisons
- Histograms
- Determining differences between groups/(sub)populations (inference)
- Blanket approach: one-way ANOVA
- Notation
- Sample statistics for groups and full collection
- Parameters:
s
= spread of each group,
mi
= population mean for group i
- SS (SSG, SSE, SST) and MS (MSG, MSE, MST)
- how these represent variations/variances
between groups, within groups and
overall
- which of these falls into the
DATA = FIT + RESIDUAL scheme
- which of these serves as the pooled
estimate sp for
s
- Hypotheses
- H0: means of all groups are equal
(i.e.,
m1 =
m2 = ... =
mI
)
- Ha: means are not all the same
- always a 2-sided alternative
- if accepted, leaves open many possibilities
for how the populations differ - calls
for closer investigation via conditional
distributions, pairwise 2-sample t
methods
- One-way ANOVA test (what you should know and be able
to do):
- Describe the assumptions that make the test valid
and how to check these assumptions for a
given data set
- Relationship between degrees of freedom and
nos. of groups/units
- Fill in a skeleton ANOVA table
this includes giving bounds on the P-value
as determined by Table E
- Predict which will be larger, the family or
individual error rates for pairwise comparisons
(see slides 29 and 30 from the ANOVA Powerpoint
presentation)
- Read and interpret Minitab output, including
- specific aspects of ANOVA table
- confidence intervals for group means
- pairwise comparisons (see below)
- Pairwise comparisons approach: 2-sample t methods
While you may want to review 2-sample t procedures
(in thinking about preparing for the Final Exam),
you will not be required to carry any out for Exam 3.
What you must be able to do is interpret the results
of such procedures, when they are included as part
of Minitab's ANOVA output.
- What does it mean when a CI for the difference
of means includes both negative and positive
numbers?
- How do you determine whether the confidence
interval (-3.685, 0.435) on Slide
29 is for
(mA -
mB) or
(mB -
mA)?
-
Terms to know:
-
One and two-way ANOVA, variance, between/within groups, sums of squares,
multiple (or pairwise) comparisons
Associations and Causation
Possible explanations for associations (see Figure 2.14, p. 208)
- Causation
- Generally, a controlled, randomized experiment must
be carried out to conclusively establish that
an observed association is actually one of
cause-and-effect
- Absent such an experiment, establishing causation
through purely observational studies requires:
See a minimal list of things necessary on pp. 211-212.
- Common response
- Lurking variable is responsible for both explanatory
and response values
- Still may be possible to predict response values from
explanatory ones
- See Examples 2.34, p. 208 and 2.36, p. 209
- Confounding
- Association can be entirely due to lurking variable
(see Berkeley graduate
programs example)
- Explanatory and lurking variables may work together
to affect the response (see Examples 2.34, p. 208
and 2.37, p. 210)
- Lurking variable may work in opposite direction of
explanatory variable, making the association
appear in the opposite direction of the
actual cause-and-effect relationship between
explanatory and response variables (see Exercise
2.94, p. 206, an example of Simpson's Paradox)
-
Terms to know:
-
Simpson's paradox, aggregate data, lurking variables, common response,
causation (cause-and-effect), confounding
Reading Discussion Questions for the sections to be tested
- Chapter 2
2.1
2.2
2.3
2.4
2.6
2.7
- Chapter 9
9.1
9.2
Back to Math 143C Class Page
This page maintained by:
Thomas L. Scofield
Department of Mathematics and Statistics
Calvin College
Last Modified:
Monday, 26-Jul-2004 13:10:07 EDT