Review: Exam 1
Variables (univariate data)
- Differences between
- quantitative and categorical variables
- variables and values (ex. gender is a variable,
female/male are values)
- continuous and discrete (quantitative) variables
- Data sets (single-variable)
- Visual displays of data (specific)
- Bar graphs: provides distribution of categorical data
- Pie charts: categorical data unusable if
categories displayed do not reflect 100% of cases
- Stemplots
- provides distribution of quantitative data
- all values must be rounded to same power of ten (same digit)
- leaf should be last digit only
- may have to subdivide according to second-to-last digit (end
of stem) to get useful depiction (see Figure 1.3, p. 12)
- Histograms
- provides distribution of quantitative data
- importance of uniform bin size
- effect of changes in bin size
- frequency vs. relative frequency
- Boxplots
- quantitative data
- goes hand in hand with 5-number summary
- Displaying outliers via 1.5*IQR criterion
- Timeplots: time is horizontal axis (unlike those
graphs which display distributions); reveals trends
- Visual displays of data (general)
- Be able to produce your own by hand from the
list above, given a data set
- Overall patterns
- shape: unimodal, bimodal; skewed (to the
right or left) vs. symmetric
- outliers
- What can and cannot be learned from the various types
of displays listed above? Under what circumstances
is it better to use one over another?
- Center and spread: which summaries work only for
quantitative data? Which for categorical? When is
one summary more appropriate than another?
- Matching a graph to a statistical summary or another
plot of the same information
- Statistical summaries for a sample
- Center/spread (or variability): know meaning and computation of
- mode: only measure of center
for categorical data (no corresponding
measure of spread)
- median/IQR (quantitative data)
- mean/standard deviation (quantitative data)
- Resistant measures (What does resistance mean? Which
measures are/are not?)
- How to locate (roughly) given plot of distribution
-
Normal
distributions N(m,
s)
- Standardizing values (converting a value of X to a
standardized value Z) and the reverse process
- Using Table A to go back and forth between probabilities
and standardized scores
-
Interpreting a probability P(a < X < b) as area
under a normal curve
- Interpretation of the standard deviation s
- Distance from center to point where inflection
occurs
- The 68-95-99.7 rule (remember, these numbers are not
exact; you should be able to tell what they are
exactly using Table A)
-
Terms to know:
-
individuals, cases, distribution, outlier, center, spread, mode,
skewed to the left/right, frequency, relative frequency, trend,
seasonal variation, resistant measure, first/third quartile,
pth percentile, mean, standard deviation, variance, median,
IQR, 5-number summary, density curve
Bivariate data
-
Two quantitative variables
- Scatterplots
- Used to discover associations between variables
- How to read them and construct them from data
- Each point corresponds to one case (individual)
- Explanatory variable (if there is one) should be
on horizontal axis, response variable on vertical
- Using different symbols to indicate categorical
variables (see Figure 2.2, p. 109)
- What to look for
- overall pattern (indicating an association) and
deviations from it
- form (are there clusters?), direction (positive
association? negative? neither?; Note: not
all positive/negative associations are linear,
and not all associations are positive/negative)
and strength (how closely data points adhere
to pattern) of relationship
- outliers (ones that are influential in regression,
ones with large residuals or lie far off pattern)
- Linear associations
- Correlation r
- Formula for r
- A quantitative measure of how strongly linear
the relationship is
- Properties of r (see pp. 128-29)
- Role r plays in determining the regression line
(see formula for slope, p. 141)
- What r2 tells you about the amount of
variation in data explained by fit (DATA = FIT + RESIDUAL)
- Simple linear regression
- Assumptions that ensure its validity
Residuals are independent between subpopulations
Patterns in these plots may reveal presence
of lurking variables
- Regression as a predictive tool
- Main purpose of regression
- Interpretation of slope
- Interpolation vs. extrapolation
- Finding the least-squares regression line for
a given sample
- idea behind it: pick line that gives minimum
sum of squares of residuals
- using equations on p. 141
-
Terms to know:
-
explanatory and response variables, association, strength, correlation,
interpolation, extrapolation, residual, slope, y-intercept,
influential observations
- Two categorical variables
- Two-way tables
- Construction
- Cells contain counts from sample/census
- Marginal distributions (use Totals row and column)
- Conditional distributions
- Calculating column/row percents
- Can suggest ways in which populations are different
- Bar charts for depicting conditional distributions
Compare Figure 9.2, p. 628, with data table of Example 9.3, p. 625;
also Figure 9.4, p. 637 with table on p. 636
-
Terms to know:
-
marginal and conditional distributions, bar graphs, expected cell
counts
-
One categorical variable, one quantitative
Visual displays and comparisons
- Side-by-side boxplots
- Back-to-back stemplots
- How to read and construct
- Only possible to do pairwise comparisons
- Histograms, one for each value of the categorical var.,
are also a possibility
-
Associations and Causation
Possible explanations for associations (see Figure 2.14, p. 208)
- Causation
- Generally, a controlled, randomized experiment must
be carried out to conclusively establish that
an observed association is actually one of
cause-and-effect
- Absent such an experiment, establishing causation
through purely observational studies requires:
See a minimal list of things necessary on pp. 211-212.
- Common response
- Lurking variable is responsible for both explanatory
and response values
- Still may be possible to predict response values from
explanatory ones
- See Examples 2.34, p. 208 and 2.36, p. 209
- Confounding
- Association can be entirely due to lurking variable
(see Berkeley graduate
programs example)
- Explanatory and lurking variables may work together
to affect the response (see Examples 2.34, p. 208
and 2.37, p. 210)
- Lurking variable may work in opposite direction of
explanatory variable, making the association
appear in the opposite direction of the
actual cause-and-effect relationship between
explanatory and response variables (see Exercise
2.94, p. 206, an example of Simpson's Paradox)
-
Terms to know:
-
Simpson's paradox, aggregate data, lurking variables, common response,
causation (cause-and-effect), confounding
Reading Discussion Questions for the sections to be tested
- Chapter 1
1.Intro
1.1
1.2
1.3
- Chapter 2
2.1
2.2
2.3
2.4
2.6
2.7
[an error occurred while processing this directive]
This page maintained by:
Thomas L. Scofield
Department of Mathematics and Statistics
Calvin College
Last Modified:
Monday, 26-Jul-2004 13:10:07 EDT