Math 143C
Introduction to Probability and Statistics
Fall, 2008

Review: Exam 1

* Variables (univariate data)

  1. Differences between
    1. quantitative and categorical variables
    2. variables and values (ex. gender is a variable, female/male are values)
    3. continuous and discrete (quantitative) variables
  2. Data sets (single-variable)
    1. Visual displays of data (specific)
      1. Bar graphs: provides distribution of categorical data
      2. Pie charts: categorical data — unusable if categories displayed do not reflect 100% of cases
      3. Stemplots
        1. provides distribution of quantitative data
        2. all values must be rounded to same power of ten (same digit)
        3. leaf should be last digit only
        4. may have to subdivide according to second-to-last digit (end of stem) to get useful depiction (see Figure 1.3, p. 12)
      4. Histograms
        1. provides distribution of quantitative data
        2. importance of uniform bin size
        3. effect of changes in bin size
        4. frequency vs. relative frequency
      5. Boxplots
        1. quantitative data
        2. goes hand in hand with 5-number summary
        3. Displaying outliers via 1.5*IQR criterion
      6. Timeplots: time is horizontal axis (unlike those graphs which display distributions); reveals trends
    2. Visual displays of data (general)
      1. Be able to produce your own by hand from the list above, given a data set
      2. Overall patterns
        1. shape: unimodal, bimodal; skewed (to the right or left) vs. symmetric
        2. outliers
      3. What can and cannot be learned from the various types of displays listed above? Under what circumstances is it better to use one over another?
      4. Center and spread: which summaries work only for quantitative data? Which for categorical? When is one summary more appropriate than another?
      5. Matching a graph to a statistical summary or another plot of the same information
    3. Statistical summaries for a sample
      1. Center/spread (or variability): know meaning and computation of
        1. mode: only measure of “center” for categorical data (no corresponding measure of spread)
        2. median/IQR (quantitative data)
        3. mean/standard deviation (quantitative data)
      2. Resistant measures (What does resistance mean? Which measures are/are not?)
      3. How to locate (roughly) given plot of distribution
  3. Normal distributions N(m, s)
    1. Standardizing values (converting a value of X to a standardized value Z) and the reverse process
    2. Using Table A to go back and forth between probabilities and standardized scores
    3. Interpreting a probability P(a < X < b) as area under a normal curve
    4. Interpretation of the standard deviation s
      1. Distance from center to point where inflection occurs
      2. The 68-95-99.7 rule (remember, these numbers are not exact; you should be able to tell what they are exactly using Table A)
Terms to know:
individuals, cases, distribution, outlier, center, spread, mode, skewed to the left/right, frequency, relative frequency, trend, seasonal variation, resistant measure, first/third quartile, pth percentile, mean, standard deviation, variance, median, IQR, 5-number summary, density curve
* Bivariate data
  1. Two quantitative variables
    1. Scatterplots
      1. Used to discover associations between variables
      2. How to read them and construct them from data
        1. Each point corresponds to one case (individual)
        2. Explanatory variable (if there is one) should be on horizontal axis, response variable on vertical
        3. Using different symbols to indicate categorical variables (see Figure 2.2, p. 109)
      3. What to look for
        1. overall pattern (indicating an association) and deviations from it
        2. form (are there clusters?), direction (positive association? negative? neither?; Note: not all positive/negative associations are linear, and not all associations are positive/negative) and strength (how closely data points adhere to pattern) of relationship
        3. outliers (ones that are influential in regression, ones with large residuals or lie far off pattern)
    2. Linear associations
      1. Correlation r
        1. Formula for r
        2. A quantitative measure of how strongly linear the relationship is
        3. Properties of r (see pp. 128-29)
        4. Role r plays in determining the regression line (see formula for slope, p. 141)
        5. What r2 tells you about the amount of variation in data explained by fit (DATA = FIT + RESIDUAL)
      2. Simple linear regression
        1. Assumptions that ensure its validity
          1. Residuals are independent between subpopulations
              Patterns in these plots may reveal presence of lurking variables
        2. Regression as a predictive tool
          1. Main purpose of regression
          2. Interpretation of slope
          3. Interpolation vs. extrapolation
        3. Finding the least-squares regression line for a given sample
          1. idea behind it: pick line that gives minimum sum of squares of residuals
          2. using equations on p. 141
    Terms to know:
    explanatory and response variables, association, strength, correlation, interpolation, extrapolation, residual, slope, y-intercept, influential observations
  2. Two categorical variables
    1. Two-way tables
      1. Construction
        1. Cells contain counts from sample/census
        2. Marginal distributions (use “Totals” row and column)
      2. Conditional distributions
        1. Calculating column/row percents
        2. Can suggest ways in which populations are different
    2. Bar charts for depicting conditional distributions
      1. Compare Figure 9.2, p. 628, with data table of Example 9.3, p. 625; also Figure 9.4, p. 637 with table on p. 636
    Terms to know:
    marginal and conditional distributions, bar graphs, expected cell counts
  3. One categorical variable, one quantitative
      Visual displays and comparisons
      1. Side-by-side boxplots
      2. Back-to-back stemplots
        1. How to read and construct
        2. Only possible to do pairwise comparisons
      3. Histograms, one for each value of the categorical var., are also a possibility
  4. Associations and Causation
      Possible explanations for associations (see Figure 2.14, p. 208)
      1. Causation
        1. Generally, a controlled, randomized experiment must be carried out to conclusively establish that an observed association is actually one of cause-and-effect
        2. Absent such an experiment, establishing causation through purely observational studies requires:
          1. See a minimal list of things necessary on pp. 211-212.
      2. Common response
        1. Lurking variable is responsible for both explanatory and response values
        2. Still may be possible to predict response values from explanatory ones
        3. See Examples 2.34, p. 208 and 2.36, p. 209
      3. Confounding
        1. Association can be entirely due to lurking variable (see Berkeley graduate programs example)
        2. Explanatory and lurking variables may work together to affect the response (see Examples 2.34, p. 208 and 2.37, p. 210)
        3. Lurking variable may work in opposite direction of explanatory variable, making the association appear in the opposite direction of the actual cause-and-effect relationship between explanatory and response variables (see Exercise 2.94, p. 206, an example of Simpson's Paradox)
    Terms to know:
    Simpson's paradox, aggregate data, lurking variables, common response, causation (cause-and-effect), confounding
* Reading Discussion Questions for the sections to be tested
  1. Chapter 1
    1. 1.Intro
      1.1
      1.2 1.3
  2. Chapter 2
    1. 2.1
      2.2
      2.3
      2.4
      2.6
      2.7

[an error occurred while processing this directive]


This page maintained by:
Thomas L. Scofield
Department of Mathematics and Statistics
Calvin College

Last Modified: Monday, 26-Jul-2004 13:10:07 EDT