Math 143C
Introduction to Probability and Statistics
Fall, 2008

Review: Exam 3 covering 2.1-2.4, 2.6-2.7, Chs. 9, 10, and 12

* Two quantitative variables

  1. Scatterplots
    1. Used to discover associations between variables
    2. How to read them and construct them from data
      1. Each point corresponds to one case (individual)
      2. Explanatory variable (if there is one) should be on horizontal axis, response variable on vertical
      3. Using different symbols to indicate categorical variables (see Figure 2.2, p. 109)
    3. What to look for
      1. overall pattern (indicating an association) and deviations from it
      2. form (are there clusters?), direction (positive association? negative? neither?; Note: not all positive/negative associations are linear, and not all associations are positive/negative) and strength (how closely data points adhere to pattern) of relationship
      3. outliers (ones that are influential in regression, ones with large residuals or lie far off pattern)
    4. Normal quantile (or probability) plots
      1. A particular type of scatterplot, most easily produced by software, and only requiring one quantitative variable
      2. Purpose: to determine if the data values for that quantitative variable appear to be normally distributed (determined by whether or not the plot appears linear)
  2. Linear associations
    1. Correlation r
      1. Formula for r as described in classroom handout
      2. A quantitative measure of how strongly linear the relationship is
      3. Properties of r (see pp. 128-29)
      4. Role r plays in determining the regression line (see formula for slope, p. 141)
      5. What r2 tells you about the amount of variation in data explained by fit (Recall: DATA = FIT + RESIDUAL)
    2. Simple linear regression
      1. Assumptions that ensure its validity
        1. Each subpopulation is normally distributed with means that lie along some line b0 + b1x
        2. There is one value s that is the spread of each subpopulation
        3. Residuals ei are normally distributed as N(0,s)
          1. Check by looking at histogram/normal quantile plot of residuals
        4. Residuals are independent between subpopulations
          1. Check by looking at plot of residuals vs. fits (i.e., a residual plot) and residuals vs. time/observation number (look for `no discernible pattern or trend')
          2. Patterns in these plots may reveal presence of lurking variables
      2. Regression as a predictive tool
        1. Main purpose of regression
        2. Interpretation of slope
        3. Interpolation vs. extrapolation
      3. Finding the least-squares regression line for a given sample
        1. idea behind it: pick line that gives minimum sum of squares of residuals
        2. using equations on p. 141
  3. Inference on regression
    1. Inference on b1
      1. Interpreting b1 = 0 as“no linear association”
      2. Determining level C confidence interval
        1. Critical value is t* with df = n - 2
        2. SEb1 provided by software
      3. Test of significance
        1. null and alternative hypotheses
        2. use t(n - 2) distribution
        3. interpretation of P-value
    2. Given appropriate Minitab output, identify and interpret
      1. CIs for subpopulation mean at x = x*
      2. prediction intervals for individuals with explanatory value x = x*
    3. Be able to interpret ANOVA table output from regression
Terms to know:
explanatory and response variables, association, strength, correlation, interpolation, extrapolation, residual, slope, y-intercept, influential observations
* Two categorical variables
  1. Two-way tables
    1. Construction
      1. Values of explanatory variable along columns, response variable along rows
      2. Cells contain counts from sample/census
      3. Marginal distributions (“Totals” row and column) provided
    2. Conditional distributions
      1. Calculating column (also row) percents
      2. Plotting as histograms (compare Figure 9.2, p. 628, with data Table of Example 9.3, p. 625)
      3. Can suggest ways in which populations are different once a difference has been established (say, by a chi-square test)
  2. Determining differences between populations (inference)
    1. Blanket approach: chi-square test
      1. Hypotheses
        1. H0: “no difference in distributions of various populations represented”
        2. Ha: “there is some difference between poulation distributions”
          1. always a 2-sided alternative
          2. if accepted, leaves open many possibilities for how the populations differ — calls for closer investigation via conditional distributions, pairwise 2-proportion methods
      2. Carrying out the test
        1. finding expected counts assuming H0 is true
        2. calculating the X2 statistic
        3. determining the number of degrees of freedom
        4. finding and interpreting the associated P-value from Table F
      3. Conditions under which test is valid
        1. when each of explanatory and response variables have just two values: each expected count should be at least 5
        2. when two-way table is larger than 2-by-2: each expected count should be at least 1 and the average of all expected counts at least 5
    2. Comparisons approach: 2-proportion methods
      1. Scenarios/types of questions for which such an approach is possible (response variable is reduced to two values: yes or no, success or failure)
      2. Allows for more detailed comparisons including CIs, tests of significance with one-sided Ha
      3. See Exam 2 review for specifics (though you will not be asked to carry out such procedures for Exam 3, you should be able to interpret their results)
Terms to know:
marginal and conditional distributions, bar graphs, expected cell counts
* One categorical variable, one quantitative
  1. Visual displays and comparisons
    1. Side-by-side boxplots
    2. Back-to-back stemplots
      1. Review how to read and construct
      2. Only possible to do pairwise comparisons
    3. Histograms
  2. Determining differences between groups/(sub)populations (inference)
    1. Blanket approach: one-way ANOVA
      1. Notation
        1. Sample statistics for groups and full collection
        2. Parameters: s = spread of each group, mi = population mean for group i
        3. SS (SSG, SSE, SST) and MS (MSG, MSE, MST)
          1. how these represent variations/variances between groups, within groups and overall
          2. which of these falls into the DATA = FIT + RESIDUAL scheme
          3. which of these serves as the pooled estimate sp for s
      2. Hypotheses
        1. H0: means of all groups are equal (i.e., m1 = m2 = ... = mI )
        2. Ha: means are not all the same
          1. always a 2-sided alternative
          2. if accepted, leaves open many possibilities for how the populations differ - calls for closer investigation via conditional distributions, pairwise 2-sample t methods
      3. One-way ANOVA test (what you should know and be able to do):
        1. Describe the assumptions that make the test valid and how to check these assumptions for a given data set
        2. Relationship between degrees of freedom and nos. of groups/units
        3. Fill in a skeleton ANOVA table
          1. this includes giving bounds on the P-value as determined by Table E
        4. Predict which will be larger, the family or individual error rates for pairwise comparisons (see slides 29 and 30 from the ANOVA Powerpoint presentation)
        5. Read and interpret Minitab output, including
          1. specific aspects of ANOVA table
          2. confidence intervals for group means
          3. pairwise comparisons (see below)
    2. Pairwise comparisons approach: 2-sample t methods
      1. While you may want to review 2-sample t procedures (in thinking about preparing for the Final Exam), you will not be required to carry any out for Exam 3. What you must be able to do is interpret the results of such procedures, when they are included as part of Minitab's ANOVA output.
        1. What does it mean when a CI for the difference of means includes both negative and positive numbers?
        2. How do you determine whether the confidence interval (-3.685, 0.435) on Slide 29 is for (mA - mB) or (mB - mA)?
Terms to know:
One and two-way ANOVA, variance, between/within groups, sums of squares, multiple (or pairwise) comparisons
* Associations and Causation
    Possible explanations for associations (see Figure 2.14, p. 208)
    1. Causation
      1. Generally, a controlled, randomized experiment must be carried out to conclusively establish that an observed association is actually one of cause-and-effect
      2. Absent such an experiment, establishing causation through purely observational studies requires:
        1. See a minimal list of things necessary on pp. 211-212.
    2. Common response
      1. Lurking variable is responsible for both explanatory and response values
      2. Still may be possible to predict response values from explanatory ones
      3. See Examples 2.34, p. 208 and 2.36, p. 209
    3. Confounding
      1. Association can be entirely due to lurking variable (see Berkeley graduate programs example)
      2. Explanatory and lurking variables may work together to affect the response (see Examples 2.34, p. 208 and 2.37, p. 210)
      3. Lurking variable may work in opposite direction of explanatory variable, making the association appear in the opposite direction of the actual cause-and-effect relationship between explanatory and response variables (see Exercise 2.94, p. 206, an example of Simpson's Paradox)
Terms to know:
Simpson's paradox, aggregate data, lurking variables, common response, causation (cause-and-effect), confounding
* Reading Discussion Questions for the sections to be tested
  1. Chapter 2
    1. 2.1
      2.2
      2.3
      2.4
      2.6
      2.7
  2. Chapter 9
    1. 9.1
      9.2

Back to Math 143C Class Page


This page maintained by:
Thomas L. Scofield
Department of Mathematics and Statistics
Calvin College

Last Modified: Monday, 26-Jul-2004 13:10:07 EDT