Top Ten Lessons of Probability and Statistics

  1. You can often learn the most about a data set through an appropriate graph/picture/summary.

  2. Data collected poorly cannot be repaired using good statistical analysis.

    One must make every attempt to get a repersentative sample, keeping an eye open to sources of bias to try to eliminate them (acknowledging any that could not be eliminated). Our inferential procedures have been predicated on the assumption (among others) that the samples involved are SRSs. While it is possible to get a representative sample using other methods (multistage and/or stratified sampling, for example), it is beyond the scope of this course to discuss inference in such a setting — in practice, a statistician should be consulted.

  3. Random behavior may not be predictable in the short term, but it is in the long.

    Probability is simply the long-term relative frequency of an event. Many people make the mistake of counting on it as short-term relative frequency.

  4. A frequent goal of statistics is to draw conclusions about a population parameter based on a statistic.

    The key to this type of inference is being able to identify an appropriate probability model for the sampling distribution of the sample statisic.

  5. Common distributions (in particular, probability models for sampling distributions) that arise in various settings come from finding areas under various density curves: the normal, binomial, student t, F and chi-square densities.

    The central limit establishes the special importance of the normal distribution, saying that it frequently can be used to approximate the sampling distribution of a sample mean. (Note: We have also be used it to approximate the sampling distribution for sample proportions because proportions themselves are a type of mean — a mean of n individual counts.) While we learned how to calculate binomial probabilities by hand, we saw that this gets tedious as the sample size n becomes large (say, when n > 6). Other distributions can only be calculated using sophisticated mathematical methods (Calculus or higher) and we are content to rely on the work of others as summarized in tables.

  6. Many of the tests we use arise from certain approximating assumptions about the data.

    It is possible to carry out the procedures even if these assumptions aren't met, but results may not be trustworthy. The more robust the procedure, the more lenient we can be about assumptions. (Note: The t procedures, including ANOVA, are quite robust. Thus, in ANOVA we can require only that the largest S.D. be less than twice the smallest when, in fact, what is assumed is that all S.D.s are equal.) Often it is the presence of outliers, extreme skewness, sample sizes that are too small, or a combination of these that will invalidate a test.

  7. Exploring data is a good way to discover relationships, but the practice should not be abused.

    Few samples are truly representative of their populations in every aspect, and most sample data sets, if looked at from enough angles, will have interesting features. Interesting relationships are often discovered in this fashion, but one should not use the same sample to both guide the formulation of questions and then to answer them by doing inference on the sample. If one sample leads to the question, another sample should be taken to try to answer that question. In formulating an alternative hypothesis for a test of significance, once again the data should not play a role in deciding if we are to use a two-sided alternative or a one-sided one. Have the question in mind beforehand, or definitely use the two-sided alternative.

  8. Another frequent goal (see also point 4 above) of statistics is to try to decide if there is a significant association between variables based on the evidence of an association observed in a sample.

    Our first exposure to inference procedures (CIs and tests of significance) were learned while investigating just one variable (see Chapters 5 and 6, Sections 7.1 and 8.1). While one can answer interesting questions using such methods, this first exposure can be thought of as preparatory to the more-often-used methods for comparing two or more populations: paired-t (late in Section 7.1), 2-sample t (Section 7.2), 2-proportion (Section 8.2), multiple (pairwise) comparison, regression, chi-square and ANOVA procedures.

  9. Association should not be confused with cause-and-effect

    Discovering a strong association can be very useful for prediction, even if the explanatory variable doesn't cause the responses measured. Where the average consumer of statistics runs amok is in not understanding the difference between association and causation, not understanding what is required to establish the latter, and not having the ability to identify whether a study described in the news is observational or an experiment.

  10. Statistics, those who generate them, and those who understand what is involved in their analysis have a role to play in society.

    Love it or hate it, statistical evidence is used to make important decisions every day in business, government, medicine, etc. Those who understand the use of statistics can do great good or harm to us all (and it needn't always be in accordance with their motives). People often do not ask questions, at least not the right questions, when presented with statistical evidence for adopting some policy. We who say we seek social justice and the renewal of creation, we who have at least some knowledge of statistical methods and the questions one should ask, need to put feet to our faith when love for God and God's Creation call for it.