Math 143 C/E, Spring 2001

Math 143 C/E, Spring 2001
IPS Reading Questions
Chapter 9, Section 1 (up to ``Beyond the basics", p. 632)

How is this material on inference for 2-way tables related to and an extension of the 2-proportion inference procedures we learned in Section 8.2?
In Section 8.2, we were considering two categorical variables: one of these variables was in the forefront and took on two values (which we generically classified as ``Successes" and ``Failures"); the other categorical variable was ``Population" which, because we were considering two populations (women and men, children trained for 6 months on the piano and children who weren't, etc.), took on just two values as well. Thus, a 2-way table of information for problems of Section 8.2 would have been 2-by-2 (2 rows, 2 cols excluding the ones for ``Totals"). See Reading Discussion Question number 1 from Section 8.2 to see an example of a two-way table for a problem from that section.
In fact, the inference procedure (chi-square test) is an alternate way of testing out the null hypothesis that the two populations are the same (which was precisely the null hypothesis used in the significance tests of Section 8.2. The chi-square test is not quite as flexible as the 2-sample z (2-proportion) test we learned, since the alternative hypothesis is always 2-sided. Nevertheless, both should yield the same answers (the same P-value) in 2-sided alternative cases.
Now, in Section 9.1, we're opening up the possibility that our two categorical variables may take on more than two values. This added flexibility will come at the cost of no longer being able to do any sort of confidence interval. The chi-square test, however, provides us the tools to perform a test of significance. If we follow the convention of placing the values of our explanatory (`population'-like) variable at the heads of columns, then the columns will give conditional distributions that correspond to samples from the populations being considered. Our null hypothesis will always be that there is no real difference in the distributions of these populations. Differences may, indeed, appear in our (column) conditional distributions, but that isn't too surprising since this data comes from a random sample (or random samples). The main question of the test of significance for 2-way tables is whether the lack of same-ness between our sampled distributions is strong enough evidence to make us doubt the same-ness of the populations they represent.

Just like when we used the normal approximation to a binomial distribution, the chi-square distribution is only approximately accurate and should be used with attention paid to whether it will give a close approximation in a given setting. What does the text say are ``safe" situations (ones where the chi-square test should be relatively accurate)?
It will be pretty accurate for 2-by-2 tables when the expected count of each cell is at least 5. For larger tables, we should be safe if <ul> <li>no cell has an expected count less than 1, and</li> <li>the average of the expected counts (for all cells) is at least 5.</li> </ul>

In what sense should a certain count in a cell be expected? To date, I have provided no justification for the term ``expected count", let alone a reason behind the formula. Try this exercise, forgetting any formula that you know. Below is a 2-way table where all of the individual cells have been left blank. The only information you have are the marginal distributions. Operating under the assumption that the two populations (columns) are no different (that is, that they have the same conditional distributions), try to fill in the counts for each cell. (Note: the totals should work out for both columns and rows.)

Gender

Smoking Status Female Male Total

Non-smoker 261

Smoker 37

Total 175 123

First, the answers:
Gender Smoking Status Female Male Total Non-smoker 153.27 107.73 261 Smoker 21.73 15.27 37 Total 175 123 298
Distributions being equal doesn't mean that counts will be, since more women (175) were included in the sample than men (123). That's the first lesson. It does mean, however, that proportions will be equal. Overall, the proportion of non-smokers to the total is 261/298, or approximately 87.58%. If the two population distributions are the same, we would expect this percentage of females and this percentage of males to be non-smokers, which means that the count of females would be 87.58% of the 175 questioned, or 261 298 ·175. A similar type of reasoning would yield the counts of all cells. Notice that the calculation above works out to be precisely the formula you learned for expected counts:
(row total)(column total) n .

	Gender
Smoking Status	Female	Male	Total
Non-smoker			261
Smoker			37
Total	175	123