Math 143 C/E, Spring 2001

Math 143 C/E, Spring 2001
IPS Reading Questions
Chapter 2, Section 3

What is the main use of a regression line (in particular, of its formula y = b₀ + b₁ x)?
We mainly want a formula such as this to use as a prediction tool - that is, to find the y value we would expect to go along with a given x value. Such predictions are usually more reliable for interpolations (i.e., predictions for x values that are between the lowest and highest x-values in the data set) than for extrapolations (i.e., predictions for x values somewhat outside of our data set).

If correlation determines how strongly linear a relationship between two quantitative variables is, what does regression do? Does the data have to look linear in its form in order to be able to perform regression?
Regression determines the actual line that best fits the data. It is a process that one does not perform very often on data that does not appear linear, but there is actually nothing stopping one from doing so except the knowledge that the resulting ``best-fit line" is not likely to be very useful as a prediction tool.

Suppose you had a data set of n individuals which included two quantitative variables. Let us suppose that these two variables have a linear relationship - that is, that the first one, let's call it x, does a decent job of explaining the values of the other variable y. If you to find the equation of the regression line by hand, how would you go about it? Mention specifically which formulas you would use, the sequence you would use them in, and on which page(s) of your textbook the formulas are found.
I will employ the often-effective mathematical technique of working backwards. The regression line has an equation of the form
y = b₀ + b₁ x, where, using the formulas found on p. 141 (and notice here that, where I am using the symbols b₀ and b₁, your authors are using a and b) we have
b₁ = r s_y s_x and b₀ = _ y - b₁ _ x . Notice that you would have to calculate b₁ before b₀ (Can you see why?), and to calculate either one requires that you first calculate the values of r, s_x, s_y, `x and `y. We can calculate r using the formula at the bottom of p. 127. The quantities `x and `y are just sample means (i.e., averages), which are calculated in the usual way (add up the value of y for each individual and divide by the number n of individuals; do similarly for the x values — the formula is on p. 41). Once we know `x and `y, we can calculate the sample standard deviations s_x and s_y after the manner of the formula on p. 51. Summarizing, we would use the formulas mentioned to calculate `x and `y, then s_x and s_y, then r (although r could be found at any point in the process so far), then b₁ and finally b₀.

Suppose you knew two variables x and y were in a linear relationship and that the predicted value of y was 13.2 when the value of x was 5. If you wanted to know the predicted y-value when x = 5.5, then which would be necessary to know: the y-intercept b₀? the slope b₁? both? neither? Explain.
All that you really would need is the slope b₁. Since b₁ represents the rise/run along the predicting line, you could determine the ``rise" from 13.2 to the unknown y by multiplying the ``run" from 5 to 5.5 by the slope. In symbols, this says
y - 13.2 = b₁ (5.5 - 5), and it would be easy enough to get the value of y if you knew the value of b₁.