- What is the main use of a regression line (in
particular, of its formula y = b0 + b1 x)?
We mainly want a formula such as this to use
as a prediction tool - that is, to find the
y value we would expect to go along with
a given x value. Such predictions are
usually more reliable for interpolations (i.e.,
predictions for x values that are between
the lowest and highest x-values in the
data set) than for extrapolations (i.e.,
predictions for x values somewhat outside
of our data set).
-
If correlation determines how strongly linear
a relationship between two quantitative variables
is, what does regression do? Does the data have
to look linear in its form in order to be able
to perform regression?
Regression determines the actual line that best
fits the data. It is a process that one does not
perform very often on data that does not appear
linear, but there is actually nothing stopping
one from doing so except the knowledge that the
resulting ``best-fit line" is not likely to be very
useful as a prediction tool.
-
Suppose you had a data set of n individuals
which included two quantitative variables. Let us
suppose that these two variables have a linear
relationship - that is, that the first one, let's
call it x, does a decent job of explaining
the values of the other variable y. If you
to find the equation of the regression line by hand,
how would you go about it? Mention specifically
which formulas you would use, the sequence you would
use them in, and on which page(s) of your textbook
the formulas are found.
I will employ the often-effective mathematical
technique of working backwards.
The regression line has an equation of the form
where, using the formulas found on p. 141 (and
notice here that, where I am using the symbols
b0 and b1, your authors are using a
and b) we have
b1 = r |
sy sx
|
and b0 = |
_ y
|
- b1 |
_ x
|
. |
|
Notice that you would have to calculate b1 before
b0 (Can you see why?), and to calculate either
one requires that you first calculate the values of
r, sx, sy, `x and `y. We can
calculate r using the formula at the bottom of
p. 127. The quantities `x and `y are
just sample means (i.e., averages), which are
calculated in the usual way (add up the value of
y for each individual and divide by the number
n of individuals; do similarly for the x values
the formula is on p. 41). Once we know `x
and `y, we can calculate the sample standard
deviations sx and sy after the manner of the
formula on p. 51. Summarizing, we would use the
formulas mentioned to calculate `x and `y,
then sx and sy, then r (although r could
be found at any point in the process so far), then
b1 and finally b0.
-
Suppose you knew two variables x and y
were in a linear relationship and that the predicted
value of y was 13.2 when the value of x
was 5. If you wanted to know the predicted
y-value when x = 5.5, then which would be
necessary to know: the y-intercept b0?
the slope b1? both? neither? Explain.
All that you really would need is the slope b1.
Since b1 represents the
rise/run along the predicting
line, you could determine the ``rise" from 13.2 to
the unknown y by multiplying the ``run" from
5 to 5.5 by the slope. In symbols, this says
and it would be easy enough to get the value of
y if you knew the value of b1.