Regression dilution

Illustration of regression dilution (or attenuation bias) by a range of regression estimates in errors-in-variables models. Two regression lines (red) bound the range of linear regression possibilities. The shallow slope is obtained when the independent variable (or predictor) is on the abscissa (x-axis). The steeper slope is obtained when the independent variable is on the ordinate (y-axis). By convention, with the independent variable on the x-axis, the shallower slope is obtained. Green reference lines are averages within arbitrary bins along each axis. Note that the steeper green and red regression estimates are more consistent with smaller errors in the y-axis variable.

Regression dilution, also known as regression attenuation, is the biasing of the linear regression slope towards zero (the underestimation of its absolute value), caused by errors in the independent variable.

Consider fitting a straight line for the relationship of an outcome variable y to a predictor variable x, and estimating the slope of the line. Statistical variability, measurement error or random noise in the y variable causes uncertainty in the estimated slope, but not bias: on average, the procedure calculates the right slope. However, variability, measurement error or random noise in the x variable causes bias in the estimated slope (as well as imprecision). The greater the variance in the x measurement, the closer the estimated slope must approach zero instead of the true value.

Suppose the green and blue data points capture the same data, but with errors (either +1 or -1 on x-axis) for the green points. Minimizing error on the y-axis leads to a smaller slope for the green points, even if they are just a noisy version of the same data.

It may seem counter-intuitive that noise in the predictor variable x induces a bias, but noise in the outcome variable y does not. Recall that linear regression is not symmetric: the line of best fit for predicting y from x (the usual linear regression) is not the same as the line of best fit for predicting x from y.[1]

  1. ^ Draper, N.R.; Smith, H. (1998). Applied Regression Analysis (3rd ed.). John Wiley. p. 19. ISBN 0-471-17082-8.