DFFITS

In statistics, DFFIT and DFFITS ("difference in fit(s)") are diagnostics meant to show how influential a point is in a linear regression, first proposed in 1980.^[1]

DFFIT is the change in the predicted value for a point, obtained when that point is left out of the regression:

{\text{DFFIT}}={\widehat {y}}_{i}-{\widehat {y}}_{i(i)}

where ${\widehat {y}}_{i}$ and ${\widehat {y}}_{i(i)}$ are the prediction for point i with and without point i included in the regression.

DFFITS is the Studentized DFFIT, where Studentization is achieved by dividing by the estimated standard deviation of the fit at that point:

{\text{DFFITS}}={\frac {\text{DFFIT}}{s_{(i)}{\sqrt {h_{ii}}}}}

where $s_{(i)}$ is the standard error estimated without the point in question, and $h_{ii}$ is the leverage for the point.

DFFITS also equals the products of the externally Studentized residual ( $t_{i(i)}$ ) and the leverage factor ( ${\sqrt {h_{ii}/(1-h_{ii})}}$ ):^[2]

{\text{DFFITS}}=t_{i(i)}{\sqrt {\frac {h_{ii}}{1-h_{ii}}}}

Thus, for low leverage points, DFFITS is expected to be small, whereas as the leverage goes to 1 the distribution of the DFFITS value widens infinitely.

For a perfectly balanced experimental design (such as a factorial design or balanced partial factorial design), the leverage for each point is p/n, the number of parameters divided by the number of points. This means that the DFFITS values will be distributed (in the Gaussian case) as ${\sqrt {p \over n-p}}\approx {\sqrt {p \over n}}$ times a t variate. Therefore, the authors suggest investigating those points with DFFITS greater than $2{\sqrt {p \over n}}$ .

Although the raw values resulting from the equations are different, Cook's distance and DFFITS are conceptually identical and there is a closed-form formula to convert one value to the other.^[3]

^ Belsley, David A.; Kuh, Edwin; Welsh, Roy E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley Series in Probability and Mathematical Statistics. New York: John Wiley & Sons. pp. 11–16. ISBN 0-471-05856-4.
^ Montgomery, Douglas C.; Peck, Elizabeth A.; Vining, G. Geoffrey (2012). Introduction to Linear Regression Analysis (5th ed.). Wiley. p. 218. ISBN 978-0-470-54281-1. Retrieved 22 February 2013. Thus, DFFITS_i is the value of R-student multiplied by the leverage of the ith observation [h_ii/(1 − h_ii)]^1/2.
^ Cohen, Jacob; Cohen, Patricia; West, Stephen G.; Aiken, Leona S. (2003). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. ISBN 0-8058-2223-2.

[1] Belsley, David A.; Kuh, Edwin; Welsh, Roy E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley Series in Probability and Mathematical Statistics. New York: John Wiley & Sons. pp. 11–16. ISBN 0-471-05856-4.

[2] Montgomery, Douglas C.; Peck, Elizabeth A.; Vining, G. Geoffrey (2012). Introduction to Linear Regression Analysis (5th ed.). Wiley. p. 218. ISBN 978-0-470-54281-1. Retrieved 22 February 2013. Thus, DFFITS_i is the value of R-student multiplied by the leverage of the ith observation [h_ii/(1 − h_ii)]^1/2.

[Cohen,_Cohen,_West_&_Aiken,_2003-3] Cohen, Jacob; Cohen, Patricia; West, Stephen G.; Aiken, Leona S. (2003). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. ISBN 0-8058-2223-2.

[1]

[2]

[3]