This article includes a list of general references, but it lacks sufficient corresponding inline citations. (January 2024) |
This article may be unbalanced toward certain viewpoints. (October 2024) |
In statistics, multicollinearity or collinearity is a situation where the predictors in a regression model are linearly dependent.
Perfect multicollinearity refers to a situation where the predictive variables have an exact linear relationship. When there is perfect collinearity, the design matrix has less than full rank, and therefore the moment matrix cannot be inverted. In this situation, the parameter estimates of the regression are not well-defined, as the system of equations has infinitely many solutions.
Imperfect multicollinearity refers to a situation where the predictive variables have a nearly exact linear relationship.
Contrary to popular belief, neither the Gauss–Markov theorem nor the more common maximum likelihood justification for ordinary least squares relies on any kind of correlation structure between dependent predictors[1][2][3] (although perfect collinearity can cause problems with some software).
There is no justification for the practice of removing collinear variables as part of regression analysis,[1][4][5][6][7] and doing so may constitute scientific misconduct. Including collinear variables does not reduce the predictive power or reliability of the model as a whole,[6] and does not reduce the accuracy of coefficient estimates.[1]
High collinearity indicates that it is exceptionally important to include all collinear variables, as excluding any will cause worse coefficient estimates, strong confounding, and downward-biased estimates of standard errors.[2]
To address the high collinearity of a dataset, variance inflation factor can be used to identify the collinearity of the predictor variables.
{{cite book}}
: CS1 maint: numeric names: authors list (link)