In this section, we explain the definition and solution to avoid the multi-collinearity in multivariate analysis, which appeared in the articles titled, "Time to first cigarette and hypertension in Korean male smokers" and "Barrier factors to the completion of diabetes education in Korean diabetic adult patients: Korea National Health and Nutrition Examination Surveys 2007-2012", published in September 2015 by Lee et al.
1) and by Kim et al.,
2) respectively.
WHAT IS AND HOW TO CHECK THE MULTI-COLLINEARITY?
Multi-collinearity indicates that independent (explanatory) variables are not mutually independent, but have some linearly correlated relationship in multiple (logistic) regression analysis. It is foredoomed that some degree of association will exist among the independent variables in the multivariate analysis. However, when the degree of association between independent variables is extremely high, some coefficients or their standard errors cannot be correctly calculated (estimated); that is, the phenomena are such that no coefficients can be obtained, or extremely large standard errors in the analysis results might occur. In these cases, we say, "We could not obtain proper estimates from the multivariate model due to the multi-collinearity (near-linear dependency)."
3)
The most popular measure used to check for multi-collinearity is the variance inflation factor (VIF). The VIF of independent variable (x
j) is defined as follows: VIF
j=(1-R
j2)
-1, where R
j2 is the coefficient of determination obtained when x
j is regressed on all the remaining independent variables. If x
j is nearly orthogonal to the remaining independent variables, R
j2 is small and VIF
j is close to unity, while if x
j is nearly dependent on some subset of the remaining independent variables, R
j2 is near unity and VIF
j is large. Practical experience indicates that if any of the VIFs exceeds 5 or 10, it is a sure sign that the associated regression coefficients are poorly estimated because of multicollinearity.
4)
HOW TO AVOID THE MULTI-COLLINEARITY?
The simplest and most intuitive method to avoid the multi-collinearity in analysis is using only independent variables with low correlation to each other. Firstly, these independent variables could be chosen by subjective method. For instance, when a researcher has to choose between Body Mass Index (BMI) and body weight, and his/her initial intention focused on BMI, then the independent variable should be the former, regardless of the variable which has higher correlation with the dependent variable.
Secondly, from the statistical point of view, a researcher can select the variable having the highest correlation with the dependent variable. The simplest way is to compare the values of correlation between the competing independent variables with a dependent variable. Also, the easiest method for selecting variables without multi-collinearity, is applying a stepwise method in the variable selection for multivariate statistical analysis programs. However, if a researcher only based on statistical methods for variable selection, its final result would be far from the original intention of the researcher, or clinically unexplainable. On the other hand, ridge regression can handle analysis for the independent variables with multi-collinearity. However, ridge regression is generally not used because the estimated coefficients are biased and its method is not easy to understand.