INTRODUCTION
Missing values are endemic across health and social studies.
1) Missing data reduce statistical power and representativeness of the sample and might cause misinterpretation of the results by introducing bias.
2) Socioeconomic status (SES) variables such as education, income, marital status, and occupation are often unanswered in the large public data sets. In most cases, the missingness in SES variables is related to certain characteristics of the individuals surveyed.
3) However, this issue rarely receives specific focus as a shortcoming of studies, and has rarely been the focus of specific discussion within academic print.
The Korea National Health and Nutrition Examination Survey (KNHANES) provides a rich source of data for studying the relationships between health and SES for primary care physicians. In this report, SES variables (i.e., education, household income, marital status, and occupation) within the Korean Journal of Family Medicine (KJFM) original articles using the KNHANES data were reviewed. Rates of missing SES variables from the 4th KNHANES were estimated, when used independently or used in combination with other SES variables. Finally, other SES characteristics related to the omissions of household income and occupation were assessed.
METHODS
This study was composed of two main parts. The first part included a detailed hand search to select KJFM articles from 2007 to 2011 which used KNHANES as their primary source of data. The methods and results sections of each relevant article were carefully reviewed, and the SES variables used in the univariate and multivariate analysis were checked.
In the second part of the study, rates of missing SES variables (i.e., education, household income, marital status, and occupation) from the Health Interview Survey (HIS) of the 4th KNHANES were estimated including that for all men and women aged more than 19 years old. The HIS consisted of four components: the household core, the sample adult core, the sample adolescent core, and the sample child core. The household core component included the household income and marital status of individuals, and the sample adult core component included their educational and occupational classification. Detailed descriptions of the plan and operation of the survey have been described on the KNHANES website (
http://knhanes.cdc.go.kr/).
Educational levels were categorized according to less than elementary school graduate, middle school graduate, high school graduate, and college graduate. In order to calculate the household income level, the mean monthly household income was divided by the root of the number of household members, and was classified into quartiles. Marital status of the individuals (married and living with a partner, divorced or separated, widowed, or unmarried) was also included. Occupational classification used the KNHANES system of classification, which is a modified version of the Korean Standard Classification of Occupation, 6th revision (2007) supplemented by an indicator reflecting unemployment status.
Analyses were performed with Stata ver. 10 (Stata Co., College Station, TX, USA) to incorporate sampling weight. Indicator variables for missing values of SES variables were created. Chi-square tests were used to assess the relationship between the missingness of household income data and occupation classification and other SES variables.
RESULTS
Of the reviewed literature, one article in 2007, three articles in 2008, four articles in 2009, three articles in 2010, and five articles in 2011 used the KNHANES data as their primary data source, totaling 16 articles (5.4%) among 296 original articles during the same period. Eleven articles presented SES data to describe the participants' characteristics with univariate analysis, and 9 articles used them as covariates in multivariate analysis. The most frequently used SES variables in the multivariate analysis were education (9 articles), household income (8 articles), marital status (3 articles), and occupation (4 articles). None of these 11 articles took into account the omissions within the analysis (data not shown).
The estimated rates of missing data on education, household income, marital status, and occupation were 0.3 (standard error, SE, 0.05), 2.7 (0.2), 0.5 (0.1), and 9.4 (0.9), respectively. The variable of occupation had the highest rate of omission. When all four variables were used simultaneously, the rate increased to 11.8 (0.9) (
Table 1).
Table 2 presented SES characteristics according to the missingness of household income and occupation data. Respondents with missing household income tended to be older (P < 0.001), less educated (P < 0.001), and more likely to be unemployed (P < 0.001), and widowed (P < 0.001). A similar relationship was shown by the missingness of occupation classification.
DISCUSSION
The rates of missing data for the categories of household income and occupation within KNHANES were not low, i.e., 2.7% and 9.4%, respectively, and the missingness was not randomly dispersed throughout the data. However, no articles in KJFM clarified the process by which missing values or attrition reduce the sample size, nor did they explicate how these problems introduce potential bias to their findings.
Traditional approaches to working with missing values are case deletion, pair wise deletion, mean substitution, or the inclusion of indicator variables. KJFM Articles using SES variable from KNHANES used the case deletion method, presumably as it is the default in standard statistical packages. Use of the case deletion method using KNHANES SES data could result in the loss of up to 12% of the data, and this figure will increase when researchers use multiple components of KNHANES together. Therefore, these approaches can result in serious biases in a positive or a negative direction, increasing type II errors.
4,5)
This study showed that low educational achievement, unemployment state, and lower household income were all associated with omission in SES data. Similar results have been reported in oversea studies. A postpartum survey in California
3) and in the National Health Interview Survey
6) showed that respondents with missing income information were, in general, more likely to be socioeconomically disadvantaged.
Missing values cannot be avoided, and naturally, the best solution is to minimize missing values at the point of collection. However, this may not be possible in most of cases. Hence, researchers should, first and foremost, carefully examine the profiles of respondents with missing information prior to analysis.
3) Secondly, researchers should keep in mind the possible bias which can be introduced by missing values. Finally, modern alternative techniques for working with missing values, such as single or multiple imputation, or full information maximum likelihood approaches should be introduced to the analysis.
7)