Missing Values in Cluster Analysis and Latent Class Analysis
The causes of missing values in cluster and latent class analysis
There are three common causes of missing values in cluster analysis and latent class analysis:
- Questions that were not asked to some respondents due to their answers to other questions. For example, a person may not have been asked about the ages of their children if it was known from a previous question that they did have any children.
- Don't know responses.
- Randomizations, whereby respondents were randomly assigned to a subset of the questions.
Where questions were not asked they need to be excluded from any analysis as the whole point of techniques like latent class analysis and cluster analysis is to identify respondents with similar data and using variables only asked of some of the respondents is at odds with this. The other two types of missing data can, in theory, be addressed by both cluster analysis and latent class analysis but, in practice, only latent class analysis programs can reliably be used to form segments in data containing missing values.
Ways of addressing missing values in cluster analysis
All of the well-known cluster analysis algorithms assume that there are no missing values. As this is often not the case a variety of solutions have been developed for addressing instances where there are missing values.
Imputation
Imputation is surprisingly often used to impute replacement values for the missing values. This is 'surprising' in that it is extremely dangerous. The whole point of cluster analysis is to find respondents that are similar and the assumptions of imputation creates artificial similarities. When missing values are replaced with averages the result is that respondents with higher numbers of missing values are guaranteed to be more similar to each other (i.e., because they are assigned the same values for the missing values). Where predictive models and many standard imputation techniques are used the same problem occurs, albeit to a lessor extent, due to regression to the mean. When stochastic models are used for imputation there is the reverse problem, with randomization being added to the data and thus more randomization for people with more missing data making them less likely to be grouped together.
Nearest neighbor assignment
Better practice than imputation is to assign observations to the most similar cluster based using the non-missing data. This is the approach that is built into most cluster analysis algorithms that purport to deal with missing values (e.g., SPSS). Although preferable to imputation, this approach implicitly makes the assumption that the data is Missing Completely At Random when forming the segments and then makes a different assumption, that of Missing At Random, when assigning respondents to the clusters. The problems that this leads to are best appreciated by examining the following data, in which a . indicates that the data is missing.
If carefully examining this data you will likely conclude that there are three clusters: the first two observations are in one cluster with means on the four variables of 1, 2, 3 and 4. The second cluster consists of observations 3 and 4 with means of 4, 3, 2 and 1, while the third cluster consists of observations 5, 6, 7 and 8 and has means of 1, 2, 2 and 1.
However, if using the nearest neighbor imputation approach this is not what will be uncovered. For example, when using SPSS to do the cluster analysis it is only able to find the first two clusters and it ends up assigning observations 5 and 7 to the first cluster and observations 6 and 8 to the second cluster. This occurs because the cluster analysis forms the clusters using only the data that is complete and this contains no observations from the third cluster. And, consequently, the observations that should be in this third cluster can only be assigned to the first two clusters.
Observation | Variable A | Variable B | Variable C | Variable D |
---|---|---|---|---|
1 | 1 | 2 | 3 | 4 |
2 | 1 | 2 | 3 | 4 |
3 | 4 | 3 | 2 | 1 |
4 | 4 | 3 | 2 | 1 |
5 | 1 | 2 | . | 1 |
6 | . | 2 | 2 | 1 |
7 | 1 | 2 | 2 | . |
8 | 1 | . | 2 | 1 |
Predictive modeling
This approach involves forming the clusters using the observations with complete data and then using a predictive model, such as Linear Discriminant Analysis to predict the segments for observations that have some missing values. In terms of the assumptions regarding missing data, this approach is identical to using nearest neighbor assignment. Nevertheless, this method is inferior to nearest neighbor assignment as generally the predictive models make different assumptions to cluster analysis and this leads to a compounding of errors.
Latent Class Analysis
Whereas cluster analysis is technically only valid in the presence of data that is Missing Completely At Random, latent class models can, in principle if not practice, be applied with any type of data.
Due to certain features of the underlying maths of latent class analysis it is standard practice to program software to make the Missing At Random assumption. The consequence of this is that it will generally do a substantially better job at addressing missing values than can be achieve by cluster analysis. For example, considering the data set used above to illustrate the problems of nearest neighbor assignment, when this same data set is analyzed using latent class software (e.g., using Q), the correct three segments will generally be identified.
Latent class models can even be used in some situations when the missing values are Nonignorable. This is done by treating the variables containing missing values as being categorical variables and treating the missing values as being another category. (This approach is not always effective, particularly where the variables are truly numeric.)