Jaroslav Horníček, Hana Řezanková

Missing Data Imputation for Categorical Variables

Číslo: 3/2022
Periodikum: Statistika
DOI: 10.54694/stat.2022.3

Klíčová slova: IMIC algorithm, missing value imputation, categorical variables

Pro získání musíte mít účet v Citace PRO.

Anotace: Dealing with missing data is a crucial part of everyday data analysis. The IMIC algorithm is a missing data imputation method that can handle mixed numerical and categorical datasets. However, the categorical data are crucial for this work. This paper proposes the new improvement of the IMIC algorithm. The two proposed modifications consider the number of categories in each categorical variable. Based on this information, the factor, which modifies the original measure, is computed. The factor equation is inspired by the Eskin similarity measure that is known in the hierarchical clustering of categorical data. The results show that as the missing value ratio in the dataset grows, better results are achieved using the second modification. The paper also shortly analyzes the advantages and disadvantages of using the IMIC algorithm.