Michal Kolárik, Lucia Gojdičová, Ján Paralič

Principles of Synthesizing Medical Datasets

Číslo: 4/2022
Periodikum: Acta Electrotechnica et Informatica
DOI: 10.2478/aei-2022-0019

Klíčová slova: data synthetization, GAN, CTGAN

Pro získání musíte mít účet v Citace PRO.

Anotace: Data in many application domains provide a valuable source for analysis and data-driven decision support. On the other hand,legislative restrictions are provided, especially on personal data and patients’ data in the medical domain. In order to maximize the useof data for decision purposes and comply with legislation, sensitive data needs to be properly anonymized or synthetized. This articlecontributes to the area of medical records synthesis. We first introduce this topic and present it in a broader context, as well as in termsof methods used and metrics for their evaluation. Based on the related work analysis, we selected CTGAN neural network model fordata synthesis and experimentally validated it on three different medical datasets. The results were evaluated both quantitatively bymeans of selected metrics as well as qualitatively by means of proper visualization techniques. The results showed that in most cases,the synthesized dataset is a very good approximation of the original one, with similar prediction performance