Table Of ContentComparison of Missing Data Imputation
Methods for Improving Detection of
Obstructive Sleep Apnea
Marit Iren Rognli Tokle
Thesis submitted for the degree of
Master in Informatics: Programming and Networks
60 credits
Department of Informatics
Faculty of mathematics and natural sciences
UNIVERSITY OF OSLO
Autumn 2017
Comparison of Missing Data
Imputation Methods for Improving
Detection of Obstructive Sleep Apnea
Marit Iren Rognli Tokle
(cid:13)c 2017 Marit Iren Rognli Tokle
Comparison of Missing Data Imputation Methods for Improving Detection of
Obstructive Sleep Apnea
http://www.duo.uio.no/
Printed: Reprosentralen, University of Oslo
Abstract
Sleep apnea is a common sleep disorder where the breathing is paused or reduced during
sleep, which forces awakening due to less oxygen in the blood. We employ the four
data mining methods K-Nearest Neighbor, Support Vector Machine, Arti(cid:28)cial Neural
Network, and Decision Tree, to analyze datasets containing the four non-invasive sensor
signals chest respiration, abdominal respiration, nasal respiration, and oxygen saturation.
Good results for sleep apnea analysis using these signals as input data for the data mining
methods already exist.
We examine how using the European Data Format Plus (EDF+) a(cid:27)ects the data
mining results, because it is a standardised data format used for storing sleep data, and
is used by the sensors and Resmed tool NOX which we use for data acquisition. We also
examine how pre-processing input data with imputation methods to handle missing data
a(cid:27)ects the data mining results, as we want to support usage of sensors of all qualities, in
which we have to assume missing data will occur.
Therearetwotasksinthisthesis. First, wecheckhowwellthedataminingalgorithms
works with our signals in the EDF+ data format. We conclude that EDF+ is as good
as the most common data format used in PhysioNet, as we could store and read data
withoutanyproblems, oritmightbeevenbettersincetheinformationisstoredinasingle
(cid:28)le instead of several (cid:28)les. By converting data to EDF+, we con(cid:28)rmed that signals and
annotations may be stored in the same (cid:28)le. We con(cid:28)rm that our data mining algorithms
and all signal combinations, except the sole use of respiration from the chest, may be
used for o(cid:27)-line classi(cid:28)cation of sleep apnea.
In thesecond task, weexamine howimputation methods workand howpre-processing
oursignaldatawithimputationmethodsa(cid:27)ectourdataminingmethods. Forthemissing
data challenge, we discovered that the only imputation method that may be used for all
percentages of missing values, 5%, 10%, 20%, 30% and 50%, and the four data mining
methods, is Self-Organising Maps. This is the overall best method, and the only method
that should be used for datasets containing 30% or more missing data, because the others
do not maintain the data structure of the dataset. Imputing with the mean and median
of each class of normal or disrupted breathing should not be applied as an imputation
method, and we assume that separating between classes when imputing is bad practice.
Multiple Linear Regression and K-Nearest Neighbor are better at maintaining the data
structure than mean and median imputation, but both have a deviation of about 8%
compared to the results of the complete dataset. Self-Organising Maps has at most a
deviation of 1.25% from the classi(cid:28)cation of the complete dataset. Mean and median
imputation may be used if the imputation time is important, as they are better than
handling the missing values by replacing them with zeros and the fastest methods, using
only a few milliseconds when imputing.
1
Acknowledgements
It has been an intensive and educational period of time. Now it is time to give a special
note to and re(cid:29)ect on the people who helped me and supported me throughout this
period.
First, Iwouldliketothankmysupervisor, ProfessorDr. VeraGoebel, forherguidance
throughout the work on this thesis. I am exceptionally happy to have such a dedicated
supervisor helping me through any unclear situations and always giving good answer to
my questions.
I would also like to thank my partner, Christian, and parents, Kate and Atle, for
their support, encouragement and patience. At last, I would like to thank my family and
friends for their encouragement.
3
Contents
1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Sleep apnea 6
2.1 Obstructive Sleep Apnea . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Central Sleep Apnea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Mixed or Complex Sleep Apnea . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Physiological Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Sleep Apnea Diagnosis Tools . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Data mining 10
3.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Data Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Data mining tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Classi(cid:28)cation methods used in this thesis . . . . . . . . . . . . . . . . . . 14
3.4.1 Arti(cid:28)cial Neural Network . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.3 K-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.4 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.2 Holdout Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.3 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 European Data Format 26
4.1 EDF Speci(cid:28)cations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Header Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.2 Data Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.3 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.4 Labels for Sleep Apnea Detection . . . . . . . . . . . . . . . . . . 35
4.2 WFDB Software Package . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 mit2edf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5
5 Missing Data 39
5.1 Types of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Handling Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Imputation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.1 Mean, median and mode . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.2 Hot-Deck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.4 Multiple Imputation . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.5 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.6 K-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.7 Self-Organising Maps . . . . . . . . . . . . . . . . . . . . . . . . . 57
6 Requirement analysis 61
6.1 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 PhysioNet EDF Databases . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2.1 CAP Sleep Database . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2.2 Sleep-EDF Database . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2.3 Non-Invasive Fetal Electrocardiogram Database . . . . . . . . . . 64
6.2.4 SHHS Polysomnography Database . . . . . . . . . . . . . . . . . . 64
6.2.5 St. Vincent’s University Hospital Database . . . . . . . . . . . . . 65
6.2.6 Apnea-ECG Database . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2.7 MIT-BIH Polysomnography Database . . . . . . . . . . . . . . . . 66
6.2.8 Analysis of Database Suitability . . . . . . . . . . . . . . . . . . . 66
6.3 Conversion of Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7 Design and Implementation 75
7.1 System Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2.1 Apnea-ECG Database . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2.2 MIT-BIH Polysomnography Database . . . . . . . . . . . . . . . . 84
7.2.3 St.Vincent’s Hospital Database . . . . . . . . . . . . . . . . . . . 88
7.3 Class Imbalance and Data Distribution . . . . . . . . . . . . . . . . . . . 90
7.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.5 Conversion of Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.5.1 Header Record Design . . . . . . . . . . . . . . . . . . . . . . . . 93
7.5.2 Data Record Design . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.6 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.6.1 Generating Missing Datasets . . . . . . . . . . . . . . . . . . . . . 101
7.6.2 Mean and Median Imputation . . . . . . . . . . . . . . . . . . . . 102
7.6.3 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . 105
7.6.4 K-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.6.5 Self-Organising Maps . . . . . . . . . . . . . . . . . . . . . . . . . 109
6
Description:required if separation of obstructive and central sleep apnea is desirable. In previous work, the signal from the abdomen scored highest of the abdomen and chest, with an accuracy of 92.9%. The performance of the data mining methods varied little in previous work. The K-. Nearest Neighbor performed