Table Of ContentDEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,
SECOND CYCLE, 30 CREDITS
STOCKHOLM, SWEDEN 2018
Deep Neural Networks for
Inverse De-Identification of
Medical Case Narratives in
Reports of Suspected
Adverse Drug Reactions
EVA-LISA MELDAU
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Deep Neural Networks for
Inverse De-Identification of
Medical Case Narratives in
Reports of Suspected
Adverse Drug Reactions
EVA-LISA MELDAU
Master in Computer Science
Date: February 24, 2018
Supervisor at KTH: Joel Brynielsson
Supervisor at the Uppsala Monitoring Centre: Niklas Norén
Examiner: Olle Bälter
Swedish title: Djupa neuronnät för omvänd avidentifiering av
medicinska fallbeskrivningar i biverkningsrapporter
School of Electrical Engineering and Computer Science
iii
Abstract
Medical research requires detailed and accurate information on indi-
vidual patients. This is especially so in the context of pharmacovig-
ilance which amongst others seeks to identify previously unknown
adverse drug reactions. Here, the clinical stories are often the start-
ing point for assessing whether there is a causal relationship between
the drug and the suspected adverse reaction. Reliable automatic de-
identification of medical case narratives could allow to share this pa-
tient data without compromising the patient’s privacy. Current re-
searchonde-identificationfocusedonsolvingthetaskoflabellingthe
tokens in a narrative with the class of sensitive information they be-
longto.
In this Master’s thesis project, we explore an inverse approach to
thetaskofde-identification. Thismeansthatde-identificationofmed-
ical case narratives is instead understood as identifying tokens which
do not need to be removed from the text in order to ensure patient
confidentiality. Ourresultsshowthatthisapproachcanleadtoamore
reliable method in terms of higher recall. We achieve a recall of sensi-
tive information of 99.1% while the precision is kept above 51% for
the 2014-i2b2 benchmark data set. The model was also fine-tuned
on case narratives from reports of suspected adverse drug reactions,
wherearecallofsensitiveinformationofmorethan99%wasachieved.
Although the precision was only at a level of 55%, which is lower
than in comparable systems, an expert could still identify informa-
tionwhichwouldbeusefulforcausalityassessmentinpharmacovigi-
lanceinmostofthecasenarrativeswhichwerede-identifiedwithour
method. Inmorethan50%ofthecasenarrativesnoinformationuseful
forcausalityassessmentwasmissingatall.
iv
Sammanfattning
Tillgångtilldetaljeradekliniskadataärenförutsättningförattbedriva
medicinsk forskning och i förlängningen hjälpa patienter. Säker avi-
dentifiering av medicinska fallbeskrivningar kan göra det möjligt att
delasådaninformationutanattäventyrapatientersskyddavpersonli-
ga data. Tidigare forskning inom området har sökt angripa problemet
genom att märka ord i en text med vilken typ av känslig information
de förmedlar. I detta examensarbete utforskar vi möjligheten att an-
gripa problemet på omvänt vis genom att identifiera de ord som inte
behöver avlägsnas för att säkerställa skydd av känslig patientinfor-
mation. Våra resultat visar att detta kan avidentifiera en större andel
av den känsliga informationen: 99,1% av all känslig information avi-
dentifieras med vår metod, samtidigt som 51% av alla uteslutna ord
verkligen förmedlar känslig information, vilket undersökts för 2014-
i2b2 jämförelse datamängden. Algoritmen anpassades även till fallbe-
skrivningarfrånbiverkningsrapporter,ochidettafallavidentifierades
99,1%avallkänsliginformationmedan55%avallauteslutnaordför-
medlar känslig information. Även om denna senare andel är lägre än
förjämförbarasystemsåkundeenexperthittainformationsomäran-
vändbar för kausalitetsvärdering i flertalet av de avidentifierade rap-
porterna; i mer än hälften av de avidentifierade fallbeskrivningarna
saknadesingeninformationmedvärdeförkausalitetsvärdering.
Contents
1 Introduction 1
1.1 PurposeandProblemStatement . . . . . . . . . . . . . . 2
2 Background 4
2.1 Pharmacovigilance . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 CausalityAssessment . . . . . . . . . . . . . . . . 5
2.1.2 WorldHealthOrganization(WHO)International
DrugMonitoringProgramme . . . . . . . . . . . . 8
2.1.3 VigiBase . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 ProtectedHealthInformation . . . . . . . . . . . . . . . . 11
2.2.1 Health Insurance Portability and Accountability
Act . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 EuropeanUnion . . . . . . . . . . . . . . . . . . . 11
2.2.3 ComparisonBetweenCountries . . . . . . . . . . 13
2.3 RelatedWork: De-IdentificationSystems . . . . . . . . . . 15
2.3.1 SystemsUsingHand-EngineeredFeatures . . . . 16
2.3.2 FeatureLearningNeuralNetworkSystems . . . . 18
2.3.3 InverseApproachSystems . . . . . . . . . . . . . 20
3 Theory 22
3.1 ArtificialNeuralNetworks . . . . . . . . . . . . . . . . . . 23
3.2 DeepLearning . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 FeatureLearning . . . . . . . . . . . . . . . . . . . 24
3.2.2 Pre-TrainingandFine-Tuning . . . . . . . . . . . . 25
3.3 DeepFeedForwardNeuralNetworks . . . . . . . . . . . 26
3.3.1 Training . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 RecurrentNeuralNetworks . . . . . . . . . . . . . . . . . 30
3.4.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . 35
v
vi CONTENTS
3.4.3 BidirectionalRecurrentNeuralNetwork . . . . . 38
3.4.4 LongShort-TermMemory . . . . . . . . . . . . . . 39
3.5 WordVectors . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6 Linear-ChainConditionalRandomField . . . . . . . . . . 43
3.7 EvaluationMeasures . . . . . . . . . . . . . . . . . . . . . 43
4 Methodology 45
4.1 DataSets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1 2014-i2b2DataSet . . . . . . . . . . . . . . . . . . 46
4.1.2 VigiBaseDataSet . . . . . . . . . . . . . . . . . . . 48
4.2 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 WHODrug . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 MedicalDictionaryforRegulatoryActivities . . . 49
4.3 De-IdentificationMethods . . . . . . . . . . . . . . . . . . 49
4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.2 AnnotatingandPre-Processing . . . . . . . . . . . 52
4.3.3 Rule-BasedApproachUsingDictionaryLook-ups 52
4.3.4 DeepLearningApproachUsingLongShort-Term
Memory . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.5 CombinationStrategy . . . . . . . . . . . . . . . . 62
4.3.6 ModelSelection . . . . . . . . . . . . . . . . . . . . 63
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.1 Recall and Precision for Protected Health Infor-
mation . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.2 RetainmentofValuableInformation . . . . . . . . 67
5 Results 68
5.1 2014-i2b2DataSet . . . . . . . . . . . . . . . . . . . . . . . 68
5.1.1 EvaluationoftheHybridDe-Identifier . . . . . . 68
5.1.2 EvaluationoftheDeepDe-Identifier . . . . . . . . 70
5.1.3 Comparisons . . . . . . . . . . . . . . . . . . . . . 70
5.1.4 ResultsPerCategory . . . . . . . . . . . . . . . . . 73
5.1.5 ExampleOutputs . . . . . . . . . . . . . . . . . . . 74
5.2 VigiBaseDataSet . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.1 GeneralResults . . . . . . . . . . . . . . . . . . . . 79
5.2.2 ResultsPerCategory . . . . . . . . . . . . . . . . . 80
5.2.3 ExamplesofLeakedProtectedHealthInformation 83
5.2.4 ValuableInformationforCausalityAssessment . 84
6 Discussion 85
CONTENTS vii
7 Conclusion 95
Bibliography 96
Description:Thus, sharing this information in form of electronic med- ical records or digital reports . with doses and dates which the patient was taking, indication for treat- centres and to the pharmaceutical companies which hold the market- .. Deep neural networks, artificial neural networks with multiple