Table Of ContentScott Mongeau
Andrzej Hajdasinski
Cybersecurity
Data Science
Best Practices in an Emerging Profession
Foreword by
Timothy Shimeall
Cybersecurity Data Science
Scott Mongeau • Andrzej Hajdasinski
Cybersecurity Data Science
Best Practices in an Emerging Profession
Foreword by Timothy Shimeall
Scott Mongeau Andrzej Hajdasinski
Nyenrode Business Universiteit Nyenrode Business Universiteit
Breukelen, Netherlands Breukelen, Netherlands
ISBN 978-3-030-74895-1 ISBN 978-3-030-74896-8 (eBook)
https://doi.org/10.1007/978-3-030-74896-8
© Springer Nature Switzerland AG 2021
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Frontispiece design: Andreas Kallipolitis, iamtraum.com
v
“Ars longa, vita brevis, occasio praeceps,
experimentum periculosum, iudicium
difficile.”
“Life is short, the art long, opportunity
fleeting, experiment treacherous, judgment
difficult.”
—Hippocrates
Dedicated to Marloes, family, and friends
Foreword
While data science has been emerging as a profession since 2005, the professional-
ization of its application to cybersecurity is less mature. One reason for this relative
immaturity is that both data science and cybersecurity have been undergoing exten-
sive change and accepted practices are still evolving. Another reason is that, unlike
many fields of data analysis, cybersecurity has intelligent opposition to its methods,
specifically attackers who wish to intrude on computer systems and networks. To
date, cybersecurity has been in a race against that opposition, and the state of data
science for cybersecurity reflects that race. Under those conditions, accepted prac-
tices are rapidly challenged and modified.
Despite these challenges, the importance of cybersecurity data science is
increased due to a number of pressures. The velocity of cybersecurity data is large,
and increasing. A single, moderately busy, server or firewall generates gigabytes of
log entries every day. A network traffic log for a large network generates tens of
billions of entries per day. Security event analysis systems only deal with some of
the more immediate and easily recognized issues. Data science approaches that can
efficiently categorize and focus attention on the most impactful streams within this
fire hose of data are urgently needed. At the same time, the activities of the attackers
are increasingly diverse in subtlety, impact, and targeting. While some are easily
recognized and of immediate effect on a recognizable target within the perception
of the defenders, others mimic desirable traffic, lie latent within the target until
desired by the attacker, or hit outside of the defender’s perception, in unmonitored
portions of their infrastructure or in the infrastructure of suppliers or vendors. By
employing explicit feature engineering and sensitivity analysis, cybersecurity data
science may focus on those features most revealing of even subtle activities and also
provide the chance to secure on a community-wide basis. Federating and sharing
data within even a tightly related community is often difficult due to the lack of
common methods for data analysis and interpretation. Cybersecurity data science,
with its explicit consideration of the characteristics of data and of analysis methods,
offers an opportunity to bridge the federation and sharing difficulties.
This book thoroughly, if not exhaustively, documents the lack of maturity in data
science applied to cybersecurity. More than identifying this lack of maturity, it uses
ix
x Foreword
a mixed-mode data collection, both qualitative and quantitative, to point to how the
gaps in cybersecurity data science can be filled as it emerges as a full profession.
Using a multifaceted mix of detailed literature review, survey of experts, and model-
ing, Dr. Mongeau has carefully delineated both where this data science profession
is currently lacking and how those lacks could be addressed in future work. A wide
range of factors are included, and clear recommendations are provided.
The reader who comes to this volume from an interest in cybersecurity will gain
much in understanding how data science methods apply in this space. The book
refers to various methods of analysis, and how those methods lend insight into
cybersecurity objectives. This book serves as a broad and useful introduction to how
data science contributes to cybersecurity, as that science is practiced by modern
professionals.
The reader who comes to this volume from an interest in data science will find
this book summarizes the state of data science as a profession (and the path for-
ward) but then focuses directly on the specific needs of cybersecurity and how the
profession would help to protect data in the modern world. The analysis problems
(such as the chronic lack of ground truth) and methods to remediate those problems
are covered thoroughly, and from a perspective that speaks to the data scientist.
The reader who comes to this volume from a managerial perspective, or one who
seeks to understand the emergence of this field and the current capability of those
practicing data science for cybersecurity, will find the clear description of the state
of the field useful. This book offers a solid state of the field and is supportive to both
realistic appraisal of what can be gained from practitioners and to what to look for
as emerging capabilities in the near future. The interviews and quantitative analysis
give in-depth understanding of what is, and what is coming.
Taken in total, this book offers an extremely useful clarification to the emergence
of cybersecurity data science as a specific profession, borrowing from both cyberse-
curity and from general data science. Dr. Mongeau has provided a useful degree of
clarity to these rapidly developing fields. From this basis, a variety of useful work
will spring in the days to come.
Timothy Shimeall, Ph.D.
Senior Member of the Technical Staff
CERT Situational Awareness Team
Software Engineering Institute
Carnegie Mellon University
Pittsburgh, Pennsylvania
May 2021
Preface
During the concluding stages of this research effort in early 2020, the global
Coronavirus (COVID-19) pandemic emerged and rapidly evolved. It would be
remiss not to comment on this world-changing event regarding its relationship to
and impact upon the central research topic.
It became increasingly evident that dramatic societal effects were occurring as a
result of the coronavirus outbreak. For one, reactions to the pandemic brought about
a rapid acceleration of the use of the Internet for communication and collaboration.
There was a sudden mass shift towards people working from home in order to limit
the spread of the contagion. Information workers, students, families, and friends
moved quickly online to communicate and collaborate.
Quite suddenly, a large proportion of global work and social life has moved
online. Although telecommuting was already a growing trend, the rapid movement
to adopt remote working as the de facto approach has introduced a radical shift.
Many popular press commentators suggest that this movement towards remote and
virtual work will persist for many months and might possibly lead to more perma-
nent shifts (Burr and Endicott 2020; Lichfield 2020).
The cybersecurity implications of this shift mirror those which were already
developing before the crisis, albeit at a much larger scale. Defenders suddenly are
confronted with safeguarding radically expanded, distributed, and defuse networks
and devices. It is clear the future of corporate networks will be increasingly decen-
tralized as a result. The “castle-and-moat” approach to defending organizational
networks will be increasingly outmoded.
In the wake of this shift, the interruption of Internet carriage and digital devices
will have far greater impacts on social, political, and economic life. Adversaries are
presented with a range of enticing new targets to exploit and schemes to perpetrate
(Gallagher and Bloomberg 2020; Jowitt 2020).
One need not look far to see that criminal and malicious actors have been quick
to adapt and exploit current events. Phishing emails have adopted subject matter
related to the coronavirus in an attempt to deliver malware and other malicious pay-
loads (Reiner 2020). As well, state-level actors quickly adapted to the coronavirus
crisis as an opportunity to sow disinformation (Beaumont et al. 2020) and to stage
xi
xii Preface
attacks to raid medical research and interrupt healthcare systems (Cohen and
Marquardt 2020). These trends are disheartening and demand concerted responses.
This research profiles a range of mounting challenges afflicting the cybersecurity
profession. A central challenge concerns the fact that rules-based methods to detect
security events are increasingly ineffective due to the ever-increasing complexity
and scale of digital infrastructure. To the degree that information professionals are
increasingly working remotely en masse, approaches to network defense will be
challenged to develop novel approaches.
An effort has been made herein to detail how cybersecurity data science (CSDS)
can address a broad range of challenges to improve security assurance. The corona-
virus pandemic has suddenly highlighted the importance of pursuing and enabling
such improvements to cybersecurity practice. The stance here is that this dramatic
event has increased the importance and visibility of CSDS.
CSDS provides refined and focused approaches to gain situational awareness.
This includes the ability to refine understandings of normal versus abnormal behav-
ior and to categorize and detect complex incidents. CSDS methods support complex
statistical understandings of user behavior and an ability to monitor distributed
events through dynamic pattern analysis. This highlights the rising importance of
machine learning-driven approaches, such as applying unsupervised machine learn-
ing to extrapolate patterned understandings of user and device behaviors.
The advent of physically distributed and diffuse networks, not to mention the
increase in cloud-based collaboration, conferencing, etc., demands the methods and
treatments made feasible by CSDS. Growing dispersion and complexity will neces-
sitate data analytics-based monitoring methods to assure and defend increasingly
distributed organizational networks. In short, the coronavirus crisis has effectively
made prescriptive security assurance approaches, of the type outlined in this work,
more a pressing necessity than an aspirational work-in-progress.
A circumstantial similarity between the corona crisis and the subject matter
herein concerns the broader challenge of undertaking scientific inquiry in complex
environments beset by incomplete or unclear data. A major challenge in efforts to
address and contain the outbreak revealed confusion, frustration, and conflict result-
ing from misunderstandings and disagreements surrounding the proper application
of statistical and scientific methods. The crisis revealed how environments of uncer-
tainty and complexity lead to confusion and challenges in marshaling focused and
coordinated responses.
While it is not appropriate to compare like-to-like in terms of impact, cybersecu-
rity faces a similar environment of complexity, uncertainty, confusion, and conflict
related to data, methods, and responses. Confronting this topic, the aim of this work
is to offer a potentially generalizable set of approaches to programmatically address
complexity and uncertainty through the orchestration of data management, scien-
tific inquiry, and organizational collaboration. The organizational implementation
of such efforts is emphasized as requiring information sharing, communication, col-
laboration, and aligned incentives…
Moving beyond recent events, it is worthy to comment briefly on this research
effort’s broader origins and intentions. It should be mentioned that this research