Table Of ContentPasswords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA
Design and Evaluation of a Data-Driven Password Meter
BlaseUr*,FeliciaAlfieri,MaungAung,LujoBauer,NicolasChristin,
JessicaColnago,LorrieFaithCranor,HenryDixon,PardisEmamiNaeini,
HanaHabib,NoahJohnson,WilliamMelicher
*UniversityofChicago,CarnegieMellonUniversity
[email protected]
{fla,mza,lbauer,nicolasc,jcolnago,lorrie,hdixon,pardis,hana007,noah,billy}@cmu.edu
ABSTRACT thestrengthofapasswordthanotheravailablemetersandpro-
Despitetheirubiquity,manypasswordmetersprovideinac- videsmoreuseful,actionablefeedbacktousers.Whereasmost
curatestrengthestimates. Furthermore,theydonotexplain previousmetersscoredpasswordsusingverybasicheuristics
to users what is wrong with their password or how to im- [10,42,52],weusethecomplementarytechniquesofsimulat-
prove it. We describe the development and evaluation of a ingadversarialguessingusingartificialneuralnetworks[32]
data-driven password meter that provides accurate strength andemploying21heuristicstoratepasswordstrength. Our
measurementandactionable,detailedfeedbacktousers. This meteralsogivesusersactionable,data-drivenfeedbackabout
metercombinesneuralnetworksandnumerouscarefullycom- howtoimprovetheirspecificcandidatepassword. Weprovide
binedheuristicstoscorepasswordsandgeneratedata-driven userswithuptothreewaysinwhichtheycouldimprovetheir
text feedback about the user’s password. We describe the passwordbasedonthecharacteristicsoftheirspecificpass-
meter’siterativedevelopmentandfinaldesign. Wedetailthe word.Furthermore,weautomaticallyproposemodificationsto
securityandusabilityimpactofthemeter’sdesigndimensions, theuser’spasswordthroughjudiciousinsertions,substitutions,
examinedthrougha4,509-participantonlinestudy. Underthe rearrangements,andcasechanges.
more common password-composition policy we tested, we
Inthispaper,wedescribeourmeterandtheresultsofa4,509-
foundthatthedata-drivenmeterwithdetailedfeedbackled
participant online study of how different design decisions
userstocreatemoresecure,andnolessmemorable,passwords
impactedthesecurityandusabilityofpasswordsparticipants
thanameterwithonlyabarasastrengthindicator.
created. Wetestedtwopassword-compositionpolicies,three
scoringstringencies,andsixdifferentlevelsoffeedback,rang-
ACMClassificationKeywords
ingfromnofeedbackwhatsoevertoourfull-featuredmeter.
K.6.5 Security and Protection: Authentication; H.5.2 User
Interfaces: Evaluation/methodology Under the more common password-composition policy we
tested,wefoundthatourdata-drivenmeterwithdetailedfeed-
AuthorKeywords backleduserstocreatemoresecurepasswordsthanameter
Passwords;usablesecurity;data-driven;meter;feedback withonlyabarasastrengthindicatorornothavinganyme-
ter,withoutasignificantimpactonanyofourmemorability
INTRODUCTION metrics. Mostparticipantsreportedthatthetextfeedbackwas
Passwordmetersareusedwidelytohelpuserscreatebetter informativeandhelpedthemcreatestrongerpasswords.
passwords [42], yet they often provide ratings of password
strength that are, at best, only weakly correlated to actual
RELATEDWORK
passwordstrength[10]. Furthermore,currentmetersprovide
Userssometimesmakepredictablepasswords[22,30,48]even
minimalfeedbacktousers. Theymaytellauserthathisor
forimportantaccounts[13,31]. Manyusersbasepasswords
herpasswordis“weak”or“fair”[10,42,52],buttheydonot
around words and phrases [5,23,29,45,46]. When pass-
explainwhattheuserisdoingwronginmakingapassword,
words contain uppercase letters, digits, and symbols, they
nordotheyguidetheusertowardsabetterpassword.
areofteninpredictablelocations[4]. Keyboardpatternslike
Inthispaper,wedescribeourdevelopmentandevaluationof “1qaz2wsx” [46] and dates [47] are common in passwords.
anopen-sourcepasswordmeterthatismoreaccurateatrating Passwordssometimescontaincharactersubstitutions,suchas
replacing “e” with “3” [26]. Furthermore, users frequently
reusepasswords[9,14,25,38,48],givingthecompromiseof
evenasingleaccountpotentiallyfar-reachingrepercussions.
Permissiontomakedigitalorhardcopiesofpartorallofthisworkforpersonalor In designing our meter, we strove to help users understand
classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed
whentheirpasswordexhibitedthesecommontendencies.
forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation
onthefirstpage.Copyrightsforthird-partycomponentsofthisworkmustbehonored.
Threetypesofinterventionsattempttoguideuserstowards
Forallotheruses,contacttheOwner/Author.
Copyrightisheldbytheowner/author(s). strongpasswords. First,password-compositionpoliciesdic-
CHI2017,May06–11,2017,Denver,CO,USA tatecharacteristicsapasswordmustinclude,suchasparticular
ACM978-1-4503-4655-9/17/05.
http://dx.doi.org/10.1145/3025453.3026050
3775
Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA
character classes. While these policies can be effective for sioncomparingthesescorestothepassword’sguessability,as
security,usersoftenfindcomplexcompositionpoliciesunus- modeledbyCMU’sPasswordGuessabilityService[7]. This
able[2,20,24,28,37,50].Second,proactivepasswordchecking servicemodelsfourtypesofguessingattacksandhasbeen
aimstomodelapassword’ssecurityandonlypermitusersto foundtobeaconservativeproxyforanexpertattacker[44].
select a password the model deems sufficiently strong. For
Although these carefully combined heuristics also provide
instance,researchershaveproposedusingserver-sideMarkov
relatively accurate password-strength estimates, at least for
modelstogaugepasswordstrength[8]. Thisapproachisinap-
resistance to online attacks [52], we use them primarily to
propriatewhenthepasswordshouldneverbesenttoaserver
identify characteristics of the password that are associated
(e.g., encrypting a hard drive). It also requires non-trivial
withguessability. Ifacandidatepasswordscoreshighona
configurationandcanenableside-channelattacks[35].
particularheuristic,indicatingthepresenceofacommonpat-
Third,servicescommonlyprovidepasswordmeterstoshow tern,wegeneratetextfeedbackthatexplainswhatiswrong
estimatedpasswordstrength. Toenablethesemeterstorun withthataspectofthepasswordandhowtoimproveit. Wede-
client-sideandthusavoidthesecuritypitfallsofaserver-side velopedthewordingsforthisfeedbackiterativelyandthrough
solution,mostmetersusebasicheuristics,suchasestimating aformativelabstudy.1
apassword’sstrengthbasedonitslengthandthenumberof
character classes used [42]. These sorts of meters can suc- VISUALDESIGNANDUSEREXPERIENCE
cessfullyencourageuserstocreatestrongerpasswords[42], In this section, we describe the visual design of our meter.
thoughperhapsonlyforhigher-valueaccounts[12]. Unfor- Atahighlevel, themetercomprisesthreedifferentscreens.
tunately, these basic heuristics frequently do not reflect the Themainscreenusesthevisualmetaphorofabartodisplay
actualstrengthofapassword[1,10]. Amongpriorpassword the strength of the password, and it also provides detailed,
metersbasedon heuristics, onlyzxcvbn [51,52]usesmore data-drivenfeedbackabouthowtheusercanimprovehisor
advancedheuristics[10,32]. Akeydifferencefromzxcvbn her password. The main screen also contains links to the
in the design of both our meter and our experiment is that twootherscreens. Userswhoclick“(Why?)” linksnextto
generatingandtestingtheimpactoffeedbacktotheuseris the feedback about their specific password are taken to the
primaryforus. specific-feedbackmodal,whichgivesmoredetailedfeedback.
Users who click the “How to make strong passwords” link
Researchershavetriednumerousvisualdisplaysofpassword
or“(Why?)” linksadjacenttofeedbackaboutpasswordreuse
strength. Using a bar is most common [42]. However, re-
aretakentothegeneric-advicemodal,whichisastaticlistof
searchershavestudiedusinglarge-scaletrainingdatatoshow
abstractstrategiesforcreatingstrongpasswords.
userspredictionsofwhattheywilltypenext[27]. Othershave
investigatedapeer-pressuremeterthatcomparesthestrength
of a user’s password with those of other users [36]. These TranslatingScorestoaVisualBar
alternativevisualizationshaveyettobewidelyadopted. Password-strengthmeasurementsarenormallydisplayedto
usersnotasanumericalscore,butusingacoloredbar[10,42].
Increatingourmeter,weneededtomapapassword’sscores
MEASURINGPASSWORDSTRENGTHINOURMETER
fromboththeneuralnetworkandcombinedheuristicstothe
We move beyond measuring password strength using inac-
amount of the bar that should be filled. We conservatively
curatebasicheuristicsbycombiningtwocomplementaryap-
calculated log of the lower of the two estimates for the
10
proaches: modeling a principled password-guessing attack
numberofguessesthepasswordwouldwithstand.
usingneuralnetworksandcarefullycombining21heuristics.
Prior work has shown that most users consider a password
Ourfirstapproachreliesonrecentworkthatproposedneural
sufficientlystrongifonlypartofthebarisfull[42]. Therefore,
networksformodelingapassword-guessingattack[32]. This
we mapped scores to the bar such that one-third of the bar
approachusesarecurrentneuralnetworktoassignprobabil-
beingfilledroughlyequatestoacandidatepasswordresisting
ities to future characters in a candidate guess based on the
an online attack, while two-thirds means that the candidate
previouscharacters. Initstrainingphase,thenetworklearns
passwordwouldlikelyresistanofflineattack,assumingthata
varioushigher-orderpasswordfeatures. UsingMonteCarlo
hashfunctiondesignedforpasswordstorageisused.Wetested
methods[11],suchanapproachcanmodel1030 ormoread-
threedifferentprecisemappings,whichwetermstringencies.
versarialguessesinrealtimeentirelyontheclientsideafter
transferringlessthanamegabyteofdatatotheclient[32].
MainScreen
However,neuralnetworksareeffectivelyablackboxandpro- On the main screen a bar below the password field fills up
videnoexplanationtotheuserfortheirscoring. Inspiredby and changes color to indicate increasing password strength.
thezxcvbnpasswordmeter[52],weimplementedandcare- Differentfrompreviousmeters,ourmeteralsodisplaysdata-
fully combined 21 heuristics for password scoring. These driventextfeedbackaboutwhataspectsoftheuser’sspecific
heuristicssearchforcharacteristicssuchastheinclusionof passwordcouldbeimproved.
common words and phrases, the use of common character
The meter initially displays text listing the requirements a
substitutions, the placement of digits and uppercase letters
passwordmustmeet.Oncetheuserbeginstoenterapassword,
in common locations, and the inclusion of keyboard pat-
terns. We scored 30,000 passwords from well-studied data 1SeeCh.7ofUr’sdissertation[40]fordetailsonthedevelopmentof
breaches [18,45] by these heuristics, and we ran a regres- thiswording,theresultsofthelabstudy,andtheheuristics.
3776
Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA
Figure2: Asuggestedimprovementfor“Mypassword123,”
with changes in magenta. The suggested improvement and
sensitivefeedbackappearwhenusersshowtheirpassword.
Figure1:Thetool’smainscreenwhenthepasswordishidden.
correlated most strongly with password guessability in our
regressioncombiningtheheuristics.
thetoolindicatesthataparticularrequirementhasbeenmet
by coloring that requirement’s text green and displaying a
SuggestedImprovement
checkmark. Itdenotesunmetrequirementsbycoloringthose
To make it easier for users to create a strong password, we
requirementsredanddisplayingemptycheckboxes. Untilthe
augmentedourtextfeedbackwithasuggestedimprovement,
passwordmeetsrequirements,thebarisgray.
orconcretemodificationofthecandidatepassword. Asshown
inFigure2,asuggestedimprovementfor“Mypassword123”
ColoredBar
mightbe“My123passwoRzd.”
Oncethepasswordmeetscompositionrequirements,wedis-
playthebarincolor. Withincreasingpassword-strengthrat- We generate this suggested improvement as follows. First,
ings,thebarprogressesfromredtoorangetoyellowtogreen. wetaketheuser’scandidatepasswordandmakeoneofthe
Whenitisone-thirdfull,thebarisdarkorange. Attwo-thirds following modifications: toggle the case of a random letter
full,itisyellow,soontobegreen. (lowercasetouppercase,orviceversa),insertarandomchar-
acterinarandomposition,orreplaceacharacteratarandom
TextFeedback
positionwitharandomlychosencharacter. Inaddition,ifall
Whereasacoloredbaristypicalofpasswordmeters[42],our
of the password’s digits or symbols are in a common loca-
meterisamongthefirsttoprovidedetailed,data-drivenfeed-
tion(e.g.,attheend),wemovethemasagrouptoarandom
backonhowtheusercanimprovehisorherspecificcandidate
location within the password. We choose from among this
password. Wedesignedthetextfeedbacktobedirectlyaction-
largesetofmodifications,ratherthanjustmakingmodifica-
able,inadditiontobeingspecifictothepassword. Examples
tions corresponding to the specific text feedback the meter
ofthisfeedbackinclude,asappropriate,suggestionstoavoid
displays, togreatlyincreasethespaceofpossiblemodifica-
dictionarywordsandkeyboardpatterns,tomoveuppercase
tions. Wethenverifythatthismodificationstillcomplieswith
lettersawayfromthefrontofthepasswordanddigitsaway
thecompositionrequirements.
fromtheend,andtoincludedigitsandsymbols.
Werequirethatsuggestedimprovementswouldfillatleasttwo-
Mostfeedbackcommentsonspecificpartsofthepassword.
thirdsofthebarandalsobeatleast1.5ordersofmagnitude
Becauseuserslikelydonotexpecttheirpasswordtobeshown
hardertoguessthantheoriginalcandidatepassword. When
on screen, we designed public and sensitive variants for all
we have generated such a suggested improvement and the
feedback. Publicversionsmentiononlythegeneralclassof
useriscurrentlyshowinghisorherpassword,wedisplaythe
characteristic(e.g.,“avoidusingkeyboardpatterns”),whereas
suggestedmodificationonscreenwithchangesinmagenta,as
sensitiveversionsalsodisplaytheproblematicportionofthe
inFigure2. Iftheuserisnotshowinghisorherpassword,we
password (e.g., “avoid using keyboard patterns like adgjl”).
insteadshowabluebuttonstating“seeyourpasswordwith
Wedisplaythepublicversionsoffeedback(Figure1)when
ourimprovements.” Priorworkhasinvestigatedautomatically
usersarenotshowingtheirpasswordonscreen,whichisthe
modifyingpasswordstoincreasesecurity,findinguserswould
defaultbehavior. Weprovidecheckboxeswithwhichauser
makeweakerpre-improvementpasswordstocompensate[16].
can“showpassword&detailedfeedback,”atwhichpointwe
Incontrast,oursuggestedimprovementsareoptional.
usethesensitiveversion(Figure2).
Toavoidoverwhelmingtheuser,weshowatmostthreepieces Specific-FeedbackModal
of feedback at a time. Each of our 21 heuristic functions For the main screen, we designed the feedback specific to
returnseitheronesentenceoffeedbackortheemptystring. a user’s password to be both succinct and action-oriented.
Tochoosewhichfeedbacktodisplay, wemanuallyordered Althoughtherationaleforthesespecificsuggestionsmightbe
thefunctionstoprioritizethosethatwefoundinaformative obvioustomanyusers,weexpecteditwouldnotbeobvious
laboratorystudyprovidedthemostusefulinformationandthat toallusersbasedonpriorworkonpasswordperceptions[41,
3777
Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA
tacks [19,39], our second point recommends using at least
12charactersinthepassword. Tohelpinspireuserswhoare
unfamiliarwithhowtomakea12+characterpassword,we
use the third point to propose a way of doing so based on
Schneier’s method to use mnemonics or fragments from a
uniquesentenceasthebasisforapassword[33]. Thefourth
pointrecommendsagainstcommonpasswordcharacteristics.
METHODOLOGY
We recruited participants from Amazon’s Mechanical Turk
crowdsourcingserviceforastudyonpasswords. Werequired
thatparticipantsbeage18+andbelocatedintheUnitedStates.
Inaddition,becausewehadonlyverifiedthatthemeterworked
correctlyonFirefox,Chrome/Chromium,Safari,andOpera,
werequiredtheyuseoneofthosebrowsers.
In order to measure both password creation and password
recall, the study comprised two parts. The first part of the
study,whichwetermPart1,includedapasswordcreationtask,
Figure3: Thespecific-feedbackmodal(passwordshown).
asurvey,andapasswordrecalltask. Weassignedparticipants
round-robintoaconditionspecifyingthemetervariantthey
wouldseewhencreatingapassword. Thesecondpartofthe
43]. Togivedetailedexplanationsofwhyspecificsuggestions
study, which we term Part 2, took place at least 48 hours
would improve a password, we created a specific-feedback
afterthefirstpartofthestudyandincludedapasswordrecall
modal,showninFigure3. Whenauserclicks“(Why?)” next
task and a survey. We compensated participants $0.55 for
topassword-specificfeedback,themodalappears.
completingPart1and$0.70forPart2.
Thespecific-feedbackmodal’smainbulletpointsmirrorthose
Afterstudying18conditionsinourfirstexperiment,wehad
fromthemainscreen. Beloweachmainpoint,however,the
lingeringquestions.Wethereforeranasecondexperimentthat
specific-feedbackmodalalsoexplainswhythisactionwould
added8newconditionsandrepeated4existingconditions.
improve the password. Our explanations take two primary
forms. Thefirstformisexplanationsofhowattackerscould
exploit particular characteristics of the candidate password. Part1
Followingtheconsentprocess,wetoldparticipantsthatthey
Forinstance,ifapasswordcontainssimpletransformations
wouldbecreatingapassword.Weaskedthattheyroleplayand
of dictionary words, we explain that attackers often guess
imaginethattheywerecreatingthispasswordfor“anaccount
passwordsbytryingsuchtransformations. Thesecondform
theycarealotabout,suchastheirprimaryemailaccount.” We
ofexplanationisstatisticsabouthowcommondifferentchar-
informedparticipantstheywouldbeinvitedbackinafewdays
acteristicsare(e.g.,30%ofpasswordsthatcontainacapital
torecallthepasswordandaskedthemto“takethestepsyou
letterhaveonlythefirstcharacterofthepasswordcapitalized).
wouldnormallytaketocreateandrememberyourimportant
passwords,andprotectthispasswordasyounormallywould
Generic-AdviceModal
protectyourimportantpasswords.”
Itisnotalwayspossibletogeneratedata-drivenfeedback. For
instance, until a user has typed more than a few characters Theparticipantthencreatedausernameandapassword.While
intothepasswordfield,thestrengthofthecandidatepassword doingso,heorshesawthepassword-strengthmeter(orlack
cannotyetbedefinitivelydetermined. Inaddition,extremely thereof)dictatedbyhisorherassignedcondition,described
predictablecandidatepasswords(e.g.,“password”or“mon- below. Participantsthenansweredasurveyabouthowthey
key1”)requirethatuserscompletelyrethinktheirstrategy. We createdthatpassword. Wefirstaskedparticipantstorespond
thuscreatedageneric-advicemodalthatrecommendsabstract ona5-pointscale(“stronglydisagree,”“disagree,”“neutral,”
strategies for creating passwords. Users access this modal “agree,”“stronglyagree”)tostatementsaboutwhethercreating
byclicking“howtomakestrongpasswords”orbyclicking a password was “annoying,” “fun,” or “difficult.” We also
“(Why?)” next to suggestions against password reuse. The askedwhethertheyreusedapreviouspassword,modifieda
firstoffourpointswemakeonthegeneric-advicemodalad- previouspassword,orcreatedanentirelynewpassword.
visesagainstreusingapasswordacrossaccounts. Wechose
The next three parts of the survey asked about the meter’s
to make this the first point because password reuse is very
coloredbar,textfeedback,andsuggestedimprovements. At
common[9,14,15,38],yetamajorthreattosecurity.
thetopofeachpage,weshowedatextexplanationandvisual
Werecommendconstructive,action-orientedstrategiesasour exampleofthefeatureinquestion. Participantsinconditions
secondandthirdpoints. Reflectingresearchthatshowedthat thatlackedoneormoreofthesefeatureswerenotaskedques-
passwords balancing length and character complexity were tionsaboutthatfeature. Forthefirsttwofeatures,weasked
oftenstrong[34]andbecauseattackerscanoftenexhaustively participantstorateonafive-pointscalewhetherthatfeature
guess all passwords up to at least 9 characters in offline at- “helpedmecreateastrongpassword,”“wasnotinformative,”
3778
Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA
andcausedthemtocreate“adifferentpasswordthan[they] • BarOnly(Bar)showsacoloredbardisplayingpassword
wouldhaveotherwise.” Wealsoaskedabouttheimportance strength,butwedonotprovideanytypeoftextfeedback
participantsplaceonthemetergivingtheirpasswordahigh otherthanwhichcompositionrequirementshavebeenmet.
score,theirperceptionoftheaccuracyofthestrengthrating, • NoFeedback(None)onlyindicatescompliancewiththe
andwhethertheylearnedanythingnewfromthefeedback. password-compositionrequirements.
Aftertheparticipantcompletedthesurvey,webroughthimor
Dimension3: ScoringStringency
hertoaloginpageandauto-filledhisorherusername. The
Uretal.foundthestringencyofameter’sscoringhasasig-
participantthenattemptedtore-enterhisorherpassword. We
nificantimpactonpasswordstrength[42]. Wethustestedtwo
refer to this final step as Part 1 recall. After five incorrect
scoringstringencies. Thesestringencieschangedthemapping
attempts,welettheparticipantproceed.
betweentheestimatednumberofguessesthepasswordwould
resistandhowmuchofthecoloredbarwasfilled.
Part2
After48hours,weautomaticallyemailedparticipantstore- • Medium(M)One-thirdofthebarfullrepresents106esti-
turn and re-enter their password. We term this step Part 2 matedguessesandtwo-thirdsfullrepresents1012.
recall,anditwasidenticaltoPart1recall. Wethendirected • High(H)One-thirdofthebarfullrepresents108estimated
participantstoasurveyabouthowtheytriedtoremembertheir guessesandtwo-thirdsfullrepresents1016.
password. Inparticular,wefirstaskedhowtheyenteredtheir
AdditionalConditionsforExperiment2
password on the previous screen. We gave multiple-choice
Experiment2addedthefollowingtwosettingsforourfeed-
optionsencompassingautomaticentrybyapasswordmanager
backandstringencydimensions,respectively:
orbrowser,typingthepasswordinentirelyfrommemory,and
lookingapasswordupeitheronpaperorelectronically. • Feedback:Standard,NoBar(NoBar)TheStandardmeter
withoutanycoloredbar. Thetextfeedbackstilldependson
Conditions thepassword’sscore,sostringencystillmatters.
InExperiment1,weassignedparticipantsround-robintoone • Stringency: Low (L) One-third of the bar full represents
of 18 different conditions that differed across three dimen- 104estimatedguessesandtwo-thirdsfullrepresents108.
sions in a full-factorial design. We refer to our conditions
usingthree-partnamesreflectingthedimensions:1)password- To investigate these two settings we introduced eight new
compositionpolicy;2)typeoffeedback;and3)stringency. conditionsandre-ranthefourexistingStd-MandStd-Hcon-
ditionsinafull-factorialdesign.
Dimension1: Password-CompositionPolicy
Weexpectedametertohaveadifferentimpactonpassword Analysis
securityandusabilitydependingonwhetheritwasusedwith Wecollectednumeroustypesofdata.Ourmainsecuritymetric
a minimal or a more complex password-composition pol- wastheguessabilityofeachparticipant’spassword,ascalcu-
icy. We tested the following two policies, which represent lated by CMU’s Password Guessability Service [7], which
awidespread,laxpolicyandamorecomplexpolicy. modelsfourtypesofguessingattacksandwhichtheyfound
to be a conservative proxy for an expert attacker [44]. Our
• 1class8 (1c8) requires that passwords contain 8 or more
usability measurements encompassed both quantitative and
characters,andalsothattheynotbe(case-sensitive)oneof
qualitative data. We recorded participants’ keystrokes, en-
the96,480suchpasswordsthatappearedfourormoretimes
ablingustoanalyzemetricslikepasswordcreationtime. To
intheXatocorpusof10millionpasswords[6].
understandtheuseofdifferentfeatures,weinstrumentedallel-
• 3class12(3c12)requiresthatpasswordscontain12ormore ementsoftheuserinterfacetorecordwhentheywereclicked.
charactersfromatleast3differentcharacterclasses(lower-
ForbothPart1andPart2,wemeasuredwhetherparticipants
caseletters,uppercaseletters,digits,andsymbols). Italso
successfully recalled their password. For participants who
requiresthatthepasswordnotbe(case-sensitive)oneofthe
didsuccessfullyrecalltheirpassword,wealsomeasuredhow
96,926suchpasswordsthatappearedintheXatocorpus[6].
long it took them, as well as how many attempts were re-
Dimension2: TypeofFeedback quired. Because notallparticipantsreturnedfor Part2, we
Ourseconddimensionvariesthetypeoffeedbackweprovide measuredwhatproportionofparticipantsdid,hypothesizing
toparticipantsabouttheirpassword. Whilethefirstsetting thatparticipantswhodidnotremembertheirpasswordmight
representsourstandardmeter,weremovedfeaturesforeach be less likely to return. To only study attempts at recalling
oftheothersettingstotesttheimpactofthosefeatures. apasswordfrommemory,weanalyzedpasswordrecallonly
forparticipantswhosaidtheytypedtheirpasswordinentirely
• Standard(Std)includesallpreviouslydescribedfeatures. frommemory,saidtheydidnotreusetheirstudypassword,
• NoSuggestedImprovement(StdNS)isthesameasStan- andwhosekeystrokesdidnotshowevidenceofcopy-pasting.
dard,exceptitneverdisplaysasuggestedimprovement.
Weaugmentedourobjectivemeasurementswithanalysesof
• Public(Pub)isthesameasStandard,exceptwenevershow responsestomultiple-choicequestionsandqualitativeanalysis
sensitive text feedback (i.e., we never show a suggested of free-text responses. These optional free-text responses
improvementandalwaysshowthelessinformative“public” solicitedparticipants’thoughtsabouttheinterfaceelements,
feedbacknormallyshownwhenthepasswordishidden). aswellaswhytheydid(ordidnot)findthemuseful.
3779
Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA
Ourprimarygoalwasunderstandingthequantitativeeffects frequently,whileothersareenteredveryrarely,andtheme-
of varying the three meter-design dimensions. Because we ter’simpactonmemorability(orlackthereof)maydifferin
hadmultipleindependentvariables,eachreflectingonedesign thesescenarios. Furthermore,whilewestudiedourmeter’s
dimension,weperformedregressions. Weranalinearregres- impactonthecreationofasinglepassword,userstypically
sionforcontinuousdata(e.g.,thetimetocreateapassword), havealargenumberofpasswords,potentiallyimpactingmulti-
alogisticregressionforbinarydata(e.g.,whetherornotthey accountintereferenceeffects. Ifafuturestudyweretofind
clickedonagivenUIelement),anordinalregressionforor- thatourmeterincreasesmulti-accountinterference,themeter
dinal data (e.g., Likert-scale responses), and a multinomial mightbebestdeployedonlyforhigh-valueaccounts.
logisticregressionforcategoricaldatawithnoclearordering
Ourstudy’secologicalvalidityisalsolimitedinotheraspects.
(e.g.,howtheyenteredtheirpassword).
Wedidnottesthabituationeffects,eithertotheuseofapar-
Foroursecurityanalyses,weperformedaCoxProportional- ticularpasswordortothenovelpassword-meterfeatureswe
Hazards Regression, which is borrowed from the literature tested. Ourpasswordmetermightcauselong-termchangesin
onsurvivalanalysisandhasbeenusedtocomparepassword howuserscreatepasswordsinthewildthatonlyanextended
guessability[31].Becauseweknowthestartingpointofguess- field study could capture. On the one hand, lessons users
ingbutnottheendpoint,weusearight-censoredmodel[17]. learnfromourmeteraboutpasswordcreationcouldhelpthem
In a traditional clinical model using survival analysis, each createstrongerpasswordsforsitesthatdonotprovidefeed-
datapointismarkedas“alive”or“deceased,”alongwiththe back. Ontheotherhand,potentialusabilitydrawbacksmight
timeoftheobservation. Ouranaloguesforpasswordsare“not become more pronounced or have a larger effect over time
guessed”and“guessed,”alongwiththenumberofguessesat inthewild. Inaddition,passwordcreationwastheprimary
whichthepasswordwasguessed,ortheguessingcutoff. taskinourstudy. Ourmetermighthaveadifferentimpactin
themoretypicalscenarioofpasswordcreationbeingusers’
Wealwaysfirstfitamodelwiththethreedesigndimensions
secondarytask,whichmightlowertheirmotivationtocreate
(compositionpolicy,feedback,andstringency)eachtreatedas
strongpasswords.
ordinalvariablesfitlinearly,aswellasinteractiontermsfor
eachpairofdimensions. Tobuildaparsimoniousmodel,we
PARTICIPANTS
removedanyinteractiontermsthatwerenotsignificant,yet
4,509 participants (2,717 in Experiment 1 and 1,792 in Ex-
alwayskeptallthreemaineffects,andre-ranthemodel.
periment 2) completed Part 1, and 84.1% of them finished
We corrected for multiple testing using the Benjamini- Part2.Amongourparticipants,52%identifiedasfemale,47%
Hochberg(BH)procedure[3]. Wechosethisapproach,which asmale,andtheremaining1%identifiedasanothergender
ismorepowerfulandlessconservativethanmethodslikeHolm or preferred not to answer. Participants’ ages ranged from
Correction,becauseweperformedanexploratorystudywitha 18 to 80 years old, with a median of 32 (mean 34.7). We
largenumberofvariables. Wecorrectedallp-valuesforeach askedwhetherparticipantsare“majoringinor...haveadegree
experimentasagroup. Weuseα =0.05. orjobincomputerscience,computerengineering,informa-
tiontechnology,orarelatedfield,”and82%responded“no.”
NotethatweanalyzedExperiments1and2separately. Thus,
Demographicsdidnotvarysignificantlybycondition.
we do not compare conditions between experiments. How-
ever,inthegraphsandtablesthatfollowwehavecombined
RESULTS
ourreportingofthesetwoexperimentsforbrevity. Weonly
Increasinglevelsofdata-drivenfeedback, evenbeyondjust
report“NoBar”andlow-stringencydatafromExperiment2in
acoloredbar,leduserstocreatestronger1class8passwords.
thesetablesandgraphs. Forconditionsthatwerepartofboth
Thatis,detailedtextfeedbackledtomoresecurepasswords
experiments,weonlyreporttheresultsfromExperiment1.
than the colored bar alone. The 3class12 policy also led to
strongerpasswords.Incontrast,varyingthescoringstringency
hadonlyasmallimpact.
Limitations
We based our study’s design on one researchers have used Amongourusabilitymetrics,alldimensionsaffectedtiming
previouslytoinvestigatevariousaspectsofpasswords[21,28, andsentiment,buttheymostlydidnotaffectpasswordmem-
34,42]. However,thepasswordparticipantscreateddidnot orability. Notably,althoughincreasinglevelsofdata-driven
protect anything of value. Beyond our request that they do feedbackledtostrongerpasswords,wedidnotobserveany
so,participantsdidnotneedtoexhibittheirnormalbehavior. significantimpactonmemorability. Table1summarizesour
Mazureketal.[31]andFahletal.[13]examinedtheecological keyfindings. Throughoutthissection,ifwedonotexplicitly
validityofthisprotocol,findingittobeareasonable,albeit calloutametricasrevealingadifferencebetweenvariantsof
imperfect,proxyforhigh-valuepasswordsforrealaccounts. themeter,thenwedidnotobserveasignificantdifference.
Thatsaid,nocontrolledexperimentcancaptureeveryaspect Overall,56.5%ofparticipantssaidtheytypedtheirpassword
ofpasswordcreation. Wedidnotcontrolthedevice[49,53] inentirelyfrommemory. Otherparticipantslookeditupon
onwhichparticipantscreatedapassword,norcouldwecon- paper (14.6%) or on an electronic device (12.8%), such as
trolhowparticipantschosetoremembertheirpassword. We theircomputerorphone. Anadditional11.2%ofparticipants
testedpasswordrecallatonlytwopointsintime. Ourlimited said their password was entered automatically for them by
evaluationofmemorabilitydoesnotcapturemanyaspectsof a password manager or browser, while 4.9% entered their
thepasswordecosystem. Somepasswordsareenteredvery passwordinanotherway(e.g.,lookinguphints).
3780
Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA
60% 1c8−None password’ssurvivaltermasthedependentvariable. ForEx-
periment 1, we found significant main effects for both the
1c8−Bar−M policyandtypeoffeedback,butwealsoobservedasignificant
interactioneffectbetweenthepolicyandtypeoffeedback.For
d increasedintelligibility,wesubsequentlyranseparateregres-
e40%
ss 1c8−Std−M sions for 1class8 and 3class12 passwords. As the 3class12
e
gu policyrequireslongerpasswordsthan1class8,participantsun-
nt 3c12−None surprisinglycreated3class12passwordsthatweresignificantly
Perce20% 33cc1122−−BSatdr−−MM longer (p<.001) and significantly more secure (p<.001)
than1class8passwords.
Moving from a 1class8 to a 3class12 policy increased the
timeittooktocreatethepassword,measuredfromthefirst
0% keystroketothelastkeystrokeinthepasswordbox(p=.014).
101 103 105 107 109 1011 1013 1015 Italsoimpactedparticipantsentiment. The3class12policy
Guesses led participants to report password creation as significantly
Figure4: Guessabilityofmedium-stringencypasswordscre- moreannoyinganddifficult(both p<.001).
atedwithoutanyfeedback(“None”),withonlyacoloredbar
Passwords created under a 3class12 policy were more se-
(“Bar”),andwithbothacoloredbarandtextfeedback(“Std”).
cure than those created under 1class8, but 3class12 partici-
pants were less likely to remember their passwords during
Part2(p=.025). Acrossconditions,81.3%of1class8partic-
60% 1c8−None ipantsrecalledtheirpasswordduringPart2,whereas75.0%
of3class12participantsdid. Thepolicydidnotsignificantly
1c8−Bar−M impactanyotherrecallmetrics.
1c8−Pub−M
d 1c8−StdNS−M
se40% 1c8−NoBar−M
es 1c8−Std−M
gu ImpactoftheAmountofFeedback
nt For1class8passwords,wefoundthatincreasinglevelsofdata-
e
c
Per20% dpraisvsewnofredesd(bpac<k.l0ed01p)a.rRticeilpaatinvtesttoohcarevaintegsnigonfiefiecdabnatclyk,sttrhoenfguelrl
suiteofdata-drivenfeedbackledto44%strongerpasswords.
AsshowninFigure4,thecoloredbaronitsownledpartici-
0% pantstocreatestrongerpasswordsthanhavingnofeedback,
echoingpriorwork[42]. Thedetailedtextfeedbackweintro-
101 103 105 107 109 1011 1013
Guesses duceinthisworkledtostrongerpasswordsthanjustthebar
(Figure 4). Increasing the amount of feedback also led par-
Figure5: Guessabilityof1class8, medium-stringencypass-
ticipantstocreatelongerpasswords(p<.001). Forexample,
wordscreatedwithdifferentlevelsoffeedback. Thisgraph
themedianlengthof1c8-Nonepasswordswas10characters,
shows data from Experiment 1. However, we have added
whereasthemedianfor1c8-Std-Mwas12characters.
1c8-NoBar-MfromExperiment2forillustrativepurposes.
Notably,thesecurityof1class8passwordscreatedwithour
standardmeter(includingalltextfeedback)wasmoresimilar
Consideringpasswordreuseandcopy-pasting,50.6%ofpar- tothesecurityof3class12passwordscreatedwithoutfeedback
ticipantstriedtorecallanovelstudypasswordfrommemory; thanto1class8passwordscreatedwithoutfeedback(Figure4).
thesearetheparticipantsforwhomweexaminepasswordre-
Figure 5 details the comparative security impact of all six
call. Overall,98.4%oftheseparticipantssuccessfullyrecalled
feedback levels on 1class8 passwords. Whereas suggested
theirpasswordduringPart1,andthemajoritydidsoontheir
improvementshadminimalimpact,havingtheoptiontoshow
firstattempt. Intotal,78.2%ofparticipantswhoreturnedfor
potentiallysensitivefeedbackprovidedsomesecuritybenefits
Part 2 and who tried to recall a novel study password from
overthepublic(“Pub”)variant. Whenweinvestigatedremov-
memorysucceeded,againprimarilyontheirfirstattempt.
ingthecoloredbarfromthestandardmeter,butleavingthe
We first discuss how the three different dimensions of the textfeedback,wefoundthatremovingthecoloredbardidnot
meter’sdesignimpactedsecurityandusability. Wethendetail significantlyimpactpasswordstrength.
howparticipantsinteractedwith,andqualitativelyresponded
For3class12passwords,however,theleveloffeedbackdid
to,themeter’svariousfeatures.
notsignificantlyimpactpasswordstrength. Wehypothesize
that either we are observing a ceiling effect, in which the
ImpactofCompositionPolicy 3class12policybyitselfledparticipantstomakesufficiently
WefirstranaCoxregressionwithallthreedimensionsand strongpasswords,orthatthetextfeedbackdoesnotprovide
theirpairwiseinteractionsasindependentvariables,andthe sufficientlyusefulrecommendationsfora3class12policy.
3781
Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA
3c12−Std−L 3class12passwordsweretwocharacterslonger. Priorwork
testedameterthatusedonlybasicheuristics(lengthandchar-
3c12−Std−M
acterclasses)toscorepasswords,withaparticularemphasis
20% on length [42]. As a result, participants could fill more of
3c12−Std−H
d thestringentmeterssimplybymakingtheirpasswordlonger.
e
ss Incontrast,ourmeterscorespasswordsfarmorerigorously,
e
gu whichwehypothesizemightaccountforthisdifference.
nt
ce10% Althoughincreasingthescoringstringencyledparticipantsto
er
P createlongerpasswords,varyingbetweenmediumandhigh
stringency did not affect participant sentiment or how long
ittooktocreateapassword. Whenweaddedanadditional
lowstringencylevelinExperiment2,however,participants
0% withmorestringentmeterstooklongertocreateapassword
101 103 105 107 109 1011 1013 1015 (p<.001)anddeletedmorecharactersduringcreation(p<
Guesses .001). Theyalsotooklongertorecalltheirpasswordduring
Figure6: Guessabilityof3class12passwordsbystringency. Part1(p=0.010)andwerelesslikelytotryrecallingtheir
passwordsolelyfrommemory(p=.002),thoughstringency
did not significantly impact other recall metrics. For high-
Increasingtheamountoffeedbackincreasedthetimeittook stringencyconditions, increasedfeedbackledtoevenmore
tocreatethepassword(p=.011). Weobservedaninteraction deletions(p=.002). However,for3class12policies,higher
betweentheamountoffeedbackandthestringency;increasing stringencyledtofewerdeletionsoverall(p=.002).
theamountoffeedbackinhigh-stringencyconditionsledtoa
Increasingthestringencygreatlyimpactedparticipantsenti-
greatertimeincrease(p=.048).
ment. Itledparticipantstoperceivepasswordcreationasmore
Tounderstandhowparticipantschangetheirpasswordsduring annoying, more difficult, and less fun (p<.001, p<.001,
creation,weexaminedthenumberofdeletions,whichwede- p = .010, respectively). It also caused participants to be
finedasaparticipantremovingcharactersfromtheirpassword morelikelytosaythebarhelpedthemcreateastrongerpass-
thattheyaddedinapriorkeystroke. Increasingamountsof word(p=.027), lesslikelytobelievethebarwasaccurate
feedbackledtosignificantlymoredeletions(p<.001), im- (p<.001),andlesslikelytofinditimportantthatthebargives
plyingthatthefeedbackcausesparticipantstochangetheir themahighscore(p=.006). Increasinglevelsofstringency
password-creationdecisions. Forinstance,themediannumber made participants more likely to say the text feedback led
ofdeletionsfor1c8-Nonewaszero,whilethemediannumber themtocreateadifferentpasswordthantheywouldhaveoth-
for1c8-Std-Hwas9. erwise(p=.010),butalsolesslikelytobelievetheylearned
somethingnewfromthetextfeedback(p<.001).
Increasingtheamountoffeedbacknegativelyaffectedpartici-
pantsentiment. Itledparticipantstoreportpasswordcreation
TextFeedback
asmoreannoying(p<.001),moredifficult(p<.001),and
Participants reacted positively to the text feedback. Most
lessfun(p=.025). Foreachsentiment,roughly10%–15%
participants(61.7%)agreedorstronglyagreedthatthetext
oftheparticipantsinthatconditionmovedfromagreementto
feedback made their password stronger. Similarly, 76.9%
disagreement,orviceversa.
disagreed or strongly disagreed that the feedback was not
Eventhoughincreasingtheamountoffeedbackledtosignifi- informative, and 48.7% agreed or strongly agreed that they
cantlymoresecurepasswords,itdidnotsignificantlyimpact createdadifferentpasswordthantheywouldhaveotherwise
anyofourrecallmetrics. becauseofthetextfeedback. Higherstringencyparticipants
were more likely to say they created a different password
(p=.022),butnootherdimensionsignificantlyimpactedany
ImpactofStringency
otheroneofthesesentiments.
Althoughpriorworkonpasswordmetersfoundthatincreased
scoringstringencyledtostrongerpasswords[42],wefound Althoughmostparticipants(68.5%)selected“no”whenwe
thatvaryingbetweenmediumandhighstringencydidnotsig- askediftheylearned“somethingnewaboutpasswords(your
nificantlyimpactsecurity. Becausebothourmediumandhigh password,orpasswordsingeneral)fromthetextfeedback,”
stringency levels were more stringent than most real-world 31.5% selected “yes.” Participants commonly said they
passwordmeters,weinvestigatedanadditionallowstringency learnedaboutmovingcapitalletters, digits, andsymbolsto
settinginExperiment2. Withthesethreelevels,wefoundthat lesspredictablelocationsfromthemeter(e.g.,“Ididn’tknow
increasinglevelsofstringencydidleadtostrongerpasswords, itwashelpfultocapitalizeaninternalletter.”). Manypartici-
butonlyfor3class12passwords(p=.043). Figure6shows pantsalsonotedthatthemeter’srequestsnottousedictionary
the impact of varying the 3class12 stringency. In all cases, entriesorwordsfromWikipediaintheirpasswordtaughtthem
however,increasingthestringencyledparticipantstocreate somethingnew. Oneoftheseparticipantsnotedlearning“that
longerpasswords(p<.001). Forinstance,themedianhigh- hackers use Wikipedia.” Requests to include symbols also
stringency 1class8 password was one character longer than resonated. Asoneparticipantwrote,“Ididn’tknowpreviously
themedianlow-stringency1class8password. High-stringency thatyoucouldinputsymbolsintoyourpasswords.”
3782
Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA
Table1:Asummaryofhowmovingfroma1class8to3class12 improvement,81.5%saidthatsuggestedimprovementswere
policy,increasingtheamountoffeedback,orincreasingthe useful, while 18.5% said they were not. A slight majority
scoringstringencyimpactedkeymetrics. (50.9%)oftheseparticipantsagreedorstronglyagreedthat
thesuggestedimprovementhelpedthemmakeastrongerpass-
Metric Policy Feedback Stringency word. Thishelp,however,didnotoftencomeintheformof
Security adoptingtheprecisesuggestionoffered. Ineachconditionthat
Passwordshardertoguess ✦ 1class8only 3class12only offeredasuggestedimprovement,atmost7%ofparticipants
usedoneofthemeter’ssuggestedpasswordsverbatim. Our
Passwordcreation
Longerpasswords ✦ ✦ ✦ qualitativefeedbackindicatedthatthesuggestedimprovement
Moretimetocreate ✦ ✦ ✦ oftensparkedotherideasformodifyingthepassword.
Moredeletions – ✦ ✦
Morelikelytoshowonscreen ✦ – ✦ Weaskedparticipantswhofoundthesuggestedimprovements
Lesslikelytoshowonscreen – ✦ – useful to explain why. They wrote that seeing a suggested
Morelikelytoshowsuggested – – ✦ improvement“helpsyoumodifywhatyoualreadyhaveinstead
improvement
ofhavingtothinkofsomethingabsolutelynew”and“breaks
Sentimentaboutcreation
yououtofpatternsyoumighthavewhencreatingpasswords.”
Moreannoying ✦ ✦ ✦
Moredifficult ✦ ✦ ✦ Participantsparticularlylikedthatthesuggestedimprovement
Lessfun – ✦ ✦ wasamodificationoftheirpassword,ratherthananentirely
new one, because it “may help spark ideas about tweaking
Passwordrecall
LessmemorableinPart1 – – – the password versus having to start from scratch.” As one
Part1recalltooklonger – – ✦ participantsummmarized,“Itactuallyofferssomehelpinstead
LessmemorableinPart2 ✦ – – ofjustshuttingyoudownbyessentiallysaying‘no,notgood
Part2recalltooklonger – – –
enough,comeupwithsomethingelse.’ It’sveryhelpful.”
Requiredmoreattempts – – –
Participantlesslikelytotry
– – ✦ Participantswhodidnotfinditusefulexpressedtwostreamsof
recallingfrommemory
reasoning. Thefirstconcernedmemorability. Oneparticipant
explained,“I’mmorelikelytoforgetapasswordifIdon’tuse
Participantsalsotookthetextfeedbackasanopportunityfor familiartechniques.” whileanotherwrote,“Ialreadyhavea
reflectionontheirpassword-creationstrategies. Onepartici- certainformatinmindwhenIcreatemypasswordtohelpme
pantlearned“thatItendtousefullwordswhichisn’tgood,” memorizeitandIdon’tliketostrayfromthat.” Thesecond
whileanotherlearned“don’tbasethepasswordofftheuser- streamconcernedthetrustworthinessofthe“algorithmthat
name.” Participants exercised many of the features of our createsthosesuggestions.” Asoneparticipantwrote,“Idon’t
feedback,includingparticipantswho“learnedtonotuseL33T trustit. Idon’twantacomputerknowingmypasswords.”
tocreateapassword(exchanginglettersforpredictablenum-
bers).” Someparticipantsalsolearnedaboutpasswordreuse,
ModalWindows
notablythat“peoplestealyourpasswordsindatabreachesand
Clickinga“(why?)” linknexttoanyoftheuptothreepieces
theytrytouseittoaccessotheraccounts.”
offeedbackinanyconditionwithtextfeedbackopenedthe
specific-advice modal. Few participants in our experiment
SuggestedImprovement
clickedon“(why?)” links,andthereforefewparticipantssaw
Whenparticipantsinapplicableconditionsshowedtheirpass-
thespecific-advicemodal. Only1.0%ofparticipantsacross
wordorclicked“seeyourpasswordwithourimprovements,”
allconditionsclickedononeoftheselinks.
theywouldseethesuggestedimprovement.Acrossconditions,
37.8%ofparticipantsclickedthe“showpassword”checkbox. Incontrast,8.4%ofparticipantslookedatthegeneric-advice
Participants who made a 3class12 password, saw a higher- modal, though 3class12 participants were less likely to do
stringencymeter,orwhosawlessfeedbackweremorelikely so(p=.025). Participantscouldarriveatthegeneric-advice
to show their password (p<.001, p=.006, and p=.022, modalbyclicking“howtomakestrongpasswords”orclicking
respectively). Whilemostparticipantswhosawasuggested “(why?)”nexttoadmonitionsagainstpasswordreuse. Partici-
improvementdidsobecausetheychecked“showpassword& pantsarrivedatitaboutevenlythroughthesetwomethods.
detailedexplanations,”8.7%ofparticipantsinapplicablecon-
ditionsspecificallyclickedthe“seeyourpasswordwithour
ColoredBar
improvements”button. Higherstringencymadeparticipants
morelikelytoshowthesuggestedimprovement(p=.003). Wealsoanalyzedparticipants’reactionstothecoloredbar.All
threedimensionsimpactedhowmuchofthecoloredbarpar-
When we asked in the survey whether participants in those ticipantsfilled. Participantswhowereassignedthe3class12
conditionshadseenasuggestedimprovement,34.8%selected policy or saw more feedback filled more of the bar, while
“yes,” 55.6%selected“no,” and9.6%chosethe“Idon’tre- participants whose passwords were rated more stringently
member” option. Nearly all participants who said they did unsurprisinglyfilledless(all p<.001). Fewparticipantscom-
notseeasuggestedimprovementindeedwerenevershown pletelyfilledthebar(estimatedguessnumbers1018and1024
a suggested improvement because they never showed their in medium and high stringency, respectively). The median
password or clicked “see your password with our improve- participantoftenfilledhalftotwo-thirdsofthebar,depending
ments.” Amongparticipantswhosaidtheysawasuggested onthestringency. Forinstance,for1c8-Std-M,only16.5%
3783
Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA
completelyfilledthebar,but51.7%filledatleasttwo-thirds, sensitivefeedbackwhentheusershowshisorherpasswordon
and73.1%filledatleasthalf. screen. Whilemuchofthistextfeedbackmightberedundant
forpowerusers,theyarefreetoignoreit.
Overall,participantsfoundthecoloredbaruseful. Themajor-
ityofparticipants(64.0%)agreedorstronglyagreedthatthe Thespecificimprovementstoauser’spasswordsuggestedby
coloredbarhelpedthemcreateastrongerpassword, 42.8% our meter did seem to help some participants, but the over-
agreed or strongly agreed that the bar led them to make a alleffectwasnotstrongandsomeparticipantsdidnottrust
differentpasswordthantheywouldhaveotherwise,and77.2% suggestions from a computer. While the inclusion of such
disagreed or strongly disagreed with the statement that the suggested improvements does not seem to hurt, we would
coloredbarwasnotinformative. Participantsalsolookedto consideritoptional. Similarly, althoughthegeneric-advice
thecoloredbarforvalidation; 50.9%ofparticipantsagreed modalwasvisitedmorethanthespecific-advicemodal,only
or strongly agreed that it is important that the colored bar afractionofparticipantslookedatit. Becausenotallusers
gives their password a high score. High-stringency partici- needtolearnthebasicsofmakingstrongpasswords[41],itis
pants were less likely to care about receiving a high score reasonablethatonlyahandfulofuserswouldneedtoengage
(p=.025). With increasing amounts of feedback, partici- withthesefeatures. Wethusrecommendthattheybeincluded.
pantsweremorelikelytocareaboutreceivingahighscore
(p=.002),morelikelytosaythatthebarhelpedthemcreate Incontrasttopriorworkthatfoundscoringstringencytobe
crucialforpasswordmeters[42],weonlyobservedasignifi-
astrongerpassword(p<.001)thatwasdifferentthanthey
cantsecurityeffectfor3class12passwords,andtheeffectsize
wouldhaveotherwise(p<.001). Theywerealsolesslikely
tobelievethebarwasnotinformative(p=.024). wassmall. Notethatourmeterusedfarmoreadvancedmeth-
odstoscorepasswordsmoreaccuratelythanthebasicheuris-
Participantsmostlyfeltthecoloredbarwasaccurate. Across tics tested in that prior work. Because the high-stringency
conditions,68.2%ofparticipantsfeltthebarscoredtheirpass- settingnegativelyimpactedsomeusabilitymetrics,werecom-
word’sstrengthaccurately,while23.6%feltthebargavetheir mendourmediumsetting.
passwordalowerscorethandeserved. Anadditional4.2%of
Ourrecommendationsdifferfor3class12passwords. Theme-
participantsfeltthebargavetheirpasswordahigherscorethan
terhadminimalimpactonthesecurityof3class12passwords.
deserved,while4.0%didnotrememberhowthebarscored
Whilethemeterintroducedfewusabilitydisadvantages,sug-
theirpassword. Participantsinmorestringentconditionswere
gestingthatitmaynothurttoincludethemeter,wewouldnot
lesslikelytofindthebar’sratingaccurate(p<.001).
recommenditnearlyasstronglyasfor1class8passwords.
Wealsotestedremovingthecoloredbarwhilekeepingalltext
While the studies we report in this paper are a crucial first
feedback. Removingthecoloredbarcausedparticipantstobe
morelikelytoreturnforPart2ofthestudy(p=.020),butdid step in pinpointing the impact of meter design dimensions
andconfigurations,thenextstepistoinvestigatetheseeffects
notimpactanyotherobjectivesecurityorusabilitymetrics.
furtherinafieldstudy. Suchastudycouldimproveecological
Removingthecoloredbardidimpactparticipantsentiment,
validity,allowforexpandedtestingofpasswordmemorability,
however. Participantswhodidnotseeabarfoundpassword
andenablecomparisonswithcurrentlydeployedmeters.
creationmoreannoyinganddifficult(both p<.001).
Tospuradoptionofdata-drivenpasswordmeters,wearere-
leasingourmeter’scodeopen-source.2
DISCUSSIONANDDESIGNRECOMMENDATIONS
Wedescribedourdesignandevaluationofapasswordmeter
thatprovidesdetailed,data-drivenfeedback. Usingacombina- ACKNOWLEDGMENTS
tionofaneuralnetworkand21carefullycombinedheuristics ThisworkwassupportedinpartbyagiftfromthePNCCenter
toscorepasswords,aswellasgivingusersdetailedtextexpla- forFinancialServicesInnovation,NSFgrantCNS-1012763,
nationsofwhatpartsoftheirpasswordarepredictable, our andbyJohn&ClaireBertucci.
metergivesusersaccurateandactionableinformationabout
theirpasswords.
REFERENCES
1. StevenVanAcker,DanielHausknecht,WouterJoosen,
We found that our password-strength meter made 1class8
andAndreiSabelfeld.2015.Passwordmetersand
passwords harder to guess without significantly impacting
generatorsontheweb: Fromlarge-scaleempiricalstudy
memorability. Textfeedbackledtomoresecure1class8pass-
togettingitright.InProc.CODASPY.
words than a colored bar alone; colored bars alone are the
type of meter widely deployed today [10]. Notably, show-
2. AnneAdams,MartinaAngelaSasse,andPeterLunt.
ingdetailedtextfeedbackbutremovingthecoloredbardid
1997.Makingpasswordssecureandusable.InProc.HCI
notsignificantlyimpactthesecurityofthepasswordspartic-
onPeopleandComputers.
ipantscreated. Combinedwithourfindingthatmostpeople
do not feel compelled to fill the bar, this suggests that the 3. YoavBenjaminiandYosefHochberg.1995.Controlling
visualmetaphorhasonlymarginalimpactwhendetailedtext thefalsediscoveryrate: Apracticalandpowerful
feedbackisalsopresent. Asaresult,wehighlyrecommend approachtomultipletesting.JournaloftheRoyal
the use of a meter that provides detailed text feedback for StatisticalSociety,SeriesB57,1(1995),289–300.
commonpassword-compositionpolicieslike1class8. From
our results, we recommend that the meter offer potentially 2Sourcecode:https://github.com/cupslab/password_meter
3784
Description:composition. In Proc. INTERACT. 49. Emanuel von Zezschwitz, Alexander De Luca, and. Heinrich Hussmann. 2014. Honey, I shrunk the keys: Influences of mobile devices on password composition and authentication performance. In Proc. NordiCHI. 50. Kim-Phuong L. Vu, Robert W. Proctor, Abhilasha.