Table Of ContentLANGUAGE,COGNITIONANDNEUROSCIENCE,2018
https://doi.org/10.1080/23273798.2018.1431679
REGULARARTICLE
When does abstraction occur in semantic memory: insights from distributional
models
Michael N. Jones
MichaelJonesDepartmentofPsychologicalandBrainSciences,IndianaUniversity,Bloomington,IN,USA
ABSTRACT ARTICLEHISTORY
Abstraction is a core principle of Distributional Semantic Models (DSMs) that learn semantic Received2December2016
representations for words by applying dimensional reduction to statistical redundancies in Accepted17January2018
language. Although the posited learning mechanisms vary widely, virtually all DSMs are
prototype models in that they create a single abstract representation of a word’s meaning. This KEYWORDS
Semanticmemory;latent
stands in stark contrast to accounts of categorisation that have very much converged on the
semanticanalysis;
superiority of exemplar models. However, there is a small but growing group of accounts in
categorisation;abstraction
psychology, linguistics, and information retrieval that are exemplar-based semantic models.
Thesemodelsborrowmanyoftheideasthathaveledtotheprominenceofexemplarmodelsin
fields such as categorisation. Exemplar-based DSMs posit only an episodic store, not a semantic
one. Rather than applying abstraction mechanisms at learning, these DSMs posit that semantic
abstractionisanemergentartifactofretrievalfromepisodicmemory.
Abstractionisan essentialmechanismtolearnandrep- prototype models: They attempt to collapse the entire
resent meaning in memory. Theoretical notions of set of a word’s linguistic exemplars into a single econ-
abstraction vary across research domains, but tend to omical representation of word meaning. However, this
emphasise aggregation across exemplars to a central practiceisincontrasttotheliteratureoncategorisation
“average”representation(Reed,1972),transformingsen- that has largely disposed of prototype representations
sorimotor input to a deeper knowledge representation infavourofexemplar-basedmodels.Inthispaper,Ihigh-
(Barsalou, 1999; Damasio, 1989), or reducing idiosyn- lightthecontradictionbetweenliteratures,andattempt
cratic dimensions to focus on those attributes most to build a case for exemplar-based models of distribu-
common to members of a category (Rosch & Mervis, tionalsemantics.
1975). In modern computational models of semantic Abstraction is a core mechanistic principle of DSMs.
memory, notions of abstraction are formally specified Most DSMs apply some form of dimensional reduction
andappliedtoreal-worldlinguistic datatoevaluatethe to words’ experienced linguistic contexts, essentially
structure of semantic memory that the mechanisms abstracting over the dimensions that are idiosyncratic
wouldproduce. to each context, and converging on the stable
Modern distributional semantic models (DSMs; e.g. higher-order dimensions that optimally explain the
Landauer & Dumais, 1997) have become immensely covariational pattern of words across contexts. Aggre-
popular in the cognitive literature due to their success gation is also a core principle of virtually all DSMs –
at fitting human experimental data, their utility in real- multiple linguistic contexts are averaged across,
worldapplications,andtheirinsightsasmodelsofcogni- either explicitly or implicitly, resulting in a single
tion. In general, DSMs learn distributed representations central representation for the word that is stored. A
for word meanings from statistical redundancies across word’svectorpatternacrossthesereduceddimensions
linguistic experience. Because they are often applied to is thought to represent its generic meaning. Hence,
text corpora as learning data, DSMs are also referred to eachDSMformallyspecifiesanabstractionmechanism
as “corpus-based” models, although, in principle, their by which episodic memory is transformed into seman-
learning mechanisms can be applied to covariational tic memory; in this sense, DSMs embody the idea of
structureinanydataset(e.g.perception,speech,etc.). abstraction and allow us to quantitatively evaluate
DespitethewiderangeofDSMsintheliterature,they various process explanations of aggregation and
virtually all share the characteristic that they are dimensional reduction.
CONTACT MichaelN.Jones [email protected]
InPress:Language,Cognition,andNeuroscience,SpecialIssueon“Concepts:What,When,andWhere”.
©2018InformaUKLimited,tradingasTaylor&FrancisGroup
2 M.N.JONES
The particular mechanisms posited for abstraction in the desired output being the activation of the correct
DSMs differinseveraltheoreticallyimportantways, and target word, with other words being inhibited. The
include reinforcement learning, probabilistic inference, error signal (difference between true and observed
latent induction, and Hebbian learning. Enumerating output pattern) is then backpropagated through the
the differences between the mechanisms used by each network to increase the likelihood that the correct
model is beyond the scope of this article (see Jones, word will be activated at the output layer given the
Willits,&Dennis,2015forareview);butallDSMsessen- input wordsinthefuture. Hence, thecontextis usedto
tially specifyanabstractionmechanismtoformalisethe predict the word.2 After training on a large text corpus,
classic notion in linguistics that “you shall know a word a word’s pattern across the hidden layer begins to
by the company it keeps” (Firth, 1957). The theoretical show higher-order relationships that go beyond the
points that follow apply broadly to all DSMs that posit first order relationships it was being trained to predict.
abstraction at learning, regardless of specific learning Very much like LSA’s reduced representation, the
mechanism.AstwoexamplesofDSMswithverydifferent reduced representation across word2vec’s hidden layer
architecturesandlearningmechanisms,1brieflyconsider has now learned similarity between words that are pre-
classic Latent Semantic Analysis (LSA; Landauer & dicted by similar contexts. While LSA used SVD for data
Dumais, 1997) and the newest DSM – Google’s reduction, word2vec used backpropagation; but both
word2vec (Mikolov, Sutskever, Chen, Corrado, & Dean, modelsessentiallyabstractsemanticsfromepisodes.
2013). ThesesimilaritiescanbeseenacrossalloftheDSMs–
LSA begins with a word-by-document frequency allachievethedesiredoutcomeofareducedabstraction
matrixofatextcorpus.Thisinitial“episodic”matrixrep- ofwordmeaningfromepisodicco-occurrences.Thejury
resentsfirst-orderrelationships:wordsaresimilarifthey isstilloutonwhich(ifany)mechanismisthemostplaus-
have frequently co-occurred in contexts. LSA then iblemodelofhowhumansconstructsemanticrepresen-
applies singular value decomposition to this episodic tations. But one property that is clear to all of these
matrix (cf. factor analysis) retaining only the 300 or so “abstractionatlearning”DSMsisthattheymaybeclassi-
dimensions that account for the largest amount of var- fiedasprototypemodels.Themodelsattempttocreatea
iance in the original matrix. Singular value decompo- single abstracted representation of meaning for each
sition serves as an abstraction mechanism, reducing word,and thissinglesemanticrepresentationiswhatis
the dimensionality and emphasising second-order storedandusedindownstreamfittingofpsycholinguis-
relationships that were not obvious in the episodic ticdata.
matrix. In the reduced space, words will be similar if There are many similarities in the literatures on
theyoccurinsimilarcontexts,eveniftheyneverdirectly semantic memory and categorisation, enough that it is
co-occur (e.g. category exemplars and synonyms). But likely that the cognitive mechanisms that subserve
much of the information idiosyncratic to specific con- semantic learning and category learning may be
textsthatwouldberequiredtoreconstructthefullorig- heavilyrelatedtoeachother.Butonekeycontradiction
inal episodic matrix has now been lost. Hence, LSA stands out: While the literature on categorisation has
achieves an abstracted semantic representation by very much converged on the superiority of exemplar-
applyingtruncatedSVDtothehistory ofepisodes. basedmodels,DSMmodelsareallessentiallyprototype
Mikolov et al.’s (2013) word2vec achieves a similar models.
outcome,albeit inarather differentway.Word2vec isa
“neuralembedding”modelthathasbeenextremelysuc-
Lessonsfrom categorization models
cessfulincomputationallinguistics.Toacognitivescien-
tist,themodelisessentiallyafeedforwardconnectionist Categorisation and semantic abstraction have many
network (cf. Rumelhart networks explored in Rogers & similarities, and it is commonly believed that the
McClelland, 2004) with some optimisation tricks that process of categorisation may be used to produce
allow it to be scaled up to large amounts of text data. semantic structure (see Rogers &McClelland, 2011fora
Word2vec has localist input and output layers, each review). The categorisation literature has been domi-
with one node for each word in the corpus. The input nated for many years by a debate between prototype
and output layers are fully connected via a hidden and exemplar-based theories. Prototype theories are
layer of ∼300 nodes, which allows the model to learn based largely on principles of cognitive economy cham-
nonlinear patterns in the text corpus. When a word is pioned by Rosch and Mervis (1975). Prototype theories
experienced, the other words that it occurs with serve (e.g. Reed, 1972) posit that as category exemplars are
asitscontext.Withthenodeforthecontextwordsacti- experienced, humans gradually abstract generalities
vated, activation feeds forward to the output layer with across them and construct a single prototypical
LANGUAGE,COGNITIONANDNEUROSCIENCE 3
representation of the category that is the central ten- pattern ofotherwords thatatargetword (the category
dencyofitsexemplars;categorisationofanewexemplar label) occurs with. This reframing of semantic learning
dependsonitssimilaritytocategoryprototypes.Incon- is very similar to current state-of-the-art exemplar-
trast, exemplar theories (e.g. Medin & Schaffer, 1978; based models of categorisation in which “…a stimulus
Nosofsky, 1988) posit that humans store every experi- is stored in memory as a complete exemplar that
enced exemplar in memory, and categorisation of a includes the full combination of other features. Thus
new exemplar depends on its weighted similarity to all the ‘context’ for a feature is the other features with
storedexemplars. which it co-occurs.” (Kruschke, 2008, p. 273). However,
Perhaps more than any other sub-field of cognition, DSMs aggregate over the multiple exemplars to create
the categorisation literature has very much converged aneconomicalprototype.
ontheconclusionthatexemplarshavebeatenoutproto- Hence, most DSMs collapse all instances of a word’s
types as models of human categorisation (but see context into a single representation, or point in high-
Murphy, 2016, for a careful discussion of the limitations dimensionalspace,verymuchconsistentwithrepresen-
of both). In addition to exemplar models providing a tational economy (Rosch, 1973). This process produces
better quantitative account of human categorisation huge issues in semantic representation that are known
data, therearemanytheoreticallydifferentiatingeffects to the field – for example, a homograph like bank has
thatareeasilyexplainablebyexemplarmodelsbutthat both senses of its meaning collapsed into a single rep-
are simply impossible under prototype accounts. For resentation, despite thefact that they arevery different
example,categorystructureswithnonlinearlyseparable context patterns. As a result, the representation
structure (e.g. XOR) are easily learned by humans, but becomes a weighted average (biased to the more fre-
impossible to account for by single prototype models quent sense) of the multiple senses of bank. A homo-
(Ashby & Maddox, 1993; Nosofsky, 1988). Even when graph like bank, with multiple unrelated senses, has a
using linear category structures that should be condu- similar characteristic structure to experimental stimuli
cive to prototype models, exemplar models still give a with XOR structure. But the prototype collapsing is a
superior quantitative fit to human data (Stanton, problemforallwordswithgradedamountsofpolysemy
Nosofsky, & Zaki, 2002). Hence, it is certainly odd that thatwouldbecapturedbyanexemplar-basedmodelbut
thefieldofdistributionalsemanticsisdominatedbypro- areabstractedoverbyaprototype-basedDSM.Multiple
totype models, while the field of categorisation has distinct statistical structures that map onto the same
largelydismissedtheminfavourofexemplaraccounts. label are collapsed in most DSMs, leading to a range of
Inthe typical categorisation experiment, subjectsare both theoretical and practical issues for the models.
presented with stimulus patterns – exemplars – Butfarfromrare,multiplesensesandcontextualmodu-
accompanied by a category label. At test, the exper- lation patterns are really the norm in linguistic infor-
imenter can present old or new exemplars, and the mation(Jones,Dye,&Johns,2016;Kintsch,2001).
subject responds with the most appropriate category
label for each stimulus. We can think of distributional
Lessonsfrom multiple-trace models
learningofsemanticsinan analogous way:The context
is the exemplar pattern, and the word is the label of Posner and Keele’s (1968) schema abstraction exper-
thecategorytowhichthisparticularexemplarbelongs. iment is a classic in semantic memory research, and
Inword2vec,forexample,theotherwordsthatoccur was a key laboratory phenomenon that lead Tulving
withatargetwordareusedasthecontext,orexemplar (1972) to divide declarative memory into separate
pattern,andthecorrectlabelisthetargetword.Sointhe semanticandepisodicstoresinhismodulartaxonomy.3
sentence “I amdrinking aglass of milk,”drinking+glass In their task, Posner and Keele presented subjects with
are used as the context to predict milk. The exemplar randomdotpatternsasexemplarsofmultiplecategories.
pattern for milk in this context is a localist vector with Unbeknownst to subjects, the exemplars they experi-
drinking and glass set to one and all other words set to enced were created from parent prototype patterns for
zero. Across many language exemplars that are all of each category. A category exemplar was a random per-
the category milk, word2vec homes in on a pattern of turbation (low or high distortion) of the prototype
activationacrossitshiddenlayerthatoptimallypredicts pattern, but prototypes were never shown to subjects
milk as the label given any language exemplar context during learning. There are many interesting effects
that contains milk. In addition, the hidden layer pattern from the schema abstraction task, but a key finding is
for milk will bevery similar toother words that arepre- thatwhilesubjectsattestarebetteratclassifyingexem-
dicted by similar contexts such as juice and wine. So in plarsthey weretrainedon,they werebetter atthepro-
allDSMs,anexemplarcanbethoughtofasthecontext totype than new exemplars. In addition, with a delay
4 M.N.JONES
between training and test, performance on the proto- is an elegant existence proof that phenomena used to
type (which was never experienced) is better than per- arguefortheexistenceofsemanticmemorymayactually
formance on the old exemplars that subjects were beduetotheprocessofretrievalfromepisodicmemory.
actually trained on. Furthermore, exemplars and proto- In the interest of parsimony, there may be no need to
typesfollowdifferenttrajectoriesofdecayasafunction posit an additional semantic store when a model that
ofretentiontime. hasonlyanepisodicstorecanproduceallthephenom-
This pattern of results suggests that subjects are ena that a model with two distinct stores could. This
storing experienced exemplars in episodic memory at claim bears considerable similarity to other instance-
thesametimeastheyarecreatinganabstractedproto- based models of memory and exemplar-based models
typeforthecategory.Thedifferentialdecaypatternsalso of categorisation. So why mention historical cases like
suggestthattheseinformationsourcesarestoredbydis- MINERVA 2 and schema abstraction here? Because one
tinct memory systems, and that semantic memory is ofthefirstsuccessfulexemplar-basedDSMsincognitive
more resilient to decay than is episodic memory. The science extends MINERVA 2’s architecture exactly to a
pattern would seem to argue in favour of DSMs that textcorpus,andmakesthesametheoreticalclaims.
use abstraction at encoding to create a prototypical
pattern,andepisodicmemoryisthenexplainedbyadis-
Exemplar-basedsemantic models
tinctmodel.
However, Hintzman (1984; 1986) provided a classic While it is true that most DSMs are prototype models,
demonstration using his MINERVA 2 memory model there is a small family of exemplar-based semantic
that questioned whether the effects seen in Posner models that diverge from the usual quest for cognitive
and Keele’s (1968) schema abstraction task suggest the economy. Exemplar-based semantic models are also
existence of a prototype in memory at all. Briefly, referred to as “retrieval-based” models in the cognitive
MINERVA 2 is a instance-based memory model: it literature or simply as “memory models” in compu-
stores a pattern for each exemplar in episodic memory, tational linguistics. Rather than positing abstraction as
but has no semantic memory. Multiple presentations of a dimensional reduction mechanism at learning, they
an exemplar simply lay down multiple memory traces. store all of a word’s episodic contexts, and abstraction
The model explains a range of episodic memory effects is a consequence of retrieval from episodic memory.
such as recognition, judgments of frequency, etc. But it Hence, there is no semantic memory per se in these
can also perform the classification task used in the models, only episodic memory. In exemplar-based
schema abstraction experiments. When presented with models,phenomenathathavetypicallybeenattributed
a probe pattern (an old or new exemplar) MINERVA 2 tosemanticmemoryareanemergentartifactofretrieval
simultaneously computes the probe’s similarity to all from episodic memory. The locus of semantics is not at
storedexemplartracesinmemory,andtheretrievedcat- encoding, but at retrieval. These models have grown
egorylabelfortheprobeisweightedbythescaledsimi- from exemplar-based models in categorisation, and
larity of the probe to all exemplars (cf. Nosofsky’s, 1986 instance-based models in memory. Intuitively, many
exemplar-based model of categorisation). The retrieved people believe the idea that we store everything we
patternis referred to as an “echo” from memory, and is ever experience rather than creating and storing an
based loosely on the principle of harmonic resonance. economical abstraction is far-fetched. But given the
Although it has no semantic memory per se, MINERVA successofexemplar-basedsemanticmodelsataccount-
2reproducesthekeyphenomenainschemaabstraction ing for an impressive array of semantic behaviours
thathadpreviouslybeenseenasevidencefordualepi- without any semantic memory, and the current resur-
sodic and semantic stores. The model performs better gence of usage-based theories in linguistics (Goldberg,
on old exemplars at immediate test (but better on the 2006; Johns & Jones, 2015; Tomasello, 2003), exemplar-
prototype than new exemplars), and performance on basedsemanticmodelsdeserveacloserlook.
the prototype is better than the training exemplars Kwantes(2005)extendedHintzman’s(1986)MINERVA
afterforgetting.Superiorperformanceontheprototype 2toexplainsemanticphenomenawithwordsbytraining
isduesimplytothefactthatitisthecentraltendencyof it on a text corpus. In his Constructed Semantics Model
the exemplar patterns; hence the prototype’s pattern is (CSM),4 each word’s representation in memory is a
distributed across the exemplars. The performance tra- binaryvectorthatreflectswhetheritoccurredinadocu-
jectories of exemplars and the prototype as a function ment or not – its episodic history. Note that memory in
ofdelayhavedistinctslopes. CSM is the same word-by-document matrix that LSA
Hintzman’s (1984; 1986) demonstration is well and other DSMs learn from. But where LSA applies
covered in most contemporary memory textbooks – it abstraction to this episodic matrix and stores a higher-
LANGUAGE,COGNITIONANDNEUROSCIENCE 5
order representation, CSM stores the episodic matrix DSMs. For example, it can handle polysemous words
itself. When a word is presented to CSM, its episodic because the multiple senses of the words are still rep-
vector is used as a probe as in MINERVA 2. Each word resented and are dissociable with nonlinear activation
inmemory isactivatedrelativetoitscontextualoverlap of exemplars (cf. Nosofsky, 1986). Memory traces
with the probe word, and the echo pattern is then the whose context fits one or the other sense of a word
similarity-weighted sum ofall traces inmemory, exactly can be differentially activated in CSM. Abstractionist
as in Hintzman (1986). Words that have similar contex- DSMs, on the other hand, collapse multiple senses of a
tual histories to the probe word will contribute more of word to a single point in high-dimensional space,
their pattern to the echo than will words with rather losing the distinction in favour of an averaged
independent histories, and the echo for the word is representation.5
then an ad hoc, and probe specific, prototype created Kwantes (2005) work suggests that the same basic
by the this process of retrieval from episodic memory. memory system could underlie both episodic and
To compute the semantic similarity between two semantic knowledge, and his work has given rise to a
words, one simply computes the cosine between their handful of other models that have explored semantic
twoechopatterns. abstraction as a memory retrieval operation rather than
It is fairly obvious how CSM can determine similarity a learning mechanism. For example, Dennis (2005; see
for words that frequently co-occur with each other also Thiessen, 2017) presented a memory-based model
(their echo cosines would be a noisy amplification of of verbal processing, including semantics and syntactic
the terms’ likelihood of co-occurrence relative to information as retrieval from long-term memory and
chance). But higher-order semantic similarities also constraint satisfaction in working memory. The model
emerge from this process of retrieval, even between mechanisms are based on a Bayesian interpretation of
two words that have zero contextual overlap. For string edit theory from linguistics. Dennis’ model posits
example, two synonyms would not activate each other that processing a word or sentence is at its core a
at all because they have never co-occurred in the text memory-retrievalprocess.
corpus, but they would activate many of the same Johns and Jones (2014, 2015; see also Thiessen &
other words due to their similar contextual usage; as a Pavlik,2013)extendedthispreviousworkintoanexem-
result, their retrieved echo patterns are extremely plar-based model, based on a hybrid of Hintzman’s
similar. Models such as LSA and word2vec accomplish MINERVA 2 (1986) and Jones and Mewhort’s (2007)
this second-order statistical inference while learning a BEAGLE architectures, that encodes sentences from a
corpus,whereasCSMdoesitwhileretrievinginformation natural language text corpus into individual memory
fromepisodicmemory. exemplars.Theretrievalmechanismisusedtogenerate
Hence, semantic abstraction in CSM is a parallel to expectancies about the future structure of sentences,
schema abstraction in MINERVA 2: the prototype is an much in the same way as Kwantes (2005) constructs a
emergent property of retrieval from episodic memory. word’s meaning as a prediction of the future contexts
AsKwantes (2005)puts it,CSM“…takeswhatitknows in which it might occur. Johns and Jones found that
about a word’s contexts and uses retrieval to estimate such an exemplar-based model successfully accounted
what other context might also contain the word” for a wide range of sentence processing tasks that had
(p. 706). The model bears obvious similarity in outcome commonlybeenseenasevidenceforrule-basedabstrac-
to prototype-based DSMs, but it differs considerably in tion of linguistic constraints. Johns, Jamieson, Crump,
the psychological mechanism that it attributes abstrac- Jones, and Mewhort (2016) extended this model to
tionto;inCSM,itisthewell-establishedprocessofretrie- demonstrate that rule-based grammatical behaviour is
valthatuncoversdeepersemanticstructure. a natural emergent property of retrieval from a model
As with Hintzman’s (1986) demonstration, CSM is a that stores exemplars of linguistic experience. Hence,
more parsimonious model of semantics – it does not bothsemanticsandsyntaxmayverywellbeconstructed
require two separate stores or processes to explain propertiesofretrievalfromepisodicmemoryratherthan
semantic and episodic memory, and serves as an exist- abstractedstructuresorrules,perse.6
enceproofthatsemanticphenomenamaybeexplained
byamodelthatonlypositsanepisodicmemorystore.In
Exemplar-basedmodels innatural language
addition,thesuccess ofthemodelisreinforcedbycon-
processing
verging evidence supporting exemplar-based models
in the fields of categorisation and recognition. Further- It is tempting to think of exemplar-based models as a
more, there are real benefits to CSM that allow it to psychologycentrictheorywithlittle,ifany,practicalsig-
handle phenomena not possible by abstractionist nificance. After all, why would a computing scientist
6 M.N.JONES
want to store all data instances? Data compression and quitesimilartomultipleprototypetheoriesofcategoris-
abstractionarecoregoalstoinformationretrievalappli- ation (Minda & Smith, 2001). Similarly, there has been
cations.However,exemplarmodelsarenowseeingcon- considerable successinNLPwithmodelsthatrepresent
siderable use in natural language processing (NLP) as words as regions, rather than points, in distributional
well, for the very same reasons that they are preferred space (Erk, 2009; Vilnis & McCallum, 2015). These
in categorisation: the affordance of nonlinear activation models preserve nonlinear activation of exemplars, but
ofmemoryexemplarsgivenaprobe. while embedding them in a more reasonable search
A classic example in NLP was presented by Daele- space with attractor basins. The practice has a similar
mans, Van Den Bosch, and Zavrel (1999), showing that outcometosettingathresholdonthesimilarityfunction
abstractionist models lose exceptions to common pat- in exemplar-based psychological models to reduce the
ternsinavarietyoflanguageprocessingtasks.Prototype activation of irrelevant items (which is precisely what
modelsofferthebestsinglerepresentation,butthedis- Kwantes,2005,modeldoes).Thisalsosuggestsconsider-
tributionofmeaningsandusagerulesisheavilyskewed able potential for the application of hybrid rule-and-
innaturallanguages–prototypemodelsdiscardthetail exception models from human category learning (e.g.
(cf. Johns & Jones, 2010 in lexical semantics). Daelmans Nosofsky, Palmeri,&McKinley,1994).
etal.foundthatretainingexceptionaltraininginstances Also of interest in practical NLP applications is the
in memory was actually beneficial for generalisation recent rise of so-called memory networks (Weston,
accuracy across a wide range of common NLP tagging Chopra,&Bordes,2015)thathaveprovenverysuccessful
tasks,andtheyarguethatthefieldneedstotakeexem- atopenquestionansweringwithcomplexreal-worldtext
plar-basedmemorymodelsmuchmoreseriously. materials. Memory networks use a long-term exemplar
More recently, exemplar-based models have seen a memory network as a dynamic knowledge base, and
resurgence in NLP, offering better accuracy on applied have produced state-of-the-art results with difficult
problems that the field had been deadlock on with tasks such as question answering, summarisation, and
abstractionist models. For example, Erk and Padó text-based inference (Bordes, Usunier, Chopra, &
(2010) used an exemplar-based memory model in Weston,2015).
whichseparate exemplars were encoded for words and
sentences (cf. Dennis, 2005; Johns & Jones, 2015). They
Discussion
found superior performance on a practical paraphrase
task using this architecture due to the nonlinear acti- Meaning is a fundamental human attribute that perme-
vation of related exemplars – this behaviour allowed ates all cognition, from low-level perceptual processing
the model to “ignore” the exemplars that were other to high-level problem solving, and everything in
senses of a target word, which would have been a col- between. Semantic abstraction is what makes us a
lapsed noise source in an abstractionist model. In fact, powerful species – informavores. The idea that humans
their exemplar model outperformed all then state-of- constructandstoreabstractedsemanticrepresentations
the-artparaphrasingmodels,andhasconsiderablesimi- forconceptsisalmostsacredincognitivescience.Butit
larity to exemplar-based memory models in cognitive isalsoatoddswithconclusionsfromotherareasofcog-
science(e.g.Thiessen&Pavlik,2013). nition,suchascategorisationandrecognition,whichpre-
However, an issue with the application of exemplar- sumablytapaspectsofthesamecognitivemechanismas
basedmodelstoappliedNLPtaskswillalwaysbeproces- semantic learning. And exemplar-based DSMs suggest
sing time. Exemplar-based models are fast to train, but that, like Hintzman’s (1986) demonstration, we might
requiresubstantialandevencomputationallyimpractical be able to explain all the same semantic phenomena
memory resources, and are slow to retrieve the correct without a semantic memory. According to exemplar-
answer. In contrast, prototype models embody data based DSMs, semantic memory is a process, not a
compression, putting all the time into training the structure.
single best representation, but then the search time for Itistemptingtoseeexemplar-basedDSMsas“cheat-
a similar instance in memory is much more efficient. In ing:” If the model simply stores all data, then it can
applied problems, such as information retrieval, access computeanaccuratesemanticrepresentationwhenever
timeiseverything.However,therehavebeenmanysuc- oneisneeded.Butthetheoreticalclaimisprofound–itis
cessful hybrid models emerging in NLP that balance a frightening proposal that we may not actually have
accuracy with generalisation and speed. Multiple proto- semantic memory. Your interpretation of the words
type models (e.g. Reisinger & Mooney, 2010) have you are reading right now may be constructed on the
become popular to represent the distinct senses of a flyasanartifactofretrievingthevisualpatternsfromepi-
word without needing to store all exemplars, and are sodicmemory.Ourphenomenologyofmeaningmaybe
LANGUAGE,COGNITIONANDNEUROSCIENCE 7
continuously constructed as the interaction between equivalence between classic exemplar based models
stimuli, episodic memory, and the memory retrieval andneuralexemplartheories.Notonlydoexemplarthe-
mechanism that mediates them (Kintsch & Mangalath, ories provide superior quantitative fits, but increasing
2011). But exemplar-based DSMs should also put us at biological plausibility also extends predictions to find-
ease – they provide converging evidence that perform- ingsfromcognitiveneuroscience(e.g.Ashby&Valentin,
ance across multiple cognitive domains (e.g. categoris- 2016; Hélie, Paul, & Ashby, 2012; Valentin, Maddox, &
ation, recognition, semantics) may be explained by the Ashby,2014).
same unified cognitive principle. Exploring exemplar- The predominance of prototype-based DSMs in the
based DSMs also has practical considerations for edu- literature may be partially due to Chomskian presump-
cation, where exemplar-based models of categorisation tionsinlinguisticsthatthejobofthecognitivemechan-
have been successfully applied (Norman, Young, & ismistoabstracttherulesofagrammarfrominstances.
Brooks, 2007; Nosofsky, 2017). And since humans are This abstractionist presumption may have implicitly
both the producers and consumers of linguistic infor- guidedarchitecturaldecisionsinearlyDSMs.Inaddition,
mation in all practical NLP tasks, it is also reassuring thenotionofcognitiveeconomy(Rosch&Mervis,1975)
that the recent findings in NLP may suggest that the was a guiding principle to models of semantic abstrac-
best models to serve humans in these tasks bear con- tion. However, both of these theoretical presumptions
siderablesimilaritytothemodelswebelievehumancog- are currently being revisited given the strength of
nitionhasevolvedtouse. usage-based theories in linguistics (Tomasello, 2003).
How might exemplar-based semantic models be Butamorelikelyreasonforthepreferenceofprototype-
implemented in neural hardware? A common criticism over exemplar-based DSMs in practice is that exemplar
againstexemplartheoryisthatitisstrandedatthecom- models are much more computationally expensive
putationallevelofMarr’shierarchy,butthetransitionto thanprototypemodels.Thefront-enddatacompression
implementation is untenable. The claim that we simply coretoprototypeDSMsmeansthattheyrequirefarless
store everything that we experience seems unintuitive memorytostore,andarefarmoreefficienttouse,than
and goes against the core principle of cognitive exemplar-models.
economy. However, Hintzman (1990) has shown how Development ofDSMs ingeneralhasbenefitedfrom
an exemplar-based memory model such as MINERVA 2 cross-disciplinary interactions with applied fields, such
can be easily implemented within a neural network fra- as information retrieval. Put simply, these models are
mework. In addition, there is a small body of work both theoretically informative to cognitive science, and
attempting to understand and formalise biologically useful for practical NLP tasks. However, the utility of
plausibleexemplartheoriesofrecognitionandcategoris- the model should not constrain its theoretical informa-
ation,whichhavetypicallypointedtoaroleforthehip- tiveness. Prototype DSMs are needed because they
pocampus and surrounding medial temporal lobe provide an efficient and economical estimate of a
structures (e.g. Pickering, 1997). Futhermore, Becker’s word’s aggregate meaning. Exemplar-based DSMs
(2005) models cleanly demonstrate that hippocampal contain more information, but the retrieval problem
coding would give rise to distinct memory represen- becomes intractable and untenable for practical tasks.
tationsforhighlysimilaritems. Nobody wants to type a search query into Google and
Morerecentworkincategorisationisnowfocusingon have it determine what you mean by activating and
thebasalgangliaandstriatumasgivingrisetotheoper- weighting exemplars in real time; the time-intensive
ations needed for exemplar models (see Ashby & Rose- computation should have been completed and stored
dahl, 2017, for a review). Ashby and Rosedahl recently longbeforeyoutypeinthequerywords.
introduced a neural implementation of exemplar ButtheconstraintofutilityinNLPmayhavehadthe
theory,inwhichakeyrolefortheformationofcategory unwanted effect of guiding our theoretical models of
exemplars is assigned to synaptic plasticity at cortical- the mind away from exemplar models. There are many
striatal synapses. Rather than storing strict exemplars, differences between the brain and computational data-
perse,theirmodeladdsnodesandmanipulatesconnec- bases in how they represent and retrieve information.
tivity between striatal and sensory neurons, achieving The search and abstraction processes used in cognition
the same effect as classic exemplar models. Ashby and need not be identical to those best for database
Rosedahl show that their neural implementation of search. Models of cognition have long assumed that
exemplar theory is mathematically equivalent to classic memoryexemplarscanbeactivatedinparallel,although
exemplar theories such as the General Context Model thecodeweusetoimplementthisinamodelwillusually
(Nosofsky, 1986), and makes identical predictions. The use a loop routine. This is a distinct difference between
work of Ashby & Rosedahl establishes an important the two disciplines: Looping through all exemplars is
8 M.N.JONES
not an efficient method of, for example, word similarity Ashby,F.G,&Valentin,V.V.(2016).Multiplesystemsofpercep-
matching, but it may well be the correct model of how tual category learning: Theory and cognitive tests. In H.
Cohen & C. Lefebvre (Eds.), Handbook of categorization in
humans do it. The constraints of our current compu-
cognitive science, second edition (p. in press). New York:
tational hardware should not be used as reasons to
Elsevier.
discardotherwisesuperiorfittingmodelsofhumancog- Barsalou, L. W. (1999). Perceptions of perceptual symbols.
nition,suchasexemplarmodels. BehavioralandBrainSciences,22(4),637–660.
Becker, S. (2005). A computational principle for hippocampal
learningandneurogenesis.Hippocampus,15(6),722–738.
Notes Bordes, A., Usunier, N., Chopra, S., & Weston, J. (2015). Large-
scale simple question answering with memory networks
1. LSA and word2vec are formally equivalent: Levy and (arXivpreprintarXiv:1506.02075).
Goldberg (2014) demonstrated analytically how the Daelemans,W.,VanDenBosch,A.,&Zavrel,J.(1999).Forgetting
SGNS architecture of word2vec is implicitly factorizing exceptions is harmful in language learning. Machine
a word-by-context matrix whose cell values are shifted Learning,34(1–3),11–41.
PMIvalues. Damasio, A. R. (1989). Time-locked multiregional retroactiva-
2. The model’s direction can also be inverted, using the tion: A systems-level proposal for the neural substrates of
word to predict the context (SGNS) rather than using recallandrecognition.Cognition,33(1),25–62.
thecontexttopredicttheword(CBOW). Dennis,S.(2005).Amemory-basedtheoryofverbalcognition.
3. ThebulkoftheevidenceusedbyTulvingtoarguefordis- CognitiveScience,29,145–193.
tinctsemantic and episodic memorysystems was from Erk, K. (2009). Representing words as regions in vector space.
neuropsychologicalpatients. Proceedingsofthethirteenthconferenceoncomputational
4. Themodelissimplyreferredtoasthe“semanticsmodel” natural language learning (pp. 57–65). Association for
in Kwantes’ (2005) original paper, but “Constructed ComputationalLinguistics.
Semantics Model” has become it’s popular name Erk, K., & Padó, S. (2010). Exemplar-based models for word
among semantic modelers because semantic represen- meaningincontext.Proceedingsoftheacl2010conference
tations are constructed on the fly from episodic short papers (pp. 92–97). Association for Computational
memoryinthemodel. Linguistics.
5. An exception here is the topic model, which uses con- Firth,J.R.(1957).Asynopsisoflinguistictheory(pp.1930–1955).
ditionalprobabilities,soitisnotsubjecttometricrestric- Oxford:Blackwell.
tions of spatial models (e.g., Griffiths, Steyvers, & Goldberg, A. E. (2006). Constructions at work: The nature of
Tenenbaum,2007). generalization in language. New York: Oxford University
6. Andessentiallythesamearchitecturehasbeenusedby Press.
Goldinger(1998)toexplain“abstract”qualitiesofspoken Goldinger,S.D.(1998).Echoesofechoes?Anepisodictheoryof
wordrepresentationfromepisodicmemoryretrieval. lexicalaccess.PsychologicalReview,105(2),251–279.
Griffiths,T.L.,Steyvers,M.,&Tenenbaum,J.B.(2007).Topicsin
semanticrepresentation.PsychologicalReview,114(2),211.
Acknowledgements Hélie,S.,Paul,E.J.,&Ashby,F.G.(2012).Aneurocomputational
account of cognitive deficits in Parkinson’s disease.
This work was supported by NSF BCS-1056744 and IES
Neuropsychologia,50(9),2290–2302.
R305A150546. I would like to thank Randy Jamieson, Melody
Hintzman, D. L. (1984). MINERVA 2: A simulation model of
Dye,andBrendanJohnsforhelpfulinputanddiscussions.
human memory. Behavior Research Methods, Instruments, &
Computers,16(2),96–101.
Hintzman,D.L.(1986).“Schemaabstraction”inamultiple-trace
Disclosure statement
memorymodel.PsychologicalReview,93,411–428.
Hintzman, D. L. (1990). Human learning and memory:
Nopotentialconflictofinterestwasreportedbytheauthors.
Connectionsanddissociations.AnnualReviewofPsychology,
41(1),109–139.
Johns, B. T., Jamieson, R. K., Crump, M. J. C., Jones, M. N., &
Funding
Mewhort,D.J. K.(2016).The combinatorialpowerofexperi-
This work was supported by National Science Foundation ence. Proceedings of the 37th meeting of the cognitive
[grant number BCS-1056744] and Institute of Education sciencesociety.
Sciences[grantnumberR305A150546]. Johns,B.T.,&Jones,M.N.(2010).Evaluatingtherandomrep-
resentation assumption of lexical semantics in cognitive
models.PsychonomicBulletinandReview,17,662–672.
Johns, B. T., & Jones, M. N. (2014). Generating structure from
References
experience: The role of memory in language. Proceedings of
Ashby,F.G.,&Maddox,W.T.(1993).Relationsbetweenproto- the 35th annual conference of the cognitive science
type, exemplar, and decision bound models of categoriz- society.Austin,TX:CognitiveScienceSociety.
ation.JournalofMathematicalPsychology,37(3),372–400. Johns, B. T., & Jones, M. N. (2015). Generating structure from
Ashby,F.G.,&Rosedahl,L.(2017).Aneuralimplementationof experience:Aretrieval-basedmodeloflanguageprocessing.
exemplartheory.PsychologicalReview,124(4),472–482. CanadianJournalofExperimentalPsychology,69,233–251.
LANGUAGE,COGNITIONANDNEUROSCIENCE 9
Jones,M.N.,Dye,M.,&Johns,B.T.(2016).Contextasanorga- Nosofsky,R.M.,Sanders,C.A.,&McDaniel,M.A.(2017).Testsof
nizingprincipleofthelexicon.InB.Ross(Ed.),Thepsychology an Exemplar-Memory Model of Classification Learning in a
oflearningandmotivationion(Vol.67,pp.239–283).https:// High-Dimensional Natural-Science Category Domain.
doi.org/10.1016/bs.plm.2017.03.008 Journal of Experimental Psychology: General. http://psycnet.
Jones, M. N., & Mewhort, D. J. K. (2007). Representing word apa.org/doi/10.1037/xge0000369
meaningandorderinformationinacompositeholographic Pickering,A.D.(1997).Newapproachestothestudyofamnesic
lexicon.PsychologicalReview,114(1),1–37. patients:Whatcananeurofunctionalphilosophyandneural
Jones,M.N.,Willits,J.A.,&Dennis,S.(2015).Modelsofsemantic networkmethodsoffer?Memory,5(1–2),255–300.
memory.InJ.R.Busemeyer&J.T.Townsend(Eds.),Oxford Posner,M.I.,&Keele,S.W.(1968).Onthegenesisofabstractideas.
handbook of mathematical and computational psychology JournalofExperimentalPsychology,77(3p1),353–363.
(pp.232–254).NewYork: OxfordUniversityPress. Reed, S. K. (1972). Pattern recognition and categorization.
Kintsch, W. (2001). Predication. Cognitive Science, 25(2), 173– CognitivePsychology,3(3),382–407.
202. Reisinger,J.,&Mooney,R.J.(2010).Multi-prototypevector-space
Kintsch, W., & Mangalath, P. (2011). The construction of modelsofwordmeaning.InHumanLanguageTechnologies:
meaning.TopicsinCognitiveScience,3(2),346–370. The2010AnnualConferenceoftheNorthAmericanChapter
Kruschke,J.K.(2008).Modelsofcategorization.InR.Sun(Ed.), of the Association for Computational Linguistics, Los
TheCambridge Handbookof Computational Psychology (pp. Angeles,June2010.
267–301).NewYork:CambridgeUniversityPress. Rogers,T.T.,&McClelland,J.L.(2004).Semanticcognition:Apar-
Kwantes, P. J. (2005). Using context to build semantics. allel distributed processing approach. Cambridge, MA: MIT
PsychonomicBulletin&Review,12,703–710. Press
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s Rogers,T.T.,&McClelland,J.L.(2011).5semanticswithoutcat-
problem:Thelatentsemanticanalysistheoryofacquisition, egorization. In E. M. Pothos & A. J. Wills (Eds.), Formal
induction, and representation of knowledge. Psychological Approaches in Categorization (pp. 88–119). New York:
Review,104(2),211–240. CambridgeUniversityPress.
Levy, O., & Goldberg, Y. (2014). Neural word embedding as Rosch, E.H. (1973).Natural categories. Cognitive Psychology,4
implicit matrix factorization. In Z. Ghahramani, M. Welling, (3),328–350.
&C.Cortes (Eds.),Advancesinneuralinformationprocessing Rosch,E.,&Mervis,C.B.(1975).Familyresemblances:Studiesin
systems (pp. 2177–2185). Cambridge, MA: MIT Press the internal structure of categories. Cognitive Psychology, 7
Cambridge. (4),573–605.
Medin,D.L.,&Schaffer,M.M.(1978).Contexttheoryofclassifi- Stanton,R.D.,Nosofsky,R.M.,&Zaki,S.R.(2002).Comparisons
cationlearning.PsychologicalReview,85(3),207–238. between exemplar similarity and mixed prototype models
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. using a linearly separable category structure. Memory &
(2013). Distributed representations of words and phrases Cognition,30(6),934–944.
andtheircompositionality.InAdvancesinneuralinformation Thiessen,E.D.(2017).What’sstatisticalaboutlearning?Insights
processingsystems(pp.3111–3119). frommodellingstatisticallearningasasetofmemorypro-
Minda,J.P.,&Smith,J.D.(2001).Prototypesincategorylearn- cesses. Philosophical Transactions of the Royal Society B:
ing:Theeffectsofcategorysize,categorystructure,andsti- BiologicalSciences,372(1711),20160056.
mulus complexity. Journal of Experimental Psychology: Thiessen,E.D.,&Pavlik,P.I.(2013).Iminerva:Amathematical
Learning,Memory,andCognition,27(3),775. model of distributional statistical learning. Cognitive
Murphy,G.L.(2016).Isthereanexemplartheoryofconcepts? Science,37(2),310–343.
PsychonomicBulletin&Review,23(4),1035–1042. Tomasello, M. (2003). Constructing a language: A usage-based
Norman, G., Young, M., & Brooks, L. (2007). Non-analytical theory of language acquisition. Cambridge, MA: Harvard
models of clinical reasoning: The role of experience. UniversityPress.
MedicalEducation,41,1140–1145. Tulving,E.(1972).Episodicandsemanticmemory.InE.Tulving
Nosofsky, R. M. (1986). Attention, similarity, and the identifi- & W. Donaldson (Eds.), Organization of memory (pp. 381–
cation–categorization relationship. Journal of Experimental 403).NewYork,NY:AcademicPress.
Psychology:General,115(1),39–57. Valentin,V.V.,Maddox,W.T.,&Ashby,F.G.(2014).Acompu-
Nosofsky, R. M. (1988). Exemplar-based accounts of relations tationalmodelofthetemporaldynamicsofplasticityinpro-
between classification, recognition, and typicality. Journal cedural learning: sensitivity to feedback timing. Frontiers in
of Experimental Psychology: Learning, Memory, and Psychology,5,1–9.
Cognition,14(4),700–708. Vilnis,L.,&McCallum,A.(2015). Wordrepresentationsviagaus-
Nosofsky,R.M.,Palmeri,T.J.,&McKinley,S.C.(1994).Rule-plus- sianembedding.arXivpreprintarXiv:1412.6623.
exception model of classification learning. Psychological Weston, J., Chopra, S., & Bordes, A. (2015). Memory networks.
Review,101(1),53–79. arXivpreprintarXiv:1410.3916.