Table Of ContentSearchingandBrowsingLinkedDatawithSWSE:
theSemanticWebSearchEngine1;2
AidanHogana,AndreasHarthb,Ju(cid:127)rgenUmbricha,SheilaKinsellaa,AxelPolleresa,
StefanDeckera
aDigitalEnterpriseResearchInstitute,NationalUniversityofIreland,Galway
bAIFB,KarlsruheInstituteofTechnology,Germany
Abstract
In this report, we discuss the architecture and implementation of the Semantic Web Search Engine (SWSE).
Following traditional search engine architecture, SWSE consists of crawling, data enhancing, indexing and a user
interface for search, browsing and retrieval of information; unlike traditional search engines, SWSE operates over
RDF web data { loosely also known as Linked Data { which implies unique challenges for the system design,
architecture,algorithms,implementationanduserinterface.Inparticular,manychallengesexistinadoptingSemantic
Web technologies for web data: the unique challenges of the Web { in terms of scale, unreliability, inconsistency
and noise { are largely overlooked by the current Semantic Web standards. In this report, we detail the current
SWSE system, initially detailing the architecture and later elaborating upon the function, design, implementation
and performance of each individual component. In so doing, we also give an insight into how current Semantic
Web standards can be tailored, in a best-e(cid:11)ort manner, for use on web data. Throughout, we o(cid:11)er evaluation and
complementaryargumentationtosupportourdesignchoices,andalsoo(cid:11)erdiscussiononfuturedirectionsandopen
research questions. Later, we also provide candid discussion relating to the di(cid:14)culties currently faced in bringing
such a search engine into the mainstream, and lessons learnt from roughly (cid:12)ve years working on the Semantic Web
SearchEngineproject.
Keywords: websearch,semanticsearch,RDF,semanticweb,linkeddata
1. Introduction
O(cid:11)ering a minimalistic and uncluttered user in-
terface, a simple keyword-based user-interaction
Emailaddresses: [email protected](AidanHogan), model, fast response times, and astute prioritisa-
[email protected](AndreasHarth), tion of results, Google [17] has become the yard-
[email protected](Ju(cid:127)rgenUmbrich), stickforweb-search,servicing(cid:24)64.6%oftraditional
[email protected](SheilaKinsella), websearchqueries3 overbillionsofwebdocuments.
[email protected](AxelPolleres),
Arguably, Google reaches the imminent limit of
[email protected](StefanDecker).
1 The work presented in this report has been funded
in part by Science Foundation Ireland under Grant No.
SFI/08/CE/I1380(Lion-2)andbyanIRCSETpostgraduate 3 Statistics taken from Nielsen MegaView Search for
scholarship. (cid:24)11b searches recorded in Aug. 2009: cf. http://
2 Updated:10August. searchenginewatch.com/3634991
PreprintsubmittedtoElsevier 10August2010
providing the best possible search over the largely structured data with search results automatically
HTMLdataitindexes.However,fromtheuserper- aggregated from multiple documents and rendered
spective, the core Google engine (here serving as directlyinacleanandconsistentuser-interface,thus
thearchetypefortraditionalHTMLsearchengines, reducing the manual e(cid:11)ort required of its users.
such as Yahoo, MSN/Bing, AOL, Ask, etc.) is far Indeed, there has been much research devoted to
from the consummate web search solution: Google thistopic,withvariousincarnationsof(mostlyaca-
doesnottypicallyproducedirectanswerstoqueries, demic) RDF-centric web search engines emerging {
but instead typically recommends a selection of re- e.g., Swoogle, FalconS, WATSON, Sindice { and in
lated documents from the Web. Thus, Google is this report, we present the culmination of over (cid:12)ve
notsuitableforcomplexinformationgatheringtasks years research on yet another such engine: the \Se-
requiring aggregation from multiple indexed docu- manticWebSearchEngine"(SWSE)4.
ments: for such tasks, users must manually aggre- Indeed, the realisation of SWSE has implied two
gate tidbits of pertinent information from various majorresearchchallenges:thesystemmustscaleto
recommendedsites,eachsitepresentinginformation largeamountsofdata,andmustbetoleranttohet-
in its own formatting and using its own navigation erogeneous,noisy,andpossiblycon(cid:13)ictingdatacol-
system. lectedfromalargenumberofsources.SemanticWeb
Google’slimitationsarepredicatedonthelackof standards and methodologies are not naturally ap-
structure in HTML documents, whose machine in- plicable in such an environment; in presenting the
terpretabilityislimitedtotheuseofgenericmarkup- design and implementation of SWSE, we show how
tags mainly concerned with document rendering standardSemanticWebapproachescanbetailored
and linking { the real content is contained in prose to meet these two challenging requirements, often
text which is inherently di(cid:14)cult for machines to taking cues from traditional information retrieval
interpret. Addressing this inherent problem with techniques.
HTMLwebdata,theSemanticWebmovementpro- As such, we present the core of a system which
videsastackoftechnologiesforpublishingmachine- we demonstrate to provide scale, and which is dis-
readabledataontheWeb,thecoreofthestackbe- tributed over a cluster of commodity hardware.
ingtheResourceDescriptionFramework(RDF). Throughout, we focus on the unique challenges of
Using URIs to name things { and not just doc- applying standard Semantic Web techniques and
uments { RDF o(cid:11)ers a standardised and (cid:13)exible methodologies, and show why the consideration of
framework for publishing structured data on the the source of data is an integral part of creating a
Web such that data can link, incorporate, extend systemwhichmustbetoleranttowebdata{inpar-
andre-useotherRDFacrosstheWeb,suchthathet- ticular,weshowhowLinkedDataprinciplescanbe
erogenousdatafromindependentsourcescanbeau- exploited for such purposes. Also, there are many
tomaticallyintegratedbysoftware-agents,andsuch researchquestionsstillverymuchopenwithrespect
that the meaning of data can be well-de(cid:12)ned using tothedirectionoftheoverallsystem,aswellasim-
lightweight ontologies described in RDF using the provements to be made in the individual compo-
RDFSchema(RDFS)andWebOntologyLangauge nents; we discuss these as they arise, rendering a
(OWL)standards. road-map of past, present and possible future re-
Thanks largely to the \Linked Open Data" searchintheareaofwebsearchoverRDFdata.
project [14] { which has emphasised more prag- Morespeci(cid:12)cally,inthisreportwe:
matic aspects of Semantic Web publishing { a rich { present the architecture and modus-operandi of
lode of open RDF data now resides on the Web: our system for o(cid:11)ering search and browsing over
this\WebofData"includescontentexportedfrom, RDFwebdata(Section2);
for example: Wikipedia, the BBC, the New York { presentrelatedworkinRDFsearchengines(Sec-
Times,Flickr,LastFM,scienti(cid:12)cpublishingindexes, tion3);
biomedicalinformationandgovernmentalagencies. { detail the design and implementation of crawl-
This precedent raises an obvious question: assum- ing, consolidation, ranking, reasoning, indexing,
ing large-scale adoption of high-quality RDF pub- query processing and user interface components,
lishing on the Web, could a search engine indexing o(cid:11)ering pertinent evaluation and related work
RDF feasibly improve upon current HTML-centric throughout(Section5-12);
engines? Theoretically at least, such a search en-
ginecouldo(cid:11)eradvancedqueryingandbrowsingof 4 http://swse.deri.org/
2
{ concludewithdiscussionoffuturedirections,open screenshot of the focus (detailed) view of the
researchchallengesandcurrentlimitationsofweb Bill Clinton entity, with data aggregated from
searchoverRDFdata(Section13-14). 54 documents spanning six domains (bbc.co.uk,
dbpedia.org, freebase.com, nytimes.com,
rdfize.com and soton.ac.uk), as well as novel
2. SystemOverview
datafoundthroughreasoning.
2.1. ApplicationOverview
Toputlaterdiscussionintocontext,wenowgive
a brief overview of the lightweight functionality of
the SWSE system; please note that although our 2.2. SystemArchitecture
methodsandalgorithmsaretailoredforthespeci(cid:12)c
needsofSWSE,manyaspectsoftheirimplementa- The high-level system architecture of SWSE
tion, design and evaluation apply to more general looselyfollowsthatoftraditionalHTMLsearchen-
scenarios. gines [17]; Figure 3 details the pre-runtime archi-
Unlikeprevalentdocument-centricwebsearchen- tecture of our system, showing the components in-
gines, SWSE operates over structured data and volved in achieving a local-index of RDF web data
holdsanentity-centricperspectiveonsearch:incon- amenableforsearch.Liketraditionalsearchengines,
trast to returning links to documents containing SWSE contains components for crawling, ranking
speci(cid:12)ed keywords [17], SWSE returns data rep- and indexing data; however, there are also compo-
resentations of real-world entities. While current nents speci(cid:12)cally designed for handling RDF data,
search engines such as Google, Bing and Yahoo re- viz.:theconsolidationcomponentandthereasoning
turnsearchresultsindi(cid:11)erentdomain-speci(cid:12)ccate- component.Thehigh-levelindexbuildingprocessis
gories (Web, Images, Videos, Shopping, etc.), data asfollows:
ontheSemanticWebis(cid:13)exiblytypedanddoesnot { the crawler accepts a set of seed URIs and re-
need to follow pre-de(cid:12)ned categories. Returned ob- trievesalargesetofRDFdatafromtheWeb;
jects can represent people, companies, cities, pro- { theconsolidationcomponenttriesto(cid:12)ndsynony-
teins{anythingpeoplecaretopublishdataabout. mous(i.e.,equivalent)identi(cid:12)ersinthedata,and
Inamannerfamiliarfromtraditionalwebsearch canonicalises the data according to the equiva-
engines, SWSE allows users to specify keyword lencesfound;
queriesinaninputboxandrespondswitharanked { the ranking component performs links-based
list of result snippets; however, the results refer to analysis over the crawled data and derives scores
entities not documents. A user can then click on indicating the importance of individual elements
an entity snippet to derive a detailed description in the data (the ranking component also consid-
thereof. The descriptions of entities are automati- ers URI redirections encountered by the crawler
callyaggregatedfromarbitrarilymanysources,and whenperformingthelinks-basedanalysis);
userscancross-checkthesourceofparticularstate- { the reasoning component materialises new data
ments presented; descriptions also include inferred whichisimpliedbytheinherentsemanticsofthe
data { data which has not necessarily been pub- inputdata(thereasoningcomponentalsorequires
lished, but has been derived from the existing data URIredirectioninformationtoevaluatethetrust-
throughreasoning.Userscansubsequentlynavigate worthinessofsourcesofdata);
to related entities, as such, browsing the Web of { theindexingcomponentpreparesanindexwhich
Data. supports the information retrieval tasks required
Along these lines, Figure 1 shows a screen- bytheuserinterface.
shot containing a list of entities returned as a Subsequently, the query-processing and user-
result to the keyword search \bill clinton" interfacecomponentsservicequeriesovertheindex
{ such results pages are familiar from HTML- builtintheprevioussteps.u
centric engines, with the addition of re- Wewilldetailthedesignandoperationofeachof
sult types (e.g., DisbarredAmericanLawyers, thecomponentsinthefollowingsections,butbefore-
AmericanVegitarians, etc.). Results are aggre- hand, we present the distribution framework upon
gated from multiple sources. Figure 2 shows a whichallofourcomponentsareimplemented.
3
Fig.1.ResultsviewforkeywordqueryBillClinton Fig.2.FocusviewforentityBillClinton
Pre-Runtime Runtime
Seed Crawl Consolidate Rank Reason Index
URIs
Query
RDF Data / Consolidated Identifier Reasoned
Processing
Redirects Data Ranks Data
Intermediary Results
Fig.3.SystemArchitecture
2.3. DistributionAbstraction slaveswarmandperformssomelocalmerge func-
tion over the data { this is usually performed to
In order to scale, we deploy each of our compo- createasingleoutput(cid:12)leforatask,ormoreusu-
nents over a distributed framework which we now ally to gather global knowledge required by all
brie(cid:13)y describe; Figure 4 illustrates the distributed slavemachinesforafuturetask;
operations possible in our framework. The frame- { (cid:13)ood:broadcastglobalknowledgerequiredbyall
workisbasedonasharednothingarchitecture[108] slavemachinesforafuturetask.
and consists of one master machine which orches- Themastermachineisintendedtodisseminatein-
trates the given tasks, and several slave machines put data to the slave swarm, to provide the control
whichperformpartsofthetaskinparallel. logicrequiredbythedistributedtask(commencing
The master machine can instigate the following tasks,coordinatingtiming,endingtasks),togather
distributedoperations: andlocallyperformtasksonglobalknowledgewhich
{ scatter: partition a (cid:12)le into chunks given some the slave machines would otherwise have to repli-
local split function, and send the chunks to indi- cate in parallel, and to transmit globally required
vidual machines { usually only used to initialise knowledge.Themastermachinecanalsobeusedto
ataskandseedtheslavemachineswithaninitial computethe(cid:12)nalresultforagivendistributedtask;
setofdataforprocessing; however,theendgoalofourdistributedframework
{ run:requesttheparallelexecutionofataskbythe istoproduceadistributedindexovertheslavema-
slave machines { such a task either involves pro- chines,thusthistaskisneverrequiredinoursystem.
cessing of some local data (embarrassingly par- Theslavemachines,aswellasperformingtasksin
allel execution), or execution of the coordinate parallel, can perform the following distributed op-
methodbytheslaveswarm; eration(onthebehestofthemastermachine):
{ gather: gathers chunks of output data from the { coordinate:localdataoneachmachineisparti-
tioned according to some split function, with the
4
chunks sent to individual machines in parallel; 3.1. DistributedWebSearchArchitectures
eachmachinealsogatherstheincomingchunksin
parallelusingsomemerge function. Distributed architectures have long been com-
Theaboveoperationallowsslavemachinestopar- mon in traditional information-retrieval based web
tition and disseminate intermediary data directly search engines, incorporating distributed crawl-
to other slave machines; the coordinate operation ing,ranking,indexingandquery-processingcompo-
could be replaced by a pair of gather/scatter op- nents. Although all mainstream search engines are
erations performed by the master machine, but we based on distributed architectures, details are not
wish to avoid the channelling of all such interme- commonly published. Again, one of the most well-
diary data through one machine. In fact, without knownsearchenginearchitecturesisthatpreviously
the coordinate operation, our framework closely described for the Google search engine [17]. More
resembles the MapReduce framework [28], with recent publications relating to the Google architec-
scatter corresponding to the Map operation, and turerelatetotheMapReduceframeworkpreviously
gathercorrespondingtotheReduceoperation. alludedto[28],andtotheunderlyingBigTable[22]
distributeddatabasesystem.
2.4. InstantiationofDistributedArchitecture Similar system architectures have been de(cid:12)ned
in the literature, including WebBase [65] which in-
cludesanincrementalcrawler,storagemanager,in-
We instantiate this architecture using the stan-
dexerandqueryprocessor;inparticular,theauthors
dard Java Remote Method Invocation libraries as
focus on hash- and log-based partitioning for stor-
a convenient means of development given our Java
ing incrementally-updated vast repositories of web
code-base.
documents. The authors of [86] also describe a sys-
All of our evaluation is based on nine machines
tem for building a distributed inverted-index over
connectedbyGigabitethernet5,eachwithuniform
a large corpus of web pages, for subsequent analy-
speci(cid:12)cations; viz.: 2.2GHz Opteron x86-64, 4GB
sisandquery-processing:theyemployanembedded
main memory, 160GB SATA hard-disks, running
distributeddatabasesystem.
Java1.6.0 12onDebian5.0.4.Pleasenotethatmuch
Much of the work presented herein is loosely in-
of the evaluation presented in this report assumes
spired by such approaches, and thus constitutes
that the slave machines have roughly equal speci(cid:12)-
an adaptation of such works for the purposes of
cationsinordertoensurethattasks(cid:12)nishinroughly
searchoverstructureddata.Sinceweconsiderrepli-
thesametime,assumingevendatadistribution.
cation, fault tolerance, incremental indexing, etc.,
Wecurrentlydonotconsidermoreadvancedtop-
currently out of scope, many of our techniques are
icsinourarchitecture{suchasload-balancing(with
morelightweightthanthosediscussed.
the exception of evenly distributing data), replica-
tion, uptime and counteracting hardware failure {
and discussion of these are outside of the current 3.2. HiddenWeb/DeepWebApproaches
scope.
So called \Hidden Web" or \Deep Web" ap-
3. RelatedWork proaches [20] are predicated on the premise that a
vastamountoftheinformationavailableontheWeb
In this section, we give an overview of related is veiled behind sites with heavy dynamic content,
work, (cid:12)rstly detailing distributed architectures for usually backed by relational databases. Such infor-
web search, then discussing related systems in the mation is largely impervious to traditional crawl-
(cid:12)eld of \Hidden Web" and \Deep Web", and (cid:12)- ing techniques since content is usually generated
nally describing current systems that o(cid:11)er search by means of bespoke (cid:13)exible queries; thus, tradi-
andbrowsingoverRDFwebdata(forafurthersur- tional search engines can only skim the surface of
veyofthelatter,cf.[120]).Pleasenotethatwewill suchinformation[61].Infact,suchdata-richsources
give further detailed related work in the context of haveleadtoearlyspeculativeworkonentity-centric
eachcomponentthroughoutthereport. search[26].
Approaches to exploit such sources heavily rely
5 We observe, e.g., a max FTP transfer rate of 38MB/sec on manually constructed, site-speci(cid:12)c wrappers to
betweenmachines. extract structured data from HTML pages [20],
5
s0 s0 s0
scatter0 s1 run(args0) s1 flood s1
scatter1 init’ed run(args1) flood
... ... ready ...
split scatter... run(...) flood
m scattern sn m run(argsn) sn m flood sn
ready hed send
s0 split split s1 finis s0
merge merge gather0 s1
coordinate0...n gather1 ...
gather
merge merge prepare merge
... split split sn m gathern sn
Fig.4.DistributionMethodsArchitecture
or to communicate directly with the underlying data,Swoogleismainlyconcernedwithmoretradi-
database of such sites [23]. Some works have also tionaldocument-searchoverontologies.
looked into automatically crawling such hidden- WATSON7 provides a similar e(cid:11)ort to provide
web sources, by interacting with forms found dur- keyword search facilities over Semantic Web doc-
ing traditional crawls [104]; however, this approach uments, but additionally provides search over en-
is \task-speci(cid:12)c" and not appropriate for general tities [106]. However, they do not include compo-
crawling. nents forconsolidation orreasoning,andseemingly
The Semantic Web may represent a future direc- insteadfocusonprovidingAPIstoexternalservices.
tion for bringing deep-web information to the sur- Sindice8 isaregistryandlookupserviceforRDF
face,leveragingRDFasacommonand(cid:13)exibledata- (cid:12)les based on Lucene and a MapReduce frame-
model for exporting the content of such databases, work [95]. Sindice originally focussed on provid-
leveragingRDFSandOWLasameansofdescribing ing an API for (cid:12)nding documents which reference
the respective schemata, and thus allowing for au- a given RDF entity or given keywords { again,
tomatic integration of such data by web search en- document-centric search. More recently however,
gines.E(cid:11)ortssuchasD2R(Q)[13]seemanatural(cid:12)t Sindicehasbeguntoo(cid:11)erentitysearchintheform
forenablingRDFexportsofsuchonlinedatabases. ofSig.Ma9 [113].However,Sig.Mamaintainsaone-
to-onerelationshipbetweenkeywordsearchandre-
sults, representing a very di(cid:11)erent user-interaction
3.3. RDF-centricSearchEngines
modeltothatpresentedherein.
TheFalconsSearchengine10 o(cid:11)ersentity-centric
Earlyprototypesusingtheconceptsofontologies
searching for entities (and concepts) over RDF
and semantics on the Web include Ontobroker [29]
data [25]. They map certain keyword phrases to
and SHOE [62], which can be seen as predecessors
query relations between entities, and also use class
tostandardisatione(cid:11)ortssuchasRDFSandOWL,
hierarchies to quickly restrict initial results. Con-
describing how data on the Web can be given in
ceptually,thissearchenginemostcloselyresembles
structured form, and subsequently crawled, stored,
ourapproach.
inferencedandqueriedover.
Other systems focus on exploiting RDF for the
Swoogle6 o(cid:11)ers search over RDF documents by
purposes of domain-speci(cid:12)c querying; for example,
meansofaninvertedkeywordindexandarelational
therecentGoWebsystem11 demonstratestheben-
database[35].Swooglecalculatesmetricsthatallow
ontologydesignerstocheckthepopularityofcertain
properties and classes. In contrast to SWSE, which 7 http://watson.kmi.open.ac.uk/WatsonWUI/
8 http://sindice.com/
ismainlyconcernedwithentitysearchoverinstance
9 http://sig.ma
10http://iws.seu.edu.cn/services/falcons/
6 http://swoogle.umbc.edu/ 11http://gopubmed.org/goweb/
6
e(cid:12)t of searching structured data for the biomedical RDFTriple Atriplet =(s,p,o)2(U[B)(cid:2)U(cid:2)
domain[33]. (U[B[L) is called an RDF triple. In a triple (s,
p,o),s iscalledsubject,p predicate,ando object.
4. Preliminaries
RDF Graph We call a (cid:12)nite set of triples an RDF
Before we continue, we brie(cid:13)y introduce some graph G (cid:26)GwhereG=(U[B)(cid:2)U(cid:2)(U[B[L).
standard core notation used throughout the report
{ relating to RDF terms (constants), triples and
quadruples { and also discuss Linked Data princi- RDF Entity We refer to the referent of a URI or
ples. Note that in this report, we will generally use blank-nodeasanRDFentity,orcommonlyjusten-
bold-face to refer to in(cid:12)nite sets: e.g., G refers to tity.
the set of all triples; we will use calligraphy font to
denoteasubsetthereof:e.g.,G isaparticularsetof
4.2. Linked Data, Data Sources, Quadruples, and
triples,whereG (cid:26)G.
Dereferencing
4.1. ResourceDescriptionFramework Inordertocopewiththeuniquechallengesofhan-
dling diverse and unveri(cid:12)ed web data, many of our
The Resource Description Framework provides a components and algorithms require inclusion of a
structuredmeansofpublishinginformationdescrib- notionofprovenance:considerationofthesourceof
ing entities through use of RDF terms and RDF RDFdatafoundontheWeb.Tightlyrelatedtosuch
triples, and constitutes the core data model for our notions are the best practices of Linked Data [9],
searchengine.Inparticular,RDFallowsforoption- which give clear guidelines for publishing RDF on
ally de(cid:12)ning names for entities using URIs and al- the Web, viz.: (LDP1) use URIs to name things;
lows for subsequent re-use of URIs across the Web; (LDP2)useHTTPURIssothatthosenamescanbe
using triples, RDF allows to group entities into looked up; (LDP3) provide useful structured infor-
named classes, allows to de(cid:12)ne named relations be- mation when a look-up on a URI is made { loosely,
tween entities, and allows for de(cid:12)ning named at- calleddereferencing;and(LDP4)includelinksusing
tributes of entities using string (literal) values. We external URIs. In particular, within SWSE, these
nowbrie(cid:13)ygivesomenecessarynotation. best-practices form the backbone of various algo-
rithms designed to interact and be tolerant to web
data.
RDFConstant GivenasetofURIreferencesU,a We must thus extend RDF triples with context
setofblanknodesB,andasetofliteralsL,theset to denote the source thereof [50,54]. We also de-
of RDF constants is denoted by C = U[B[L. (cid:12)nesomerelationsbetweentheidenti(cid:12)erforadata
The set of blank nodes B is a set of existensially source, and the graph it contains, including a func-
quanti(cid:12)ed variables. The set of literals is given as tion to represent HTTP redirects prevalently used
L = Lp [Ld, where Lp is the set of plain literals inLinkedDataforLDP3[9].
and L is the set of typed literals. A typed literal is
d
thepairl =(s,d),wheres isthelexicalformofthe
literalandd2U isadatatypeURI.ThesetsU,B, DataSource Wede(cid:12)nethehttp-download function
L andL arepairwisedisjoint. http:U!GasthemappingfromaURItoanRDF
p t
Please note that in this report, we treat blank graph it may provide by means of a given HTTP
nodesastheirskolemversions:i.e.,notasexistential lookup [44] which directly returns status code 200
variables,butasdenotingtheirownsyntacticform. OK and data in a suitable RDF format. We de(cid:12)ne
We also ensure correct merging of RDF graphs [60] the set of data sources S (cid:26) U as the set of URIs
byusingblank-nodelabelsuniqueforagivensource. S = fs 2 U j http(s) 6= ;g. We de(cid:12)ne the reference
ForURIs,weusenamespacepre(cid:12)xesinthisreport function refs : C ! P(S) as the mapping from an
ascommonintheliterature{thefullURIscanbere- RDF term to the set of data sources that mention
trievedfromtheconvenienthttp://prefix.ccser- it.12
vice. For space reasons, we sometimes denote owl:
asthedefaultnamespace. 12P(S)referstothepowersetofS.
7
RDFTripleinContext/RDFQuadruple Apair(t, { Structured Data: The crawler should retrieve
c)withatriplet =(s,p,o),c 2Sandt 2http(c)is a high percentage of RDF/XML documents and
calledatripleincontext c.Wemayalsoreferto(s,p, avoid wasted lookups on unwanted formats: e.g.,
o,c)asanRDFquadruple orquadq withcontextc. HTMLdocuments.
Currently, we crawl for RDF/XML syntax docu-
ments{RDF/XMLisstillthemostcommonlyused
HTTP Dereferencing We de(cid:12)ne dereferencing as
syntaxforpublishingRDFontheWeb,andweplan
thefunctionderef:U!UwhichmapsagivenURI
infuturetoextendthecrawlertosupportotherfor-
totheidenti(cid:12)erofthedocumentreturnedbyHTTP
matssuchasRDFa,N-TriplesandTurtle.
lookupoperationsuponthatURIfollowingredirects
The following algorithm details the operation of
(for a given (cid:12)nite and non-cyclical path) [44], or
thecrawler,andwillbeexplainedindetailthrough-
which maps a URI to itself in the case of failure.
outthissection.
Notethatwedonot distinguishbetweenthedi(cid:11)er-
ent 30x redirection schemes, and that this function
wouldinvolve,e.g.,strippingthefragmentidenti(cid:12)er Algorithm1Algorithmforcrawling
of a URI [11]. Note that all HTTP level functions
Require: SEEDS, ROUNDS, PLDLIMIT,
fhttp;refs;derefg are set at the time of the crawl,
MINDELAY
andareboundedbytheknowledgeofourcrawl:for
1: frontier SEEDS
example,refswillonlyconsiderdocumentsaccessed
2: pld0:::n newqueue
bythecrawl.
3: stats newstats
4: whilerounds+1<ROUNDS do
5. Crawling 5: putfrontier intopld0:::n
6: whiledepth+1<PLDLIMITdo
The (cid:12)rst component required for the building of 7: fori=0tondo
ourindexisthecrawler,whosepurposeistoretrieve 8: prioritise(pldi,stats)
a large set of RDF documents from the Web. Our 9: endfor
crawlerstartswithasetofseedURIs,retrievesthe 10: start current time()
content of URIs, parses and writes content to disk 11: fori=0tondo
in the form of quads, and recursively extracts new 12: curi =calculate cur(pldi,stats)
URIs for crawling; following LDP2 and LDP3, we 13: if curi >random([0;1])then
considerallhttp:protocolURIsextractedfroman 14: geturifrompldi
RDFdocumentascandidatesforcrawling. 15: urideref =deref(uri)
Like traditional HTML crawlers, we identify the 16: if urideref =urithen
followingrequirementsforcrawling: 17: G =http(uri)
{ Politeness: The crawler must implement po- 18: outputG
liteness restrictions to avoid hammering remote 19: addnewURIsfromG tofrontier
servers with dense HTTP GET requests and 20: updatestats
to abide by policies identi(cid:12)ed in the provided 21: else
robots.txt(cid:12)les13. 22: if urideref isunseenthen
{ Throughput:Thecrawlershouldcrawlasmany 23: addurideref tofrontier
URIs as possible in as little time as is possible 24: endif
withintheboundsofthepolitenesspolicies. 25: endif
{ Scale: The crawler should employ scalable tech- 26: endif
niques,andon-diskindexingasrequired. 27: endfor
{ Quality: The crawler should prioritise crawling 28: elapsed current time()-start
URIsitconsiderstobe\highquality". 29: if elapsed<MINDELAYthen
Thus, the design of our crawler is inspired by re- 30: wait(MINDELAY(cid:0)elapsed)
lated work from traditional HTML crawlers. Addi- 31: endif
tionally{andspeci(cid:12)ctocrawlingstructureddata{ 32: endwhile
weidentifythefollowingrequirement: 33: endwhile
13http://www.robotstxt.org/orig.html
8
5.1. Breadth-(cid:12)rstCrawling withminimale(cid:11)ectonperformance,wemustre(cid:12)ne
ourcrawlingalgorithm:largesiteswithalargeinter-
Traditional web crawlers (cf. [15] [64]) typically nalbranchingfactor(largenumbersofuniqueintra-
use a breadth-(cid:12)rst crawling strategy: the crawl is PLDoutlinksperdocument)canresultinthefron-
conducted in rounds, with each round crawling a tier of each round being dominated by URIs from
frontier. On a high-level, Algorithm 1 represents asmallselectionofPLDs.Thus,na(cid:127)(cid:16)vebreadth-(cid:12)rst
thisround-basedapproachapplyingROUNDS num- crawlingcanleadtocrawlershammeringsuchsites;
ber of rounds. The frontier comprises of seed URIs conversely, given a politeness policy, a crawler may
for round 0 (Line 1, Algorithm 1), and thereafter spendalotoftimeidlewaitingforthemin-delayto
withnovelURIsextractedfromdocumentscrawled pass.
inthepreviousround(Line19,Algorithm1).Thus, Onesolutionistoreasonablyrestrictthebranch-
thecrawlemulatesabreadth-(cid:12)rsttraversalofinter- ing factor [81] { the maximum number of URIs
linked web documents. (Note that the algorithm is crawled per PLD per round { which ensures that
further tailored according to requirements we will individualPLDswithlargeinternalfan-outarenot
describeasthesectionprogresses.) hammered; thus, in each round of the crawl, we
As we will see later in the section, the round- implement a cut-o(cid:11) for URIs per PLD, given by
basedapproach(cid:12)tswellwithourdistributedframe- PLDLIMIT inAlgorithm1.
work, allowing for crawlers to work independently Secondly, to ensure the maximum gap between
for each round, and co-rdinating new frontier URIs crawlingsuccessiveURIsforthesamePLD,weim-
attheendofeachround.Additionally,[91]showthat plement a per-PLD queue (given by pld0:::n in Al-
a breadth-(cid:12)rst traversal strategy tends to discover gorithm 1) whereby each PLD is given a dedicated
high-qualitypagesearlyoninthecrawl,withthejus- queue of URIs (cid:12)lled from the frontier, and during
ti(cid:12)cation that well-linked documents (representing the crawl, a URI is polled from each PLD queue in
higherqualitydocuments)aremorelikelytobeen- around-robinfashion.IfallofthePLDqueueshave
countered in earlier breadth-(cid:12)rst rounds; similarly, been polled before the min-delay is satis(cid:12)ed, then
breadth(cid:12)rstcrawlingleadstoamorediversedataset the crawler must wait: this is given by Lines 28-31
earlieron,ratherthanadepth-(cid:12)rstapproachwhich inAlgorithm1.Thus,theminimumcrawltimefora
may end up traversing deep paths within a given round{assumingasu(cid:14)cientlyfullqueue{becomes
site. In [81], the authors justify a rounds-based ap- MINDELAY *PLDLIMIT.
proach to crawling according to observations that
writing/reading concurrently and dynamically to a
5.3. On-diskQueue
single queue can become the bottleneck in a large-
scalecrawler.
As the crawl continues, the in-memory capacity
of the machine will eventually be exceeded by the
5.2. IncorporatingPoliteness capacity required for storing URIs [81]. Perform-
ing a stress-test, we observed that with 2GB of
The crawler must be careful not to bite the JAVA heap-space, the crawler could crawl approx.
hands that feed it by hammering the servers of 199kURIs(additionallystoringtherespectivefron-
data providers or breaching policies outlined in the tier URIs) before throwing an out-of-memory ex-
provided robots.txt (cid:12)le [111]. We use pay-level- ception.Inordertoscalebeyondtheimpliedmain-
domains [81] (PLDs; a.k.a. \root domains"; e.g., memory limitations of the crawler, we implement
bbc.co.uk) to identify individual data-providers, on-disk storage for URIs, with the additional ben-
and implement politeness on a per-PLD basis. e(cid:12)t of maintaining a persistent state for the crawl
Firstly, when we (cid:12)rst encounter a URI for a PLD, and thus o(cid:11)ering a \continuation point" useful for
wecross-checktherobots.txt(cid:12)letoensurethatwe extensionofanexistingcrawl,orrecoveryfromfail-
arepermittedtocrawlthatsite;secondly,weimple- ure.
menta\minimumPLDdelay"toavoidhammering We implement the on-disk storage of URIs using
servers, viz.: a minimum time-period between sub- Berkeley DB which comprises of two indexes { the
sequent requests to a given PLD. This is given by (cid:12)rst provides lookups for URI strings against their
MINDELAY inAlgorithm1. status (polled/unpolled); the second o(cid:11)ers a key-
In order to accommodate the min-delay policy sortedmapwhichcaniterateoverunpolledURIsin
9
decreasing order of inlink count. The inlink count 1,000 URIs given a seed URI14 for 1, 2, 4, 8, 16,
re(cid:13)ects the total number of documents from which 32, 64, and 128 threads. Also, to alleviate the pos-
the URI has been extracted thus far; we deem a siblee(cid:11)ectsofremotecachingonourcomparisonof
higher count to roughly equate to a higher priority increasing thread counts, we pre-crawled all of the
URI. URIsbeforerunningthebenchmark.
The crawler utilises both the on-disk index and Figures 5 displays the time taken in minutes for
the in-memory queue to o(cid:11)er similar functionality thecrawlertorun,whilstFigure6showstheaverage
as above. The on-disk index and in-memory queue percentage CPU usage (the averages are over read-
aresynchronisedatthestartofeachround: ings extracted from the UNIX command ps taken
(i) links and respective inlink counts extracted every three seconds during the crawl). Time and
from the previous round (or seed URIs if the CPU% noticeably have an inverse correlation. As
(cid:12)rstround)areaddedtotheon-diskindex; the number of threads increases up until 64, the
(ii) URIs polled from the previous round have time taken for the crawl decreases { the reduction
theirstatusupdatedon-disk; in time is particularly pronounced in earlier thread
(iii) an in-memory PLD queue is (cid:12)lled using an increments;similarly,andasexpected,theCPUus-
iteratorofon-diskURIssortedbydescending age increases as a higher density of documents are
inlinkcount. retrievedandprocessed.Beyond64threads,theef-
Mostimportantly,theaboveprocessensuresthat fect of increasing threads becomes minimal as the
onlytheURIsactive(currentPLDqueueandfron- machine reaches the limits of CPU and disk I/O
tier URIs) for the current round must be stored in throughput;infact,thetotaltimetakenstartstoin-
memory. Also, the process ensures that the on-disk crease{wesuspectthatcontentionbetweenthreads
indexstoresthepersistentstateofthecrawlerupto for shared resources a(cid:11)ects performance. Thus, we
the start of the last round; if the crawler unexpect- settleupon64threadsasanapproximatelyoptimal
edlydies,thecrawlcanberesumedfromthestartof (cid:12)gureforoursetup.
the last round. Finally, the in-memory PLD queue
is(cid:12)lledwithURIssortedinorderofinlinkcount,of-
5.5. CrawlingRDF/XML
feringacheapformofintra-PLDURIprioritisation
(Line8,Algorithm1).
Since our architecture is currently implemented
toindexRDF/XML,wewouldfeasiblyliketomax-
imise the ratio of HTTP lookups which result in
5.4. Multi-threading RDF/XML content; i.e., given the total HTTP
lookups as L, and the total number of downloaded
Thebottle-neckforasingle-threadedcrawlerwill RDF/XML pages as R, we would like to maximise
be the response times of remote servers; the CPU theratioR=L.
load, I/O throughput and network bandwidth of a In order to reduce the amount of HTTP lookups
crawling machine will not be e(cid:14)ciently exploited wasted on non-RDF/XML content, we implement
by sequential HTTP GET requests over the Web. thefollowingheuristics:
Thus,crawlersarecommonlymulti-threadedtomit- (i) (cid:12)rstly,weblacklistnon-httpprotocolURIs;
igatethisbottleneckandperformconcurrentHTTP (ii) secondly,weblacklistURIswithcommon(cid:12)le-
lookups. At a certain point of increasing the num- extensions that are highly unlikely to return
ber of lookup threads operating, the CPU load, RDF/XML(e.g.,html,jpg,pdf,etc.)follow-
I/O load, or network bandwidth becomes an im- ingargumentswepreviouslylaidoutin[114];
mutablebottleneck;thisbecomestheoptimalnum- (iii) thirdly, we check the returned HTTP header
berofthreads. andonlyretrievethecontentofURIsreporting
In order to (cid:12)nd a suitable thread count for our Content-type:application/rdf+xml;15
particularsetup(withrespecttoprocessor/network (iv) (cid:12)nally, we use a credible useful ratio when
bandwidth), we conducted some illustrative small- polling PLDs to indicate the probability that
scale experiments comparing a machine crawling
withthesamesetupandinputparameters,butwith
14http://sw.deri.org/~aidanh/foaf/foaf.rdf
an exponentially increasing number of threads: in 15Indeed, one advantage RDF/XML has over RDFa is an
particular, we measure the time taken for crawling unambiguousMIME-typeusefulinsuchsituations
10
Description:Aidan Hogana, Andreas Harthb, J urgen Umbricha, Sheila Kinsellaa, Axel Polleresa, interface for search, browsing and retrieval of information; unlike traditional search engines, SWSE operates over. RDF web
[email protected] (Stefan Decker) screenshot of the focus (detailed) view of the.