Table Of Content

SearchingandBrowsingLinkedDatawithSWSE: theSemanticWebSearchEngine1;2 AidanHogana,AndreasHarthb,Ju(cid:127)rgenUmbricha,SheilaKinsellaa,AxelPolleresa, StefanDeckera aDigitalEnterpriseResearchInstitute,NationalUniversityofIreland,Galway bAIFB,KarlsruheInstituteofTechnology,Germany Abstract In this report, we discuss the architecture and implementation of the Semantic Web Search Engine (SWSE). Following traditional search engine architecture, SWSE consists of crawling, data enhancing, indexing and a user interface for search, browsing and retrieval of information; unlike traditional search engines, SWSE operates over RDF web data { loosely also known as Linked Data { which implies unique challenges for the system design, architecture,algorithms,implementationanduserinterface.Inparticular,manychallengesexistinadoptingSemantic Web technologies for web data: the unique challenges of the Web { in terms of scale, unreliability, inconsistency and noise { are largely overlooked by the current Semantic Web standards. In this report, we detail the current SWSE system, initially detailing the architecture and later elaborating upon the function, design, implementation and performance of each individual component. In so doing, we also give an insight into how current Semantic Web standards can be tailored, in a best-e(cid:11)ort manner, for use on web data. Throughout, we o(cid:11)er evaluation and complementaryargumentationtosupportourdesignchoices,andalsoo(cid:11)erdiscussiononfuturedirectionsandopen research questions. Later, we also provide candid discussion relating to the di(cid:14)culties currently faced in bringing such a search engine into the mainstream, and lessons learnt from roughly (cid:12)ve years working on the Semantic Web SearchEngineproject. Keywords: websearch,semanticsearch,RDF,semanticweb,linkeddata 1. Introduction O(cid:11)ering a minimalistic and uncluttered user interface, a simple keyword-based user-interaction Emailaddresses: [email protected](AidanHogan), model, fast response times, and astute prioritisa- [email protected](AndreasHarth), tion of results, Google [17] has become the yard- [email protected](Ju(cid:127)rgenUmbrich), stickforweb-search,servicing(cid:24)64.6%oftraditional [email protected](SheilaKinsella), websearchqueries3 overbillionsofwebdocuments. [email protected](AxelPolleres), Arguably, Google reaches the imminent limit of [email protected](StefanDecker). 1 The work presented in this report has been funded in part by Science Foundation Ireland under Grant No. SFI/08/CE/I1380(Lion-2)andbyanIRCSETpostgraduate 3 Statistics taken from Nielsen MegaView Search for scholarship. (cid:24)11b searches recorded in Aug. 2009: cf. http:// 2 Updated:10August. searchenginewatch.com/3634991 PreprintsubmittedtoElsevier 10August2010 providing the best possible search over the largely structured data with search results automatically HTMLdataitindexes.However,fromtheuserper- aggregated from multiple documents and rendered spective, the core Google engine (here serving as directlyinacleanandconsistentuser-interface,thus thearchetypefortraditionalHTMLsearchengines, reducing the manual e(cid:11)ort required of its users. such as Yahoo, MSN/Bing, AOL, Ask, etc.) is far Indeed, there has been much research devoted to from the consummate web search solution: Google thistopic,withvariousincarnationsof(mostlyaca- doesnottypicallyproducedirectanswerstoqueries, demic) RDF-centric web search engines emerging { but instead typically recommends a selection of re- e.g., Swoogle, FalconS, WATSON, Sindice { and in lated documents from the Web. Thus, Google is this report, we present the culmination of over (cid:12)ve notsuitableforcomplexinformationgatheringtasks years research on yet another such engine: the \Se- requiring aggregation from multiple indexed docu- manticWebSearchEngine"(SWSE)4. ments: for such tasks, users must manually aggre- Indeed, the realisation of SWSE has implied two gate tidbits of pertinent information from various majorresearchchallenges:thesystemmustscaleto recommendedsites,eachsitepresentinginformation largeamountsofdata,andmustbetoleranttohet- in its own formatting and using its own navigation erogeneous,noisy,andpossiblycon(cid:13)ictingdatacol- system. lectedfromalargenumberofsources.SemanticWeb Google’slimitationsarepredicatedonthelackof standards and methodologies are not naturally ap- structure in HTML documents, whose machine in- plicable in such an environment; in presenting the terpretabilityislimitedtotheuseofgenericmarkup- design and implementation of SWSE, we show how tags mainly concerned with document rendering standardSemanticWebapproachescanbetailored and linking { the real content is contained in prose to meet these two challenging requirements, often text which is inherently di(cid:14)cult for machines to taking cues from traditional information retrieval interpret. Addressing this inherent problem with techniques. HTMLwebdata,theSemanticWebmovementpro- As such, we present the core of a system which videsastackoftechnologiesforpublishingmachine- we demonstrate to provide scale, and which is dis- readabledataontheWeb,thecoreofthestackbe- tributed over a cluster of commodity hardware. ingtheResourceDescriptionFramework(RDF). Throughout, we focus on the unique challenges of Using URIs to name things { and not just doc- applying standard Semantic Web techniques and uments { RDF o(cid:11)ers a standardised and (cid:13)exible methodologies, and show why the consideration of framework for publishing structured data on the the source of data is an integral part of creating a Web such that data can link, incorporate, extend systemwhichmustbetoleranttowebdata{inpar- andre-useotherRDFacrosstheWeb,suchthathet- ticular,weshowhowLinkedDataprinciplescanbe erogenousdatafromindependentsourcescanbeau- exploited for such purposes. Also, there are many tomaticallyintegratedbysoftware-agents,andsuch researchquestionsstillverymuchopenwithrespect that the meaning of data can be well-de(cid:12)ned using tothedirectionoftheoverallsystem,aswellasim- lightweight ontologies described in RDF using the provements to be made in the individual compo- RDFSchema(RDFS)andWebOntologyLangauge nents; we discuss these as they arise, rendering a (OWL)standards. road-map of past, present and possible future re- Thanks largely to the \Linked Open Data" searchintheareaofwebsearchoverRDFdata. project [14] { which has emphasised more prag- Morespeci(cid:12)cally,inthisreportwe: matic aspects of Semantic Web publishing { a rich { present the architecture and modus-operandi of lode of open RDF data now resides on the Web: our system for o(cid:11)ering search and browsing over this\WebofData"includescontentexportedfrom, RDFwebdata(Section2); for example: Wikipedia, the BBC, the New York { presentrelatedworkinRDFsearchengines(Sec- Times,Flickr,LastFM,scienti(cid:12)cpublishingindexes, tion3); biomedicalinformationandgovernmentalagencies. { detail the design and implementation of crawl- This precedent raises an obvious question: assum- ing, consolidation, ranking, reasoning, indexing, ing large-scale adoption of high-quality RDF pub- query processing and user interface components, lishing on the Web, could a search engine indexing o(cid:11)ering pertinent evaluation and related work RDF feasibly improve upon current HTML-centric throughout(Section5-12); engines? Theoretically at least, such a search en- ginecouldo(cid:11)eradvancedqueryingandbrowsingof 4 http://swse.deri.org/ 2 { concludewithdiscussionoffuturedirections,open screenshot of the focus (detailed) view of the researchchallengesandcurrentlimitationsofweb Bill Clinton entity, with data aggregated from searchoverRDFdata(Section13-14). 54 documents spanning six domains (bbc.co.uk, dbpedia.org, freebase.com, nytimes.com, rdfize.com and soton.ac.uk), as well as novel 2. SystemOverview datafoundthroughreasoning. 2.1. ApplicationOverview Toputlaterdiscussionintocontext,wenowgive a brief overview of the lightweight functionality of the SWSE system; please note that although our 2.2. SystemArchitecture methodsandalgorithmsaretailoredforthespeci(cid:12)c needsofSWSE,manyaspectsoftheirimplementa- The high-level system architecture of SWSE tion, design and evaluation apply to more general looselyfollowsthatoftraditionalHTMLsearchen- scenarios. gines [17]; Figure 3 details the pre-runtime archi- Unlikeprevalentdocument-centricwebsearchen- tecture of our system, showing the components in- gines, SWSE operates over structured data and volved in achieving a local-index of RDF web data holdsanentity-centricperspectiveonsearch:incon- amenableforsearch.Liketraditionalsearchengines, trast to returning links to documents containing SWSE contains components for crawling, ranking speci(cid:12)ed keywords [17], SWSE returns data rep- and indexing data; however, there are also compo- resentations of real-world entities. While current nents speci(cid:12)cally designed for handling RDF data, search engines such as Google, Bing and Yahoo re- viz.:theconsolidationcomponentandthereasoning turnsearchresultsindi(cid:11)erentdomain-speci(cid:12)ccate- component.Thehigh-levelindexbuildingprocessis gories (Web, Images, Videos, Shopping, etc.), data asfollows: ontheSemanticWebis(cid:13)exiblytypedanddoesnot { the crawler accepts a set of seed URIs and re- need to follow pre-de(cid:12)ned categories. Returned ob- trievesalargesetofRDFdatafromtheWeb; jects can represent people, companies, cities, pro- { theconsolidationcomponenttriesto(cid:12)ndsynony- teins{anythingpeoplecaretopublishdataabout. mous(i.e.,equivalent)identi(cid:12)ersinthedata,and Inamannerfamiliarfromtraditionalwebsearch canonicalises the data according to the equiva- engines, SWSE allows users to specify keyword lencesfound; queriesinaninputboxandrespondswitharanked { the ranking component performs links-based list of result snippets; however, the results refer to analysis over the crawled data and derives scores entities not documents. A user can then click on indicating the importance of individual elements an entity snippet to derive a detailed description in the data (the ranking component also consid- thereof. The descriptions of entities are automati- ers URI redirections encountered by the crawler callyaggregatedfromarbitrarilymanysources,and whenperformingthelinks-basedanalysis); userscancross-checkthesourceofparticularstate- { the reasoning component materialises new data ments presented; descriptions also include inferred whichisimpliedbytheinherentsemanticsofthe data { data which has not necessarily been pub- inputdata(thereasoningcomponentalsorequires lished, but has been derived from the existing data URIredirectioninformationtoevaluatethetrust- throughreasoning.Userscansubsequentlynavigate worthinessofsourcesofdata); to related entities, as such, browsing the Web of { theindexingcomponentpreparesanindexwhich Data. supports the information retrieval tasks required Along these lines, Figure 1 shows a screen- bytheuserinterface. shot containing a list of entities returned as a Subsequently, the query-processing and user- result to the keyword search \bill clinton" interfacecomponentsservicequeriesovertheindex { such results pages are familiar from HTML- builtintheprevioussteps.u centric engines, with the addition of re- Wewilldetailthedesignandoperationofeachof sult types (e.g., DisbarredAmericanLawyers, thecomponentsinthefollowingsections,butbefore- AmericanVegitarians, etc.). Results are aggre- hand, we present the distribution framework upon gated from multiple sources. Figure 2 shows a whichallofourcomponentsareimplemented. 3 Fig.1.ResultsviewforkeywordqueryBillClinton Fig.2.FocusviewforentityBillClinton Pre-Runtime Runtime Seed Crawl Consolidate Rank Reason Index URIs Query RDF Data / Consolidated Identifier Reasoned Processing Redirects Data Ranks Data Intermediary Results Fig.3.SystemArchitecture 2.3. DistributionAbstraction slaveswarmandperformssomelocalmerge function over the data { this is usually performed to In order to scale, we deploy each of our compo- createasingleoutput(cid:12)leforatask,ormoreusu- nents over a distributed framework which we now ally to gather global knowledge required by all brie(cid:13)y describe; Figure 4 illustrates the distributed slavemachinesforafuturetask; operations possible in our framework. The frame- { (cid:13)ood:broadcastglobalknowledgerequiredbyall workisbasedonasharednothingarchitecture[108] slavemachinesforafuturetask. and consists of one master machine which orches- Themastermachineisintendedtodisseminatein- trates the given tasks, and several slave machines put data to the slave swarm, to provide the control whichperformpartsofthetaskinparallel. logicrequiredbythedistributedtask(commencing The master machine can instigate the following tasks,coordinatingtiming,endingtasks),togather distributedoperations: andlocallyperformtasksonglobalknowledgewhich { scatter: partition a (cid:12)le into chunks given some the slave machines would otherwise have to repli- local split function, and send the chunks to indicate in parallel, and to transmit globally required vidual machines { usually only used to initialise knowledge.Themastermachinecanalsobeusedto ataskandseedtheslavemachineswithaninitial computethe(cid:12)nalresultforagivendistributedtask; setofdataforprocessing; however,theendgoalofourdistributedframework { run:requesttheparallelexecutionofataskbythe istoproduceadistributedindexovertheslavema- slave machines { such a task either involves pro- chines,thusthistaskisneverrequiredinoursystem. cessing of some local data (embarrassingly par- Theslavemachines,aswellasperformingtasksin allel execution), or execution of the coordinate parallel, can perform the following distributed op- methodbytheslaveswarm; eration(onthebehestofthemastermachine): { gather: gathers chunks of output data from the { coordinate:localdataoneachmachineisparti- tioned according to some split function, with the 4 chunks sent to individual machines in parallel; 3.1. DistributedWebSearchArchitectures eachmachinealsogatherstheincomingchunksin parallelusingsomemerge function. Distributed architectures have long been com- Theaboveoperationallowsslavemachinestopar- mon in traditional information-retrieval based web tition and disseminate intermediary data directly search engines, incorporating distributed crawl- to other slave machines; the coordinate operation ing,ranking,indexingandquery-processingcompo- could be replaced by a pair of gather/scatter op- nents. Although all mainstream search engines are erations performed by the master machine, but we based on distributed architectures, details are not wish to avoid the channelling of all such interme- commonly published. Again, one of the most well- diary data through one machine. In fact, without knownsearchenginearchitecturesisthatpreviously the coordinate operation, our framework closely described for the Google search engine [17]. More resembles the MapReduce framework [28], with recent publications relating to the Google architec- scatter corresponding to the Map operation, and turerelatetotheMapReduceframeworkpreviously gathercorrespondingtotheReduceoperation. alludedto[28],andtotheunderlyingBigTable[22] distributeddatabasesystem. 2.4. InstantiationofDistributedArchitecture Similar system architectures have been de(cid:12)ned in the literature, including WebBase [65] which in- cludesanincrementalcrawler,storagemanager,in- We instantiate this architecture using the stan- dexerandqueryprocessor;inparticular,theauthors dard Java Remote Method Invocation libraries as focus on hash- and log-based partitioning for stor- a convenient means of development given our Java ing incrementally-updated vast repositories of web code-base. documents. The authors of [86] also describe a sys- All of our evaluation is based on nine machines tem for building a distributed inverted-index over connectedbyGigabitethernet5,eachwithuniform a large corpus of web pages, for subsequent analy- speci(cid:12)cations; viz.: 2.2GHz Opteron x86-64, 4GB sisandquery-processing:theyemployanembedded main memory, 160GB SATA hard-disks, running distributeddatabasesystem. Java1.6.0 12onDebian5.0.4.Pleasenotethatmuch Much of the work presented herein is loosely in- of the evaluation presented in this report assumes spired by such approaches, and thus constitutes that the slave machines have roughly equal speci(cid:12)- an adaptation of such works for the purposes of cationsinordertoensurethattasks(cid:12)nishinroughly searchoverstructureddata.Sinceweconsiderrepli- thesametime,assumingevendatadistribution. cation, fault tolerance, incremental indexing, etc., Wecurrentlydonotconsidermoreadvancedtop- currently out of scope, many of our techniques are icsinourarchitecture{suchasload-balancing(with morelightweightthanthosediscussed. the exception of evenly distributing data), replica- tion, uptime and counteracting hardware failure { and discussion of these are outside of the current 3.2. HiddenWeb/DeepWebApproaches scope. So called \Hidden Web" or \Deep Web" ap- 3. RelatedWork proaches [20] are predicated on the premise that a vastamountoftheinformationavailableontheWeb In this section, we give an overview of related is veiled behind sites with heavy dynamic content, work, (cid:12)rstly detailing distributed architectures for usually backed by relational databases. Such infor- web search, then discussing related systems in the mation is largely impervious to traditional crawl- (cid:12)eld of \Hidden Web" and \Deep Web", and (cid:12)- ing techniques since content is usually generated nally describing current systems that o(cid:11)er search by means of bespoke (cid:13)exible queries; thus, tradi- andbrowsingoverRDFwebdata(forafurthersur- tional search engines can only skim the surface of veyofthelatter,cf.[120]).Pleasenotethatwewill suchinformation[61].Infact,suchdata-richsources give further detailed related work in the context of haveleadtoearlyspeculativeworkonentity-centric eachcomponentthroughoutthereport. search[26]. Approaches to exploit such sources heavily rely 5 We observe, e.g., a max FTP transfer rate of 38MB/sec on manually constructed, site-speci(cid:12)c wrappers to betweenmachines. extract structured data from HTML pages [20], 5 s0 s0 s0 scatter0 s1 run(args0) s1 flood s1 scatter1 init’ed run(args1) flood ... ... ready ... split scatter... run(...) flood m scattern sn m run(argsn) sn m flood sn ready hed send s0 split split s1 finis s0 merge merge gather0 s1 coordinate0...n gather1 ... gather merge merge prepare merge ... split split sn m gathern sn Fig.4.DistributionMethodsArchitecture or to communicate directly with the underlying data,Swoogleismainlyconcernedwithmoretradi- database of such sites [23]. Some works have also tionaldocument-searchoverontologies. looked into automatically crawling such hidden- WATSON7 provides a similar e(cid:11)ort to provide web sources, by interacting with forms found dur- keyword search facilities over Semantic Web doc- ing traditional crawls [104]; however, this approach uments, but additionally provides search over en- is \task-speci(cid:12)c" and not appropriate for general tities [106]. However, they do not include compo- crawling. nents forconsolidation orreasoning,andseemingly The Semantic Web may represent a future direc- insteadfocusonprovidingAPIstoexternalservices. tion for bringing deep-web information to the sur- Sindice8 isaregistryandlookupserviceforRDF face,leveragingRDFasacommonand(cid:13)exibledata- (cid:12)les based on Lucene and a MapReduce frame- model for exporting the content of such databases, work [95]. Sindice originally focussed on provid- leveragingRDFSandOWLasameansofdescribing ing an API for (cid:12)nding documents which reference the respective schemata, and thus allowing for au- a given RDF entity or given keywords { again, tomatic integration of such data by web search en- document-centric search. More recently however, gines.E(cid:11)ortssuchasD2R(Q)[13]seemanatural(cid:12)t Sindicehasbeguntoo(cid:11)erentitysearchintheform forenablingRDFexportsofsuchonlinedatabases. ofSig.Ma9 [113].However,Sig.Mamaintainsaone- to-onerelationshipbetweenkeywordsearchandre- sults, representing a very di(cid:11)erent user-interaction 3.3. RDF-centricSearchEngines modeltothatpresentedherein. TheFalconsSearchengine10 o(cid:11)ersentity-centric Earlyprototypesusingtheconceptsofontologies searching for entities (and concepts) over RDF and semantics on the Web include Ontobroker [29] data [25]. They map certain keyword phrases to and SHOE [62], which can be seen as predecessors query relations between entities, and also use class tostandardisatione(cid:11)ortssuchasRDFSandOWL, hierarchies to quickly restrict initial results. Con- describing how data on the Web can be given in ceptually,thissearchenginemostcloselyresembles structured form, and subsequently crawled, stored, ourapproach. inferencedandqueriedover. Other systems focus on exploiting RDF for the Swoogle6 o(cid:11)ers search over RDF documents by purposes of domain-speci(cid:12)c querying; for example, meansofaninvertedkeywordindexandarelational therecentGoWebsystem11 demonstratestheben- database[35].Swooglecalculatesmetricsthatallow ontologydesignerstocheckthepopularityofcertain properties and classes. In contrast to SWSE, which 7 http://watson.kmi.open.ac.uk/WatsonWUI/ 8 http://sindice.com/ ismainlyconcernedwithentitysearchoverinstance 9 http://sig.ma 10http://iws.seu.edu.cn/services/falcons/ 6 http://swoogle.umbc.edu/ 11http://gopubmed.org/goweb/ 6 e(cid:12)t of searching structured data for the biomedical RDFTriple Atriplet =(s,p,o)2(U[B)(cid:2)U(cid:2) domain[33]. (U[B[L) is called an RDF triple. In a triple (s, p,o),s iscalledsubject,p predicate,ando object. 4. Preliminaries RDF Graph We call a (cid:12)nite set of triples an RDF Before we continue, we brie(cid:13)y introduce some graph G (cid:26)GwhereG=(U[B)(cid:2)U(cid:2)(U[B[L). standard core notation used throughout the report { relating to RDF terms (constants), triples and quadruples { and also discuss Linked Data princi- RDF Entity We refer to the referent of a URI or ples. Note that in this report, we will generally use blank-nodeasanRDFentity,orcommonlyjusten- bold-face to refer to in(cid:12)nite sets: e.g., G refers to tity. the set of all triples; we will use calligraphy font to denoteasubsetthereof:e.g.,G isaparticularsetof 4.2. Linked Data, Data Sources, Quadruples, and triples,whereG (cid:26)G. Dereferencing 4.1. ResourceDescriptionFramework Inordertocopewiththeuniquechallengesofhan- dling diverse and unveri(cid:12)ed web data, many of our The Resource Description Framework provides a components and algorithms require inclusion of a structuredmeansofpublishinginformationdescrib- notionofprovenance:considerationofthesourceof ing entities through use of RDF terms and RDF RDFdatafoundontheWeb.Tightlyrelatedtosuch triples, and constitutes the core data model for our notions are the best practices of Linked Data [9], searchengine.Inparticular,RDFallowsforoption- which give clear guidelines for publishing RDF on ally de(cid:12)ning names for entities using URIs and al- the Web, viz.: (LDP1) use URIs to name things; lows for subsequent re-use of URIs across the Web; (LDP2)useHTTPURIssothatthosenamescanbe using triples, RDF allows to group entities into looked up; (LDP3) provide useful structured infor- named classes, allows to de(cid:12)ne named relations be- mation when a look-up on a URI is made { loosely, tween entities, and allows for de(cid:12)ning named at- calleddereferencing;and(LDP4)includelinksusing tributes of entities using string (literal) values. We external URIs. In particular, within SWSE, these nowbrie(cid:13)ygivesomenecessarynotation. best-practices form the backbone of various algorithms designed to interact and be tolerant to web data. RDFConstant GivenasetofURIreferencesU,a We must thus extend RDF triples with context setofblanknodesB,andasetofliteralsL,theset to denote the source thereof [50,54]. We also de- of RDF constants is denoted by C = U[B[L. (cid:12)nesomerelationsbetweentheidenti(cid:12)erforadata The set of blank nodes B is a set of existensially source, and the graph it contains, including a func- quanti(cid:12)ed variables. The set of literals is given as tion to represent HTTP redirects prevalently used L = Lp [Ld, where Lp is the set of plain literals inLinkedDataforLDP3[9]. and L is the set of typed literals. A typed literal is d thepairl =(s,d),wheres isthelexicalformofthe literalandd2U isadatatypeURI.ThesetsU,B, DataSource Wede(cid:12)nethehttp-download function L andL arepairwisedisjoint. http:U!GasthemappingfromaURItoanRDF p t Please note that in this report, we treat blank graph it may provide by means of a given HTTP nodesastheirskolemversions:i.e.,notasexistential lookup [44] which directly returns status code 200 variables,butasdenotingtheirownsyntacticform. OK and data in a suitable RDF format. We de(cid:12)ne We also ensure correct merging of RDF graphs [60] the set of data sources S (cid:26) U as the set of URIs byusingblank-nodelabelsuniqueforagivensource. S = fs 2 U j http(s) 6= ;g. We de(cid:12)ne the reference ForURIs,weusenamespacepre(cid:12)xesinthisreport function refs : C ! P(S) as the mapping from an ascommonintheliterature{thefullURIscanbere- RDF term to the set of data sources that mention trievedfromtheconvenienthttp://prefix.ccser- it.12 vice. For space reasons, we sometimes denote owl: asthedefaultnamespace. 12P(S)referstothepowersetofS. 7 RDFTripleinContext/RDFQuadruple Apair(t, { Structured Data: The crawler should retrieve c)withatriplet =(s,p,o),c 2Sandt 2http(c)is a high percentage of RDF/XML documents and calledatripleincontext c.Wemayalsoreferto(s,p, avoid wasted lookups on unwanted formats: e.g., o,c)asanRDFquadruple orquadq withcontextc. HTMLdocuments. Currently, we crawl for RDF/XML syntax documents{RDF/XMLisstillthemostcommonlyused HTTP Dereferencing We de(cid:12)ne dereferencing as syntaxforpublishingRDFontheWeb,andweplan thefunctionderef:U!UwhichmapsagivenURI infuturetoextendthecrawlertosupportotherfor- totheidenti(cid:12)erofthedocumentreturnedbyHTTP matssuchasRDFa,N-TriplesandTurtle. lookupoperationsuponthatURIfollowingredirects The following algorithm details the operation of (for a given (cid:12)nite and non-cyclical path) [44], or thecrawler,andwillbeexplainedindetailthrough- which maps a URI to itself in the case of failure. outthissection. Notethatwedonot distinguishbetweenthedi(cid:11)erent 30x redirection schemes, and that this function wouldinvolve,e.g.,strippingthefragmentidenti(cid:12)er Algorithm1Algorithmforcrawling of a URI [11]. Note that all HTTP level functions Require: SEEDS, ROUNDS, PLDLIMIT, fhttp;refs;derefg are set at the time of the crawl, MINDELAY andareboundedbytheknowledgeofourcrawl:for 1: frontier SEEDS example,refswillonlyconsiderdocumentsaccessed 2: pld0:::n newqueue bythecrawl. 3: stats newstats 4: whilerounds+1<ROUNDS do 5. Crawling 5: putfrontier intopld0:::n 6: whiledepth+1<PLDLIMITdo The (cid:12)rst component required for the building of 7: fori=0tondo ourindexisthecrawler,whosepurposeistoretrieve 8: prioritise(pldi,stats) a large set of RDF documents from the Web. Our 9: endfor crawlerstartswithasetofseedURIs,retrievesthe 10: start current time() content of URIs, parses and writes content to disk 11: fori=0tondo in the form of quads, and recursively extracts new 12: curi =calculate cur(pldi,stats) URIs for crawling; following LDP2 and LDP3, we 13: if curi >random([0;1])then considerallhttp:protocolURIsextractedfroman 14: geturifrompldi RDFdocumentascandidatesforcrawling. 15: urideref =deref(uri) Like traditional HTML crawlers, we identify the 16: if urideref =urithen followingrequirementsforcrawling: 17: G =http(uri) { Politeness: The crawler must implement po- 18: outputG liteness restrictions to avoid hammering remote 19: addnewURIsfromG tofrontier servers with dense HTTP GET requests and 20: updatestats to abide by policies identi(cid:12)ed in the provided 21: else robots.txt(cid:12)les13. 22: if urideref isunseenthen { Throughput:Thecrawlershouldcrawlasmany 23: addurideref tofrontier URIs as possible in as little time as is possible 24: endif withintheboundsofthepolitenesspolicies. 25: endif { Scale: The crawler should employ scalable tech- 26: endif niques,andon-diskindexingasrequired. 27: endfor { Quality: The crawler should prioritise crawling 28: elapsed current time()-start URIsitconsiderstobe\highquality". 29: if elapsed<MINDELAYthen Thus, the design of our crawler is inspired by re- 30: wait(MINDELAY(cid:0)elapsed) lated work from traditional HTML crawlers. Addi- 31: endif tionally{andspeci(cid:12)ctocrawlingstructureddata{ 32: endwhile weidentifythefollowingrequirement: 33: endwhile 13http://www.robotstxt.org/orig.html 8 5.1. Breadth-(cid:12)rstCrawling withminimale(cid:11)ectonperformance,wemustre(cid:12)ne ourcrawlingalgorithm:largesiteswithalargeinter- Traditional web crawlers (cf. [15] [64]) typically nalbranchingfactor(largenumbersofuniqueintra- use a breadth-(cid:12)rst crawling strategy: the crawl is PLDoutlinksperdocument)canresultinthefron- conducted in rounds, with each round crawling a tier of each round being dominated by URIs from frontier. On a high-level, Algorithm 1 represents asmallselectionofPLDs.Thus,na(cid:127)(cid:16)vebreadth-(cid:12)rst thisround-basedapproachapplyingROUNDS num- crawlingcanleadtocrawlershammeringsuchsites; ber of rounds. The frontier comprises of seed URIs conversely, given a politeness policy, a crawler may for round 0 (Line 1, Algorithm 1), and thereafter spendalotoftimeidlewaitingforthemin-delayto withnovelURIsextractedfromdocumentscrawled pass. inthepreviousround(Line19,Algorithm1).Thus, Onesolutionistoreasonablyrestrictthebranch- thecrawlemulatesabreadth-(cid:12)rsttraversalofinter- ing factor [81] { the maximum number of URIs linked web documents. (Note that the algorithm is crawled per PLD per round { which ensures that further tailored according to requirements we will individualPLDswithlargeinternalfan-outarenot describeasthesectionprogresses.) hammered; thus, in each round of the crawl, we As we will see later in the section, the round- implement a cut-o(cid:11) for URIs per PLD, given by basedapproach(cid:12)tswellwithourdistributedframe- PLDLIMIT inAlgorithm1. work, allowing for crawlers to work independently Secondly, to ensure the maximum gap between for each round, and co-rdinating new frontier URIs crawlingsuccessiveURIsforthesamePLD,weim- attheendofeachround.Additionally,[91]showthat plement a per-PLD queue (given by pld0:::n in Al- a breadth-(cid:12)rst traversal strategy tends to discover gorithm 1) whereby each PLD is given a dedicated high-qualitypagesearlyoninthecrawl,withthejus- queue of URIs (cid:12)lled from the frontier, and during ti(cid:12)cation that well-linked documents (representing the crawl, a URI is polled from each PLD queue in higherqualitydocuments)aremorelikelytobeen- around-robinfashion.IfallofthePLDqueueshave countered in earlier breadth-(cid:12)rst rounds; similarly, been polled before the min-delay is satis(cid:12)ed, then breadth(cid:12)rstcrawlingleadstoamorediversedataset the crawler must wait: this is given by Lines 28-31 earlieron,ratherthanadepth-(cid:12)rstapproachwhich inAlgorithm1.Thus,theminimumcrawltimefora may end up traversing deep paths within a given round{assumingasu(cid:14)cientlyfullqueue{becomes site. In [81], the authors justify a rounds-based ap- MINDELAY *PLDLIMIT. proach to crawling according to observations that writing/reading concurrently and dynamically to a 5.3. On-diskQueue single queue can become the bottleneck in a large- scalecrawler. As the crawl continues, the in-memory capacity of the machine will eventually be exceeded by the 5.2. IncorporatingPoliteness capacity required for storing URIs [81]. Perform- ing a stress-test, we observed that with 2GB of The crawler must be careful not to bite the JAVA heap-space, the crawler could crawl approx. hands that feed it by hammering the servers of 199kURIs(additionallystoringtherespectivefron- data providers or breaching policies outlined in the tier URIs) before throwing an out-of-memory ex- provided robots.txt (cid:12)le [111]. We use pay-level- ception.Inordertoscalebeyondtheimpliedmain- domains [81] (PLDs; a.k.a. \root domains"; e.g., memory limitations of the crawler, we implement bbc.co.uk) to identify individual data-providers, on-disk storage for URIs, with the additional ben- and implement politeness on a per-PLD basis. e(cid:12)t of maintaining a persistent state for the crawl Firstly, when we (cid:12)rst encounter a URI for a PLD, and thus o(cid:11)ering a \continuation point" useful for wecross-checktherobots.txt(cid:12)letoensurethatwe extensionofanexistingcrawl,orrecoveryfromfail- arepermittedtocrawlthatsite;secondly,weimple- ure. menta\minimumPLDdelay"toavoidhammering We implement the on-disk storage of URIs using servers, viz.: a minimum time-period between sub- Berkeley DB which comprises of two indexes { the sequent requests to a given PLD. This is given by (cid:12)rst provides lookups for URI strings against their MINDELAY inAlgorithm1. status (polled/unpolled); the second o(cid:11)ers a key- In order to accommodate the min-delay policy sortedmapwhichcaniterateoverunpolledURIsin 9 decreasing order of inlink count. The inlink count 1,000 URIs given a seed URI14 for 1, 2, 4, 8, 16, re(cid:13)ects the total number of documents from which 32, 64, and 128 threads. Also, to alleviate the pos- the URI has been extracted thus far; we deem a siblee(cid:11)ectsofremotecachingonourcomparisonof higher count to roughly equate to a higher priority increasing thread counts, we pre-crawled all of the URI. URIsbeforerunningthebenchmark. The crawler utilises both the on-disk index and Figures 5 displays the time taken in minutes for the in-memory queue to o(cid:11)er similar functionality thecrawlertorun,whilstFigure6showstheaverage as above. The on-disk index and in-memory queue percentage CPU usage (the averages are over read- aresynchronisedatthestartofeachround: ings extracted from the UNIX command ps taken (i) links and respective inlink counts extracted every three seconds during the crawl). Time and from the previous round (or seed URIs if the CPU% noticeably have an inverse correlation. As (cid:12)rstround)areaddedtotheon-diskindex; the number of threads increases up until 64, the (ii) URIs polled from the previous round have time taken for the crawl decreases { the reduction theirstatusupdatedon-disk; in time is particularly pronounced in earlier thread (iii) an in-memory PLD queue is (cid:12)lled using an increments;similarly,andasexpected,theCPUus- iteratorofon-diskURIssortedbydescending age increases as a higher density of documents are inlinkcount. retrievedandprocessed.Beyond64threads,theef- Mostimportantly,theaboveprocessensuresthat fect of increasing threads becomes minimal as the onlytheURIsactive(currentPLDqueueandfron- machine reaches the limits of CPU and disk I/O tier URIs) for the current round must be stored in throughput;infact,thetotaltimetakenstartstoin- memory. Also, the process ensures that the on-disk crease{wesuspectthatcontentionbetweenthreads indexstoresthepersistentstateofthecrawlerupto for shared resources a(cid:11)ects performance. Thus, we the start of the last round; if the crawler unexpect- settleupon64threadsasanapproximatelyoptimal edlydies,thecrawlcanberesumedfromthestartof (cid:12)gureforoursetup. the last round. Finally, the in-memory PLD queue is(cid:12)lledwithURIssortedinorderofinlinkcount,of- 5.5. CrawlingRDF/XML feringacheapformofintra-PLDURIprioritisation (Line8,Algorithm1). Since our architecture is currently implemented toindexRDF/XML,wewouldfeasiblyliketomax- imise the ratio of HTTP lookups which result in 5.4. Multi-threading RDF/XML content; i.e., given the total HTTP lookups as L, and the total number of downloaded Thebottle-neckforasingle-threadedcrawlerwill RDF/XML pages as R, we would like to maximise be the response times of remote servers; the CPU theratioR=L. load, I/O throughput and network bandwidth of a In order to reduce the amount of HTTP lookups crawling machine will not be e(cid:14)ciently exploited wasted on non-RDF/XML content, we implement by sequential HTTP GET requests over the Web. thefollowingheuristics: Thus,crawlersarecommonlymulti-threadedtomit- (i) (cid:12)rstly,weblacklistnon-httpprotocolURIs; igatethisbottleneckandperformconcurrentHTTP (ii) secondly,weblacklistURIswithcommon(cid:12)le- lookups. At a certain point of increasing the num- extensions that are highly unlikely to return ber of lookup threads operating, the CPU load, RDF/XML(e.g.,html,jpg,pdf,etc.)follow- I/O load, or network bandwidth becomes an im- ingargumentswepreviouslylaidoutin[114]; mutablebottleneck;thisbecomestheoptimalnum- (iii) thirdly, we check the returned HTTP header berofthreads. andonlyretrievethecontentofURIsreporting In order to (cid:12)nd a suitable thread count for our Content-type:application/rdf+xml;15 particularsetup(withrespecttoprocessor/network (iv) (cid:12)nally, we use a credible useful ratio when bandwidth), we conducted some illustrative small- polling PLDs to indicate the probability that scale experiments comparing a machine crawling withthesamesetupandinputparameters,butwith 14http://sw.deri.org/~aidanh/foaf/foaf.rdf an exponentially increasing number of threads: in 15Indeed, one advantage RDF/XML has over RDFa is an particular, we measure the time taken for crawling unambiguousMIME-typeusefulinsuchsituations 10

Description:

Aidan Hogana, Andreas Harthb, J urgen Umbricha, Sheila Kinsellaa, Axel Polleresa, interface for search, browsing and retrieval of information; unlike traditional search engines, SWSE operates over. RDF web [email protected] (Stefan Decker) screenshot of the focus (detailed) view of the.

Searching and Browsing Linked Data with SWSE PDF

53 Pages·2010·1.09 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Download Searching and Browsing Linked Data with SWSE PDF Free - Full Version

by Unknow| 2010| 53 pages| 1.09| English

Download Searching and Browsing Linked Data with SWSE by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Searching and Browsing Linked Data with SWSE

Detailed Information

Author:	Unknown
Publication Year:	2010
Pages:	53
Language:	English
File Size:	1.09
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Searching and Browsing Linked Data with SWSE Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Searching and Browsing Linked Data with SWSE PDF?

Yes, on https://PDFdrive.to you can download Searching and Browsing Linked Data with SWSE by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Searching and Browsing Linked Data with SWSE on my mobile device?

After downloading Searching and Browsing Linked Data with SWSE PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Searching and Browsing Linked Data with SWSE?

Yes, this is the complete PDF version of Searching and Browsing Linked Data with SWSE by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Searching and Browsing Linked Data with SWSE PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.