Table Of ContentINTELLIGENT SYSTEMS REFERENCE LIBRARY
Volume 50
Gabriella Pasi
Gloria Bordogna
Lakhmi C. Jain (Eds.)
Quality Issues in the
Management of Web
Information
123
Intelligent Systems Reference Library
Volume 50
SeriesEditors
J.Kacprzyk,Warsaw,Poland
L.C.Jain,Adelaide,Australia
Forfurthervolumes:
http://www.springer.com/series/8578
·
Gabriella Pasi Gloria Bordogna
Lakhmi C. Jain
Editors
Quality Issues in the
Management of Web
Information
ABC
Editors
GabriellaPasi LakhmiC.Jain
DipartimentodiInformatica UniversityofCanberra
SistemisticaeComunicazione Canberra
UniversitàdegliStudidiMilanoBicocca Australia
Milano and
Italy
UniversityofSouthAustralia
SouthAustralia
GloriaBordogna
Australia
CNR-IDPA–Istitutoperla
Dinamicadei
ProcessiAmbientali
Dalmine
Italy
ISSN1868-4394 ISSN1868-4408 (electronic)
ISBN978-3-642-37687-0 ISBN978-3-642-37688-7 (eBook)
DOI10.1007/978-3-642-37688-7
SpringerHeidelbergNewYorkDordrechtLondon
LibraryofCongressControlNumber:2013934982
(cid:2)c Springer-VerlagBerlinHeidelberg2013
Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof
thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,
broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation
storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology
nowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publisher’slocation,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.
PermissionsforusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.Violations
areliabletoprosecutionundertherespectiveCopyrightLaw.
Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication
doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant
protectivelawsandregulationsandthereforefreeforgeneraluse.
Whiletheadviceandinformationinthisbookarebelievedtobetrueandaccurateatthedateofpub-
lication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityforany
errorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,withrespect
tothematerialcontainedherein.
Printedonacid-freepaper
SpringerispartofSpringerScience+BusinessMedia(www.springer.com)
Foreword
ProfessorDr.CarloBatini
DepartmentofInformatics,SystemsandCommunication
UniversityofMilanBicocca,
Milan
Italy
AbookonQualityissuesinthemanagementofWebInformationhastodealwitha
potentiallywidenumberofissues.Theconceptofqualityispervasive,sopervasive
that it is even difficult to providea shared and usable definition of the concept of
quality. The difficulty is reduced(not too much...) if we delimit the area of con-
sidered technologies and related resources. This book is focused on web retrieval
technologies and on the information resource, that web retrieval technologies ac-
cess and manipulate to provide knowledge access services to human beings and
computerapplications.
Informationisinturnapervasiveconcept,thatisinherentlyrelatedtoothertwo
concepts,dataandknowledge.WecansaywithBoisot[1999]that“dataisdiscrim-
inationbetweenphysicalstatesofthings(black,white,etc.)thatmayconveyornot
conveyinformationto an agent. Whether it does so or not dependson the agent’s
priorstockofknowledge.....thus,whereasdatacanbecharacterizedbyaproperty
ofthings,knowledgeisapropertyofagents...informationestablishesarelationship
betweenthingsandagents.”
Theworldisdirty,andalsotheWeb,oftenatoovividandfaithfulrepresentation
of the world, is dirty. Yet, the Web is and will be more and more in the future,
the mostaccessed sourceof knowledgeforthe humanbeings.Fromthisscenario,
wecanunderstandwhytheissueofinformationqualityisofgrowingrelevancein
Computer Science literature and InformationSystems applications, and in a wide
spectrumofresearchareasandreallifeapplications.
The issue ofqualityhasbeenhistoricallyinvestigatedfirstin thesimplestcase,
data stored in databases, structured in domains and tables, and managed in
transactionalapplications, under the rigid controlof the organization.The second
VI Foreword
agecorrespondstothedispersionofdataofinterestfortheorganizationinamulti-
plicityofdatabases,heterogeneousinformat,contentandsemantics,thatleadtyp-
ically to represent in the information systems of the organization the same entity
oftherealworldwithmultipleheterogeneousrepresentations,characterizedusually
bydifferentlevelsofquality.
A number of dimensions and related metrics have been proposed to formally
characterizequalityofdatainthesetwoscenarios.Ananalysisoftheliteratureon
data quality (see Batini and Scannapieco [2006] and Batini et al. [2009]),reports
more than 50 dimensions and about 100 metrics, and at least 12 methodologies
for the assessment and improvement of data quality in Information Systems us-
ing the database technology. Among dimensions, the most relevant are accuracy,
currency, completeness and consistency, for the definitions see Batini and Scan-
napieco[2006].Techniquesrangefromrecordlinkageandentityidentification,to
datacleansing,qualitydrivenqueryanswering,editimputationandcorrection,out-
lier identification.With the adventof networksand the planetarydiffusionsofthe
Web, new typesof informationsystems and data access and usage paradigmshad
tobeconsidered.Amonginformationsystems,cooperativeinformationsystemsal-
low different autonomous organizations to share data, applications and services,
whilepeer-to-peersystemsarecharacterizedbyhigherautonomyandheterogeneity
and absence of commonmanagementof data. Among new data access and usage
paradigms, the evolution of information systems to cover a wide range of infor-
mationrepresentations,suchassemistructuredtexts,unstructuredtexts,maps,im-
ages,videos,sounds,leadtodevelopaccessmechanismswheresearchesarebased
onmetadata,tagsandfull-textindexing,givingrisetotheInformationretrievalre-
searchdiscipline.
In theareaofdataandinformationquality,theabovediversificationresultedin
theinvestigationofdimensions,methodologiesandtechniquesthatcoverallofthe
above mentioned types of information representations, previously in the world of
single-organizationinformationsystemsandcooperativeinformationsystems,and
now in the differentarticulations of peer-to-peerinformationsystems and the im-
menseworldoftheWeb.Andwhileafilrougecanbeidentifiedamongdimensions
definedinthedifferenttypesofinformationrepresentations(seeBatinietal.[2012]
fora discussion),whenthe othercoordinatesare considered(typesof information
systems and the Web), the need arises to consider new dimensions and new tech-
niques. Among dimensions and related determinants, due to the uncontrolled and
“anarchic”characteroftheWeb,theattentionisshiftedtodimensionssuchastrust-
worthiness,provenance,authority,ageandpopularity(see e.g.fora discussionon
Ramachandranet al. [2009])that refer to quality of sources, besides the data and
informationtheyconvey.
Focusing on the main theme of this volume, techniques are a wide range and
coverissuessuchasqualitydrivenretrieval,qualityawaresimilaritysearch,quality
ofvolunteeredgeographicalinformationsystems,qualitybasedknowledgediscov-
eryinspecificdomains,qualityofwebengines.Suchtechniquesareinvestigatedin
severalpapersofthepresentvolume.
Foreword VII
References
1. Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques.
Springer(2006)
2. Batini,C.,Cappiello,C.,Francalanci,C.,Maurino,A.:Methodologies fordataquality
assessmentandimprovement.ACMComputingSurveys(2009)
3. Batini,C.,Palmonari,M.,Viscusi,G.:Themanyfacesofinformationandtheirimpacton
informationquality.In:ProceedingsofInternationalConferenceonInformationQuality,
Paris(2012)
4. Boisot, M.: Knowledge Assets: Securing Information Advantage in the Information
Economy.OxfordUniversityPress(1998)
5. Ramachandran,S.,Paulraj,S.,Joseph,S.,Ramaraj,V.:EnhancedTrusworthyandHigh-
QualityInformationRetrievalSystemforWebSearchEngines.IJCSIInternationalJour-
nalofComputerScienceIssues5(2009)
Preface
Thismainfocusofthisbookisonthequalityissueinthemanagementofinforma-
tionusedinWebapplications.Avarietyoftasksareconcernedwithandaffectedby
theassessmentofquality.Thechaptersincludedinthisbookarerelatedtothetasks
of InformationRetrieval, GeographicInformationRetrieval, InformationFiltering
andKnowledgeExtraction.Theseareasdemonstratethatbymodellingandexploit-
ing the quality dimensions of the information objects considered, it is possible to
improvesystems’effectiveness.
Theproblemofassessingthequalityoftextualinformationhasbeeninvestigated
foralongtime.Severaldistinctproposalshavebeenformulated.Thereisnotasin-
gle unifying consensual definition of a texts’ suitability for the task in hand. The
problemoftexts’qualityassessmentmaybeconsideredinrelationtotheinforma-
tioncontentitself(objectivecriteria),orfromtheuserpointofview.Forexample,
inthecontextofInformationRetrievalitisclearthattherelevanceofthedocuments
toarequestdependsonseveralaspects.Thesearerelatedtothedistinctproperties
ofthedocuments,thesearch,theuserwhoformulatedthequeryandtheuserswho
accessed the documentspreviously.It may include other informationsuch as user
ratings and tags, and the contextof both documentsand queries. One of the rele-
vancedimensionsmayberelatedtothequalityofdocuments.IncaseofWebpages
wellknownalgorithms(suchasPageRank)havebeendefined.
Thisbookhasbeenorganisedintoninechapters.Itincludesrecentcontributions
related to quality-based information management on the Web. Academic and ap-
plied researchers working on the issue of information quality will find the book
a valuable referenceresource. The methods,models and systems proposedin this
bookcaninspireandmotivatefurtherresearchonimportantissues.Itishopedthat
finalyearundergraduate,mastersandPhDstudentsincomputerscience,andinfor-
mationsystemswillfindinthisbookanexcellentcompiledreferencetextfortheir
futurestudies.
X Preface
We wish to thank all the contributorsand referees for their excellentwork and
assistanceandSpringer-Verlaginproducingthispublication.
GabriellaPasi,Italy
GloriaBordogna,Italy
LakhmiC.Jain,Australia
Editors
GabriellaPasi
Gabriella Pasigraduatedin ComputerScience atthe
Universita` degli Studi di Milano, Italy, and took
a PhD in Computer Science at the Universite´ de
Rennes,France.SheworkedasaresearcherattheNa-
tional Council of Research in Italy till 2005. Since
2005 she is Associate Professor at the Universita`
DegliStudidiMilanoBicocca,Milano,Italy,where,
within the Department of Informatics, Systems and
CommunicationsheleadstheInformationRetrievalresearchLab.Herresearchac-
tivity mainly concerns the modelling and design of flexible and personalised sys-
temsforthemanagementandaccesstoinformation(suchasInformationRetrieval
Systems,InformationFilteringSystemsandDataBaseManagementSystems).
She is member of organizing and program committees of several international
conferences. She has co-edited eight books and several special issues of Interna-
tionalJournals.She haspublishedmorethan 180papersonInternationalJournals
andBooks,andontheProceedingofInternationalConferences.Sheisinvolvedin
severalactivitiesfortheevaluationofresearch,inparticular,shewasappointedas
anexpertoftheComputerSciencepanelfortheStartingGrantsoftheProgramme
IdeasattheEuropeanResearchCouncil.SheistheVice-PresidentoftheEuropean
SocietyforFuzzyLogicandTechnologies(EUSFLAT).
She is a memberof the EditorialBoardof the internationaljournalsFuzzy Sets
andSystems,JournalofComputationalIntelligenceSystems,WebIntelligenceand
AgentSystems,IntelligentDecisionTechnology:AnInternationalJournal,andACM
AppliedComputingReview.
ShewasthecoordinatoroftheEuropeanProjectPENG(PersonalizedNewsCon-
tent Programming),a STREP (Specific Targeted Research or InnovationProject),
withintheVIFrameworkProgramme,PriorityII,InformationSocietyTechnology.
She organized several International events among which the IEEE / WIC /
ACM Intenational Joint Conference on Web Intelligence and Intelligent Agent