Table Of ContentCOLLABORATIVE OLAP WITH TAG CLOUDS
Web 2.0 OLAP Formalism and Experimental Evaluation
KamelAouiche,DanielLemireandRobertGodin
Universite´duQue´beca` Montre´al,100SherbrookeWest,Montreal,Canada
[email protected],[email protected],[email protected]
8
0 Keywords: OLAP,DataWarehouse,BusinessIntelligence,TagCloud,SocialWeb
0
2 Abstract: Increasingly, business projects are ephemeral. New Business Intelligence tools must support ad-lib data
n sourcesandquickperusal. Meanwhile,tagcloudsareapopularcommunity-drivenvisualizationtechnique.
a Hence,weinvestigatetag-cloudviewswithsupportforOLAPoperationssuchasroll-ups,slices,dices,clus-
J tering, and drill-downs. As a case study, we implemented an application where users can upload data and
4 immediatelynavigatethroughitsadhocdimensions.Tosupportsocialnetworking,viewscanbeeasilyshared
1 andembeddedinotherWebsites. Algorithmically,ourtag-cloudviewsareapproximaterangetop-kqueries
overspontaneousdatacubes.Wepresentexperimentalevidencethaticebergcuboidsprovideadequateonline
]
approximations.Webenchmarkseveralbrowser-oblivioustag-cloudlayoutoptimizations.
B
D
.
s
c 1 INTRODUCTION ing) (Codd, 1993) is a dominant paradigm in Busi-
[ ness Intelligence (BI). OLAP allows domain experts
2 The Web 2.0, or Social Web, is about making avail- tonavigatethroughaggregateddatainamultidimen-
v able social software applications on the Web in an sionaldatamodel. Standardoperationsincludedrill-
6 unrestricted manner. Enabling a wide range of dis- down, roll-up, dice, and slice. The data cube (Gray
5 tributed individuals to collaborate on data analysis et al., 1996) model provides well-defined semantics
1
tasksmayleadtosignificantproductivitygains(Heer and performance optimization strategies. However,
2
et al., 2007; Wattenberg and Kriss, 2006). Sev- OLAP requires much effort from database adminis-
.
0 eral companies, like SocialText and IBM, are offer- trators even after the data has been cleaned, tuned
1 ing Web 2.0 solutions dedicated to enterprise needs. and loaded: schemas must be designed in collabo-
7
The data visualization Web sites Many Eyes (IBM, ration with users having fast changing needs and re-
0
: 2007) and Swivel (Swivel, Inc, 2007) have become quirements(Bodyetal., 2002; MorzyandWrembel,
v partoftheWeb2.0landscape:over1milliondatasets 2004). Vendors such as Spotfire, Business Objects
Xi wereuploadedtoSwivelinlessthan3months(But- and QlikTech have reacted by proposing a new class
ler,2007). of tools allowing end-user to customize their appli-
r
a These Web 2.0 data visualization sites use tradi- cations and to limit the need for centralized schema
crafting(Havenstein,2003).
tionalpiechartsandhistograms, butalsotagclouds.
Tagcloudsareaformofhistogramwhichcanrepre- OLAP itself has never been formally defined
senttheamplitudeofoverahundreditemsbyvarying though rules have been proposed to recognize an
thefontsize. Theuseofhyperlinksmakestagclouds OLAPapplication(Codd,1993). Inasimilarmanner,
naturally interactive. Tag clouds are used by many weproposerulestorecognizeWeb2.0OLAPappli-
Web2.0 sitessuchas Flickr, del.icio.usand Techno- cations(seealsoTable1):
rati. Increasingly,e-CommercesitessuchasAmazon
1. Dataandschemasareprovidedautonomouslyby
orO’ReillyMedia, areusingtagcloudstohelptheir
users.
usersnavigatethroughaggregateddata.
Meanwhile, OLAP (On-Line Analytical Process- 2. ItisavailableasaWebapplication.
3. It supports complete online interaction over ag- 3 OLAP FORMALISM
gregatedmultidimensionaldata.
4. Usersareencouragedtocollaborate. 3.1 ConventionalOLAPFormalism
Tag clouds are well suited for Web 2.0 OLAP.
They are flexible: a tag cloud can represent a dozen MostOLAPenginesrelyonadatacube(Grayetal.,
orhundreddifferentamplitudes. Andtheyareacces- 1996). A data cube C contains a non empty set of d
sible: theonlyrequirementisabrowserthatcandis- dimensions D ={Di}1≤i≤d and a non empty set of
playdifferentfontsizes. measures M. Data cubes are usually derived from
a fact table (see Table 2) where each dimension and
Wedescribeatag-cloudformalism,asaninstance
measureisacolumnandallrows(orfacts)havedis-
of Web 2.0 OLAP. Since we implemented a pro-
joint dimension tuples. Figure 1(a) gives tridimen-
totype, technical issues will be discussed regarding
sionalrepresentationofthedatacube.
application design. In particular, we used iceberg
cubes (Carey and Kossmann, 1997) to generate tag
cloudsonlinewhenthedataandschemaareprovided Table2:Facttableexample
extemporaneously. Because tag clouds are meant to
Dimensions Measures
conveyageneralimpression,presentingapproximate
location time salesman product cost profit
measuresandclusteringissufficient:weproposespe-
Montreal March John shoe 100$ 10$
cific metrics to measure the quality of tag-cloud ap-
Montreal December Smith shoe 150$ 30$
proximations. We conclude the paper with experi- Quebec December Smith dress 175$ 45$
mentalresultsonrealandsyntheticdatasets. Ontario April Kate dress 90$ 10$
Paris March John shoe 100$ 20$
Paris March Marc table 120$ 10$
Table1:ConventionalOLAPversusWeb2.0OLAP Paris June Martin shoe 120$ 5$
Lyon April Claude dress 90$ 10$
ConventionalOLAP Web2.0OLAP
NewYork October Joe chair 100$ 10$
recurringneeds ephemeralprojects
NewYork May Joe chair 90$ 10$
predefinedschemas spontaneousschemas
Detroit April Jim dress 90$ 10$
centralizeddesign userinitiative
histograms tagclouds
plotsandreports iframes,wikis,blogs Measures can be aggregated using several opera-
accesscontrol socialnetworking torssuchasAVERAGE,MAX,MIN,SUM,andCOUNT.
All of these measures and dimensions are typically
prespecifiedinadatabaseschema. Databaseadminis-
2 RELATED WORK tratorspreaggregateviewstoacceleratequeries.
Thedatacubesupportsthefollowingoperations:
There are decentralized models (Taylor and Ives, • A slice specifies that you are only interested in
2006)andsystems(Greenetal.,2007)tosupportcol- some attribute values of a given dimension. For
laborativedatasharingwithoutasingleschema. example, one may want to focus on one specific
According to Wu et al., it is difficult to navigate product(seeFigure1(g)).Similarly,adiceselects
an OLAP schema without help; they have proposed rangesofattributevalues(seeFigure1(e)).
a keyword-driven OLAP model (Wu et al., 2007).
• A roll-up aggregates the measures on coarser at-
There are several OLAP visualization techniques in-
tributevalues. Forexample,fromthesalesgiven
cluding the Cube Presentation Model (CPM) (Mani-
for every store, a user may want to see the sales
atis et al., 2005), Multiple Correspondence Analysis
aggregatedpercountry(seeFigure1(c)). Adrill-
(MCA) (Ben Messaoud et al., 2006) and other inter-
downisthereverseoperation: fromthesalesper
activesystems(TechapichetvanichandDatta,2005).
country, one may want to explore the sales per
TagcloudshavebeenpopularizedbytheWebsite storeinonecountry.
Flickr launched in 2004. Several optimization op-
The various specific multidimensional views in
portunities exist: similar tags can be clustered to-
Figure1arecalledcuboids.
gether(KaserandLemire,2007),tagscanbepruned
automatically (Hassan-Montero and Herrero-Solana,
3.2 Tag-CloudOLAPFormalism
2006) or by user intervention (Millen et al., 2006),
tags can be indexed (Millen et al., 2006), and so
on. Tag clouds can be adapted to spatio-temporal AWeb2.0OLAPapplicationshouldbesupportedby
data(Russell,2006;Jaffeetal.,2006). a flexible formalism that can adapt a wide range of
country location Roll-up on product
{New York 10 10
country location { US Detroit 10
ALL{CFarUannSacdea{{{NLODQPyaeenuorwtteinrsabo rYeiiotocrk20 5 Ta{pbrCloehdauicrt ALL CFarannacdea{{MLOQPyanuoorteinnsabtrreieocal31001100 5 4350
Montreal1{ahcrM0iAlrpyaMALnueJL reotcbO3emDbecre0 tSimhoDeeress ALL {ahcrM iAlrp yaMALnueJL reotcbO eDbecrme time
(a) OLAPdatacube (b) Tag-clouddatacube (c) OLAProll-up (d) Tag-cloudroll-up
Dice on the first year semester
country location country locationSlice where product=`shoe’
{New York {New York 10 10
{ US Detroit { US Detroit 10
ALL France{LPyaorins 20 ALL France{LPyaorins 1010
{Quebec product {Quebec 45
Canada Ontario Ta{bClehair Canada Ontario 10
Montreal10 ShoDeress ALL Montreal10
ahcrM lrpiA yaM time ahcrM iAlrp yaM nueJ reotcbO emDbecretime
{ {
ALL ALL
(e) OLAPdice (f) Tag-clouddice (g) OLAPslice (h) Tag-cloudslice
Figure1:ConventionalOLAPoperationsvs.tag-cloudOLAPoperations
data loaded by users. Processing time must be rea-
sonableandbatchprocessingshouldbeavoided.
Unlikeinconventionaldatacubes, wedonotex-
pect that most dimensions have explicit hierarchies
whentheyareloaded: instead,userscanspecifyhow
thedataislaidout(seeSection5). Asarelatedissue,
the dimensions are not orthogonal in general: there
might be a “City” dimension as a well as “Climate
Figure2:User-drivenschemadesign
Zone”dimension. Itisuptotheusertoorganizethe
citiesperclimatezoneorpercountry.
Definition1(Tag) A tag is a term or phrase de- 3.3 Tag-CloudOperations
scribing an object with corresponding non-negative
weightsdeterminingitsrelativeimportance.Hence,a
Inoursystem,userscanuploaddata,selectadataset,
tagismadeofatriplet(term,object,weight).
anddefineaschemabychoosingdimensions(seeFig-
Asanexample,apicturemayhavebeenattributed ure 2). Then, users can apply various operations on
the tags “dog” (12 times) and “cat” (20 times). In the data using a menu bar. On the one hand, OLAP
a Business Intelligence context, a tag may describe operations such as slice, dice, roll-up and drill-down
thecurrentstateofabusiness. Forexample,thetags generate new tag clouds and new cuboids from ex-
“USA”(16,000$)and“Canada”(8,000$)describethe isting cuboids. Figures 1(d), 1(f) and 1(h), show the
salesofagivenproductbyagivensalesman. results of a roll-up, a dice, and a slice as tag clouds.
Wecanaggregateseveralattributevalues,suchas Ontheotherhand,wecanapplysomeoperationson
“Canada” and “March,” into a single term, such as an existing tag cloud: sort by either the weights or
“Canada–March.” Atagcomposedofkattributeval- the terms of tags, remove some tags, remove lesser
ues is called a k-tag. Figure 1(b) shows a tag cloud weightedtags,andsoon.Weestimatethatatagcloud
representationofTable2using3-tags. shouldnothavemorethan150tags.
Each tag T is represented visually using a font Tag-cloud layout has measurable benefits when
size, font color, background color, area or motif, de- trying to convey a general impression (Rivadeneira
pendingonitsmeasurevalues. et al., 2007). Hence, we wish to optimize the visual
clusteredbycountries.
Without similarity
Quebec-dress
Detroit-dress Paris-table Ontario-dress
Paris-shoe Montreal-shoe Lyon-dress New York-chair
With similarity
Detroit-dress New York-chair Quebec-dress
Ontario-dressMontreal-shoeParis-tableParis-shoeLyon-dress
Figure3:Choosingsimilaritydimensions
Figure4:Tag-cloudreorderingbasedonsimilarity
arrangement of tags. Chen et al. propose the com-
putation of similarity measures between cuboids to 4 FAST COMPUTATION
help users explore data (Chen et al., 2000): we ap-
plythisideatodefinesimilaritiesbetweentags. First Because only a moderate number of tags can be dis-
of all, users are asked to provide one or several di- played, the computation of tag clouds is a form of
mensionstheywanttousetoclusterthetags. Choos- top-k query: given any user-specified range of cells,
ing the “Country” dimension would mean that the we seek the top-k cells having the largest measures.
user wants the tags rearranged by countries so that There is a little hope of answering such queries in
“Montreal–April” and “Toronto–March” are nearby nearconstant-timewithrespecttothenumberoffacts
(seeFigure3). Theclusteringdimensionsselectedby without an index or a buffer. Indeed, finding all
theusertogetherwiththetag-clouddimensionsform and only the elements with frequency exceeding a
a cuboid: in our example, we have the dimensions givenfrequencythreshold(CormodeandMuthukrish-
“Country,” “City,” and “Time.” Since a tag contains nan, 2005) or merely finding the most frequent ele-
a set of attribute values, it has a corresponding sub- ment(Alonetal., 1996)requiresΩ(m)bitswherem
cuboiddefinedbyslicingthecuboid. isthenumberofdistinctitems.
Several similarity measures can be applied be- Various efficient techniques have been proposed
tween subcuboids: Jaccard, Euclidean distance, co- for the related range MAX problem (Chazelle, 1988;
sine similarity, Tanimoto similarity, Pearson correla- Poon, 2003), but they do not necessarily generalize.
tion,Hammingdistance,andsoon. Whichsimilarity Instead, for the range top-k problem, we can parti-
measure is best depends on the application at hand, tionsparsedatacubesintocustomizeddatastructures
so advanced users should be given a choice. Com- to speed up queries by an order of magnitude (Luo
monly, similarity measures take up values in the in- et al., 2001; Loh et al., 2002a; Loh et al., 2002b).
terval[−1,1]. Similaritymeasuresareexpectedtobe We can also answer range top-k queries using RD-
reflexive(f(a,a)=1),symmetric(f(a,b)= f(b,a)) trees (Chung et al., 2007) or R-trees (Seokjin et al.,
andtransitive: ifaissimilartob, andbissimilarto 2005). Intagclouds,precisionisnotrequiredandac-
c,thenaisalsosimilartoc. curacyislessimportant;onlythemostsignificanttags
Recall that given two vectors v and w, the co- aretypicallyneeded. Further, ifalltagshavesimilar
sine similarity measure is defined as cos(v,w) = weights, then any subset of tag may form an accept-
(cid:113) abletagcloud.
∑iviwi/ ∑iv2i ∑iw2i = v/|v|·w/|w|. The Tani- A strategy to speed up top-k queries is to
moto similarity is given by ∑iviwi/(∑iv2i +∑iw2i − transform them into comparatively easier iceberg
∑iviwi); it becomes the Jaccard similarity when the queries (Carey and Kossmann, 1997). For example,
vectors have binary values. Both of these measures in computing the top-10 (k=10) best vendors, one
are reflexive, symmetric and transitive. Specifically, couldstartbyfindingallvendorswitharatingabove
the cosine similarity is transitive by this inequality: 4/5. If there are at least 10 such vendors, then sort-
(cid:112)
cos(v,z)≥cos(w,z)− 1−cos(v,w)2. To general- ingthissmallerlistisenough. Ifnot, onecanrestart
izetheformulasfromvectorstocuboids,itsufficesto the query, seeking vendors with a rating above 3/5.
replace the single summation by one summation per Givenahistogramorselectivityestimates,wecanre-
dimension. Figure 4 shows an example of tag-cloud duce the number of expected iceberg queries (Don-
reorderingtoclustersimilartags. Inthisexample,the jerkovic and Ramakrishnan, 1999). Unfortunately,
“City–Product”tagswerecomparedaccordingtothe thisapproachisnotnecessarilyapplicabletomultidi-
“Country” dimension. The result is that the tags are mensional data since even computing iceberg aggre-
Giventag-clouddata,thetag-clouddrawingprob-
lem is to optimally display the tags, generally using
HTML,sothatsomedesirablepropertiesaremet,in-
cluding the following: (1) the screen space usage is
Figure5:Exampleofnoninformativetagcloud minimized;(2)whenapplicable,similartagsareclus-
teredtogether. Typically,thewidthofthetagcloudis
fixed,butitsheightcanvary.
For practical reasons, we do not wish for the
gates once for each query may be prohibitive. How-
server to send all of the data to the browser, includ-
ever,icebergcuboidscanstillbeputtogooduse.That
ing a possibly large number of similarity measures
is, one materializes the iceberg of a cuboid, small
betweentags. Hence,someofthetag-clouddrawing
enough to fit in main memory, from which the tag
computations must be server-bound. There are two
cloudsarecomputed. Intuitively,acuboidrepresent-
possiblearchitectures.Thefirstscenarioisabrowser-
ing the largest measures is likely to provide reason-
awareapproach(KaserandLemire,2007): giventhe
able tag clouds. Users mostly notice tags with large
tag-cloud data provided by the server, the browser
fontsizes(Rivadeneiraetal.,2007). Agoodapprox-
sends back to the server some display-specific data,
imation captures the tags having significantly larger
suchastheboxdimensionsofvarioustagsusingdif-
weights. To determine whether a tag cloud has such
ferentfontsizes. Theserverthensendsbackanopti-
significanttags,wecancomputetheentropy.
mized tag cloud. The second approach is browser-
Definition2(Entropyofatagcloud) LetT ∈T be oblivious: the server optimizes the display of the
a tag from a tag cloud T, then entropy(T) = tag cloud without any knowledge of the browser by
−∑ p(T)log(p(T))where p(T)= weight(T) . passing simple display hints. The browser can then
T∈T ∑x∈Tweight(x) execute a final and inexpensive display optimiza-
Theentropyquantifiesthedisparityofweightsbe-
tion. Whilebrowser-obliviousoptimizationisneces-
tweentags. Thelowertheentropy,themoreinterest-
sarily limited, it has reduced latency and it is easily
ingthecorrespondingtagcloudis. Indeed,tagclouds
cacheable.
with uniform tag weights have maximal entropy and
Browser-oblivious optimization can take many
arevisuallynotveryinformative(seeFigure5).
forms. For example, we could send classes of tags
We can measure the quality of a low-entropy tag
and instruct the browser to display them on separate
cloud by measuring false positives and negatives:
lines(Hassan-MonteroandHerrero-Solana,2006).In
false positive happens when a tag has been falsely
our system, tags are sent to the browser as an or-
added to a tag cloud whereas a false negative occurs
dered list, using the convention that successive tags
when a tag is missing. These measures of error as-
are similar and should appear nearby. Given a simi-
sumethatwelimitthenumberoftagstoamoderately
laritymeasurewbetweentags, wewanttominimize
smallnumber. Weusethefollowingqualityindexes;
∑ w(p,q)d(p,q) where d(p,q) is a distance func-
p,q
indexvaluesarein[0,1]andavalueof0isideal;they
tion between the two tags in the list and the sum is
arenotapplicabletohigh-entropytagclouds.
over all tags. Ideally, d(p,q) should be the physi-
Definition3 Givenapproximateandexacttagclouds cal distance between the tags as they appear in the
AandE,thefalse-positiveandfalse-negativeindexes browser; we model this distance with the index dis-
are maxt∈A,t(cid:54)∈Eweight(t) and maxt∈E,t(cid:54)∈Aweight(t). tance: if tag a appears at index i in the list and
maxt∈Aweight(t) maxt∈Eweight(t) tag b appears at index j, their distance is the inte-
ger |i− j|. This optimization problem is an instance
of the NP-complete MINIMUM LINEAR ARRANGE-
5 TAG-CLOUD DRAWING
MENT (MLA) problem: an optimal linear arrange-
mentofagraphG=(V,E), isamap f fromV onto
While we can ensure some level of device- {1,2,...,N}minimizing∑ |f(u)−f(v)|.
u,v∈V
independentdisplaysontheWeb,byusingimagesor
Proposition1 Thebrowser-oblivioustag-cloudopti-
plugins,textdisplayinHTMLmayvarysubstantially
mizationproblemisNP-Complete.
frombrowsertoanother. Thereisnocommonsetof
√
fontbrowsersarerequiredtosupport, andWebstan- ThereisanO( lognloglogn)-approximationfor
dardsdonotdictateline-breakingalgorithmsorother the MLA problem (Feige and Lee, 2007) in some
typographicalissues.Itisnotpracticaltosimulatethe instances. However, for our generic purposes, the
browseronaserver.Meanwhile,ifwewishtoremain greedy NEAREST NEIGHBOR (NN) algorithm might
accessibleandtoabidebyopenstandards,producing suffice: insert any tag in an empty list, then repeat-
HTMLandECMAScriptisthefavoriteoption. edly append a tag most similar to the latest tag in
1000 Fromeachdataset,wegenerateda4-dimensional
Original data
Iceberg data cube. We used the COUNT function to aggre-
100 gatedata. Tagcloudswerecomputedfromeachdata
me (seconds) 1 01 cvimualbpueleesumsoeifnnlgtiemdthitee:xitachceetbnceourgmmpbapuerptarotoifoxfniamsctaustsirionentgawitneiemthdp.odWrifafereyareltsnao-t
Ti
bles. Wespecifieddifferentvaluesfortag-cloudsize,
0.1 limitingthemaximumnumberoftags. Foreachice-
berglimitvalueandtag-cloudsize,wecomputedthe
0.01
3 4 5 6 7 8 9 10 11 entropyofthetagcloud,thefalse-positiveandfalse-
# of dimensions negativeindexes,andprocessingtimeforbothofice-
Figure6: Computingtagcloudsfromoriginaldatavs. ice-
bergapproximationandexactcomputation.
bergs: iceberglimitvaluesetat150andtag-cloudsizeis9
WeplottedinFigure7thefalse-positiveandfalse-
(USIncome2000).
negative indexes as a function of the relative en-
tropy(entropy/log(tag-cloudsize))usingvariousice-
berg limit values (150, 600, 1200, 4800, and 19600)
the list, until all tags have been inserted. It runs in
and various tag-cloud sizes (50, 100, 150, and 200),
O(n2) time where n is the number of tags. Another
foratotalof20tagcloudsperdimension. TheYaxis
heuristic for the MLA problem is the PAIRWISE EX-
is in a logarithmic scale. Points having their in-
CHANGEMONTECARLO(PWMC)method(Bhasker
dexes equal to zero are not displayed. As discussed
andSahni,1987): afterapplyingNN,yourepeatedly
inSection4,false-positiveandfalse-negativeindexes
considertheexchangeoftwotagschosenatrandom,
should be low when the entropy is low. We verify
permutingthemifitreducestheMLAcost. Another
thatforlow-entropyvalues(< 3log(tag-cloudsize)),
MONTECARLO(MC)heuristicbeginswiththeappli- 4
the indexes are always close to zero which indicates
cation of NN (Johnson et al., 2004): cut the list into
a good approximation. Meanwhile, small iceberg
two blocks at a random location, test if exchanging
cuboidscanbeprocessedmuchfaster.
thetwoblocksreducestheMLAcost, ifsoproceed;
repeat.
6.2 SimilarityComputation
Additionaldisplayhintscanbeinsertedinthislist.
Forexample,iftwotagsmustabsolutelybeveryclose
toeachother,aGLUEDtokencouldbeinserted. Also, Using our two data sets, we tested the NN, PWMC,
if two tags can be permuted freely in the list, then a andMCheuristicsusingboththecosineandtheTan-
PERMUTABLE tokencouldbeinserted: thelistcould imoto similarity measures. From data cubes made
taketheformofaPQtree(BoothandLueker,1976). of all available dimensions, we used all possible 1-
tagclouds,usingsuccessivelyallotherdimensionsas
clusteringdimensionforatotalof2×(18×17+42×
6 EXPERIMENTS 41)=4056 layout optimizations. The iceberg limit
value was set at 150. The MC heuristic never fared
better than NN, even when considering a very large
Throughouttheseexperiments,weusedtheJavaver-
number of random block permutations: we rejected
sion1.6.0 02fromSunMicrosystemsInc. onanAp-
this heuristic as ineffective. However, as Figure 8
ple MacPro machine with 2 Dual-Core Intel Xeon
shows, the PWMC heuristic can sometimes signifi-
processorsrunningat2.66GHzand2GiBofRAM.
cantlyoutperformNNwhenalargenumber(1000)of
tagexchangesareconsidered,butitonlyoutperforms
6.1 Iceberg-BasedComputation
NNbymorethan20%inlessthan5%ofalllayoutop-
timizations. Meanwhile,PWMCcanbeseveralorder
To validate the generation of tag clouds from ice- ofmagnitudesslowerthanNN:NNis10timesfaster
bergs, we have run tests over the US Income 2000 thanPWMCwith100exchangesand70timesfaster
data set (Hettich and Bay, 2000) (42 dimensions than PWMC with 1000 exchanges. Computing the
and about 2×105 facts) as well as a synthetic similarity function over an iceberg cuboid was mod-
data set (18 dimensions and 2×104 facts) provided erately expensive (0.07s) for a small iceberg cuboid
by Swivel (http://www.swivel.com/data sets/ (limit set to 150 cells): the exact computation of the
show/1002247). Figure6showsthatwhilesometag- similarityfunctioncandwarfthecostoftheheuristics
cloudcomputationsrequireseveralminutes,iceberg- (NN and PWMC) over a moderately large data set.
basedcomputationscanbemuchfaster. InformaltestssuggestthatNNcomputedoverasmall
1 1
State(52) Country of birth (43)
False-positive and false-negative indexes 0 .00 .000.111 MSuidrdnlaeCmIinteyit i((a47l 21(270602))) False-positive and false-negative indexes 0 .00 .000.111 CaHpoituasl elohsosldeA s(g 9(e19 48(970180)))
0.0001 0.0001
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
entropy/log(tag-cloud size) entropy/log(tag-cloud size)
(a) Swivel (b) USIncome2000
Figure7:False-negativeandfalse-positiveindexes(0isbest,1isworst),valuesunder0.0001arenotincluded
25000 1.12e+006
No Clustering No Clustering
NN NN
PWMC10 PWMC10
20000 PPWWMMCC1100000 1.1e+006 PPWWMMCC1100000
1.08e+006
A cost 15000 A cost 1.06e+006
ML 10000 ML
1.04e+006
5000
1.02e+006
0 1e+006
COSINE TANIMOTO COSINE TANIMOTO
(a) Displayingdimension“Givenname”andclus-(b) Displayingdimension“HHDFMX”andclus-
teringby“State”(Swivel) teringby“ARACE”(USIncome2000)
Figure8:MLAcostsfortwoexamples:thePWMCheuristicwasappliedusing10,100and1000randomexchanges.
icebergcuboidprovidessignificantvisuallayouts. hasalinearflowsuchastimeorlatitude. Amoreap-
propriateapproachistoallowtheuseofaslider(Rus-
sell, 2006) tying several tag clouds, each one corre-
spondingtoagivenattributevalue.
7 CONCLUSION
Accordingtoourexperimentalresults,precomputing
ACKNOWLEDGMENTS
a single iceberg cuboid per data cube allows to gen-
erate adequate approximate tag clouds online. Com-
The second author is supported by NSERC
bined with modern Web technologies such as AJAX
grant 261437 and FQRNT grant 112381. The third
andJSON,itprovidesaresponsiveapplication.How-
author is supported by NSERC grant OGP0009184
ever, we plan to make more precise the relationship
and FQRNT grant PR-119731. The authors wish to
betweenicebergcubes,entropy,dimensionsizes,and
thankOwenKaserfromUNBforhiscontributions.
our quality indexes. Yet another approach to com-
pute tag clouds quickly may be to use a bitmap in-
dex (O’Neil and Quass, 1997). While we built a
REFERENCES
Web 2.0 with support for numerous collaborations
features such as permalinks, tag-cloud embeddings
Alon, N., Matias, Y., andSzegedy, M.(1996). Thespace
with iframe elements, we still need to experiment
complexityofapproximatingthefrequencymoments.
withliveusers.Ourapproachtomultidimensionaltag InSTOC’96,pages20–29.
clouds has been to rely on k-tags. However, this ap-
BenMessaoud,R.,Boussaid,O.,andLoudcherRabase´da,
proach might not be appropriate when a dimension S.(2006). Efficientmultidimensionaldatarepresen-
tationsbasedonmultiplecorrespondenceanalysis. In Jaffe, A., Naaman, M., Tassa, T., and Davis, M. (2006).
KDD’06,pages662–667. Generatingsummariesandvisualizationforlargecol-
lectionsofgeo-referencedphotographs. InMIR’06,
Bhasker, J. andSahni, S. (1987). Optimal linear arrange-
pages89–98.
ment of circuit components. J. VLSI Comp. Syst.,
2(1):87–109. Johnson, D., Krishnan, S., Chhugani, J., Kumar, S., and
Venkatasubramanian, S. (2004). Compressing large
Body, M., Miquel, M., Be´dard, Y., and Tchounikine, A.
boolean matrices using reordering techniques. In
(2002). A multidimensional and multiversion struc-
ture for OLAP applications. In DOLAP ’02, pages VLDB’04,pages13–23.
1–6. Kaser,O.andLemire,D.(2007).Tag-clouddrawing:Algo-
Booth,K.S.andLueker,G.S.(1976). Testingforthecon- rithmsforcloudvisualization. InWWW2007–Tag-
secutiveonesproperty,intervalgraphs,andgraphpla- gingandMetadataforSocialInformationOrganiza-
narityusingPQ-treealgorithms.JournalofComputer tion.
andSystemSciences,13:335–379. Loh,Z.,Ling,T.,Ang,C.,andLee,S.(2002a). Adaptive
Butler,D.(2007). Datasharing: thenextgeneration. Na- methodforrangetop-kqueriesinOLAPdatacubes.
ture,446(7131):1–10. InDEXA’02,pages648–657.
Carey,M.J.andKossmann,D.(1997). Onsaying“enough Loh,Z.X.,Ling,T.W.,Ang,C.H.,andLee,S.Y.(2002b).
already!”inSQL. InSIGMOD’97,pages219–230. Analysis of pre-computed partition top method for
rangetop-kqueriesinOLAPdatacubes.InCIKM’02,
Chazelle, B.(1988). Afunctionalapproachtodatastruc-
pages60–67.
turesanditsuseinmultidimensionalsearching.SIAM
J.Comput.,17(3):427–462. Luo, Z., Ling, T., Ang, C., Lee, S., and Cui, B. (2001).
Range top/bottom k queries in OLAP sparse data
Chen, Q., Dayal, U., and Hsu, M. (2000). OLAP-based
cubes. InDEXA’01,pages678–687.
data mining for business intelligence applications in
telecommunications and e-commerce. In DNIS ’00, Maniatis, A., Vassiliadis, P., Skiadopoulos, S., Vassiliou,
pages1–19. Y.,Mavrogonatos,G.,andMichalarias,I.(2005). A
presentationmodel&non-traditionalvisualizationfor
Chung,Y.,Yang,W.,andKim,M.(2007). Anefficient,ro-
OLAP. International Journal of Data Warehousing
bustmethodforprocessingofpartialtop-k/bottom-k
andMining,1:1–36.
queriesusingtheRD-treeinOLAP.DecisionSupport
Systems,43(2):313–321. Millen, D. R., Feinberg, J., and Kerr, B. (2006). Dogear:
Social bookmarking in the enterprise. In CHI ’06,
Codd,E.(1993). ProvidingOLAP(on-lineanalyticalpro-
pages111–120.
cessing) to user-analysis: an IT mandate. Technical
report,E.F.CoddandAssociates. Morzy,T.andWrembel,R.(2004). Onqueryingversions
ofmultiversiondatawarehouse.InDOLAP’04,pages
Cormode,G.andMuthukrishnan,S.(2005).What’shotand
what’snot:trackingmostfrequentitemsdynamically. 92–101.
ACMTrans.DatabaseSyst.,30(1):249–278. O’Neil, P. and Quass, D. (1997). Improved query perfor-
mancewithvariantindexes. InSIGMOD’97, pages
Donjerkovic,D.andRamakrishnan,R.(1999). Probabilis-
38–49.
ticoptimizationoftopnqueries. InVLDB’99,pages
411–422. Poon, C. (2003). Dynamic orthogonal range queries in
OLAP. Theoretical Computer Science, 296(3):487–
Feige,U.andLee,J.R.(2007). Animprovedapproxima-
510.
tion ratio for the minimum linear arrangement prob-
lem. Inf.Process.Lett.,101(1):26–29. Rivadeneira, A. W., Gruen, D. M., Muller, M. J., and
Millen,D.R.(2007). Gettingourheadintheclouds:
Gray, J., Bosworth, A., Layman, A., and Pirahesh, H.
toward evaluation studies of tagclouds. In CHI’07,
(1996). Data cube: A relational aggregation opera-
pages995–998.
torgeneralizinggroup-by,cross-tab,andsub-total. In
ICDE’96,pages152–159. Russell,T.(2006).cloudalicious:folksonomyovertime.In
JCDL’06,pages364–364.
Green, T. J., Karvounarakis, G., Taylor, N. E., Biton, O.,
Ives,Z.G.,andTannen,V.(2007).ORCHESTRA:fa- Seokjin,H.,Moon,B.,andSukho,L.(2005).Efficientexe-
cilitatingcollaborativedatasharing. InSIGMOD’07, cutionofrangetop-kqueriesinaggregater-trees. IE-
pages1131–1133,NewYork,NY,USA.ACM. ICE–TransactionsonInformationandSystems,E88-
D(11):2544–2554.
Hassan-Montero, Y. and Herrero-Solana, V. (2006). Im-
provingtag-cloudsasvisualinformationretrievalin- Swivel, Inc (2007). Swivel. http://www.swivel.com.
terfaces. InInSciT’06. [Online;accessed7-6-2007].
Havenstein, H. (2003). BI vendors seek to tap end-user Taylor, N. E. and Ives, Z. G. (2006). Reconciling while
power: New class of tools built to reap user knowl- toleratingdisagreementincollaborativedatasharing.
edgeforcustomizinganalyticapplications.InfoWorld, InSIGMOD’06,pages13–24,NewYork,NY,USA.
22:20–21. ACM.
Heer, J., Vie´gas, F. B., and Wattenberg, M. (2007). Voy- Techapichetvanich,K.andDatta,A.(2005). Interactivevi-
agers and voyeurs: supporting asynchronous collab- sualizationforOLAP. InICCSA’05,pages206–214.
orativeinformationvisualization. InCHI’07, pages
Wattenberg,M.andKriss,J.(2006). Designingforsocial
1029–1038.
dataanalysis.IEEETransactionsonVisualizationand
Hettich, S. and Bay, S. D. (2000). The UCI KDD ComputerGraphics,12(4):549–557.
archive. http://kdd.ics.uci.edu. [Online; ac-
cessed21/12/2007]. Wu, P., Sismanis, Y., and Reinwald, B. (2007). Towards
keyword-driven analytical processing. In SIGMOD
IBM (2007). Many Eyes. http://services.
’07,pages617–628.
alphaworks.ibm.com/manyeyes/. [Online; ac-
cessed7-6-2007].