Table Of Content

An Adaptive Partitioning Scheme for Ad-hoc and Time-varying Database Analytics by Anil Shanbhag B.Tech. in Computer Science Indian Institute of Technology Bombay, 2014 Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2016 (cid:13)c Massachusetts Institute of Technology 2016. All rights reserved. Author .............................................................. Department of Electrical Engineering and Computer Science May 19, 2016 Certified by.......................................................... Samuel Madden Professor of Electrical Engineering and Computer Science Thesis Supervisor Accepted by......................................................... Leslie A. Kolodziejski Chairman, Department Committee on Graduate Students 2 An Adaptive Partitioning Scheme for Ad-hoc and Time-varying Database Analytics by Anil Shanbhag Submitted to the Department of Electrical Engineering and Computer Science on May 19, 2016, in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science Abstract Data partitioning significantly improves query performance in distributed database systems. A large number of techniques have been proposed to efficiently partition a dataset, often focusing on finding the best partitioning for a particular query workload. However, many modern analytic applications involve ad-hoc or exploratory analysiswhereusersdonothavearepresentativequeryworkload. Furthermore, workloads change over time as businesses evolve or as analysts gain better understanding of their data. Static workload-based data partitioning techniques are therefore not suitable for such settings. In this thesis, we present Amoeba, an adaptive distributed storage system for data skipping. It does not require an upfront query workload and adapts the data partitioning according to the queries posed by users over time. We present the data structures, partitioning algorithms, and an efficient implementation on top of Apache Spark and HDFS. Our experimental results show that the Amoeba storage system provides improved query performance for ad-hoc workloads, adapts to changes in the query workloads, and converges to a steady state in case of recurring workloads. On a real world workload, Amoeba reduces the total workload runtime by 1.8x compared to Spark with data partitioned and 3.4x compared to unmodified Spark. Thesis Supervisor: Samuel Madden Title: Professor of Electrical Engineering and Computer Science 4 Acknowledgments I would like to thank Alekh Jindal, Qui Nguyen, Aaron Elmore, Jorge Quiane and Divyakanth Agarwal who have contributed many ideas to this work and helped build the system. I would also like to thank Prof. Samuel Madden, my thesis supervisor, for being a constant source of guidance and feedback in this project and outside. Finally, I am always grateful to my family and friends, who encouraged me and supported me along the way. 5 6 Contents 1 Introduction 13 2 Related Work 17 3 System Overview 21 4 Upfront Data Partitioning 23 4.1 Key Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Upfront Partitioning Algorithm . . . . . . . . . . . . . . . . . . . . . 27 5 Adaptive Repartitioning 31 5.1 Workload Monitor and Cost Model . . . . . . . . . . . . . . . . . . . 32 5.2 Partitioning Tree Transformations . . . . . . . . . . . . . . . . . . . . 33 5.3 Divide-And-Conquer Repartitioning . . . . . . . . . . . . . . . . . . . 36 5.4 Handling Multiple Predicates . . . . . . . . . . . . . . . . . . . . . . 39 6 Implementation 41 6.1 Initial Robust Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 41 6.2 Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7 Discussion 45 7.1 Leveraging Replication . . . . . . . . . . . . . . . . . . . . . . . . . . 45 7.2 Handling Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 8 Evaluation 47 8.1 Upfront Partitioning Performance . . . . . . . . . . . . . . . . . . . . 47 8.2 Micro-benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 7 8.3 Amoeba on Real Workload . . . . . . . . . . . . . . . . . . . . . . . 53 9 Conclusion 55 Appendices 60 A Fast Remote Reads 61 8 List of Figures 1-1 Example partitioning tree with 8 blocks . . . . . . . . . . . . . . . . . 14 3-1 Amoeba Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4-1 Partitioning Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . 25 4-2 Upfront Partitioning Algorithm Example. . . . . . . . . . . . . . . . . 26 5-1 Node swap in the partitioning tree. . . . . . . . . . . . . . . . . . . . 34 5-2 Illustrating adaptive partitioning when predicate A appears repeatedly. 35 2 5-3 Node pushdown in partitioning tree. . . . . . . . . . . . . . . . . . . 35 5-4 Node rotation in partitioning tree. . . . . . . . . . . . . . . . . . . . . 35 7-1 Heterogenous Replication. . . . . . . . . . . . . . . . . . . . . . . . . 45 8-1 Ad-hoc query runtimes for different attributes of TPC-H lineitem. . . 48 8-2 Comparing the upload time in Amoeba with HDFS . . . . . . . . . . 49 8-3 Comparing performance of upfront partition tree vs kd-tree . . . . . . 49 8-4 Query runtimes for changing query attributes on TPC-H lineitem. . . 51 8-5 Query runtimes for changing predicates on the same attribute of TPC- H lineitem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 8-6 Cumulative Optimizer Runtime Across 100 queries . . . . . . . . . . 52 8-7 Cumulative Repartitioning Cost . . . . . . . . . . . . . . . . . . . . . 52 8-8 Total runtimes of the different approaches . . . . . . . . . . . . . . . 53 A-1 Response time with varying data locality (%) . . . . . . . . . . . . . 61 9 10

Description:

suitable for such settings. In this thesis, we present Amoeba, an adaptive distributed . Adaptive partitioning/indexing is extensively used in modern single node in- memory column stores Similar to MAGIC, both Oracle and MySQL.

An Adaptive Partitioning Scheme for Ad-hoc and Time-varying Database Analytics Anil Shanbhag PDF

62 Pages·2016·0.75 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Download An Adaptive Partitioning Scheme for Ad-hoc and Time-varying Database Analytics Anil Shanbhag PDF Free - Full Version

by Unknow| 2016| 62 pages| 0.75| English

Download An Adaptive Partitioning Scheme for Ad-hoc and Time-varying Database Analytics Anil Shanbhag by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About An Adaptive Partitioning Scheme for Ad-hoc and Time-varying Database Analytics Anil Shanbhag

Detailed Information

Author:	Unknown
Publication Year:	2016
Pages:	62
Language:	English
File Size:	0.75
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free An Adaptive Partitioning Scheme for Ad-hoc and Time-varying Database Analytics Anil Shanbhag Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download An Adaptive Partitioning Scheme for Ad-hoc and Time-varying Database Analytics Anil Shanbhag PDF?

Yes, on https://PDFdrive.to you can download An Adaptive Partitioning Scheme for Ad-hoc and Time-varying Database Analytics Anil Shanbhag by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read An Adaptive Partitioning Scheme for Ad-hoc and Time-varying Database Analytics Anil Shanbhag on my mobile device?

After downloading An Adaptive Partitioning Scheme for Ad-hoc and Time-varying Database Analytics Anil Shanbhag PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of An Adaptive Partitioning Scheme for Ad-hoc and Time-varying Database Analytics Anil Shanbhag?

Yes, this is the complete PDF version of An Adaptive Partitioning Scheme for Ad-hoc and Time-varying Database Analytics Anil Shanbhag by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download An Adaptive Partitioning Scheme for Ad-hoc and Time-varying Database Analytics Anil Shanbhag PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.