Efficiently indexing AND Querying Big Data in Hadoop MapReduce

Facoltà di scienze informatiche - Segreterie degli studi

Data d'inizio: 22 ottobre 2012

Data di fine: 23 ottobre 2012

The Faculty of Informatics is pleased to announce a seminar given by Jens Dittrich

DATE: Monday, October 22nd, 2012
PLACE: USI Università della Svizzera italiana, room SI-006, Informatics building
TIME: 9.30

ABSTRACT:
Big data analytics has become a reality of several companies. Therefore, processing datasets in the order of terabytes or even petabytes is a clear need for many users. In this context, Hadoop MapReduce is a big data processing framework that has rapidly become the de facto standard in both industry and academia. The main reasons of such popularity are the ease-of-use, scalability, and failover properties of Hadoop MapReduce. However, these features come at a price: the performance of Hadoop MapReduce is usually far from the performance of a well-tuned parallel database. In previous work, we introduced Hadoop++ [VLDB 2010] and Trojan Layouts [ACM SOCC 2011] to speed-up MapReduce jobs by up to a factor 20. However, Hadoop++ demands long index creation as well as data preparation steps before any query can be executed. In order to fix this we propose HAIL (Hadoop Aggressive Indexing Library [VLDB 2012a]; see also our tutorial on Big Data [VLDB 2012b]), an enhancement of HDFS and Hadoop MapReduce. HAIL dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create difference clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve_both_data upload to HDFS_and_the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60% with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop. In our experiments, we use six different clusters including physical as well as EC2 clusters of up to 100 nodes. A series of scalability experiments demonstrates the superiority of HAIL.

BIO:
Jens Dittrich is an Associate Professor of Computer Science/Databases at Saarland University, Germany. Previous affiliations include U Marburg, SAP AG, and ETH Zurich. He received an Outrageous Ideas and Vision Paper Award at CIDR 2011, a CS teaching award for database systems, as well as several presentation and science slam awards. His research focuses on fast access to big data.

HOST: Prof. Mauro Pezzè