Objectif du projet
The goal of this project is to propose a unified user-defined aggregation framework for sharing aggregate data in massively distributed and parallel computing. More precisely, we target on how to reuse caches of aggregation results to compute new coming aggregate functions without scanning base data, which can significantly improve query evaluation time with aggregation.
Aggregate functions are fundamental operators in data analytic workloads, and partial aggregation is a ubiquitous technique used to compute aggregation in distributed manner. Sharing common data of relational operators are extensively studied, e.g. finding repeated computation and results of selection and group-by operators. However, aggregate operators lack consideration, especially UDA (user-defined aggregation), which can create significant recomputation.
In DS4A, we address the problem how to systematically share aggregate data (partial aggregate results) during processing UDA. In particular, DS4A contains a declarative application interface for defining aggregation, from which it can generate partial aggregation and verify sharing possibilities between new aggregates and cached aggregates. We demonstrate that in DS4A identical query models with various UDA can be evaluated in constant time when successful sharing.
Descriptif des ressources
Spark cluster: 1 master node, 6 worker nodes. Master node: 6 CPU (XEON E5-2630 2.4GHz), 16 GB main memory and 160 GB disk space Worker node: each one: 4 CPU (XEON E5-2630 2.4GHz), 8 GB main memory and 80 GB disk space
Descriptif des logiciels et sources de données
Software: Ubuntu server 16.04, Spark 2.2 and Hadoop 2.7.4.
Datasets: TPC-DS (scale = 100), http://www.tpc.org/tpcds/ Milan telecommunications (319 million rows), https://doi.org/10.7910/DVN/EGZHFV
Descriptif du déroulement des expérimentations ou les points clés du déroulement du projet
We compare two sequences of aggregate queries in three query models in two scenarios against Spark SQL.
Descriptif de la procédure d’exécution
We evaluate our experiments in two scenarios no-sharing and sharing. We compare the valuation time between DS4A and Spark SQL. We summarize our results as follows
• In unsuccessful sharing, DS4A provides 3x faster queries with UDAF compared to Spark UDAF framework. • In successful sharing, queries with UDAF are evaluated in constant time.