The Hytormo Project

16 May 2018

Summary

We propose a hybrid storage model of row and column stores, called HYTORMO, together with data storage and query processing strategies. First, HYTORMO is designed and implemented to be deployed on large-scale environment to make it possible to manage big medical data. Second, the data storage strategy combines the use of vertical partitioning and a hybrid storage in order to create data storage configurations that can reduce storage space demand and increase the performance of queries in given workloads. To achieve a data storage configuration, we propose two different database design approaches: (1) expert-based design and (2) automated design.

In the former approach, experts (e.g., database designers) manually create data storage configurations by grouping the attributes of DICOM data and selecting a suitable data storage layout for each column group. In the latter approach, we propose a hybrid automated design framework, called HADF. HADF depends on similarity measures (between attributes) that can take into consideration the combined impact of both workload- and data-specific information to automatically group the attributes into column groups and to select suitable data storage layouts. Finally, we propose a suitable and efficient query processing strategy built on top of HYTORMO. It considers the use of both inner joins and left-outer joins for join operations between vertically partitioned tables to prevent data loss if only using inner joins. Furthermore, an Intersection Bloom filter is applied to remove irrelevant tuples from input tables of join operations. This helps to reduce the number of disk I/Os, network communication and CPU cost and thus improves query performance.

Experimental protocol:

We consider two scenarios according to the data processing mode. For each scenario, we measure the performance of the presented frameworks.

  • Batch Mode: In the Batch Mode scenario, we evaluate HADOOP, SPARK and FLINK while running the WordCount example on big set of tweets. The used tweets were collected by Apache Flume and stored in HDFS. The motivations behind using Apache Flume to collect the processed tweets is its integration facility in the HADOOP ecosystem (especially the HDFS system). Moreover, Apache Flume allows collecting data in a distributed way and offers high data availability and fault tolerance. We collected 10 billions tweets and we used them to form large tweet files with a size on disk varying from 250 MB to 100 GB of data.

  • Stream Mode: In the Stream Mode scenario, we evaluate real-time data processing capabilities of STORM, FLINK and SPARK. The Stream Mode scenario is divided into three main steps. The first step is devoted to data storage. Do to this step, we collected 1 billion tweets from Twitter using Apache Flume and stored in HDFS. Those data are then sent to KAFKA, a messaging server that guarantees fault tolerance during the streaming and message persistence .

Resources used

1 master node (Spark)

  • 4 CPU (XEON E5-2630 2.4GHz)
  • RAM: 10GB
  • Disk: 300 GB

8 worker nodes (Spark)

  • 2 CPU (XEON E5-2630 2.4GHz)
  • RAM: 6GB
  • Disk: 250 GB

Software Used

Hadoop 2.7.1, Hive 1.2.1 and Spark 1.6.0.

Dataset used

Real DICOM datasets are collected and extracted to be stored in HYTORMO.

Queries used

Three types of query workloads will be applied:

  • OLAP workloads.
  • OLTP workloads.
  • Mixed OLAP and OLTP workloads.

Expected results

We provides experimental evaluations to validate the benefits of the proposed methods over real DICOM datasets. The storage space demand is reduced and the workload execution time is improved.


Auteur:
Danh Nguyen Cong

(ncdanh@cit.ctu.edu.vn)
Doctorant Limos