Filter-Based Fuzzy Big Joins

16 May 2018

Summary

In the Filter-based fuzzy big joins research project for my PhD study, we try to improve the different fuzzy join algorithms in the distributed and parallel framework. We compare and evaluate analytically the algorithms to validate results with real datasets.

Excerpt

A fuzzy join query combines all pairs of tuples for which the distance is lower than or equal to a prespecified threshold $varepsilon$ from one or several relations. In this project, we run our some fuzzy join queries for many different algorithms with many different threshold.

Resources used

Hardware

80 x 2.4 GHz CPU, 160 GB RAM, and 1 TB disk

Software

  • Ubuntu 14.04.3 LTS
  • Java 1.8.0_162

Framework

  • Hadoop 2.7.1
  • Spark 1.6.1

Dataset used

We use the real datasets from Semsoft and others benchmark datasets.

Queries used

An example fuzzy join query:

FB (outlet_address_street_1) $\bowtie_{Sim \le d}$ GGP(outlet_address_street_1)

We use other fuzzy join queries, e.g. multiway join, recursive join.

Publication

Thi-To-Quyen TRAN, Thuong-Cang PHAN, Anne Laurent, Laurent D’Orazio. Improving Hamming distance based fuzzy join in Map Reduce using Bloom Filters. 2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Rio de Janeiro, Brazil, July 8-13, 2018.


Auteur:
Quyen Tran Thi To

(tttquyen01@gmail.com)
Doctorant ENSSAT