Filter-Based Fuzzy Big Joins

16 May 2018


In the Filter-based fuzzy big joins research project for my PhD study, we try to improve the different fuzzy join algorithms in the distributed and parallel framework. We compare and evaluate analytically the algorithms to validate results with real datasets.


A fuzzy join query combines all pairs of tuples for which the distance is lower than or equal to a prespecified threshold $varepsilon$ from one or several relations. In this project, we run our some fuzzy join queries for many different algorithms with many different threshold.

Resources used


80 x 2.4 GHz CPU, 160 GB RAM, and 1 TB disk


  • Ubuntu 14.04.3 LTS
  • Java 1.8.0_162


  • Hadoop 2.7.1
  • Spark 1.6.1

Dataset used

We use the real datasets from Semsoft and others benchmark datasets.

Queries used

An example fuzzy join query:

FB (outlet_address_street_1) $\bowtie_{Sim \le d}$ GGP(outlet_address_street_1)

We use other fuzzy join queries, e.g. multiway join, recursive join.


Thi-To-Quyen TRAN, Thuong-Cang PHAN, Anne Laurent, Laurent D’Orazio. Improving Hamming distance based fuzzy join in Map Reduce using Bloom Filters. 2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Rio de Janeiro, Brazil, July 8-13, 2018.

Quyen Tran Thi To

Doctorant ENSSAT