This work aims at providing the first parallel RDF Graph Summarization algorithm based on approximate pattern mining, for summarizing large RDF Knowledge bases.
RDF Graph Summarization becomes increasingly important with the introduction of Big Data and especially Linked Open Data that allow the linking and processing of large possibly distributed Knowledge Bases (KBs). Querying all the interlinked KBs could become increasingly expensive, while the results might still come from few of them. In this respect we would propose to query first the summary of each KB so as to retrieve information on whether the required information exists or not and to what extent before going on with the query processing. Given the size and complexity of these KBs, a parallel approach seems well-suited to address this problem, since the methods used are increasingly memory based and usually unable to fit all data within one single machine.
Speed and scalability are too additional concerns that we want to address.
Up to 128 cores running VMs of 2GB to 4 GB memory
Hadoop / Java
A variety of datasets ranging from small sized ones to bigger ones like DBPedia.
Standard Hadoop execution
Increased speedup for the processing of the summarization algorithms (including the underline approximate mining algorithms), while verifying the correctness of the results and compare those with the sequential version.