Journal Home Online First Current Issue Archive For Authors Journal Information 中文版

Frontiers of Information Technology & Electronic Engineering >> 2017, Volume 18, Issue 10 doi: 10.1631/FITEE.1601347

FrepJoin: an efficient partition-based algorithm for edit similarity join

. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China.. Guangdong Key Laboratory of Popular High Performance Computers, Key Laboratory of Service Computing and Application, Shenzhen 518000, China

Available online: 2018-01-18

Next Previous

Abstract

String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.

Related Research