TokenJoin: Efficient Filtering for Set Similarity Join with Maximum Weighted Bipartite Matching

Publication
Proc. VLDB Endow.

We propose TokenJoin, a method for linking complex records, i.e., identifying similar pairs among a collection of complex records. A complex record is a set of simpler text entities, such as a set of addresses. To increase robustness, our approach is based on a relaxed match criterion, the fuzzy set similarity join, which calculates the similarity of two complex records based on maximum weighted bipartite matching instead of overlap.

image
TokenJoin is able to match records where exact matching would fail.