Files
avimallu 47efe7a0fb Add imports, change arg vectorize_string
Added packages and methods to `import` into the class for it to work. The default argument for `typ` is `source`, and if the `source_strings` and `compare_strings` arguments are different, not changing it in the call to `vectorize_strings` will throw an error.
2022-11-12 10:28:55 -06:00
..
2022-10-17 21:56:40 -05:00

Fast Fuzzy Joins in Python

This is a culmination of finding out an efficient way to perform fuzzy matching in Python at scale (on a single machine at least). Fuzzy matching in Python does not seem to be anywhere close to the level in R that packages like stringdist provide - multi-threaded string distance matching across a variety of distance metrics.

The primary source that this file refers for its queries is this Medium blog, which, like most Medium blogs that claim to have focused data science content, is just a hodgepodge of pathetic code borrowed from multiple sources. The original source for performing TF-IDF driven fuzzy join at scale is likely by https://github.com/bergvca who has their own package, StringGrouper, although the optimizations that leverage the NMSLIB library is likely the Medium blog author's.

The Medium blog author's code is particular poor in documentation, and here is slightly cleaned up version that I may revisit to add more information to. The approach recommended for using this class is:

  1. Prepare the list of strings that you would like compared, along with the source strings that you want compared to.
  2. Send these two to the class by calling fastfuzzy(), with relevant arguments. Both source_strings and compare_strings must be lists of strings, and can be identical.
  3. Call the class variable with the query method, and specify the value of k that interests you.