Added packages and methods to `import` into the class for it to work. The default argument for `typ` is `source`, and if the `source_strings` and `compare_strings` arguments are different, not changing it in the call to `vectorize_strings` will throw an error.
Fast Fuzzy Joins in Python
This is a culmination of finding out an efficient way to perform fuzzy matching in Python at scale (on a single machine at least). Fuzzy matching in Python does not seem to be anywhere close to the level in R that packages like stringdist provide - multi-threaded string distance matching across a variety of distance metrics.
The primary source that this file refers for its queries is this Medium blog, which, like most Medium blogs that claim to have focused data science content, is just a hodgepodge of pathetic code borrowed from multiple sources. The original source for performing TF-IDF driven fuzzy join at scale is likely by https://github.com/bergvca who has their own package, StringGrouper, although the optimizations that leverage the NMSLIB library is likely the Medium blog author's.
The Medium blog author's code is particular poor in documentation, and here is slightly cleaned up version that I may revisit to add more information to. The approach recommended for using this class is:
- Prepare the list of strings that you would like compared, along with the source strings that you want compared to.
- Send these two to the class by calling
fastfuzzy(), with relevant arguments. Bothsource_stringsandcompare_stringsmust be lists of strings, and can be identical. - Call the class variable with the
querymethod, and specify the value ofkthat interests you.