13 Commits

Author SHA1 Message Date
avimallu
ea45040f03 Create test_train_split 2023-06-17 08:53:41 -05:00
avimallu
1303be7e57 Create cut.py
From https://github.com/pola-rs/polars/issues/8551
2023-05-16 13:54:53 -05:00
avimallu
c4ded75689 Update Readme.md 2023-02-18 19:20:17 -06:00
avimallu
cc9696abec Inital K-POD file 2022-11-21 10:43:35 -06:00
avimallu
47efe7a0fb Add imports, change arg vectorize_string
Added packages and methods to `import` into the class for it to work. The default argument for `typ` is `source`, and if the `source_strings` and `compare_strings` arguments are different, not changing it in the call to `vectorize_strings` will throw an error.
2022-11-12 10:28:55 -06:00
avimallu
e8c6948ae1 Update fastfuzzy.py 2022-10-17 23:41:25 -05:00
avimallu
f2295393c0 Update fastfuzzy.py 2022-10-17 23:34:11 -05:00
avimallu
eaec532056 Update fastfuzzy.py
Add ability to return a `sparse` COO matrix.
2022-10-17 22:57:54 -05:00
avimallu
79b7dba09e Update readme.md 2022-10-17 21:56:40 -05:00
avimallu
366ca46fe0 Update readme.md 2022-10-17 21:55:55 -05:00
avimallu
5db09ff7ab Create readme.md 2022-10-17 21:55:39 -05:00
avimallu
28d60515b5 Fast Fuzzy Matching in Python via tf-idf values
This is a culmination of finding out an efficient way to perform fuzzy matching in Python at scale (on a single machine at least). Fuzzy matching in Python does not seem to be anywhere close to the level in R that packages like [`stringdist`](https://cran.r-project.org/web/packages/stringdist/) provide - multi-threaded string distance matching across a variety of distances.

The primary source that this file refers for its queries is [this Medium blog](https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536?gi=9bd6bb8ccfd5), which, like most Medium blogs that claim to have focused data science content, is just a hodgepodge of pathetic code borrowed from multiple sources. The original source for performing TF-IDF driven fuzzy join at scale is by [https://github.com/bergvca](https://bergvca.github.io/2017/10/14/super-fast-string-matching.html) who has their own package, [StringGrouper](https://github.com/Bergvca/string_grouper), although the optimizations that leverage the NMSLIB library is likely the Medium blog author's.

The Medium blog author's code is particular poor in documentation, and here is slightly cleaned up version that I may revisit to add more information to. The approach recommended for using this class is:

1. Prepare the list of strings that you would like compared, along with the source strings that you want compared to. 
2. Send these two to the class by calling `fastfuzzy()`, with relevant arguments. Both `source_strings` and `compare_strings` must be lists of strings, and can be identical.
3. Call the class variable with the `query` method, and specify the value of `k` that interests you.
2022-10-17 21:55:15 -05:00
avimallu
d93e48925b Initial commit 2022-10-17 21:41:45 -05:00