Files
www/public/blog/index.xml
Avinash Mallya 57eff46d6c Switch to Hugo
2025-09-13 21:27:23 -05:00

927 lines
166 KiB
XML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>blog on Avinash's Blog</title><link>https://avimallu.dev/blog/</link><description>Recent content in blog on Avinash's Blog</description><generator>Hugo -- gohugo.io</generator><language>en-US</language><copyright>© Avinash Mallya</copyright><lastBuildDate>Fri, 20 Oct 2023 00:00:00 +0000</lastBuildDate><atom:link href="https://avimallu.dev/blog/index.xml" rel="self" type="application/rss+xml"/><item><title>Quick hacks to make client-ready presentations</title><link>https://avimallu.dev/blog/003_powerpointsnap/</link><pubDate>Fri, 20 Oct 2023 00:00:00 +0000</pubDate><guid>https://avimallu.dev/blog/003_powerpointsnap/</guid><description>&lt;h1 id="premise">Premise&lt;/h1>
&lt;p>When I worked in healthcare consulting, I often spent a LOT of my time creating PowerPoint presentations (&lt;em>decks&lt;/em> in consulting lingo - not even &lt;em>slide decks&lt;/em>). However, it was rather repetitive. Thus, was born PowerPointSnap.&lt;/p>
&lt;h1 id="what-is-it">What is it?&lt;/h1>
&lt;p>I&amp;rsquo;ll write this down as pointers.&lt;/p>
&lt;ol>
&lt;li>It&amp;rsquo;s a VBA based PowerPoint add-on. Just a set of commands that work well with each other.&lt;/li>
&lt;li>It&amp;rsquo;s Windows only - it&amp;rsquo;s unlikely to work on MacOS.&lt;/li>
&lt;li>It&amp;rsquo;s installation-free and is not an executable, which makes it perfect for locked-down corporate environments, as long as you have the permission to download files.&lt;/li>
&lt;/ol>
&lt;h1 id="how-do-i-get-it">How do I get it?&lt;/h1>
&lt;p>The project is available on this &lt;a href="https://github.com/avimallu/PowerPointSnap">Github repo&lt;/a>. The instructions to install it are available there, but here&amp;rsquo;s the down-low:&lt;/p></description><content:encoded><![CDATA[<h1 id="premise">Premise</h1>
<p>When I worked in healthcare consulting, I often spent a LOT of my time creating PowerPoint presentations (<em>decks</em> in consulting lingo - not even <em>slide decks</em>). However, it was rather repetitive. Thus, was born PowerPointSnap.</p>
<h1 id="what-is-it">What is it?</h1>
<p>I&rsquo;ll write this down as pointers.</p>
<ol>
<li>It&rsquo;s a VBA based PowerPoint add-on. Just a set of commands that work well with each other.</li>
<li>It&rsquo;s Windows only - it&rsquo;s unlikely to work on MacOS.</li>
<li>It&rsquo;s installation-free and is not an executable, which makes it perfect for locked-down corporate environments, as long as you have the permission to download files.</li>
</ol>
<h1 id="how-do-i-get-it">How do I get it?</h1>
<p>The project is available on this <a href="https://github.com/avimallu/PowerPointSnap">Github repo</a>. The instructions to install it are available there, but here&rsquo;s the down-low:</p>
<ol>
<li>Download the Snap.ppam file to your system.</li>
<li>Enable the developer options.</li>
<li>Go to the Developer tab, and click on PowerPoint Add-ins.</li>
<li>Click on Add New. Choose the location of the file you just dowloaded. Click Close.</li>
<li>To uninstall, repeat the process, and simply click on Remove this time.</li>
</ol>
<h1 id="what-can-i-do-with-it">What can I do with it?</h1>
<p>Frankly, a LOT. The base concept of this tool is:</p>
<ol>
<li>&ldquo;Set&rdquo; a shape as the one you want to copy a property from.</li>
<li>Select any property from the list to automatically apply it.</li>
</ol>
<p>Here&rsquo;s a non-exhaustive list of all the options available.</p>
<h2 id="apply-properties-of-shapes-directly">Apply properties of shapes directly</h2>
<p>This is the part of the interface that can be used for shapes (which include charts and tables).</p>
<p><img src="/blog/003_powerpointsnap/01_Shapes.png" alt="The UI for copying shape properties"></p>
<p>To use, first select a <em>shape</em> object, click on &ldquo;Set&rdquo;. Then, choose the object you want to <em>Snap</em> its properties to (see how I got the inspiration for the name?). You should be able to copy all compatible properties - if something is not copy-able, the tool will show an error, and then let you exit.</p>
<p>Note that it&rsquo;s probably not to apply a property of a shape to a table - if you want to make the entire table orange, there are probably better built-in ways to do it than to use <em>Snap</em>.</p>
<h2 id="beautify-charts-with-snappable-properties">Beautify charts with <em>Snap</em>pable properties</h2>
<p>Charts are also supported, with dedicated features for it.</p>
<p><img src="/blog/003_powerpointsnap/02_Charts.png" alt="The UI for copying chart properties"></p>
<p>What do these features do? You should be able to hover over the option and get a tooltip that shows what it&rsquo;s capable of, but here&rsquo;s another summary just in case:</p>
<ol>
<li>Sync Value/Date Axis: this will try to align the range, the ticks, the numeric values etc. of the &ldquo;set&rdquo; chart to the one you&rsquo;ve selected. I couldn&rsquo;t put in just $x$ and $y$ here because Microsoft internally doesn&rsquo;t label them that way. Try either of these two options (you can undo!) and see what works best for your chart. This doesn&rsquo;t work well yet for 3D charts.</li>
<li>Sync Plot/Title/Legend: often, you want to centre a title, or make sure that multiple charts that show nearly identical things for different variables all <em>look</em> exactly the same from a client perspective. But that&rsquo;s usually difficult if you&rsquo;ve already configured the charts a little - which can be remedied with this option!</li>
<li>Format Painter: this is simply a helper for the normal format painter to align the formats of the text that you&rsquo;ve selected with the way it originally is in the &ldquo;set&rdquo; chart. The reason for this feature is simply to avoid going back to <em>Home</em> to click on the <em>Format Painter</em> option again.</li>
<li>Reset Axes Scales: in case you messed up somewhere, you can use this to rever to PowerPoint defaults.</li>
</ol>
<p>The next two options deserve their own section.</p>
<h2 id="customize-the-labels-programmatically">Customize the labels programmatically</h2>
<p>Your immediate senior in a consulting environment would frown at your chart, and then exclaim, &ldquo;I think that&rsquo;s too many labels for the data points. Can you show them every two/three/four labels? I know this is manual work, but it&rsquo;s a one time thing!&rdquo;</p>
<p>It&rsquo;s <strong>never</strong> a one time affair. But don&rsquo;t worry, we have this nice feature to help us. If you click on the <em>Customize Label</em> option, you will get this (without the &ldquo;Set&rdquo; option):</p>
<p><img src="/blog/003_powerpointsnap/DataLabelsScreenshot.JPG" alt="The UI for customizing labels."></p>
<p>Never mind the rather unfriendly legend entries. They&rsquo;re just here to demonstrate that you can do the following kinds of whacky abilities with your own chart!</p>
<h3 id="screenshots-of-the-chart-snapability">Screenshots of the chart <em>snap</em>ability</h3>
<p>Of course, visuals will do it more justice. For example, look at this image:</p>
<p><img src="/blog/003_powerpointsnap/Revenue_Presentation_1.png" alt="Theres a lot wrong with this image. But primarily, the charts are of different sizes, the axes are different, the labels are too clustered, and the titles arent centered."></p>
<p>Here&rsquo;s what you can do:</p>
<ol>
<li>Click on the left chart. Press &ldquo;Set&rdquo; in the toolbar for <em>Snap</em>.</li>
<li>Click on the right chart, and then go through the following:
<ol>
<li>In <em>Shapes</em>, click on <em>Dim</em>. This will align the shapes of the chart.</li>
<li>Use the guides that you get while moving the chart to align the positions of the two charts now that their shapes are equal.</li>
<li>You&rsquo;ll notice that the chart area doesn&rsquo;t still match, nor does the title.</li>
<li>In <em>Charts</em>, click on <em>Sync Plot Area</em> and <em>Sync Title Area</em>, and watch the magic unfold.</li>
<li>Now, click on the second chart, and click on &ldquo;Set&rdquo;. Let&rsquo;s align the axes of the first chart to the second one.</li>
<li>Click on the first chart, and then in <em>Charts</em>, click <em>Sync Value Axis</em>.</li>
</ol>
</li>
<li>Let&rsquo;s bring that senior&rsquo;s exclamation back into play - (s)he wants you to highlight <em>only</em> Profit labels, and that too every 2 iterations. To do this:
<ol>
<li>Click on <em>Customize Labels</em> after clicking on either chart.</li>
<li>You&rsquo;ll get the screen shown in the previous section. Make sure to adjust the values such that it&rsquo;s exactly like the screenshot there.</li>
<li>Click on &ldquo;Save and Run&rdquo;. This will <em>save</em> the configuration you&rsquo;ve selected, and <em>run</em> it on the chart you&rsquo;ve selected.</li>
<li>Click the other chart. Then, in <em>Charts</em>, click on <em>Rerun Customization</em>.</li>
</ol>
</li>
</ol>
<p>This is what your results should look like:</p>
<p><img src="/blog/003_powerpointsnap/Revenue_Presentation_2.png" alt="Everything almost consistent. Your senior rests their eyes, and secretly wonder how you managed to do it quickly… maybe they should change some requirements…"></p>
<p>Of course, getting those calculations right is a whole different thing that will need some work.</p>
<h2 id="align-table-dimensions">Align table dimensions</h2>
<p>Oftentimes, you have two tables that show similar values&hellip; you know the drill. Here&rsquo;s what you can do in a scenario such as this:</p>
<p><img src="/blog/003_powerpointsnap/Table_Presentation_1.png" alt="Similar data, but vastly different tables."></p>
<p>This is what the <em>Tables</em> section of the tool looks like:</p>
<p><img src="/blog/003_powerpointsnap/03_Tables.png" alt="The UI for Tables"></p>
<p>To align these tables together,</p>
<ol>
<li>Click on the left table. Press &ldquo;Set&rdquo; in the toolbar for <em>Snap</em>.</li>
<li>Click on the right table.</li>
<li>Click on <em>Shapes</em>, inside it, <em>Dim</em>. Now the shapes of the table are the same.</li>
<li>In <em>Tables</em>, click on <em>Sync Column Widths</em>. Now the columns are also the same.</li>
<li>If you try to align by rows, it fails because the number of rows are not the same in the two tables.</li>
</ol>
<p>Here&rsquo;s what you&rsquo;ll end up with:</p>
<p><img src="/blog/003_powerpointsnap/Table_Presentation_2.png" alt="Similar data, and similar enough tables."></p>
<p>Pretty neat, eh?</p>
]]></content:encoded></item><item><title>Finding representative samples efficiently for large datasets</title><link>https://avimallu.dev/blog/002_representative_samples/</link><pubDate>Thu, 19 Oct 2023 00:00:00 +0000</pubDate><guid>https://avimallu.dev/blog/002_representative_samples/</guid><description>&lt;h1 id="premise">Premise&lt;/h1>
&lt;p>In this day and age, we&amp;rsquo;re not short on data. &lt;em>Good&lt;/em> data, on the other hand, is very valuable. When you&amp;rsquo;ve got a large amount of improperly labelled data, it may become hard to find to find a representative dataset to train a model on such that it generalizes well.&lt;/p>
&lt;p>Let&amp;rsquo;s formalize the problem a little so that a proper approach can be developed. Here&amp;rsquo;s the problem statement:&lt;/p>
&lt;ol>
&lt;li>You have a large-ish set of (imperfectly) labelled data points. These data points can be represented as a 2D matrix.&lt;/li>
&lt;li>You need to train a model to classify these data points on either these labels, or on labels dervied from imperfect labels.&lt;/li>
&lt;li>You need a good (but not perfect) representative sample for the model to be generalizable, but there are too many data points for each label to manually pick representative examples.&lt;/li>
&lt;/ol>
&lt;h2 id="in-a-hurry">In a hurry?&lt;/h2>
&lt;p>Here&amp;rsquo;s what you need to do:&lt;/p></description><content:encoded><![CDATA[<h1 id="premise">Premise</h1>
<p>In this day and age, we&rsquo;re not short on data. <em>Good</em> data, on the other hand, is very valuable. When you&rsquo;ve got a large amount of improperly labelled data, it may become hard to find to find a representative dataset to train a model on such that it generalizes well.</p>
<p>Let&rsquo;s formalize the problem a little so that a proper approach can be developed. Here&rsquo;s the problem statement:</p>
<ol>
<li>You have a large-ish set of (imperfectly) labelled data points. These data points can be represented as a 2D matrix.</li>
<li>You need to train a model to classify these data points on either these labels, or on labels dervied from imperfect labels.</li>
<li>You need a good (but not perfect) representative sample for the model to be generalizable, but there are too many data points for each label to manually pick representative examples.</li>
</ol>
<h2 id="in-a-hurry">In a hurry?</h2>
<p>Here&rsquo;s what you need to do:</p>
<ol>
<li>Read the premise and see if it fits your problem.</li>
<li>Go to the <strong>For the folks in a hurry!</strong> section at the end to find the generic solution and how it works.</li>
</ol>
<h2 id="why-do-we-need-representative-samples">Why do we need representative samples?</h2>
<p>Generally, three things come to mind:</p>
<ol>
<li>Allows the model to be generalizable for all <em>kinds</em> of data points <em>within</em> a category.</li>
<li>Allows for faster training of the model - you need <em>fewer</em> data points to get the same accuracy!</li>
<li>Allows maintaining the training set - if your training set needs validation by experts or annotations, this keeps your costs low!</li>
</ol>
<h1 id="define-the-data">Define the data</h1>
<p>This data can be practically anything that can be represented as a 2D matrix.</p>
<p>There are exceptions. Raw image data (as numbers) might get difficult because even if you flatten them, they&rsquo;ll be significant correlation between features. For example, a face can appear practically anywhere in the image, and all pixels centered around the face will be highly correlated, even if they are on different lines. A workaround in this case would be to pipe the image through a CNN model that has been trained on some <em>generic</em> task and produces a 1D representation of a single image in the final hidden layer before the output. Other data will need further processing along similar lines.</p>
<h2 id="get-a-specific-dataset">Get a specific dataset</h2>
<p>For this specific article, I will use the <a href="https://www.kaggle.com/datasets/lakritidis/product-classification-and-categorization/data">ShopMania dataset on Kaggle</a>. I apologize in advance for not using a more easily accessible dataset (you need to sign into Kaggle to download it) - and I&rsquo;m not 100% sure if the GPL allows me to create a copy of the data and place it in my own repository. Nevertheless, the data (if you download it and choose to use it instead of some other dataset) will look like this:</p>
<blockquote>
<p><strong>NOTE</strong>: whenever I want to show an output <em>along</em> with the code I used for it, you&rsquo;ll see the characters <code>&gt;&gt;</code> indicating the command used, and the output to be without those prefixes.</p>
</blockquote>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="o">&gt;&gt;</span> <span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="nn">pl</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="o">&gt;&gt;</span> <span class="n">data</span> <span class="o">=</span> <span class="n">pl</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s2">&#34;archive/shopmania.csv&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="o">&gt;&gt;</span> <span class="n">data</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">shape</span><span class="p">:</span> <span class="p">(</span><span class="mi">313_705</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="err">┌────────────┬──────────────────────────────────────────────────────┬─────────────┬────────────────┐</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="err">│</span> <span class="n">product_ID</span> <span class="err">┆</span> <span class="n">product_title</span> <span class="err">┆</span> <span class="n">category_ID</span> <span class="err">┆</span> <span class="n">category_label</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="err">│</span> <span class="o">---</span> <span class="err">┆</span> <span class="o">---</span> <span class="err">┆</span> <span class="o">---</span> <span class="err">┆</span> <span class="o">---</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="err">│</span> <span class="n">i64</span> <span class="err">┆</span> <span class="nb">str</span> <span class="err">┆</span> <span class="n">i64</span> <span class="err">┆</span> <span class="nb">str</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="err">╞════════════╪══════════════════════════════════════════════════════╪═════════════╪════════════════╡</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="err">│</span> <span class="mi">2</span> <span class="err">┆</span> <span class="n">twilight</span> <span class="n">central</span> <span class="n">park</span> <span class="nb">print</span> <span class="err">┆</span> <span class="mi">2</span> <span class="err">┆</span> <span class="n">Collectibles</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="err">│</span> <span class="mi">3</span> <span class="err">┆</span> <span class="n">fox</span> <span class="nb">print</span> <span class="err">┆</span> <span class="mi">2</span> <span class="err">┆</span> <span class="n">Collectibles</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="err">│</span> <span class="mi">4</span> <span class="err">┆</span> <span class="n">circulo</span> <span class="n">de</span> <span class="n">papel</span> <span class="n">wall</span> <span class="n">art</span> <span class="err">┆</span> <span class="mi">2</span> <span class="err">┆</span> <span class="n">Collectibles</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="err">│</span> <span class="mi">5</span> <span class="err">┆</span> <span class="n">hidden</span> <span class="n">path</span> <span class="nb">print</span> <span class="err">┆</span> <span class="mi">2</span> <span class="err">┆</span> <span class="n">Collectibles</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="err">│</span> <span class="err">…</span> <span class="err">┆</span> <span class="err">…</span> <span class="err">┆</span> <span class="err">…</span> <span class="err">┆</span> <span class="err">…</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="err">│</span> <span class="mi">313703</span> <span class="err">┆</span> <span class="n">deago</span> <span class="n">anti</span> <span class="n">fog</span> <span class="n">swimming</span> <span class="n">diving</span> <span class="n">full</span> <span class="n">face</span> <span class="n">mask</span> <span class="err">┆</span> <span class="mi">229</span> <span class="err">┆</span> <span class="n">Water</span> <span class="n">Sports</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="err">│</span> <span class="err">┆</span> <span class="n">surface</span> <span class="n">snorkel</span> <span class="n">scuba</span> <span class="n">fr</span> <span class="n">gopro</span> <span class="n">black</span> <span class="n">s</span><span class="o">/</span><span class="n">m</span> <span class="err">┆</span> <span class="err">┆</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="err">│</span> <span class="mi">313704</span> <span class="err">┆</span> <span class="n">etc</span> <span class="n">buys</span> <span class="n">full</span> <span class="n">face</span> <span class="n">gopro</span> <span class="n">compatible</span> <span class="n">snorkel</span> <span class="n">scuba</span> <span class="err">┆</span> <span class="mi">229</span> <span class="err">┆</span> <span class="n">Water</span> <span class="n">Sports</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="err">│</span> <span class="err">┆</span> <span class="n">diving</span> <span class="n">mask</span> <span class="n">blue</span> <span class="n">large</span><span class="o">/</span><span class="n">xtralarge</span> <span class="n">blue</span> <span class="err">┆</span> <span class="err">┆</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="err">│</span> <span class="mi">313705</span> <span class="err">┆</span> <span class="n">men</span> <span class="mi">039</span> <span class="n">s</span> <span class="n">full</span> <span class="n">face</span> <span class="n">breathe</span> <span class="n">free</span> <span class="n">diving</span> <span class="n">snorkel</span> <span class="n">mask</span> <span class="err">┆</span> <span class="mi">229</span> <span class="err">┆</span> <span class="n">Water</span> <span class="n">Sports</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="err">│</span> <span class="err">┆</span> <span class="n">scuba</span> <span class="n">optional</span> <span class="n">hd</span> <span class="n">camera</span> <span class="n">blue</span> <span class="n">mask</span> <span class="n">only</span> <span class="n">adult</span> <span class="n">men</span> <span class="err">┆</span> <span class="err">┆</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="err">│</span> <span class="mi">313706</span> <span class="err">┆</span> <span class="n">women</span> <span class="mi">039</span> <span class="n">s</span> <span class="n">full</span> <span class="n">face</span> <span class="n">breathe</span> <span class="n">free</span> <span class="n">diving</span> <span class="n">snorkel</span> <span class="err">┆</span> <span class="mi">229</span> <span class="err">┆</span> <span class="n">Water</span> <span class="n">Sports</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="err">│</span> <span class="err">┆</span> <span class="n">mask</span> <span class="n">scuba</span> <span class="n">optional</span> <span class="n">hd</span> <span class="n">camera</span> <span class="n">black</span> <span class="n">mask</span> <span class="n">only</span> <span class="err">┆</span> <span class="err">┆</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="err">│</span> <span class="err">┆</span> <span class="n">children</span> <span class="ow">and</span> <span class="n">women</span> <span class="err">┆</span> <span class="err">┆</span> <span class="err">│</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="err">└────────────┴──────────────────────────────────────────────────────┴─────────────┴────────────────┘</span></span></span></code></pre></div><p>The data documentation on Kaggle states:</p>
<blockquote>
<p>The first dataset originates from ShopMania, a popular online product comparison platform. It enlists tens of millions of products organized in a three-level hierarchy that includes 230 categories. The two higher levels of the hierarchy include 39 categories, whereas the third lower level accommodates the rest 191 leaf categories. Each product is categorized into this tree structure by being mapped to only one leaf category. Some of these 191 leaf categories contain millions of products. However, shopmania.com allows only the first 10,000 products to be retrieved from each category. Under this restriction, our crawler managed to collect 313,706 products.</p>
</blockquote>
<p>For demonstration, I&rsquo;ll just limit the categories to those that have exactly 10,000 occurences.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln">1</span><span class="cl"><span class="n">data</span> <span class="o">=</span> <span class="p">(</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"> <span class="n">data</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"> <span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">over</span><span class="p">(</span><span class="s2">&#34;category_ID&#34;</span><span class="p">)</span> <span class="o">==</span> <span class="mi">10000</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="p">)</span></span></span></code></pre></div><p>You&rsquo;ll notice that there are only 17 categories in this dataset. Run this to verify that fact.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="o">&gt;&gt;&gt;</span> <span class="n">data</span><span class="o">.</span><span class="n">get_column</span><span class="p">(</span><span class="s2">&#34;category_label&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">unique</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">shape</span><span class="p">:</span> <span class="p">(</span><span class="mi">17</span><span class="p">,)</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">Series</span><span class="p">:</span> <span class="s1">&#39;category_label&#39;</span> <span class="p">[</span><span class="nb">str</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="p">[</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"> <span class="s2">&#34;Kitchen &amp; Dining&#34;</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> <span class="s2">&#34;Scarves and wraps&#34;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> <span class="s2">&#34;Handbags &amp; Wallets&#34;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"> <span class="s2">&#34;Rugs Tapestry &amp; Linens&#34;</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="s2">&#34;Cell Phones Accessories&#34;</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"> <span class="s2">&#34;Men&#39;s Clothing&#34;</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"> <span class="s2">&#34;Jewelry&#34;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"> <span class="s2">&#34;Belts&#34;</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"> <span class="s2">&#34;Men Lingerie&#34;</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"> <span class="s2">&#34;Crafts&#34;</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"> <span class="s2">&#34;Football&#34;</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"> <span class="s2">&#34;Medical Supplies&#34;</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"> <span class="s2">&#34;Adult&#34;</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"> <span class="s2">&#34;Hunting&#34;</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl"> <span class="s2">&#34;Women&#39;s Clothing&#34;</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl"> <span class="s2">&#34;Pet Supply&#34;</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl"> <span class="s2">&#34;Office Supplies&#34;</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="p">]</span></span></span></code></pre></div><p>Note that this is very easy in Polars, which is the package I typically use for data manipulation. I recommend using it over Pandas.</p>
<h2 id="specify-the-task">Specify the task</h2>
<p>Okay - so now we have exactly 10,000 products <em>per</em> category. We only have the title of the product that can be leveraged for categorization. So let me define the task this way:</p>
<blockquote>
<p>Craft a <em>small</em> representative sample for each category.</p>
</blockquote>
<p>Why small? It helps that it&rsquo;ll make the model faster to train - <em>and</em> keep the training data manageable in size.</p>
<h1 id="finding-representative-samples">Finding representative samples</h1>
<p>I mentioned earlier that we need to represent data as a 2D matrix for the technique I have in mind to work. How can I translate a list of text to a matrix? The answer&rsquo;s rather simple: use <code>SentenceTransformers</code> to get a string&rsquo;s embedding. You could also use more classic techniques like computing TF-IDF values, or use more advanced transformers, but I&rsquo;ve noticed that <code>SentenceTransformers</code> are able to capture semantic meaning of sentences rather well (assuming you use a good model suited for the language the data is in) - they are trained on sentence similarity after all.</p>
<h2 id="getting-sentencetransformer-embeddings">Getting <code>SentenceTransformer</code> embeddings</h2>
<p>This part is rather simple. If you&rsquo;re unable to install SentenceTransformers, <a href="https://www.sbert.net/docs/installation.html">please check their website</a>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln">1</span><span class="cl"><span class="kn">import</span> <span class="nn">sentence_transformers</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># See list of models at www.sbert.net/docs/pretrained_models.html</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="n">ST</span> <span class="o">=</span> <span class="n">sentence_transformers</span><span class="o">.</span><span class="n">SentenceTransformer</span><span class="p">(</span><span class="s2">&#34;all-mpnet-base-v2&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="n">title_embeddings</span> <span class="o">=</span> <span class="p">(</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"> <span class="n">ST</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"> <span class="n">data</span><span class="o">.</span><span class="n">get_column</span><span class="p">(</span><span class="s2">&#34;product_title&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">to_list</span><span class="p">(),</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"> <span class="n">show_progress_bar</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">convert_to_tensor</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl"> <span class="o">.</span><span class="n">numpy</span><span class="p">())</span></span></span></code></pre></div><p>This process will be slow (~30 minutes) if you don&rsquo;t have a GPU. There are faster approaches, but they are slightly more involved than would be beneficial for a blog post. The wait will be worth it, I promise! In addition, the call to <code>.numpy()</code> at the end is to directly get a single <code>numpy</code> array - otherwise you get a <code>list</code> of <code>numpy</code> arrays, which is rather inefficient. Further, <code>SentenceTransformers</code> will try to run on the GPU if available, and if so, you will need to write <code>.cpu().numpy()</code> so that the tensor is copied from the GPU to the CPU.</p>
<blockquote>
<p><strong>NOTE</strong>: for a proof-of-concept implementation, or if you&rsquo;re on the CPU, try the <code>all-MiniLM-L6-v2</code> model. It&rsquo;s a much smaller and much faster model, although you sacrifice a little in terms of accuracy.</p>
</blockquote>
<h2 id="the-concept-of-_approximate_-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
<p>Performing any kind of nearest neighbor algorithm on medium scale datasets (even bordering 10,000 rows and tens of columns) tends to be slow. A primary driver of this was the need to calculate all, or nearly all distances between all data points. <em>Approximate</em> nearest neighbor (ANN) algorithms work around this through various approaches, which warrant their own blog post. For now, it would suffice to understand that there are shortcuts that ANN algorithms take to give you if not the exact nearest neighbor, at least <em>one</em> of the nearest neighbors (hence the term <em>approximate</em>).</p>
<p>There are several algorithms that you can use - I shall proceed with <code>faiss</code>, because it has a nice Python interface and is rather easy to work with. You can use any algorithm - a full list of the major ones are <a href="https://github.com/erikbern/ann-benchmarks">available here</a>.</p>
<p>I&rsquo;ll explain why we&rsquo;re in the nearest neighbor territory in due course.</p>
<h3 id="building-the-database">Building the database</h3>
<p>To build the database, all we need is the <code>title_embeddings</code> matrix.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln">1</span><span class="cl"><span class="kn">import</span> <span class="nn">faiss</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="k">def</span> <span class="nf">create_index</span><span class="p">(</span><span class="n">title_embeddings</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"> <span class="n">d</span> <span class="o">=</span> <span class="n">title_embeddings</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># Number of dimensions</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"> <span class="n">ann_index</span> <span class="o">=</span> <span class="n">faiss</span><span class="o">.</span><span class="n">IndexFlatL2</span><span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="c1"># Index using Eucledian Matrix</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"> <span class="n">ann_index</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">title_embeddings</span><span class="p">)</span> <span class="c1"># Build the index</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">
</span></span><span class="line"><span class="ln">7</span><span class="cl"> <span class="k">return</span> <span class="n">ann_index</span> <span class="c1"># Faiss considers databases an &#34;index&#34;</span></span></span></code></pre></div><p>This does create <em>a</em> database. But remember, we&rsquo;re trying to find <em>representative samples</em> - which means we need to do this <em>by</em> the category (or label). So let&rsquo;s design a function that sends only the necessary data as that for a particular category, and then create the database. We&rsquo;ll need three pieces of information from this function:</p>
<ol>
<li>The actual <code>faiss</code> database.</li>
<li>The actual subset of data that was used to build this index.</li>
<li>The label indices with respect to the original data that went into the <code>faiss</code> database.</li>
</ol>
<p>(2) and (3) will help us later in rebuilding a &ldquo;network graph&rdquo; that will allow us to reference the original data points.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">import</span> <span class="nn">faiss</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="nn">pl</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">def</span> <span class="nf">create_index</span><span class="p">(</span><span class="n">label</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> <span class="n">faiss_indices</span> <span class="o">=</span> <span class="p">(</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> <span class="n">data</span> <span class="c1"># this needs to be an argument if you want to create a generic function</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"> <span class="o">.</span><span class="n">with_row_count</span><span class="p">(</span><span class="s2">&#34;row_idx&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category_label&#34;</span><span class="p">)</span> <span class="o">==</span> <span class="n">label</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"> <span class="o">.</span><span class="n">get_column</span><span class="p">(</span><span class="s2">&#34;row_idx&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"> <span class="o">.</span><span class="n">to_list</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"> <span class="p">)</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl"> <span class="n">faiss_data</span> <span class="o">=</span> <span class="n">title_embeddings</span><span class="p">[</span><span class="n">faiss_indices</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"> <span class="n">d</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># Number of dimensions</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"> <span class="n">faiss_DB</span> <span class="o">=</span> <span class="n">faiss</span><span class="o">.</span><span class="n">IndexFlatIP</span><span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="c1"># Index using Inner Product</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"> <span class="n">faiss</span><span class="o">.</span><span class="n">normalize_L2</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="c1"># Normalized L2 with Inner Product search = cosine similarity</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"> <span class="c1"># Why cosine similarity? It&#39;s easier to specify thresholds - they&#39;ll always be between 0 and 1.4.</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl"> <span class="c1"># If using Eucledian or other distance, we&#39;ll have to spend some time finding a good range</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl"> <span class="c1"># where distances are reasonable. See https://stats.stackexchange.com/a/146279 for details.</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl"> <span class="n">faiss_DB</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="c1"># Build the index</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">
</span></span><span class="line"><span class="ln">23</span><span class="cl"> <span class="k">return</span> <span class="n">faiss_DB</span><span class="p">,</span> <span class="n">faiss_data</span><span class="p">,</span> <span class="n">faiss_indices</span></span></span></code></pre></div><h3 id="identifying-the-nearest-neighbors">Identifying the nearest neighbors</h3>
<p>To proceed with getting a representative sample, the next step is to find the nearest neighbors for <strong>all</strong> data points in the database. This isn&rsquo;t too hard - <code>faiss</code> <code>index</code> objects have a built-in <code>search</code> method to find the <code>k</code> nearest neighbors for a given index, along with the (approximate) distance to it. Let&rsquo;s then write a function to get the following information: the label index for whom nearest neighbors are being searched, the indices of said nearest neighbors and the distance between them. In network graph parlance, this kind of data is called an <em>edge list</em> i.e. a list of pair of <em>nodes</em> that are connected, along with any additional information that specifies a property (in this case distance) of the <em>edge</em> that connects these <em>nodes</em>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">def</span> <span class="nf">get_edge_list</span><span class="p">(</span><span class="n">label</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"> <span class="n">faiss_DB</span><span class="p">,</span> <span class="n">faiss_data</span><span class="p">,</span> <span class="n">faiss_indices</span> <span class="o">=</span> <span class="n">create_index</span><span class="p">(</span><span class="n">label</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"> <span class="c1"># To map the data back to the original `train[b&#39;data&#39;]` array</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"> <span class="n">faiss_indices_map</span> <span class="o">=</span> <span class="p">{</span><span class="n">i</span><span class="p">:</span> <span class="n">x</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">x</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">faiss_indices</span><span class="p">)}</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"> <span class="c1"># To map the indices back to the original strings</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> <span class="n">title_name_map</span> <span class="o">=</span> <span class="p">{</span><span class="n">i</span><span class="p">:</span> <span class="n">x</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">x</span> <span class="ow">in</span> <span class="n">data</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">&#34;row_idx&#34;</span><span class="p">,</span> <span class="s2">&#34;product_title&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">rows</span><span class="p">()}</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> <span class="n">distances</span><span class="p">,</span> <span class="n">neighbors</span> <span class="o">=</span> <span class="n">faiss_DB</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">faiss_data</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="k">return</span> <span class="p">(</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"> <span class="n">pl</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"> <span class="s2">&#34;from&#34;</span><span class="p">:</span> <span class="n">faiss_indices</span><span class="p">})</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"> <span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"> <span class="n">pl</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="s2">&#34;to&#34;</span><span class="p">,</span> <span class="n">neighbors</span><span class="p">),</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"> <span class="n">pl</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="s2">&#34;distance&#34;</span><span class="p">,</span> <span class="n">distances</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"> <span class="o">.</span><span class="n">explode</span><span class="p">(</span><span class="s2">&#34;to&#34;</span><span class="p">,</span> <span class="s2">&#34;distance&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"> <span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"> <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;from&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"> <span class="o">.</span><span class="n">map_dict</span><span class="p">(</span><span class="n">title_name_map</span><span class="p">),</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl"> <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;to&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl"> <span class="o">.</span><span class="n">map_dict</span><span class="p">(</span><span class="n">faiss_indices_map</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl"> <span class="o">.</span><span class="n">map_dict</span><span class="p">(</span><span class="n">title_name_map</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl"> <span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;from&#34;</span><span class="p">)</span> <span class="o">!=</span> <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;to&#34;</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl"> <span class="p">)</span> </span></span></code></pre></div><h3 id="networkx-and-connected-components">NetworkX and Connected Components</h3>
<p>The next step in the process is to create a network graph using the edge-list. But why?</p>
<p>Remember that we have identified the (k=5) nearest neighbors of <strong>each</strong> data point. Let&rsquo;s say that we have a point A that has a nearest neighbor B. C is <strong>not</strong> a nearest neighbor of A, but it is a nearest neighbor of B. In a network graph, if A and C are sufficiently similar enough to B within a particular <em>minimum thershold</em>, then A will be connected to C through B! Hopefully a small visual below would help.</p>
<p><img src="/blog/002_representative_samples/001_Network_Cluster_1.png" alt="How a network component is formed."></p>
<p>What happens when such a concept is extended for many data points? Not all of them would be connected - because we&rsquo;re applying a <em>minimum</em> threshold that they have to meet. This is the only hueristic part of the rather fast process. Here&rsquo;s one more helpful visual:</p>
<p><img src="/blog/002_representative_samples/002_Network_Cluster_2.png" alt="How a network cluster is formed."></p>
<p>Very starry night-eque vibes here. Let&rsquo;s get to the code.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln">1</span><span class="cl"><span class="kn">import</span> <span class="nn">networkx</span> <span class="k">as</span> <span class="nn">nx</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="k">def</span> <span class="nf">get_cluster_map</span><span class="p">(</span><span class="n">label</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">min_cosine_distance</span><span class="o">=</span><span class="mf">0.95</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"> <span class="n">edge_list</span> <span class="o">=</span> <span class="p">(</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"> <span class="n">get_edge_list</span><span class="p">(</span><span class="n">label</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="n">k</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"> <span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;distance&#34;</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">min_cosine_distance</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"> <span class="p">)</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"> <span class="n">graph</span> <span class="o">=</span> <span class="n">nx</span><span class="o">.</span><span class="n">from_pandas_edgelist</span><span class="p">(</span><span class="n">edge_list</span><span class="o">.</span><span class="n">to_pandas</span><span class="p">(),</span> <span class="n">source</span><span class="o">=</span><span class="s2">&#34;from&#34;</span><span class="p">,</span> <span class="n">target</span><span class="o">=</span><span class="s2">&#34;to&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl"> <span class="k">return</span> <span class="p">{</span><span class="n">i</span><span class="p">:</span> <span class="nb">list</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">x</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">nx</span><span class="o">.</span><span class="n">connected_components</span><span class="p">(</span><span class="n">graph</span><span class="p">))}</span></span></span></code></pre></div><h1 id="getting-clusters">Getting clusters</h1>
<p>Now that all the parts of the puzzle are together, let&rsquo;s run it to see what kind of clusters you get for <code>Cell Phone Accessories</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln">1</span><span class="cl"><span class="n">clusters</span> <span class="o">=</span> <span class="n">get_cluster_map</span><span class="p">(</span><span class="s2">&#34;Cell Phones Accessories&#34;</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mf">0.95</span><span class="p">)</span></span></span></code></pre></div><p>Make sure to configure the following if your results aren&rsquo;t good enough:</p>
<ol>
<li>Relax the <code>min_cosine_distance</code> value if you want <em>bigger</em> clusters.</li>
<li>Increase the number of nearest neighbors if you want <em>more</em> matches.</li>
</ol>
<h2 id="viewing-the-components">Viewing the components</h2>
<p>There will likely be many clusters (you can see how many exactly with <code>len(clusters)</code>). Let&rsquo;s look at a random cluster:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln">1</span><span class="cl"><span class="o">&gt;&gt;</span> <span class="n">clusters</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="p">[</span><span class="s1">&#39;smartphone lanyard with card slot for any phone up to 6 yellow 72570099&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"> <span class="s1">&#39;smartphone lanyard with card slot for any phone up to 6 black 72570093&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"> <span class="s1">&#39;smartphone lanyard with card slot for any phone up to 6 lightblue 72570097&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"> <span class="s1">&#39;smartphone lanyard with card slot for any phone up to 6 blue 72570095&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"> <span class="s1">&#39;smartphone lanyard with card slot for any phone up to 6 green 72570101&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"> <span class="s1">&#39;smartphone lanyard with card slot for any phone up to 6 pink 72570091&#39;</span><span class="p">]</span></span></span></code></pre></div><p>Let&rsquo;s see another cluster that had 172(!) members in my run (the clusters themselves will be stable, but their indices may change in each run owing to some inherent randomness in the process).</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="o">&gt;&gt;&gt;</span> <span class="n">clusters</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="p">[</span><span class="s1">&#39;otm essentials iphone 8/7 modern clear printed phone case snowflakes iphone 8/7 op qq z051a&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"> <span class="s1">&#39;otm essentials iphone 8/7 modern clear printed phone case iphone 8/7 arrows blue op qq a02 58&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"> <span class="s1">&#39;otm essentials iphone 8/7/6s clear printed phone case single iphone 8/7/6s golden pineapple op qq z089a&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"> <span class="s1">&#39;otm essentials iphone 8/7/6s clear printed phone case single iphone 8/7/6s butteryfly delight yellow op qq z029d&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> <span class="s1">&#39;otm essentials iphone 8/7 modern clear printed phone case iphone 8/7 luck of the irish op qq a01 45&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> <span class="s1">&#39;otm essentials iphone 8/7 modern clear printed phone case iphone 8/7 brides maid white op qq a02 16&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"> <span class="o">...</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="s1">&#39;otm essentials iphone 8/7 modern clear printed phone case iphone 8/7 flying arrows white op qq hip 20&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"> <span class="s1">&#39;otm essentials iphone 8/7 modern clear printed phone case iphone 8/7 brides maid pink white op qq a02 17&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"> <span class="s1">&#39;otm essentials iphone 8/7 modern clear printed phone case iphone 8/7 anemone flowers white op qq z036a&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"> <span class="s1">&#39;otm essentials iphone 8/7 modern clear printed phone case mustache iphone 8/7 op qq hip 08&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"> <span class="s1">&#39;otm essentials iphone 8/7 modern clear printed phone case oh snap iphone 8/7 op qq z053a&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"> <span class="s1">&#39;otm essentials iphone 8/7/6s clear printed phone case single iphone 8/7/6s desert cacti orange pink op qq a02 22&#39;</span><span class="p">]</span></span></span></code></pre></div><h2 id="running-for-all-categories">Running for all categories</h2>
<p>This isn&rsquo;t that hard (although it may take more than a moment). Just iterate it for each category!</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln">1</span><span class="cl"><span class="n">clusters</span> <span class="o">=</span> <span class="p">[</span><span class="n">get_cluster_map</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mf">0.95</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">data</span><span class="o">.</span><span class="n">get_column</span><span class="p">(</span><span class="s2">&#34;category_label&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">unique</span><span class="p">()]</span></span></span></code></pre></div><h1 id="for-the-folks-in-a-hurry">For the folks in a hurry!</h1>
<p>I get it - you often want a solution that &ldquo;just works&rdquo;. I can come close to it. See below for code and a succinct explanation. For those of my readers who aren&rsquo;t in a hurry, this also serves as a nice summary (and copy-pastable code)!</p>
<h2 id="the-code">The code</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">import</span> <span class="nn">sentence_transformers</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">faiss</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="nn">pl</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># Data is read here. You download the files from Kaggle here: </span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># https://www.kaggle.com/datasets/lakritidis/product-classification-and-categorization</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">data</span> <span class="o">=</span> <span class="n">pl</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s2">&#34;archive/shopmania.csv&#34;</span><span class="p">,</span> <span class="n">new_columns</span><span class="o">=</span><span class="p">[</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="s2">&#34;product_ID&#34;</span><span class="p">,</span> <span class="s2">&#34;product_title&#34;</span><span class="p">,</span> <span class="s2">&#34;category_ID&#34;</span><span class="p">,</span> <span class="s2">&#34;category_label&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">data</span> <span class="o">=</span> <span class="p">(</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"> <span class="n">data</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"> <span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">over</span><span class="p">(</span><span class="s2">&#34;category_ID&#34;</span><span class="p">)</span> <span class="o">==</span> <span class="mi">10000</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"> <span class="o">.</span><span class="n">with_row_count</span><span class="p">(</span><span class="s2">&#34;row_idx&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl">
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1"># See list of models at www.sbert.net/docs/pretrained_models.html</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="n">ST</span> <span class="o">=</span> <span class="n">sentence_transformers</span><span class="o">.</span><span class="n">SentenceTransformer</span><span class="p">(</span><span class="s2">&#34;all-mpnet-base-v2&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="n">title_embeddings</span> <span class="o">=</span> <span class="p">(</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl"> <span class="n">ST</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl"> <span class="n">data</span><span class="o">.</span><span class="n">get_column</span><span class="p">(</span><span class="s2">&#34;product_title&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">to_list</span><span class="p">(),</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl"> <span class="c1"># I&#39;m on a MacBook, you should use `cuda` or `cpu`</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl"> <span class="c1"># if you&#39;ve got different hardware.</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl"> <span class="n">device</span><span class="o">=</span><span class="s2">&#34;mps&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl"> <span class="n">show_progress_bar</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">convert_to_tensor</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl"> <span class="o">.</span><span class="n">cpu</span><span class="p">()</span><span class="o">.</span><span class="n">numpy</span><span class="p">())</span>
</span></span><span class="line"><span class="ln">27</span><span class="cl">
</span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="c1"># Code to create a FAISS index</span>
</span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="k">def</span> <span class="nf">create_index</span><span class="p">(</span><span class="n">label</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">30</span><span class="cl"> <span class="n">faiss_indices</span> <span class="o">=</span> <span class="p">(</span>
</span></span><span class="line"><span class="ln">31</span><span class="cl"> <span class="n">data</span> <span class="c1"># this needs to be an argument if you want to create a generic function</span>
</span></span><span class="line"><span class="ln">32</span><span class="cl"> <span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category_label&#34;</span><span class="p">)</span> <span class="o">==</span> <span class="n">label</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">33</span><span class="cl"> <span class="o">.</span><span class="n">get_column</span><span class="p">(</span><span class="s2">&#34;row_idx&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">34</span><span class="cl"> <span class="o">.</span><span class="n">to_list</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">35</span><span class="cl"> <span class="p">)</span>
</span></span><span class="line"><span class="ln">36</span><span class="cl">
</span></span><span class="line"><span class="ln">37</span><span class="cl"> <span class="n">faiss_data</span> <span class="o">=</span> <span class="n">title_embeddings</span><span class="p">[</span><span class="n">faiss_indices</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">38</span><span class="cl"> <span class="n">d</span> <span class="o">=</span> <span class="n">faiss_data</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># Number of dimensions</span>
</span></span><span class="line"><span class="ln">39</span><span class="cl"> <span class="n">faiss_DB</span> <span class="o">=</span> <span class="n">faiss</span><span class="o">.</span><span class="n">IndexFlatIP</span><span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="c1"># Index using Inner Product</span>
</span></span><span class="line"><span class="ln">40</span><span class="cl"> <span class="n">faiss</span><span class="o">.</span><span class="n">normalize_L2</span><span class="p">(</span><span class="n">faiss_data</span><span class="p">)</span> <span class="c1"># Normalized L2 with Inner Product search = cosine similarity</span>
</span></span><span class="line"><span class="ln">41</span><span class="cl"> <span class="n">faiss_DB</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">faiss_data</span><span class="p">)</span> <span class="c1"># Build the index</span>
</span></span><span class="line"><span class="ln">42</span><span class="cl">
</span></span><span class="line"><span class="ln">43</span><span class="cl"> <span class="k">return</span> <span class="n">faiss_DB</span><span class="p">,</span> <span class="n">faiss_data</span><span class="p">,</span> <span class="n">faiss_indices</span>
</span></span><span class="line"><span class="ln">44</span><span class="cl">
</span></span><span class="line"><span class="ln">45</span><span class="cl"><span class="c1"># Code to create an edge-list</span>
</span></span><span class="line"><span class="ln">46</span><span class="cl"><span class="k">def</span> <span class="nf">get_edge_list</span><span class="p">(</span><span class="n">label</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">47</span><span class="cl"> <span class="n">faiss_DB</span><span class="p">,</span> <span class="n">faiss_data</span><span class="p">,</span> <span class="n">faiss_indices</span> <span class="o">=</span> <span class="n">create_index</span><span class="p">(</span><span class="n">label</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">48</span><span class="cl"> <span class="c1"># To map the data back to the original `train[b&#39;data&#39;]` array</span>
</span></span><span class="line"><span class="ln">49</span><span class="cl"> <span class="n">faiss_indices_map</span> <span class="o">=</span> <span class="p">{</span><span class="n">i</span><span class="p">:</span> <span class="n">x</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">x</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">faiss_indices</span><span class="p">)}</span>
</span></span><span class="line"><span class="ln">50</span><span class="cl"> <span class="c1"># To map the indices back to the original strings</span>
</span></span><span class="line"><span class="ln">51</span><span class="cl"> <span class="n">title_name_map</span> <span class="o">=</span> <span class="p">{</span><span class="n">i</span><span class="p">:</span> <span class="n">x</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">x</span> <span class="ow">in</span> <span class="n">data</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">&#34;row_idx&#34;</span><span class="p">,</span> <span class="s2">&#34;product_title&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">rows</span><span class="p">()}</span>
</span></span><span class="line"><span class="ln">52</span><span class="cl"> <span class="n">distances</span><span class="p">,</span> <span class="n">neighbors</span> <span class="o">=</span> <span class="n">faiss_DB</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">faiss_data</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">53</span><span class="cl">
</span></span><span class="line"><span class="ln">54</span><span class="cl"> <span class="k">return</span> <span class="p">(</span>
</span></span><span class="line"><span class="ln">55</span><span class="cl"> <span class="n">pl</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span>
</span></span><span class="line"><span class="ln">56</span><span class="cl"> <span class="s2">&#34;from&#34;</span><span class="p">:</span> <span class="n">faiss_indices</span><span class="p">})</span>
</span></span><span class="line"><span class="ln">57</span><span class="cl"> <span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">58</span><span class="cl"> <span class="n">pl</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="s2">&#34;to&#34;</span><span class="p">,</span> <span class="n">neighbors</span><span class="p">),</span>
</span></span><span class="line"><span class="ln">59</span><span class="cl"> <span class="n">pl</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="s2">&#34;distance&#34;</span><span class="p">,</span> <span class="n">distances</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">60</span><span class="cl"> <span class="o">.</span><span class="n">explode</span><span class="p">(</span><span class="s2">&#34;to&#34;</span><span class="p">,</span> <span class="s2">&#34;distance&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">61</span><span class="cl"> <span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">62</span><span class="cl"> <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;from&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">63</span><span class="cl"> <span class="o">.</span><span class="n">map_dict</span><span class="p">(</span><span class="n">title_name_map</span><span class="p">),</span>
</span></span><span class="line"><span class="ln">64</span><span class="cl"> <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;to&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">65</span><span class="cl"> <span class="o">.</span><span class="n">map_dict</span><span class="p">(</span><span class="n">faiss_indices_map</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">66</span><span class="cl"> <span class="o">.</span><span class="n">map_dict</span><span class="p">(</span><span class="n">title_name_map</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">67</span><span class="cl"> <span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;from&#34;</span><span class="p">)</span> <span class="o">!=</span> <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;to&#34;</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">68</span><span class="cl"> <span class="p">)</span>
</span></span><span class="line"><span class="ln">69</span><span class="cl">
</span></span><span class="line"><span class="ln">70</span><span class="cl"><span class="c1"># Code to extract components from a Network Graph</span>
</span></span><span class="line"><span class="ln">71</span><span class="cl"><span class="kn">import</span> <span class="nn">networkx</span> <span class="k">as</span> <span class="nn">nx</span>
</span></span><span class="line"><span class="ln">72</span><span class="cl"><span class="k">def</span> <span class="nf">get_cluster_map</span><span class="p">(</span><span class="n">label</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">min_cosine_distance</span><span class="o">=</span><span class="mf">0.95</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">73</span><span class="cl"> <span class="n">edge_list</span> <span class="o">=</span> <span class="p">(</span>
</span></span><span class="line"><span class="ln">74</span><span class="cl"> <span class="n">get_edge_list</span><span class="p">(</span><span class="n">label</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="n">k</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">75</span><span class="cl"> <span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;distance&#34;</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">min_cosine_distance</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">76</span><span class="cl"> <span class="p">)</span>
</span></span><span class="line"><span class="ln">77</span><span class="cl"> <span class="n">graph</span> <span class="o">=</span> <span class="n">nx</span><span class="o">.</span><span class="n">from_pandas_edgelist</span><span class="p">(</span><span class="n">edge_list</span><span class="o">.</span><span class="n">to_pandas</span><span class="p">(),</span> <span class="n">source</span><span class="o">=</span><span class="s2">&#34;from&#34;</span><span class="p">,</span> <span class="n">target</span><span class="o">=</span><span class="s2">&#34;to&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">78</span><span class="cl"> <span class="k">return</span> <span class="p">{</span><span class="n">i</span><span class="p">:</span> <span class="nb">list</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">x</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">nx</span><span class="o">.</span><span class="n">connected_components</span><span class="p">(</span><span class="n">graph</span><span class="p">))}</span>
</span></span><span class="line"><span class="ln">79</span><span class="cl">
</span></span><span class="line"><span class="ln">80</span><span class="cl"><span class="c1"># Example call to a single category to obtain its clusters</span>
</span></span><span class="line"><span class="ln">81</span><span class="cl"><span class="n">clusters</span> <span class="o">=</span> <span class="n">get_cluster_map</span><span class="p">(</span><span class="s2">&#34;Cell Phones Accessories&#34;</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mf">0.95</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">82</span><span class="cl"><span class="c1"># Example call to **all** categories to obtain all clusters</span>
</span></span><span class="line"><span class="ln">83</span><span class="cl"><span class="n">clusters</span> <span class="o">=</span> <span class="p">[</span><span class="n">get_cluster_map</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mf">0.95</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">data</span><span class="o">.</span><span class="n">get_column</span><span class="p">(</span><span class="s2">&#34;category_label&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">unique</span><span class="p">()]</span></span></span></code></pre></div><h2 id="how-the-code-works">How the code works</h2>
<p>If you want to write down an algorithmic way of looking at this approach,</p>
<ol>
<li>Obtain a 2D representation of the labelled/categorized data. This can be embeddings for strings, the final hidden state output from a generic CNN model for images, or a good ol&rsquo; tabular dataset where all numbers are normalized and can be expressed as such.</li>
<li>Create an ANN database (based on a package such as <code>faiss</code>) that allows you fast nearest neighbor searches. Use cosine similarity for an easy threshold determination step.</li>
<li>Obtain an edge-list of k (from 5 to 100) nearest neighbors for <strong>all</strong> (or a sample of data points in case your dataset is incredibly HUGE) data points in the ANN database.</li>
<li>Apply a minimum threshold on similarity (completely based on heuristics), and obtain the connected components of the network graph from the filtered edge-list you just created.</li>
<li>Map all indices back to their source data-points that make sense, and pick any number of items from each cluster (usually, I end up picking one element from each cluster), and you now have your representative sample!</li>
</ol>
]]></content:encoded></item><item><title>Overlap Joins: Number of docker trucks in an interval</title><link>https://avimallu.dev/blog/001_overlap_joins/</link><pubDate>Thu, 22 Jun 2023 00:00:00 +0000</pubDate><guid>https://avimallu.dev/blog/001_overlap_joins/</guid><description>&lt;h1 id="premise">Premise&lt;/h1>
&lt;p>I stumbled upon an interesting &lt;a href="https://stackoverflow.com/questions/76488314/polars-count-unique-values-over-a-time-period">Stackoverflow question&lt;/a> that was linked &lt;a href="https://github.com/pola-rs/polars/issues/9467">via an issue&lt;/a> on Polars github repo. The OP asked for a pure Polars solution. At the time of answering the question Polars did not have support for non-equi joins, and any solution using it would be pretty cumbersome.&lt;/p>
&lt;p>I&amp;rsquo;m more of a right-tool-for-the-job person, so I tried to find a better solution.&lt;/p>
&lt;h1 id="problem-statement">Problem Statement&lt;/h1>
&lt;p>Suppose we have a dataset that captures the arrival and departure times of trucks at a station, along with the truck&amp;rsquo;s ID.&lt;/p></description><content:encoded><![CDATA[<h1 id="premise">Premise</h1>
<p>I stumbled upon an interesting <a href="https://stackoverflow.com/questions/76488314/polars-count-unique-values-over-a-time-period">Stackoverflow question</a> that was linked <a href="https://github.com/pola-rs/polars/issues/9467">via an issue</a> on Polars github repo. The OP asked for a pure Polars solution. At the time of answering the question Polars did not have support for non-equi joins, and any solution using it would be pretty cumbersome.</p>
<p>I&rsquo;m more of a right-tool-for-the-job person, so I tried to find a better solution.</p>
<h1 id="problem-statement">Problem Statement</h1>
<p>Suppose we have a dataset that captures the arrival and departure times of trucks at a station, along with the truck&rsquo;s ID.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="nn">pl</span> <span class="c1"># if you don&#39;t have polars, run </span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"> <span class="c1"># pip install &#39;polars[all]&#39;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">data</span> <span class="o">=</span> <span class="n">pl</span><span class="o">.</span><span class="n">from_repr</span><span class="p">(</span><span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="s2">┌─────────────────────┬─────────────────────┬─────┐
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="s2">│ arrival_time ┆ departure_time ┆ ID │
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="s2">│ --- ┆ --- ┆ --- │
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="s2">│ datetime[μs] ┆ datetime[μs] ┆ str │
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="s2">╞═════════════════════╪═════════════════════╪═════╡
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="s2">│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1 │
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="s2">│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1 │
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="s2">│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5 │
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="s2">│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6 │
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="s2">│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3 │
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="s2">│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3 │
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="s2">│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6 │
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="s2">│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5 │
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="s2">│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6 │
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="s2">└─────────────────────┴─────────────────────┴─────┘
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="s2">&#34;&#34;&#34;</span><span class="p">)</span></span></span></code></pre></div><p>We want to identify the number of trucks docked at any given time within a threshold of 1 minute <em>prior</em> to the arrival time of a truck, and 1 minute <em>after</em> the departure of a truck. Equivalently, this means that we need to calculate the number of trucks within a specific window for each row of the data.</p>
<h1 id="finding-a-solution-to-the-problem">Finding a solution to the problem</h1>
<h2 id="evaluate-for-a-specific-row">Evaluate for a specific row</h2>
<p>Before we find a general solution to this problem, let&rsquo;s consider a specific row to understand the problem better:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln">1</span><span class="cl"><span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="s2">┌─────────────────────┬─────────────────────┬─────┐
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s2">│ arrival_time ┆ departure_time ┆ ID │
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s2">│ --- ┆ --- ┆ --- │
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s2">│ datetime[μs] ┆ datetime[μs] ┆ str │
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="s2">╞═════════════════════╪═════════════════════╪═════╡
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="s2">│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6 │
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="s2">└─────────────────────┴─────────────────────┴─────┘
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="s2">&#34;&#34;&#34;</span></span></span></code></pre></div><p>For this row, we need to find the number of trucks that are there between <code>2023-01-01 06:31:06</code> (1 minute prior to the <code>arrival_time</code> and <code>2023-01-01 06:34:48</code> (1 minute post the <code>departure_time</code>). Manually going through the original dataset, we see that <code>B3</code>, <code>C3</code>, <code>A6</code> and <code>A5</code> are the truck IDs that qualify - they all are at the station in a duration that is between <code>2023-01-01 06:31:06</code> and <code>2023-01-01 06:34:48</code>.</p>
<h2 id="visually-deriving-an-algorithm">Visually deriving an algorithm</h2>
<p>There are many cases that will qualify a truck to be present in the overlap window defined by a particular row. Specifically for the example above, we have (this visualization is generalizable, because for each row we can calculate without much difficulty the overlap <em>window</em> relative to the arrival and departure times):</p>
<p><img src="/blog/001_overlap_joins/overlap_algorithm.png" alt="The five different ways a period can overlap."></p>
<p>Take some time to absorb these cases - it&rsquo;s important for the part where we write the code for the solution. Note that we need to actually tell our algorithm to filter only for Cases 2, 3 and 4, since Cases 1 and 5 will not satisfy our requirements.</p>
<h2 id="writing-an-sql-query-based-on-the-algorithm">Writing an SQL query based on the algorithm</h2>
<p>In theory, we can use any language that has the capability to define rules that meet our algorithmic requirements outlined in the above section to find the solution. Why choose SQL? It&rsquo;s often able to convey elegantly the logic that was used to execute the algorithm; and while it does come with excessive verbosity at times, it doesn&rsquo;t quite in this case.</p>
<p>Note here that we run SQL in Python with almost no setup or boilerplate code - so this is a Python based solution as well (although not quite Pythonic!).</p>
<h3 id="introducing-the-duckdb-package">Introducing the DuckDB package</h3>
<p>Once again, in theory, any SQL package or language can be used. Far too few however meet the ease-of-use that <a href="https://duckdb.org/">DuckDB</a> provides:</p>
<ol>
<li>no expensive set-up time (meaning no need for setting up databases, even temporary ones),</li>
<li>no dependencies (other than DuckDB itself, just <code>pip install duckdb</code>),</li>
<li>some very <a href="https://duckdb.org/2022/05/04/friendlier-sql.html">friendly SQL extensions</a>, and</li>
<li>ability to work directly on Polars and Pandas DataFrames without conversions</li>
</ol>
<p>all with <a href="https://duckdblabs.github.io/db-benchmark/">mind-blowing speed</a> that stands shoulder-to-shoulder with Polars. We&rsquo;ll also use a few advanced SQL concepts noted below.</p>
<h4 id="self-joins">Self-joins</h4>
<p>This should be a familiar, albeit not often used concept - a join of a table with itself is a self join. There are few cases where such an operation would make sense, and this happens to be one of them.</p>
<h4 id="a-bullet-train-recap-of-non-equi-joins">A bullet train recap of non-equi joins</h4>
<p>A key concept that we&rsquo;ll use is the idea of joining on a <em>range</em> of values rather than a specific value. That is, instead of the usual <code>LEFT JOIN ON A.column = B.column</code>, we can do <code>LEFT JOIN ON A.column &lt;= B.column</code> for one row in table <code>A</code> to match to multiple rows in <code>B</code>. DuckDB has a <a href="https://duckdb.org/2022/05/27/iejoin.html">blog post</a> that outlines this join in detail, including fast implementation.</p>
<h4 id="the-concept-of-list-columns">The concept of <code>LIST</code> columns</h4>
<p>DuckDB has first class support for <code>LIST</code> columns - that is, each row in a <code>LIST</code> column can have a varying length (much like a Python <code>list</code>), but must have the exact same datatype (like R&rsquo;s <code>vector</code>). Using list columns allow us to eschew the use of an additional <code>GROUP BY</code> operation on top of a <code>WHERE</code> filter or <code>SELECT DISTINCT</code> operation, since we can directly perform those on the <code>LIST</code> column itself.</p>
<h4 id="date-algebra">Date algebra</h4>
<p>Dates can be rather difficult to handle well in most tools and languages, with several packages purpose built to make handling them easier - <a href="https://lubridate.tidyverse.org/">lubridate</a> from the <a href="https://www.tidyverse.org/">tidyverse</a> is a stellar example. Thankfully, DuckDB provides a similar swiss-knife set of tools to deal with it, including specifying <code>INTERVAL</code>s (a special data type that represent a period of time independent of specific time values) to modify <code>TIMESTAMP</code> values using addition or subtraction.</p>
<h3 id="tell-me-the-query-please">Tell me the query, PLEASE!</h3>
<p>Okay - had a lot of background. Let&rsquo;s have at it! The query by itself in SQL is (see immediately below for runnable code in Python):</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">arrival_time</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">A</span><span class="p">.</span><span class="n">departure_time</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_DISTINCT</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_trucks</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_UNIQUE</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_truck_count</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_open</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">departure_time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_close</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">data</span><span class="p">)</span><span class="w"> </span><span class="n">A</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">DATEDIFF</span><span class="p">(</span><span class="s1">&#39;seconds&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">arrival_time</span><span class="p">,</span><span class="w"> </span><span class="n">departure_time</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">duration</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">data</span><span class="p">)</span><span class="w"> </span><span class="n">B</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w"></span><span class="k">ON</span><span class="w"> </span><span class="p">((</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">TO_SECONDS</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">duration</span><span class="p">))</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="p">)</span><span class="w"> </span><span class="k">OR</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="p">)</span><span class="w"> </span><span class="k">OR</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">TO_SECONDS</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">duration</span><span class="p">))</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="p">))</span><span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span></span></span></code></pre></div><p>A small, succinct query such as this will need a bit of explanation to take it all in. Here&rsquo;s one below, reproducible in Python (make sure to install <code>duckdb</code> first!). Expand it to view.</p>
<details markdown="1"><summary>SQL with explanation.</summary>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">import</span> <span class="nn">duckdb</span> <span class="k">as</span> <span class="nn">db</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">db</span><span class="o">.</span><span class="n">query</span><span class="p">(</span><span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="s2"> SELECT
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="s2"> A.arrival_time
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="s2"> ,A.departure_time
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="s2"> ,A.window_open
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="s2"> ,A.window_close
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="s2"> -- LIST aggregates the values into a LIST column
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="s2"> -- and LIST_DISTINCT finds the unique values in it
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="s2"> ,LIST_DISTINCT(LIST(B.ID)) AS docked_trucks
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="s2"> -- finally, LIST_UNIQUE calculates the unique number of values in it
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="s2"> ,LIST_UNIQUE(LIST(B.ID)) AS docked_truck_count
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="s2"> FROM (
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="s2"> SELECT
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="s2"> *
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="s2"> ,arrival_time - (INTERVAL 1 MINUTE) AS window_open
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="s2"> ,departure_time + (INTERVAL 1 MINUTE) AS window_close
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="s2"> FROM data -- remember we defined data as the Polars DataFrame with our truck station data
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="s2"> ) A
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="s2"> LEFT JOIN (
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="s2"> SELECT
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="s2"> *
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="s2"> -- This is the time, in seconds between the arrival and departure of
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="s2"> -- each truck PER ROW in the original data-frame
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="s2"> ,DATEDIFF(&#39;seconds&#39;, arrival_time, departure_time) AS duration
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="s2"> FROM data -- this is where we perform a self-join
</span></span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="s2"> ) B
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="s2"> ON (
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="s2"> -- Case 2 in the diagram;
</span></span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="s2"> (B.arrival_time &lt;= A.window_open AND
</span></span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="s2"> -- Adding the duration here makes sure that the second interval
</span></span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="s2"> -- is at least ENDING AFTER the start of the overlap window
</span></span></span><span class="line"><span class="ln">36</span><span class="cl"><span class="s2"> (B.arrival_time + TO_SECONDS(B.duration)) &gt;= A.window_open) OR
</span></span></span><span class="line"><span class="ln">37</span><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="ln">38</span><span class="cl"><span class="s2"> -- Case 3 in the diagram - the simplest of all five cases
</span></span></span><span class="line"><span class="ln">39</span><span class="cl"><span class="s2"> (B.arrival_time &gt;= A.window_open AND
</span></span></span><span class="line"><span class="ln">40</span><span class="cl"><span class="s2"> B.departure_time &lt;= A.window_close) OR
</span></span></span><span class="line"><span class="ln">41</span><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="ln">42</span><span class="cl"><span class="s2"> -- Case 4 in the digram;
</span></span></span><span class="line"><span class="ln">43</span><span class="cl"><span class="s2"> (B.arrival_time &gt;= A.window_open AND
</span></span></span><span class="line"><span class="ln">44</span><span class="cl"><span class="s2"> -- Subtracting the duration here makes sure that the second interval
</span></span></span><span class="line"><span class="ln">45</span><span class="cl"><span class="s2"> -- STARTS BEFORE the end of the overlap window.
</span></span></span><span class="line"><span class="ln">46</span><span class="cl"><span class="s2"> (B.departure_time - TO_SECONDS(B.duration)) &lt;= A.window_close)
</span></span></span><span class="line"><span class="ln">47</span><span class="cl"><span class="s2"> )
</span></span></span><span class="line"><span class="ln">48</span><span class="cl"><span class="s2"> GROUP BY 1, 2, 3, 4
</span></span></span><span class="line"><span class="ln">49</span><span class="cl"><span class="s2">&#34;&#34;&#34;</span><span class="p">)</span></span></span></code></pre></div></details>
<p>The output of this query is:</p>
<pre tabindex="0"><code>&#34;&#34;&#34;
┌─────────────────────┬─────────────────────┬─────────────────────┬───┬──────────────────┬────────────────────┐
│ arrival_time │ departure_time │ window_open │ … │ docked_trucks │ docked_truck_count │
│ timestamp │ timestamp │ timestamp │ │ varchar[] │ uint64 │
├─────────────────────┼─────────────────────┼─────────────────────┼───┼──────────────────┼────────────────────┤
│ 2023-01-01 06:23:47 │ 2023-01-01 06:25:08 │ 2023-01-01 06:22:47 │ … │ [A1] │ 1 │
│ 2023-01-01 06:26:42 │ 2023-01-01 06:28:02 │ 2023-01-01 06:25:42 │ … │ [A1] │ 1 │
│ 2023-01-01 06:30:20 │ 2023-01-01 06:35:01 │ 2023-01-01 06:29:20 │ … │ [B3, C3, A6, A5] │ 4 │
│ 2023-01-01 06:32:06 │ 2023-01-01 06:33:48 │ 2023-01-01 06:31:06 │ … │ [B3, C3, A6, A5] │ 4 │
│ 2023-01-01 06:33:09 │ 2023-01-01 06:36:01 │ 2023-01-01 06:32:09 │ … │ [B3, C3, A6, A5] │ 4 │
│ 2023-01-01 06:34:08 │ 2023-01-01 06:39:49 │ 2023-01-01 06:33:08 │ … │ [B3, C3, A6, A5] │ 4 │
│ 2023-01-01 06:36:40 │ 2023-01-01 06:38:34 │ 2023-01-01 06:35:40 │ … │ [A5, A6, C3, B3] │ 4 │
│ 2023-01-01 06:37:43 │ 2023-01-01 06:40:48 │ 2023-01-01 06:36:43 │ … │ [A5, A6, C3] │ 3 │
│ 2023-01-01 06:39:48 │ 2023-01-01 06:46:10 │ 2023-01-01 06:38:48 │ … │ [A6, A5, C3] │ 3 │
├─────────────────────┴─────────────────────┴─────────────────────┴───┴──────────────────┴────────────────────┤
│ 9 rows 6 columns (5 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
&#34;&#34;&#34;</code></pre><p>We clearly see the strengths of DuckDB in how succintly we were able to express this operation. We also find how DuckDB is able to seamlessly integrate with an existing Pandas or Polars pipeline with zero-conversion costs. In fact, we can convert this back to a Polars or Pandas dataframe by appending the ending bracket with <code>db.query(...).pl()</code> and <code>db.query(...).pd()</code> respectively.</p>
<h2 id="can-we-make-the-sql-simpler">Can we make the SQL simpler?</h2>
<p>Now that we&rsquo;ve understood the logic that goes into the query, let&rsquo;s try to optimize the algorithm. We have the three conditions:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Case 2 in the diagram
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">TO_SECONDS</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">duration</span><span class="p">))</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="p">)</span><span class="w"> </span><span class="k">OR</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Case 3 in the diagram
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="p">)</span><span class="w"> </span><span class="k">OR</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- Case 4 in the diagram
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"></span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">TO_SECONDS</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">duration</span><span class="p">))</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="p">)</span></span></span></code></pre></div><p>What is common between these three conditions? It takes a while to see it; but it becomes clear that all these cases require the start of the overlap to be <em>before</em> the window ends, and the end of the overlap to be <em>after</em> the window starts. This can be simplified to just:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="w"> </span><span class="k">AND</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span></span></span></code></pre></div><p>making our query much simpler!</p>
<h3 id="simplified-sql-part-1">Simplified SQL: Part 1</h3>
<p>We&rsquo;ve removed the need for the <code>duration</code> calculation algother now. Therefore, we can write:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">arrival_time</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">A</span><span class="p">.</span><span class="n">departure_time</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_DISTINCT</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_trucks</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_UNIQUE</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_truck_count</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_open</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">departure_time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_close</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">data</span><span class="p">)</span><span class="w"> </span><span class="n">A</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="k">data</span><span class="w"> </span><span class="n">B</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="k">ON</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="w"> </span><span class="k">AND</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span></span></span></code></pre></div><p>Can we simplify this even further?</p>
<h3 id="simplification-part-2">Simplification: Part 2</h3>
<p>I think the SQL query in the above section is very easy to ready already. However, it is a little clunky overall, and there is a way that we can leverage DuckDB&rsquo;s extensive optimizations to simplify our <strong>legibility</strong> by rewriting the query as a cross join:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">arrival_time</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">A</span><span class="p">.</span><span class="n">departure_time</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">A</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_open</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">A</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_close</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_DISTINCT</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_trucks</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_UNIQUE</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_truck_count</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="k">data</span><span class="w"> </span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="k">data</span><span class="w"> </span><span class="n">B</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">window_close</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">AND</span><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">window_open</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span></span></span></code></pre></div><p>Why does this work? Before optimization on DuckDB, this is what the query plan looks like:</p>
<details markdown="1"><summary>DuckDB query plan before optimization</summary>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="s2">┌───────────────────────────┐
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="s2">│ PROJECTION │
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="s2">│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="s2">│ 0 │
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="s2">│ 1 │
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="s2">│ 2 │
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="s2">│ 3 │
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="s2">│ docked_trucks │
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="s2">│ docked_truck_count │
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="s2">└─────────────┬─────────────┘
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="s2">┌─────────────┴─────────────┐
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="s2">│ AGGREGATE │
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="s2">│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="s2">│ arrival_time │
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="s2">│ departure_time │
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="s2">│ window_open │
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="s2">│ window_close │
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="s2">│ list(ID) │
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="s2">└─────────────┬─────────────┘
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="s2">┌─────────────┴─────────────┐
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="s2">│ FILTER │
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="s2">│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="s2">│ (arrival_time &lt;= │
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="s2">│(departure_time + to_m... │
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="s2">│ AS BIGINT)))) │
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="s2">│ (departure_time &gt;= │
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="s2">│(arrival_time - to_min... │
</span></span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="s2">│ AS BIGINT)))) │
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="s2">└─────────────┬─────────────┘
</span></span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="s2">┌─────────────┴─────────────┐
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="s2">│ CROSS_PRODUCT ├──────────────┐
</span></span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="s2">└─────────────┬─────────────┘ │
</span></span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="s2">┌─────────────┴─────────────┐┌─────────────┴─────────────┐
</span></span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="s2">│ ARROW_SCAN ││ ARROW_SCAN │
</span></span></span><span class="line"><span class="ln">36</span><span class="cl"><span class="s2">└───────────────────────────┘└───────────────────────────┘
</span></span></span><span class="line"><span class="ln">37</span><span class="cl"><span class="s2">&#34;&#34;&#34;</span> </span></span></code></pre></div></details>
<p>After optimization, the <code>CROSS_PRODUCT</code> is <strong>automatically</strong> optimized to an <strong>interval join</strong>!</p>
<details markdown="1"><summary>DuckDB query after before optimization</summary>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="s2">┌───────────────────────────┐
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="s2">│ PROJECTION │
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="s2">│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="s2">│ 0 │
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="s2">│ 1 │
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="s2">│ 2 │
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="s2">│ 3 │
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="s2">│ docked_trucks │
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="s2">│ docked_truck_count │
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="s2">└─────────────┬─────────────┘
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="s2">┌─────────────┴─────────────┐
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="s2">│ AGGREGATE │
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="s2">│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="s2">│ arrival_time │
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="s2">│ departure_time │
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="s2">│ window_open │
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="s2">│ window_close │
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="s2">│ list(ID) │
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="s2">└─────────────┬─────────────┘
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="s2">┌─────────────┴─────────────┐
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="s2">│ COMPARISON_JOIN │
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="s2">│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="s2">│ INNER │
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="s2">│ ((departure_time + &#39;00:01 │
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="s2">│ :00&#39;::INTERVAL) &gt;= ├──────────────┐
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="s2">│ arrival_time) │ │
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="s2">│((arrival_time - &#39;00:01:00&#39;│ │
</span></span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="s2">│ ::INTERVAL) &lt;= │ │
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="s2">│ departure_time) │ │
</span></span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="s2">└─────────────┬─────────────┘ │
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="s2">┌─────────────┴─────────────┐┌─────────────┴─────────────┐
</span></span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="s2">│ ARROW_SCAN ││ ARROW_SCAN │
</span></span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="s2">└───────────────────────────┘└───────────────────────────┘
</span></span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="s2">&#34;&#34;&#34;</span> </span></span></code></pre></div></details>
<p>So in effect, we&rsquo;re actually exploiting a feature of DuckDB to allow us to write our queries in a suboptimal manner for greater readability, and allowing the optmizer to do a good chunk of our work for us. I wouldn&rsquo;t recommend using this generally, because not all SQL engine optmizers will be able to find an efficient route to these calculations for large datasets.</p>
<h3 id="how-to-get-query-plans">How to get query plans?</h3>
<p>I&rsquo;m glad you asked. Here&rsquo;s the DuckDB <a href="https://duckdb.org/docs/guides/meta/explain.html">page explaining <code>EXPLAIN</code></a> (heh). Here&rsquo;s the code I used:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">import</span> <span class="nn">duckdb</span> <span class="k">as</span> <span class="nn">db</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">db</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&#34;SET EXPLAIN_OUTPUT=&#39;all&#39;;&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">query</span><span class="p">(</span><span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="s2">EXPLAIN
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="s2">SELECT
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="s2"> A.arrival_time
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="s2"> ,A.departure_time
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="s2"> ,A.arrival_time - (INTERVAL 1 MINUTE) AS window_open
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="s2"> ,A.departure_time + (INTERVAL 1 MINUTE) AS window_close
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="s2"> ,LIST_DISTINCT(LIST(B.ID)) AS docked_trucks
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="s2"> ,LIST_UNIQUE(LIST(B.ID)) AS docked_truck_count
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="s2">FROM data A, data B
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="s2">WHERE B.arrival_time &lt;= window_close
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="s2">AND B.departure_time &gt;= window_open
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="s2">GROUP BY 1, 2, 3, 4
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="s2">&#34;&#34;&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">pl</span><span class="p">()[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span></span></span></code></pre></div><h1 id="what-are-the-alternatives">What are the alternatives?</h1>
<h2 id="the-datatable-way">The <code>data.table</code> way</h2>
<p><a href="https://github.com/Rdatatable/data.table"><code>data.table</code></a> is a package that has historically been ahead of its time - in both speed and features that it has had. Developement has taken a hit recently, but will likely <a href="https://github.com/Rdatatable/data.table/issues/5656">pick back up</a>. It&rsquo;s my favourite package on all fronts for data manipulation, but suffers simply from the lack of broader R support across the ML and DL space.</p>
<h3 id="the-foverlaps-function">The <code>foverlaps</code> function</h3>
<p>If this kind of overlapping join is common, shouldn&rsquo;t someone have developed a package for it? Turns out, <code>data.table</code> has, and with very specific constraints that make it the perfect solution to our problem (if you don&rsquo;t mind switching over to R, that is).</p>
<p>The <code>foverlaps</code> function has these requirements:</p>
<ol>
<li>The input <code>data.table</code> objects have to be keyed for automatic recognition of columns.</li>
<li>The default match type is that it matches all three cases from the image above. Side note: it also has matches for <code>within</code> overlap, matching <code>start</code> and <code>end</code> windows,</li>
<li>The last two matching columns in the join condition in <code>by</code> must specify the <code>start</code> and <code>end</code> points of the overlapping window. This isn&rsquo;t a problem for us now, but does restrict for future uses where we may want non-equi joins on other cases.</li>
</ol>
<h3 id="the-code-_si_-the-code">The code, <em>si</em>, the code!</h3>
<p>Without further ado:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">data.table</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">lubridate</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1">######### BOILERPLATE CODE, NO LOGIC HERE ####################</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">arrival_time</span> <span class="o">=</span> <span class="nf">as_datetime</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> <span class="s">&#39;2023-01-01 06:23:47.000000&#39;</span><span class="p">,</span> <span class="s">&#39;2023-01-01 06:26:42.000000&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> <span class="s">&#39;2023-01-01 06:30:20.000000&#39;</span><span class="p">,</span> <span class="s">&#39;2023-01-01 06:32:06.000000&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"> <span class="s">&#39;2023-01-01 06:33:09.000000&#39;</span><span class="p">,</span> <span class="s">&#39;2023-01-01 06:34:08.000000&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="s">&#39;2023-01-01 06:36:40.000000&#39;</span><span class="p">,</span> <span class="s">&#39;2023-01-01 06:37:43.000000&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"> <span class="s">&#39;2023-01-01 06:39:48.000000&#39;</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">departure_time</span> <span class="o">=</span> <span class="nf">as_datetime</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"> <span class="s">&#39;2023-01-01 06:25:08.000000&#39;</span><span class="p">,</span> <span class="s">&#39;2023-01-01 06:28:02.000000&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"> <span class="s">&#39;2023-01-01 06:35:01.000000&#39;</span><span class="p">,</span> <span class="s">&#39;2023-01-01 06:33:48.000000&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"> <span class="s">&#39;2023-01-01 06:36:01.000000&#39;</span><span class="p">,</span> <span class="s">&#39;2023-01-01 06:39:49.000000&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"> <span class="s">&#39;2023-01-01 06:38:34.000000&#39;</span><span class="p">,</span> <span class="s">&#39;2023-01-01 06:40:48.000000&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"> <span class="s">&#39;2023-01-01 06:46:10.000000&#39;</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="n">ID</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s">&#39;A1&#39;</span><span class="p">,</span> <span class="s">&#39;A1&#39;</span><span class="p">,</span> <span class="s">&#39;A5&#39;</span><span class="p">,</span> <span class="s">&#39;A6&#39;</span><span class="p">,</span> <span class="s">&#39;B3&#39;</span><span class="p">,</span> <span class="s">&#39;C3&#39;</span><span class="p">,</span> <span class="s">&#39;A6&#39;</span><span class="p">,</span> <span class="s">&#39;A5&#39;</span><span class="p">,</span> <span class="s">&#39;A6&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="n">DT</span> <span class="o">=</span> <span class="nf">data.table</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl"> <span class="n">arrival_time</span> <span class="o">=</span> <span class="n">arrival_time</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl"> <span class="n">departure_time</span> <span class="o">=</span> <span class="n">departure_time</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl"> <span class="n">ID</span> <span class="o">=</span> <span class="n">ID</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="c1">######### BOILERPLATE CODE, NO LOGIC HERE ####################</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">
</span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="c1"># A copy(DT) creates a copy of a data.table that isn&#39;t linked</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="c1"># to the original one, so that changes in it don&#39;t reflect in</span>
</span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="c1"># the original DT object.</span>
</span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="c1"># The `:=` allow assignment by reference (i.e. &#34;in place&#34;).</span>
</span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="n">DT_with_windows</span> <span class="o">=</span> <span class="nf">copy</span><span class="p">(</span><span class="n">DT</span><span class="p">)</span><span class="n">[</span><span class="p">,</span> <span class="nf">`:=`</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">30</span><span class="cl"> <span class="n">window_start</span> <span class="o">=</span> <span class="n">arrival_time</span> <span class="o">-</span> <span class="nf">minutes</span><span class="p">(</span><span class="m">1</span><span class="p">),</span>
</span></span><span class="line"><span class="ln">31</span><span class="cl"> <span class="n">window_end</span> <span class="o">=</span> <span class="n">departure_time</span> <span class="o">+</span> <span class="nf">minutes</span><span class="p">(</span><span class="m">1</span><span class="p">))</span><span class="n">]</span>
</span></span><span class="line"><span class="ln">32</span><span class="cl">
</span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="c1"># This step is necessary for the second table, but not the first, but we</span>
</span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="c1"># key both data.tables to make the foverlap code very succinct.</span>
</span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="nf">setkeyv</span><span class="p">(</span><span class="n">DT</span><span class="p">,</span> <span class="nf">c</span><span class="p">(</span><span class="s">&#34;arrival_time&#34;</span><span class="p">,</span> <span class="s">&#34;departure_time&#34;</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">36</span><span class="cl"><span class="nf">setkeyv</span><span class="p">(</span><span class="n">DT_with_windows</span><span class="p">,</span> <span class="nf">c</span><span class="p">(</span><span class="s">&#34;window_start&#34;</span><span class="p">,</span> <span class="s">&#34;window_end&#34;</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">37</span><span class="cl">
</span></span><span class="line"><span class="ln">38</span><span class="cl"><span class="c1"># The foverlap function returns a data.table, so we can simply apply</span>
</span></span><span class="line"><span class="ln">39</span><span class="cl"><span class="c1"># the usual data.table syntax on it!</span>
</span></span><span class="line"><span class="ln">40</span><span class="cl"><span class="c1"># Since we have the same name of some columns in both data.tables,</span>
</span></span><span class="line"><span class="ln">41</span><span class="cl"><span class="c1"># the latter table&#39;s columns are prefixed with &#34;i.&#34; to avoid conflicts.</span>
</span></span><span class="line"><span class="ln">42</span><span class="cl"><span class="nf">foverlaps</span><span class="p">(</span><span class="n">DT</span><span class="p">,</span> <span class="n">DT_with_windows</span><span class="p">)</span><span class="n">[</span>
</span></span><span class="line"><span class="ln">43</span><span class="cl"> <span class="p">,</span> <span class="n">.(docked_trucks</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="nf">unique</span><span class="p">(</span><span class="n">i.ID</span><span class="p">)),</span>
</span></span><span class="line"><span class="ln">44</span><span class="cl"> <span class="n">docked_truck_count</span> <span class="o">=</span> <span class="nf">uniqueN</span><span class="p">(</span><span class="n">i.ID</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">45</span><span class="cl"> <span class="p">,</span> <span class="n">.(arrival_time</span><span class="p">,</span> <span class="n">departure_time</span><span class="p">)</span><span class="n">]</span></span></span></code></pre></div><p>provides us the output:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="ln"> 1</span><span class="cl"> <span class="n">arrival_time</span> <span class="n">departure_time</span> <span class="n">docked_trucks</span> <span class="n">docked_truck_count</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"> <span class="o">&lt;</span><span class="n">POSc</span><span class="o">&gt;</span> <span class="o">&lt;</span><span class="n">POSc</span><span class="o">&gt;</span> <span class="o">&lt;</span><span class="n">list</span><span class="o">&gt;</span> <span class="o">&lt;</span><span class="n">int</span><span class="o">&gt;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="m">1</span><span class="o">:</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">23</span><span class="o">:</span><span class="m">47</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">25</span><span class="o">:</span><span class="m">08</span> <span class="n">A1</span> <span class="m">1</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="m">2</span><span class="o">:</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">26</span><span class="o">:</span><span class="m">42</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">28</span><span class="o">:</span><span class="m">02</span> <span class="n">A1</span> <span class="m">1</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="m">3</span><span class="o">:</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">30</span><span class="o">:</span><span class="m">20</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">35</span><span class="o">:</span><span class="m">01</span> <span class="n">A5</span><span class="p">,</span><span class="n">A6</span><span class="p">,</span><span class="n">B3</span><span class="p">,</span><span class="n">C3</span> <span class="m">4</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="m">4</span><span class="o">:</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">32</span><span class="o">:</span><span class="m">06</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">33</span><span class="o">:</span><span class="m">48</span> <span class="n">A5</span><span class="p">,</span><span class="n">A6</span><span class="p">,</span><span class="n">B3</span><span class="p">,</span><span class="n">C3</span> <span class="m">4</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="m">5</span><span class="o">:</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">33</span><span class="o">:</span><span class="m">09</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">36</span><span class="o">:</span><span class="m">01</span> <span class="n">A5</span><span class="p">,</span><span class="n">A6</span><span class="p">,</span><span class="n">B3</span><span class="p">,</span><span class="n">C3</span> <span class="m">4</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="m">6</span><span class="o">:</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">34</span><span class="o">:</span><span class="m">08</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">39</span><span class="o">:</span><span class="m">49</span> <span class="n">A5</span><span class="p">,</span><span class="n">A6</span><span class="p">,</span><span class="n">B3</span><span class="p">,</span><span class="n">C3</span> <span class="m">4</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="m">7</span><span class="o">:</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">36</span><span class="o">:</span><span class="m">40</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">38</span><span class="o">:</span><span class="m">34</span> <span class="n">B3</span><span class="p">,</span><span class="n">C3</span><span class="p">,</span><span class="n">A6</span><span class="p">,</span><span class="n">A5</span> <span class="m">4</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="m">8</span><span class="o">:</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">37</span><span class="o">:</span><span class="m">43</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">40</span><span class="o">:</span><span class="m">48</span> <span class="n">C3</span><span class="p">,</span><span class="n">A6</span><span class="p">,</span><span class="n">A5</span> <span class="m">3</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="m">9</span><span class="o">:</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">39</span><span class="o">:</span><span class="m">48</span> <span class="m">2023-01-01</span> <span class="m">06</span><span class="o">:</span><span class="m">46</span><span class="o">:</span><span class="m">10</span> <span class="n">C3</span><span class="p">,</span><span class="n">A5</span><span class="p">,</span><span class="n">A6</span> <span class="m">3</span></span></span></code></pre></div><h3 id="considerations-for-using-datatable">Considerations for using <code>data.table</code></h3>
<p>The package offers a wonderful, nearly one-stop solution that doesn&rsquo;t require you to write the logic out for the query or command yourself, but has a major problem for a lot of users - it requires you to switch your codebase to R, and a lot of your tasks may be on Python or in an SQL pipeline. So, what do you do?</p>
<p>Consider the effort in maintaining an additional dependency for your analytics pipeline (i.e. R), and the effort that you&rsquo;ll need to invest to run R from Python, or run an R script in your pipeline and pull the output from it back into the pipeline, and make your call.</p>
]]></content:encoded></item></channel></rss>