Compare commits
5 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
fb57176f75
|
|||
|
b9098e0e93
|
|||
|
c4bff1536c
|
|||
|
6f6dc572db
|
|||
|
08a9289bb1
|
@@ -1,5 +1,5 @@
|
||||
---
|
||||
title: Environment Variables and Multiprocessing
|
||||
title: Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines
|
||||
author: Avinash Mallya
|
||||
date: 2026-01-14
|
||||
tags: [python, numpy, multiprocessing, parallel, environment, variables]
|
||||
@@ -88,7 +88,7 @@ However, we called `numpy` via Python here 10 separate times. Each of those 10 p
|
||||
on the machine. 320 threads on 32 cores, which means that there were significantly more threads than cores, all actively
|
||||
vying for those 32 cores!
|
||||
|
||||
The clearest signal of this is that a *naive attempt at parallelization via multiprocessing* being **SLOWER** than a
|
||||
The clearest signal of this is that a _naive attempt at parallelization via multiprocessing_ being **SLOWER** than a
|
||||
sequential set of calls to the same operation. That's why using a system with more cores and trying to parallelize blindly
|
||||
doesn't always speed things up, and in the worst case, such as this, it considerably slows things down.
|
||||
|
||||
@@ -97,13 +97,13 @@ doesn't always speed things up, and in the worst case, such as this, it consider
|
||||
`numpy` uses different libraries for multithreading based on the system. Environment variables can be used to control the
|
||||
number of threads each process creates, a summary of which is provided below.
|
||||
|
||||
| Variable | Backend |
|
||||
| :-: | :-: |
|
||||
| `OMP_NUM_THREADS` |OpenMP (used by many BLAS)
|
||||
| `OPENBLAS_NUM_THREADS` |OpenBLAS
|
||||
| `MKL_NUM_THREADS` |Intel MKL
|
||||
| `BLIS_NUM_THREADS` |BLIS
|
||||
| `VECLIB_MAXIMUM_THREADS` |Apple Accelerate
|
||||
| Variable | Backend |
|
||||
| :----------------------: | :------------------------: |
|
||||
| `OMP_NUM_THREADS` | OpenMP (used by many BLAS) |
|
||||
| `OPENBLAS_NUM_THREADS` | OpenBLAS |
|
||||
| `MKL_NUM_THREADS` | Intel MKL |
|
||||
| `BLIS_NUM_THREADS` | BLIS |
|
||||
| `VECLIB_MAXIMUM_THREADS` | Apple Accelerate |
|
||||
|
||||
To avoid oversubscription, we need to tell `numpy` to use fewer threads than the default (which is all threads), because
|
||||
(1) we are aware that the process running is not compute intensive, and (2) we will handle parallelism ourselves.
|
||||
@@ -118,7 +118,7 @@ user 0m4.905s
|
||||
sys 0m0.236s
|
||||
```
|
||||
|
||||
to make the slowest job among a set of 10 *run faster than single call to the script*!
|
||||
to make the slowest job among a set of 10 _run faster than single call to the script_!
|
||||
|
||||
> The real world impact in the actual codebase was 15x response time, and 30x throughput. What used to take 120 machines earlier took only 4 now!
|
||||
|
||||
@@ -161,8 +161,8 @@ single core might just be inherently better!
|
||||
I've simplified a few things in this blog post:
|
||||
|
||||
1. This demo uses heavy `numpy` ops to clearly show the oversubscription effect. The actual codebase had much lighter
|
||||
usage, but the principle is the same.
|
||||
2. I've used *contention* and *oversubscription* interchangeably here. The former is a case of threads competing for
|
||||
the same shared resource, and the latter is a higher thread count than CPU cores. *Oversubscription* here **led** to *contention*.
|
||||
usage, but the principle is the same.
|
||||
2. I've used _contention_ and _oversubscription_ interchangeably here. The former is a case of threads competing for
|
||||
the same shared resource, and the latter is a higher thread count than CPU cores. _Oversubscription_ here **led** to _contention_.
|
||||
3. Modern systems have hundreds, or thousands of active threads for very few system cores. The difference with most of
|
||||
these threads is that they often are "sleeping", and don't *vie* for attention like the `numpy` ones.
|
||||
these threads is that they often are "sleeping", and don't _vie_ for attention like the `numpy` ones.
|
||||
|
||||
296
content/blog/005_ldmb_as_image_db/index.md
Normal file
296
content/blog/005_ldmb_as_image_db/index.md
Normal file
@@ -0,0 +1,296 @@
|
||||
---
|
||||
title: Resolving I/O Bottlenecks for 100K Small Files with LMDB
|
||||
author: Avinash Mallya
|
||||
date: 2026-02-10
|
||||
tags: [python, machine-learning, storage, images, dvc, lmdb, bottleneck, io]
|
||||
---
|
||||
|
||||
# Premise
|
||||
|
||||
I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.
|
||||
|
||||
I had several thousand images already.
|
||||
I was expecting several thousand more.
|
||||
My repository was tracking these images via DVC.
|
||||
My computer was also slowing down massively because of the sheer number of files.
|
||||
DVC itself was slowing down (after all, randomly accessing many files isn't going to be fast).
|
||||
I also needed to access files at random for training/evaluating the model (lots of shuffling).
|
||||
Lastly, these images had their own associated metadata (labels, bounding boxes, "correct" text etc.), and they need to be stored along with the images - or least easily linkable to them.
|
||||
|
||||
I was primarily aiming for a "simple" solution, and didn't need a productionizable codebase.
|
||||
|
||||
# Potential Solutions
|
||||
|
||||
## Partitioning
|
||||
|
||||
A typical solution for "too many files" is to partition them by their name. It's ideal if their
|
||||
name is a hash, so you can store the first character of the hash, then the second character, and
|
||||
then the actual file. So, for example, the directory changes from:
|
||||
|
||||
```bash
|
||||
files/
|
||||
├── a1b2c3d4e5.txt
|
||||
├── b7f8a9c0d1.txt
|
||||
├── b7e4d2f1a0.txt
|
||||
├── cf1a2b3c4d.txt
|
||||
├── c0d1e2f3a4.txt
|
||||
└── a1f5e6d7c8.txt
|
||||
```
|
||||
|
||||
to
|
||||
|
||||
```bash
|
||||
files/
|
||||
├── a/
|
||||
│ └── 1/
|
||||
│ ├── a1b2c3d4e5.txt
|
||||
│ └── a1f5e6d7c8.txt
|
||||
├── b/
|
||||
│ └── 7/
|
||||
│ ├── b7f8a9c0d1.txt
|
||||
│ └── b7e4d2f1a0.txt
|
||||
└── c/
|
||||
├── 0/
|
||||
│ └── c0d1e2f3a4.txt
|
||||
└── f/
|
||||
└── cf1a2b3c4d.txt
|
||||
```
|
||||
|
||||
This isn't novel - `git` and `dvc` both store their objects this way.
|
||||
This limits directory size, so file system look ups take less time.
|
||||
|
||||
This isn't a perfect solution - it required that I store the images as their hash,
|
||||
and handle the directory structure correctly. I will also need to maintain my own
|
||||
mechanism to maintain the link between the hash and the metadata, which means creating
|
||||
some sort of index. Lastly, DVC will still track files individually, which means that
|
||||
its `push`, `diff` and `pull` commands will still be slow.
|
||||
|
||||
## Separate (object) storage, maintain only the index
|
||||
|
||||
Another solution would be to offload the storage to a medium capable of handling a large number of files in any order,
|
||||
while maintaining random access - for example, S3. Now my task will reduce to just correctly maintaining the index so
|
||||
that I know how many images are stored on S3, and link their metadata via their hash.
|
||||
|
||||
If you've experienced retrieving a list of a large number of files stored on S3, you'd have first encountered the limit
|
||||
of 1000 objects that `boto3` enforces per request. You'll need to work around it with pagination, which while standard,
|
||||
is still more work. You would have also realized that even after all this, S3 will take quite some time to give you the
|
||||
list even after you've optimized as much as you can.
|
||||
|
||||
However, that means I need an active internet connection to access any data. It also introduces latency during training,
|
||||
and quite wasteful bandwidth in running multiple training sessions for the models. I can minimize these if I store the
|
||||
images in the same AWS region as I would be running the training in (say, an EC2), but that means I needed access to an
|
||||
EC2.
|
||||
|
||||
This still left me the task of maintaining my own index, which I really wanted to avoid, as it would mean additional
|
||||
maintenance burden for a relatively nascent project that hasn't reached production status, while demanding production
|
||||
code for an even more nascent pipeline.
|
||||
|
||||
## What about a... database?
|
||||
|
||||
This feels much like reaching for your nose by looping your hand behind your head instead of touching it directly with
|
||||
your fingers. A database is great for many things, but setting one up and maintaining it (even a local SQLite one) isn't
|
||||
really a quick and painless process, and has many gotchas. For instance, most databases aren't optimized for storing a
|
||||
large number of binary blobs.
|
||||
|
||||
I considered, and tested using a more modern embeddable analytical database like DuckDB for this purpose (I'm quite
|
||||
biased to using DuckDB and/or Polars to solve a large number of my data processing problems). I quickly
|
||||
found out that storing large binary blobs in it causes it to choke (which is fair, it isn't really designed for that). Storing
|
||||
the files elsewhere while maintaining just the index in it still had the original problem - I needed to write the mechanism
|
||||
of maintaining the index.
|
||||
|
||||
> Note: HuggingFace now provides many images datasets (such as [MNIST](https://huggingface.co/datasets/ylecun/mnist)) in
|
||||
> the Parquet format, with the images stored using Arrow's extension types (but still as binary blobs). My experience with
|
||||
> storing binary data in Parquet hasn't been great, but you could check this out to see if it meets your requirements.
|
||||
|
||||
# The solution I landed on
|
||||
|
||||
## What about a... _different_ kind of database?
|
||||
|
||||
Let's get down to first principles. What did I want to do? I wanted to store images. With those images, I also wanted
|
||||
to store its metadata. I wanted to access said data quickly. It became clearer to me that I was looking for a fast key-value
|
||||
store, and I stumbled upon [LMDB](https://www.symas.com/mdb).
|
||||
|
||||
## LMDB
|
||||
|
||||
Wikipedia's [entry](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database) on LMDB indicates that it's an incredibly
|
||||
small (64kB) piece of software that does one thing really well - be a ridiculously fast key-value store. I won't pretend to
|
||||
understand how it works, the writeup on the Wiki provides plenty of good detail. I'll focus, rather, on how I used it to
|
||||
solve my problem.
|
||||
|
||||
## Storing and retrieving image data along with its metadata
|
||||
|
||||
LMDB is rather barebones. It exposes few features - the ability to write, and read a particular key (stored as a bytestring),
|
||||
that itself points to arbitrary bytes. I used its [Python bindings](https://lmdb.readthedocs.io/en/release/).
|
||||
|
||||
I wrote a tiny class (~200 LoC) that did the following:
|
||||
|
||||
1. Read the file names and the data of the images that I had, along with their metadata as it currently was, into Python. Batched to avoid running out of memory.
|
||||
2. Serialize the metadata, and read the image files as bytes, and link them to the keys f"{file_name}_image" and f"{file_name}_metadata" respectively.
|
||||
3. Store these as key-value pairs into the LMDB database, which is a **single** file.
|
||||
4. Provided a method to read the keys to identify all the images present in the database.
|
||||
5. Provided a method to retrieve an arbitrary set of images and their metadata quickly from the saved database.
|
||||
|
||||
This has many advantages:
|
||||
|
||||
1. LMDB is fast - really, really fast. Random access? Check. Fast retrieval of available images? Check.
|
||||
2. DVC tracking becomes simple - maintain a single file, and just version control that. No slow downs due to sheer number of files - either for DVC, or my computer.
|
||||
3. No index to maintain - the images and their metadata are stored in the same location, and linkable via a mere change in the suffix to their file name.
|
||||
4. Local access, practically zero latency.
|
||||
|
||||
Which solves... all of the problems I had! When new files come in, all I needed to do was add them to the DB. LMDB has a few options available - you can
|
||||
avoid overwriting the same key, ensure that the database is de-duplicated.
|
||||
|
||||
# The code
|
||||
|
||||
I've provided a sample code below that demonstrates storing just the images (not metadata) for the [Oxford 102 Category Flower](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/)
|
||||
dataset, which has around 8000 images.
|
||||
|
||||
```py
|
||||
from pathlib import Path
|
||||
import lmdb
|
||||
|
||||
|
||||
class ImageDB:
|
||||
def __init__(self, env_path: Path, max_size_as_mb: int):
|
||||
self.env_path = str(env_path)
|
||||
self.env = lmdb.open(self.env_path, map_size=max_size_as_mb * (2**20))
|
||||
self.db = self.env.open_db()
|
||||
|
||||
def save_image(
|
||||
self,
|
||||
name: str,
|
||||
image_path: Path,
|
||||
) -> None:
|
||||
with self.env.begin(write=True) as txn:
|
||||
txn.put(name.encode(), image_path.read_bytes())
|
||||
|
||||
def read_image(self, name: str) -> bytes:
|
||||
with self.env.begin(write=False) as txn:
|
||||
if image_as_bytes := txn.get(name.encode()):
|
||||
return image_as_bytes
|
||||
else:
|
||||
raise KeyError()
|
||||
|
||||
def save_images(
|
||||
self,
|
||||
name_image: dict[str, Path],
|
||||
) -> None:
|
||||
# Note: you might need to enforce a batch size here
|
||||
# to aovid running out of memory because this loads
|
||||
# all images sent to this function as bytes.
|
||||
with self.env.begin(write=True) as txn:
|
||||
item_tuples = [
|
||||
(k.encode(), image_path.read_bytes())
|
||||
for k, image_path in name_image.items()
|
||||
]
|
||||
cursor = txn.cursor()
|
||||
consumed, added = cursor.putmulti(
|
||||
item_tuples, dupdata=False, overwrite=False
|
||||
)
|
||||
print(
|
||||
f"Saved {added:,} out of {len(name_image):,} images to the DB ({consumed - added:,} seem to already exist)."
|
||||
)
|
||||
|
||||
def load_images(
|
||||
self,
|
||||
names: list[str],
|
||||
) -> dict[str, bytes]:
|
||||
names_as_bytestrings = [x.encode() for x in names]
|
||||
with self.env.begin(write=False) as txn:
|
||||
cursor = txn.cursor()
|
||||
return {
|
||||
k.decode(): image_as_bytes
|
||||
for k, image_as_bytes in cursor.getmulti(names_as_bytestrings)
|
||||
}
|
||||
|
||||
def delete_image(self, name: str):
|
||||
with self.env.begin(write=True) as txn:
|
||||
if txn.delete(name.encode()):
|
||||
print(f"Image {name} deleted successfully")
|
||||
else:
|
||||
raise KeyError()
|
||||
|
||||
def retrieve_names(self) -> list[str]:
|
||||
with self.env.begin(write=False) as txn:
|
||||
return [x.decode() for x in txn.cursor().iternext(keys=True, values=False)]
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
db = ImageDB(Path("./db/"), 512)
|
||||
name_image: dict[str, Path] = dict()
|
||||
|
||||
# Save the results
|
||||
for image_path in Path("./data/jpg/").glob("*.jpg"):
|
||||
name_image[image_path.name] = image_path
|
||||
if len(name_image) >= 1000:
|
||||
db.save_images(name_image)
|
||||
name_image.clear()
|
||||
# Add last batch also
|
||||
db.save_images(name_image)
|
||||
|
||||
del name_image
|
||||
# How many images have been stored?
|
||||
print(f"The DB has {len(db.retrieve_names()):,} images stored")
|
||||
|
||||
name_image: dict[str, bytes] = dict()
|
||||
# Load the results from the DB and check if they match the files on disk
|
||||
for image_path in Path("./data/jpg").glob("*.jpg"):
|
||||
name_image[image_path.name] = image_path.read_bytes()
|
||||
if len(name_image) >= 1000:
|
||||
saved_name_image = db.load_images(list(name_image.keys()))
|
||||
assert name_image == saved_name_image
|
||||
name_image.clear()
|
||||
# Verify last batch also
|
||||
saved_name_image = db.load_images(list(name_image.keys()))
|
||||
assert name_image == saved_name_image
|
||||
|
||||
print("All images stored are byte identical to the original ones!")
|
||||
|
||||
db.env.close()
|
||||
```
|
||||
|
||||
This should provide you a good starting point to implement additional features,
|
||||
such as storing metadata, filtering required input by metadata (such as extracting
|
||||
a specific label for evaluation) and so on.
|
||||
|
||||
# Caveats
|
||||
|
||||
## Avoid PIL, or pay the (small) price
|
||||
|
||||
One gotcha that I initially faced is that the images I saved wasn't the same as
|
||||
the images that I retrieved. This wasn't LMDB's fault, this was because I was
|
||||
reading the images from disk via PIL, and storing them as bytes in LMDB. PIL
|
||||
decodes and encodes the image, so a roundtrip will not necessarily be identical,
|
||||
even for lossless file formats (other than bitmap images).
|
||||
|
||||
Don't encode/re-encode the image before you store it, or be prepared for the
|
||||
stored data to not be byte-identical.
|
||||
|
||||
## The `max_size_as_mb` argument
|
||||
|
||||
LMDB has an unusual design. You need to specify the upper bound of the DB size
|
||||
upon creation, and if it exceeds this size, it will fail. You can edit this
|
||||
later, with some caveats (on Windows, this will actually allocate the full
|
||||
size).
|
||||
|
||||
## Concurrency and LMDB
|
||||
|
||||
LMDB, while extremely fast, has some considerations with concurrency. See
|
||||
[the documentation](https://lmdb.readthedocs.io/en/release/#threads) for details.
|
||||
It may not be suited for distributed workloads.
|
||||
|
||||
# Alternatives
|
||||
|
||||
This article covers a "quick and dirty" solution, and was before more purpose-built
|
||||
solutions were available. Some alternatives are:
|
||||
|
||||
1. If you're comfortable operating directly on archives, a simple `tar` file will
|
||||
do - it can provide an offset index to provide random access to data.
|
||||
2. [Nvidia's WebDataset](https://github.com/webdataset/webdataset). Modern, open
|
||||
source and purpose built for large scale deep learning.
|
||||
3. [LanceDB](https://lancedb.com/), which describes itself as "designed for multimodal"
|
||||
and "built for scale". It's built on top of Arrow, closely related to Parquet.
|
||||
4. As mentioned, HuggingFace has multiple solutions to this, starting with Arrow backed
|
||||
storage, and their own [`datasets`](https://huggingface.co/docs/datasets/index).
|
||||
|
||||
Use these if you want to scale to production level training.
|
||||
@@ -1,15 +1,17 @@
|
||||
---
|
||||
title: "projects"
|
||||
menu: "main"
|
||||
weight: 3
|
||||
weight: 4
|
||||
---
|
||||
|
||||
Most of my work is on private repositories, but I do find some time to learn new topics, contribute back to some of the open source packages I frequently use, or to create interesting tools.
|
||||
|
||||
# Featured projects
|
||||
|
||||
1. [BorrowChecker](https://avimallu.github.io/BorrowChecker/): A play on the same concept in Rust, this is a simple web-app that allows you to split complex receipts with multiple people in a simple manner. Runs entirely in-browser. Made with Dioxus and Rust. [Repository link](https://github.com/avimallu/BorrowChecker).
|
||||
2. [PowerPointSnap](https://github.com/avimallu/PowerPointSnap): A mostly feature complete tool for PowerPoint on VBA that is filled with a lot of tricks to make it easy to consistently format presentations to impress clients - from my consulting days. Written in VBA. See accompanying [blog post]({{< ref "blog/003_powerpointsnap">}}).
|
||||
1. [ducktabe](https://github.com/avimallu/ducktabe): Short for **Duck**DB **Tab**ular **E**xplorer. A small tool to run SQL on parquet files I generated, for "quick and dirty" analysis. I am working on improving it over time and packaging it for distribution - currently it is in a pre-alpha stage.
|
||||
2. [ducktabe-py](https://github.com/avimallu/ducktabe-py): An offshoot of the Rust version above, but written entirely by AI. It is my first foray into spec-driven development, in a language and general area that I've got significant experience in. It is surprisingly capable, and is [available on PyPI](https://pypi.org/project/ducktabe/).
|
||||
3. [BorrowChecker](https://avimallu.github.io/BorrowChecker/): A play on the same concept in Rust, this is a simple web-app that allows you to split complex receipts with multiple people in a simple manner. Runs entirely in-browser. Made with Dioxus and Rust. [Repository link](https://github.com/avimallu/BorrowChecker).
|
||||
4. [PowerPointSnap](https://github.com/avimallu/PowerPointSnap): A mostly feature complete tool for PowerPoint on VBA that is filled with a lot of tricks to make it easy to consistently format presentations to impress clients - from my consulting days. Written in VBA. See accompanying [blog post]({{< ref "blog/003_powerpointsnap">}}).
|
||||
|
||||
# Other work or contributions
|
||||
|
||||
|
||||
7
content/series/_index.md
Normal file
7
content/series/_index.md
Normal file
@@ -0,0 +1,7 @@
|
||||
---
|
||||
title: "series"
|
||||
menu: "main"
|
||||
weight: 3
|
||||
---
|
||||
|
||||
# The By Hand Series
|
||||
349
content/series/byhand/byhand_001_square_roots.md
Normal file
349
content/series/byhand/byhand_001_square_roots.md
Normal file
@@ -0,0 +1,349 @@
|
||||
+++
|
||||
date = '2026-02-15'
|
||||
title = 'BasicsByHand: Square Roots'
|
||||
tags = ['mathematics', 'square-roots', 'by-hand']
|
||||
+++
|
||||
|
||||
# Premise
|
||||
|
||||
Long ago in middle school (or even earlier), along with the concept of what a
|
||||
square root is, I was also taught a method to calculate them by hand. This
|
||||
method did not involve Taylor series expansions, nor was it the popular
|
||||
numeric Newton-Raphson method - we didn't have a lick of knowledge of Calculus
|
||||
to use them anyway.
|
||||
|
||||
It worked, and nobody ever quite taught *why* it worked. Given that I'm an adult
|
||||
now and free to do adult things, I wanted to cross this off my (admittedly
|
||||
non-existent) list as a part of something else I was trying to understand.
|
||||
|
||||
# What method?
|
||||
|
||||
This is for my readers who may not be familiar with this method, or just need a
|
||||
refresher. Suppose you need to calculate the square root of 78129.329 while
|
||||
you're stuck on an island. Maybe your captor is a math fanatic. This is how you
|
||||
would go about it.
|
||||
|
||||
Start with grouping the digits in pairs - starting from the decimal point,
|
||||
moving on either side of it:
|
||||
|
||||
```
|
||||
┌─────────────────
|
||||
│ 7,81,29.32,9
|
||||
```
|
||||
|
||||
Now, we begin a method that suspiciously looks a lot like long-form division.
|
||||
Find the largest number whose square is equal to, or less than the left-most
|
||||
group.
|
||||
|
||||
```
|
||||
2
|
||||
┌─────────────────
|
||||
2 │ 7,81,29.32,9
|
||||
```
|
||||
|
||||
Then, you subtract its square from the digit:
|
||||
|
||||
```
|
||||
2
|
||||
┌─────────────────
|
||||
2 │ 7,81,29.32,9
|
||||
│ -4
|
||||
│─────────────────
|
||||
│ 3
|
||||
```
|
||||
|
||||
and multiply the number at the top with 2 (how arbitrary! or is it...?) and call
|
||||
it the new *divisor* you're going to work with. The next task is to find a value
|
||||
for a **digit** $x$ such that $x \times (40+x)$ is less than or equal to $381$:
|
||||
|
||||
```
|
||||
2 x
|
||||
┌─────────────────
|
||||
2 │ 7,81,29.32,9
|
||||
│ -4
|
||||
│─────────────────
|
||||
4x │ 3,81
|
||||
```
|
||||
|
||||
This will require some manual multiplication (since we're doing this by hand!).
|
||||
Looks like $7$ meets this requirement (as $48 \times 8 = 384$). Apply the same
|
||||
subtraction:
|
||||
|
||||
```
|
||||
2 7
|
||||
┌─────────────────
|
||||
2 │ 7,81,29.32,9
|
||||
│ -4
|
||||
│─────────────────
|
||||
47 │ 3,81
|
||||
│ -3,29
|
||||
│─────────────────
|
||||
│ 52
|
||||
```
|
||||
|
||||
and repeat the process once more. We need to first multiply $27$ by $2$, which
|
||||
gives us $54$. Similarly, $x \times (540+x)$ should then be less than or equal
|
||||
to $52,29$:
|
||||
|
||||
```
|
||||
2 7 x
|
||||
┌─────────────────
|
||||
2 │ 7,81,29.32,9
|
||||
│ -4
|
||||
│─────────────────
|
||||
47 │ 3,81
|
||||
│ -3,29
|
||||
│─────────────────
|
||||
54x │ 52,29
|
||||
```
|
||||
|
||||
I'll forgive you if you use the calculator here! I'll skip a step here (we end
|
||||
up at $9$ as $549\times 9=49,41<52,29$) to show that we need to now find $x$
|
||||
such that $x \times (5580 + x)$ is less than or equal to $2,88,32$:
|
||||
|
||||
```
|
||||
2 7 9. x
|
||||
┌─────────────────
|
||||
2 │ 7,81,29.32,9
|
||||
│ -4
|
||||
│─────────────────
|
||||
47 │ 3,81
|
||||
│ -3,29
|
||||
│─────────────────
|
||||
549 │ 52,29
|
||||
│ -49,41
|
||||
│─────────────────
|
||||
558x │ 2,88,32
|
||||
```
|
||||
|
||||
We settle on 5. Note here that the next number is $2795\times2$, not
|
||||
$279.5\times2$, a small subtlety here. The next $x$ is going to be $1$:
|
||||
|
||||
```
|
||||
2 7 9. 5 x
|
||||
┌─────────────────
|
||||
2 │ 7,81,29.32,9
|
||||
│ -4
|
||||
│─────────────────
|
||||
47 │ 3,81
|
||||
│ -3,29
|
||||
│─────────────────
|
||||
549 │ 52,29
|
||||
│ -49,41
|
||||
│─────────────────
|
||||
5585 │ 2,88,32
|
||||
│ -2,79,25
|
||||
│─────────────────
|
||||
5590x │ 9,07,90
|
||||
```
|
||||
|
||||
and we get:
|
||||
|
||||
```
|
||||
2 7 9. 5 1
|
||||
┌─────────────────
|
||||
2 │ 7,81,29.32,9
|
||||
│ -4
|
||||
│─────────────────
|
||||
47 │ 3,81
|
||||
│ -3,29
|
||||
│─────────────────
|
||||
549 │ 52,29
|
||||
│ -49,41
|
||||
│─────────────────
|
||||
5585 │ 2,88,32
|
||||
│ -2,79,25
|
||||
│─────────────────
|
||||
55901 │ 9,07,90
|
||||
│ - 5,59,01
|
||||
│─────────────────
|
||||
│ 3,48,89
|
||||
```
|
||||
|
||||
We could stop here, or go on further. I elect to stop here, and verify my
|
||||
result. Let's go to trusty Python:
|
||||
|
||||
```python
|
||||
>>> from math import sqrt
|
||||
>>> print(sqrt(7_81_29.32_9))
|
||||
279.51624103082094
|
||||
```
|
||||
|
||||
Excellent, so we were right!
|
||||
|
||||
But... why? How did we get it right?
|
||||
|
||||
# Building the Intuition
|
||||
|
||||
|
||||
## Getting our foot in the door
|
||||
|
||||
Let's start with our previous example.
|
||||
|
||||
$$
|
||||
7,81,29.32,9
|
||||
$$
|
||||
|
||||
Note that commas intentionally retained every two digits instead of the usual
|
||||
three.
|
||||
|
||||
> Why are commands retained every two digits?
|
||||
|
||||
> This is because squares of numbers have, up to twice the number of digits of
|
||||
the original number. Anecdotally, consider $999^2$. It will be less than
|
||||
$1000^2$, which is $1,000,000$ and has 7 digits (and 7 is less than 4, the
|
||||
number of digits that $1000$) has.
|
||||
|
||||
> On the other hand, smaller numbers like $4$ go to $16$ ($1$ digit to $2$
|
||||
digits). $1$ to $3$ remain at $1$ digit (1, 4, 9).
|
||||
|
||||
Remember that we usually also write our numbers in base $10$. Since any power
|
||||
of $10$ is easy to calculate, this makes the **estimate of our first digit**
|
||||
rather easy. Let's start by finding the closest square of $10$ that is less
|
||||
than or equal to $7,81,29.329$.
|
||||
|
||||
$$
|
||||
\begin{array}{lrll}
|
||||
&1,00,00 &\leq 7,81,29.329 &\leq 10,00,00\\
|
||||
\Rightarrow &10^4 &\leq 7,81,29.32,9 &\leq 10^5\\
|
||||
\Rightarrow &(10^2)^2 &\leq 7,81,29.32,9 &\leq 10^5\\
|
||||
\end{array}
|
||||
$$
|
||||
|
||||
Why did we write this as $(10^2)^2$? This tells us that our desired square root
|
||||
is greater than $100$. What's the closest multiple of hundred that is less than
|
||||
or equal to $7,81,29.32,9$? It's 200, because
|
||||
|
||||
$$
|
||||
\begin{array}{rlr}
|
||||
200^2 &= 4,00,00 &< 7,81,29.32,9\\
|
||||
300^2 &= 9,00,00 &> 7,81,29.32,9
|
||||
\end{array}
|
||||
$$
|
||||
|
||||
In other words,
|
||||
|
||||
$$
|
||||
\begin{array}{lrll}
|
||||
&200^2 &\leq 7,81,29.32,9 &\leq 300^2\\
|
||||
\end{array}
|
||||
$$
|
||||
|
||||
When we calculated that $200$ was the closest multiple of $100$ that was less than
|
||||
or equal to $7,81,29.32,9$, we were calculating the first digit of the square root.
|
||||
|
||||
## We continue by applying a common formula...
|
||||
|
||||
...which is derived below to minimize cognitive load, in case it's been a long
|
||||
time since you've touched algebra.
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
(a+x)^2 &=(a+x)(a+x)\\
|
||||
&=a(a+x) + x(a+x)\\
|
||||
&=a^2 + ax + xa + x^2\\
|
||||
&=a^2+2ax+x^2
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
## Connecting the dots
|
||||
|
||||
Let's consider the next digit. Because the true root lies between $200$ and $300$,
|
||||
we can say that the next digit of the root will be the closest multiple of $10$ that
|
||||
is greater than $200$ and whose square is less than $7,81,29.32,9$.
|
||||
|
||||
That means we're trying to find a single **digit** $x$ that meets the criteria:
|
||||
|
||||
$$
|
||||
\begin{array}{rll}
|
||||
&(200+10x)^2 &\leq 7,81,29.32,9\\
|
||||
\Rightarrow &200^2 + 2\cdot200\cdot10x + 100x^2 &\leq 7,81,29.32,9\\
|
||||
\Rightarrow &200^2 + 100\cdot\boldsymbol{(40+x)x} &\leq 7,81,29.32,9\\
|
||||
\Rightarrow &100\cdot\boldsymbol{(40+x)x} &\leq 3,81,29.32,9\\
|
||||
\Rightarrow &\boldsymbol{(40+x)x} &\leq \boldsymbol{3,81}.29,32,9
|
||||
\end{array}
|
||||
$$
|
||||
|
||||
Look closely at the bolded parts of the last equation:
|
||||
|
||||
$$
|
||||
\boldsymbol{(40+x)x} \leq \boldsymbol{3,81}
|
||||
$$
|
||||
|
||||
This is **exactly what we did here**:
|
||||
|
||||
```
|
||||
2 x
|
||||
┌─────────────────
|
||||
2 │ 7,81,29.32,9
|
||||
│ -4
|
||||
│─────────────────
|
||||
4x │ 3,81
|
||||
```
|
||||
|
||||
The single **digit** $x$ that satisfies this equation is $7$, as we had obtained earlier.
|
||||
|
||||
## One more step for clearer understanding
|
||||
|
||||
Let's do one more round to inscribe the logic in our heads:
|
||||
|
||||
We know that $270$ is the next best approximation of the root of $7,81,29.329$. We also
|
||||
know that to get the next digit, we need the closest multiple of $1$ that is between
|
||||
$270$ and $280$ whose square is less than $7,81,29.32,9$.
|
||||
|
||||
Once again, in trying to find the single **digit** $x$, the expressions become (note that
|
||||
in this iteration, $x$ is just $x$, not $10x$, because it's a multiple of $1$, not $10$):
|
||||
|
||||
$$
|
||||
\begin{array}{rll}
|
||||
&(270+x)^2 &\leq 7,81,29.32,9\\
|
||||
\Rightarrow &270^2 + 2\cdot270\cdot x + x^2 &\leq 7,81,29.32,9\\
|
||||
\Rightarrow &270^2 + \boldsymbol{(540+x)x} &\leq 7,81,29.32,9\\
|
||||
\Rightarrow &\boldsymbol{(540+x)x} &\leq 52,29.32,9\\
|
||||
\Rightarrow &\boldsymbol{(540+x)x} &\leq \boldsymbol{52,29}.32,9
|
||||
\end{array}
|
||||
$$
|
||||
|
||||
and the bolded parts of the expression are:
|
||||
|
||||
$$
|
||||
\boldsymbol{(540+x)x} \leq \boldsymbol{52,29}
|
||||
$$
|
||||
|
||||
once again, matching exactly to:
|
||||
|
||||
```
|
||||
2 7 x
|
||||
┌─────────────────
|
||||
2 │ 7,81,29.32,9
|
||||
│ -4
|
||||
│─────────────────
|
||||
47 │ 3,81
|
||||
│ -3,29
|
||||
│─────────────────
|
||||
54x │ 52,29
|
||||
```
|
||||
|
||||
## Estimating to more decimal points
|
||||
|
||||
You also now understand why this process can continue on till as many
|
||||
decimal points are needed.
|
||||
|
||||
The rest of the estimation, however, is, um... [*left as an exercise to the
|
||||
reader*](https://en.wikipedia.org/wiki/Proof_by_intimidation) 😬.
|
||||
|
||||
# Conclusion
|
||||
|
||||
I hope you now understand that the so called "long-division" method is
|
||||
just encoding the technique that I laid out above step-by-step, and is
|
||||
not really doing division alone, it's extending the binomial identity
|
||||
$(a+b)^2$ and refining it successively every iteration.
|
||||
|
||||
You will also notice that as we proceed further in calculating the root,
|
||||
the smaller the error becomes, allowing us to ignore it when we see fit.
|
||||
|
||||
# References
|
||||
|
||||
The method laid out in the article is described in [NIST article](https://xlinux.nist.gov/dads/HTML/squareRoot.html).
|
||||
However, the explanation of the method is my own, as I was not able to grasp how the explanation provided
|
||||
in that link actually worked "all the way" to the end.
|
||||
302
content/series/byhand/byhand_002_logarithms.md
Normal file
302
content/series/byhand/byhand_002_logarithms.md
Normal file
@@ -0,0 +1,302 @@
|
||||
+++
|
||||
date = '2026-02-15'
|
||||
title = 'BasicsByHand: Logarithms'
|
||||
tags = ['mathematics', 'logarithms', 'history', 'by-hand', 'euler']
|
||||
+++
|
||||
|
||||
# Premise
|
||||
|
||||
While reading a book on the
|
||||
[history of $e$ and why it appears everywhere](https://press.princeton.edu/books/paperback/9780691168487/e-the-story-of-a-number),
|
||||
the author focuses a little on how Napier first developed logarithms. While the
|
||||
book clarifies that Napierian logarithms are different from the definitions we
|
||||
use today, and demonstrates how they were calculated, it occured to me that I
|
||||
had no idea how to derive their values by hand.
|
||||
|
||||
Searching online had limited results, and it took me quite some time to find
|
||||
a version that described a simple method that didn't rely on calculus (something
|
||||
that wasn't developed fully when logarithms were first calculated).
|
||||
|
||||
I write this down almost entirely from that source, primarily as a means to make
|
||||
one more source available on the internet, and for my own curiosity.
|
||||
|
||||
# Prerequisites
|
||||
|
||||
You need to know very little to read this article.
|
||||
|
||||
However, if you truly wish to do it quite literally by hand, you will also need
|
||||
to know how to calculate square roots by hand. I went down this (small) rabbit
|
||||
hole myself while doing this research, and have [written down how (again, no
|
||||
calculus involved)]({{< relref "series/byhand/byhand_001_square_roots.md" >}}).
|
||||
|
||||
# Logarithms and key properties
|
||||
|
||||
My "first principles" knowledge on Logarithms was a little rusty. I hope this
|
||||
helps you the same way it did me. However, if you remember your logarithms well,
|
||||
you should skip this section.
|
||||
|
||||
## What is a logarithm?
|
||||
|
||||
The logarithm of a number $y$ is defined for a base $b$ as the value $x$ that
|
||||
satisfies the equation:
|
||||
|
||||
$$y=b^x$$
|
||||
|
||||
It is more popularly written as:
|
||||
|
||||
$$\log_b y=x$$
|
||||
|
||||
By definition, therefore,
|
||||
|
||||
$$\log_b b^x = x$$
|
||||
|
||||
and conversely
|
||||
|
||||
$$b^{\log_bx}=x$$
|
||||
|
||||
The converse is tricky, but becomes easier if you say $\log_b x=t$, which implies
|
||||
that $b^t=x$, which is exactly what $b^{\log_b x}$ becomes.
|
||||
|
||||
### Restrictions on $b$ and $y$
|
||||
|
||||
It is important to note here that $\log_b y$ is defined only for $y>0$ and $b>0$
|
||||
and $b \ne 1$. These restrictions are necessary for the reasons laid out below:
|
||||
|
||||
1. If $b < 0$, then $b^x$ is not always defined (for example, $(-2)^{0.5}$ is not real)
|
||||
2. If $b = 0$, $0^x$ is 0 for positive $x$ and undefined for $x \leq 0$. It's not invertible.
|
||||
3. If $b = 1$, $1^x = 1$ for all $x$, which doesn't give a useful $\log$ function.
|
||||
4. If $y<0$ for valid $b$, then no real $x$ can satisfy $b^x>0$.
|
||||
5. If $y=0$ for valid $b$, then no real $x$ can satisfy $b^x=0$.
|
||||
|
||||
## Properties of logarithms
|
||||
### Sum of logs to the same base
|
||||
|
||||
A very basic property of exponents is that:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
b^u\cdots b^v &= \underbrace{b\cdots b\ldots\cdots b}_\text{(u times)}\cdots \underbrace{b\cdots b\ldots\cdots b}_\text{(v times)}\\
|
||||
&=b^{u+v}
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
To extend this to logarithms, define $u=\log_b x$ and $v=\log_b y$:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
b^{u+v} &= b^u\cdots b^v\\
|
||||
b^{\log_b x+\log_b y}&=b^{\log_b x}\cdots b^{\log_b y}\\
|
||||
b^{\log_b x+\log_b y}&=xy\\
|
||||
b^{\log_b x+\log_b y}&=b^{\log_b{(xy)}}\\
|
||||
\log_bx+\log_by&=\log_b(xy)
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
It is important that all logs are to the same base $b$.
|
||||
|
||||
### Logs raised to an arbitrary exponent
|
||||
|
||||
This follows directly from sum of logs:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\log_bx^a&=\log_b(\underbrace{x\cdots x\cdots\ldots\cdots x}_{\text{a cdots}})\\
|
||||
&=\underbrace{\log_bx+\log_b x+\ldots\log_bx}_{\text{a times}}\\
|
||||
&=a\log_bx
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
Or, when $a$ isn't positive or a whole number, we use the fact that:
|
||||
|
||||
$$
|
||||
(b^x)^a=b^{xa}=b^{ax}
|
||||
$$
|
||||
|
||||
which can be written as:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
b^{\log_b(y^a)}&=y^a=(y)^a\\
|
||||
&=(b^{\log_b y})^a\\
|
||||
&=b^{a\log_b y}\\
|
||||
\Rightarrow\log_b(y^a)&=a\log_b y
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
### Difference of logs
|
||||
|
||||
Following the first two properties:
|
||||
$$
|
||||
\begin{aligned}
|
||||
\log_bx-\log_by&=\log_bx+(-1)\log_by\\
|
||||
&=\log_bx+\log_by^{-1}\\
|
||||
&=\log_bx+\log_b\frac{1}{y}\\
|
||||
&=\log_b\frac{x}{y}
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
### Changing of base
|
||||
|
||||
This is an important one. We need to write:
|
||||
|
||||
$$
|
||||
\log_xy
|
||||
$$
|
||||
|
||||
in the form of just logs to a common base $b$. We can write:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
x^\frac{\log_by}{\log_bx}&=(b^{\log_bx})^\frac{\log_by}{\log_bx}\\
|
||||
&=b^{\frac{\log_bx\cdots\log_by}{\log_bx}}\\
|
||||
&=b^{\log_by}\\
|
||||
&=y\\
|
||||
&=x^{\log_xy}\\
|
||||
\Rightarrow \dfrac{\log_by}{\log_bx}&=\log_xy
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
# The Euler Method
|
||||
|
||||
Let's say that we want to calculate $\log_7 155$. Euler's method involved first
|
||||
changing the base:
|
||||
|
||||
$$
|
||||
\log_7 155=\dfrac{\log_{10} 155}{\log_{10} 7}
|
||||
$$
|
||||
|
||||
and then using the identity:
|
||||
|
||||
$$
|
||||
\log(\sqrt{x\cdot y})=\dfrac{\log(xy)}{2}=\dfrac{\log x+\log y}{2}
|
||||
$$
|
||||
|
||||
Let's see how. From herewith, if only $\log$ is used, assume that it means
|
||||
$\log_{10}$. This will simplify a lot of the visual noise.
|
||||
|
||||
## Simplifying log calculations
|
||||
|
||||
We can first start by writing:
|
||||
|
||||
$$
|
||||
\log155=\log(100\cdot1.55)=2+\log1.55
|
||||
$$
|
||||
|
||||
## Calculating $\log1.55$
|
||||
|
||||
### Iteration 1
|
||||
|
||||
We know that from the identity above:
|
||||
|
||||
$$
|
||||
\log(\sqrt{1\cdot10}) = \dfrac{\log1 +\log10}{2}=\dfrac{1}{2}=0.5
|
||||
$$
|
||||
|
||||
In other words,
|
||||
|
||||
$$
|
||||
\log{\sqrt{10}} = 0.5
|
||||
$$
|
||||
|
||||
We know ([or can calculate by hand]({{< relref "series/byhand/byhand_001_square_roots.md" >}}))
|
||||
that $\sqrt{10}\approx3.16228$. You can hopefully now see where this is going.
|
||||
|
||||
### Iteration 2
|
||||
|
||||
Where does $1.55$ lie? Between $1$ and $\sqrt{10}$, or between $1$ and $3.16228$.
|
||||
Let's now calculate:
|
||||
|
||||
$$
|
||||
\log{\sqrt{1\cdot3.16228}}=\dfrac{\log1+\log3.16228}{2}=\dfrac{0+0.5}{2}=0.25
|
||||
$$
|
||||
|
||||
Thus, $\sqrt{1\cdot3.16228}$ is $\sqrt{3.16228}=1.77828$.
|
||||
|
||||
### Iteration 3
|
||||
|
||||
Once again, we make more progress by finding the square root of $1.77828$,
|
||||
because $1.55$ lies between $1$ and $1.77828$. Therefore,
|
||||
|
||||
$$
|
||||
\log{\sqrt{1\cdot1.77828}}=\dfrac{\log1+\log{1.77828}}{2}=0.125
|
||||
$$
|
||||
|
||||
$\sqrt{1.77828}\approx1.33521$. We're getting rather close.
|
||||
|
||||
### Iteration 4
|
||||
|
||||
We have a small change here. $1.55$ now lies between $1.33521$ and $1.77828$.
|
||||
This means that the new value we need to find is:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\log{\sqrt{1.33521\cdot1.77828}}&=\dfrac{\log{1.33521}+\log{1.77828}}{2}\\
|
||||
&=\dfrac{0.125+0.25}{2}\\
|
||||
&=0.1875
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
Also, $\sqrt{1.33521\cdot1.77828}=1.54090$
|
||||
|
||||
### Further iterations
|
||||
|
||||
The next step would be to determine if $1.54090$ is sufficiently close to $1.55$
|
||||
for us to approximate that $\log 1.54090\approx\log 1.55$, or we need to proceed
|
||||
further.
|
||||
|
||||
Of course, we can continue to refine the values until we're sufficiently satisfied
|
||||
of our accuracy. What does a calculator tell us about $\log 1.55$?
|
||||
|
||||
```python
|
||||
>>> from math import log
|
||||
>>> log(1.55, 10)
|
||||
0.1903316981702915
|
||||
```
|
||||
|
||||
So we were off by around... $0.0028$, which isn't bad at all, considering we did
|
||||
only around 4 iterations (and many more iterations for calculating the square
|
||||
root).
|
||||
|
||||
## Back to the Euler method
|
||||
|
||||
We find that $\log_{10}155=2+0.1875$. We calculate $\log_{10}7$ the same way,
|
||||
which should come to $0.845$ (according to my calculator), so
|
||||
|
||||
$$
|
||||
\log_7 155\approx\dfrac{2.1875}{0.845}=2.5
|
||||
$$
|
||||
|
||||
and calculators give us:
|
||||
|
||||
```python
|
||||
>>> from math import log
|
||||
>>> log(155, 7)
|
||||
2.591
|
||||
```
|
||||
|
||||
Clearly, while this is close to the $2.5$ we got, it's not quite correct - we're
|
||||
off by $0.091$, an order of magnitude more. If we'd used the true value, $0.1903$,
|
||||
then we would have got:
|
||||
|
||||
$$
|
||||
\log_7 155\approx\dfrac{2.1903}{0.845}\approx2.592
|
||||
$$
|
||||
|
||||
Which is much closer to the true value. We find that approximating the true value
|
||||
of the log to a much larger number of decimal points is crucial to get the correct
|
||||
value.
|
||||
|
||||
## Takeaway
|
||||
|
||||
A key takeaway for a programmer would be that Euler managed to implement binary
|
||||
search to efficiently (especially in terms of human effort) find the logarithm
|
||||
of a number
|
||||
|
||||
# References
|
||||
|
||||
The primary reference for the material in this article is
|
||||
[Bureau42](https://bureau42.com/view/7398/teaching-tidbit-calculating-logarithms-by-hand).
|
||||
While the link works, it looks like the material that this used to point to doesn't
|
||||
anymore. Some searching led me to a
|
||||
[backup hosted on GitHub](https://vault.hanover.edu/~vaughnj/Mat%20112%20Calculus%20with%20Review/manual_logarithms.pdf)
|
||||
and [another Hackernews linked source](http://eulerarchive.maa.org/hedi/HEDI-2005-07.pdf).
|
||||
@@ -20,6 +20,12 @@ enableRobotsTXT = true
|
||||
[markup.goldmark]
|
||||
[markup.goldmark.renderer]
|
||||
unsafe = true
|
||||
[markup.goldmark.extensions]
|
||||
[markup.goldmark.extensions.passthrough]
|
||||
enable = true
|
||||
[markup.goldmark.extensions.passthrough.delimiters]
|
||||
block = [['\[', '\]'], ['$$', '$$']]
|
||||
inline = [['\(', '\)'], ['$', '$']]
|
||||
|
||||
# Multilingual mode config. More for information about how to setup translation,
|
||||
# see https://gohugo.io/content-management/multilingual/
|
||||
@@ -79,3 +85,4 @@ enableRobotsTXT = true
|
||||
[params.author]
|
||||
# name = "Avinash Mallya" # Your name as shown in the RSS feed metadata
|
||||
# email = "nah@example.com" # Added to the footer so readers can reply to posts
|
||||
|
||||
|
||||
9
layouts/_markup/render-passthrough.html
Normal file
9
layouts/_markup/render-passthrough.html
Normal file
@@ -0,0 +1,9 @@
|
||||
{{- $opts := dict "output" "htmlAndMathml" "displayMode" (eq .Type "block") }}
|
||||
{{- with try (transform.ToMath .Inner $opts) }}
|
||||
{{- with .Err }}
|
||||
{{- errorf "Unable to render mathematical markup to HTML using the transform.ToMath function. The KaTeX display engine threw the following error: %s: see %s." . $.Position }}
|
||||
{{- else }}
|
||||
{{- .Value }}
|
||||
{{- $.Page.Store.Set "hasMath" true }}
|
||||
{{- end }}
|
||||
{{- end -}}
|
||||
65
layouts/baseof.html
Normal file
65
layouts/baseof.html
Normal file
@@ -0,0 +1,65 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="{{ with .Site.LanguageCode }}{{ . }}{{ else }}en-US{{ end }}">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
{{- partial "favicon.html" . -}}
|
||||
<title>{{- block "title" . }}{{ with .Title }}{{ . }} | {{ end }}{{ .Site.Title }}{{- end }}</title>
|
||||
|
||||
{{- partial "seo_tags.html" . -}}
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
{{ $style := print (default "original" .Site.Params.themeStyle) ".css" | resources.Get | minify }}
|
||||
<link href="{{ $style.RelPermalink }}" rel="stylesheet">
|
||||
|
||||
{{ if (.Page.Store.Get "hasCodeBlock") }}
|
||||
{{ $syntax := resources.Get "syntax.css" | minify }}
|
||||
<link href="{{ $syntax.RelPermalink }}" rel="stylesheet">
|
||||
{{ end }}
|
||||
|
||||
{{ with .Params.style }}
|
||||
{{ $extra := resources.Get . | minify }}
|
||||
<link href="{{ $extra.RelPermalink }}" rel="stylesheet">
|
||||
{{ end }}
|
||||
|
||||
{{ with .OutputFormats.Get "rss" -}}
|
||||
{{ printf `<link rel="%s" type="%s" href="%s" title="%s" />` .Rel .MediaType.Type .Permalink $.Site.Title | safeHTML }}
|
||||
{{ end -}}
|
||||
|
||||
<!-- A partial to be overwritten by the user.
|
||||
Simply place a custom_head.html into
|
||||
your local /layouts/partials-directory -->
|
||||
{{- partial "custom_head.html" . -}}
|
||||
|
||||
{{ $noop := .WordCount }}
|
||||
{{ if .Page.Store.Get "hasMath" }}
|
||||
<link
|
||||
rel="stylesheet"
|
||||
href="https://cdn.jsdelivr.net/npm/katex@0.16.25/dist/katex.min.css"
|
||||
integrity="sha384-WcoG4HRXMzYzfCgiyfrySxx90XSl2rxY5mnVY5TwtWE6KLrArNKn0T/mOgNL0Mmi"
|
||||
crossorigin="anonymous"
|
||||
>
|
||||
{{ end }}
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header>
|
||||
{{- partial "header.html" . -}}
|
||||
</header>
|
||||
<main id="main-content">
|
||||
{{- block "main" . }}{{- end }}
|
||||
</main>
|
||||
<footer>
|
||||
{{- partial "footer.html" . -}}
|
||||
</footer>
|
||||
|
||||
<!-- A partial to be overwritten by the user.
|
||||
Simply place a custom_body.html into
|
||||
your local /layouts/partials-directory -->
|
||||
{{- partial "custom_body.html" . -}}
|
||||
</body>
|
||||
|
||||
</html>
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="404 Page not found" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -47,6 +47,9 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -58,6 +61,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -69,6 +69,9 @@ Problem Statement Suppose we have a dataset that captures the arrival and depart
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -80,6 +83,8 @@ Problem Statement Suppose we have a dataset that captures the arrival and depart
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
@@ -188,24 +193,24 @@ Problem Statement Suppose we have a dataset that captures the arrival and depart
|
||||
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_DISTINCT</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_trucks</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_UNIQUE</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_truck_count</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_open</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">departure_time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_close</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">data</span><span class="p">)</span><span class="w"> </span><span class="n">A</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">DATEDIFF</span><span class="p">(</span><span class="s1">'seconds'</span><span class="p">,</span><span class="w"> </span><span class="n">arrival_time</span><span class="p">,</span><span class="w"> </span><span class="n">departure_time</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">duration</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">data</span><span class="p">)</span><span class="w"> </span><span class="n">B</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w"></span><span class="k">ON</span><span class="w"> </span><span class="p">((</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="k">ON</span><span class="w"> </span><span class="p">((</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">TO_SECONDS</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">duration</span><span class="p">))</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="p">)</span><span class="w"> </span><span class="k">OR</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="p">)</span><span class="w"> </span><span class="k">OR</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">TO_SECONDS</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">duration</span><span class="p">))</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="p">))</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span></span></span></code></pre></div><p>A small, succinct query such as this will need a bit of explanation to take it all in. Here’s one below, reproducible in Python (make sure to install <code>duckdb</code> first!). Expand it to view.</p>
|
||||
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span></span></span></code></pre></div><p>A small, succinct query such as this will need a bit of explanation to take it all in. Here’s one below, reproducible in Python (make sure to install <code>duckdb</code> first!). Expand it to view.</p>
|
||||
<details markdown="1"><summary>SQL with explanation.</summary>
|
||||
|
||||
|
||||
@@ -293,13 +298,13 @@ Problem Statement Suppose we have a dataset that captures the arrival and depart
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Case 2 in the diagram
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">TO_SECONDS</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">duration</span><span class="p">))</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="p">)</span><span class="w"> </span><span class="k">OR</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Case 3 in the diagram
|
||||
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1">-- Case 3 in the diagram
|
||||
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="p">)</span><span class="w"> </span><span class="k">OR</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- Case 4 in the diagram
|
||||
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"></span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1">-- Case 4 in the diagram
|
||||
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">TO_SECONDS</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">duration</span><span class="p">))</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="p">)</span></span></span></code></pre></div><p>What is common between these three conditions? It takes a while to see it; but it becomes clear that all these cases require the start of the overlap to be <em>before</em> the window ends, and the end of the overlap to be <em>after</em> the window starts. This can be simplified to just:</p>
|
||||
|
||||
|
||||
@@ -307,7 +312,7 @@ Problem Statement Suppose we have a dataset that captures the arrival and depart
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span></span></span></code></pre></div><p>making our query much simpler!</p>
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span></span></span></code></pre></div><p>making our query much simpler!</p>
|
||||
<h3 id="simplified-sql-part-1">Simplified SQL: Part 1</h3>
|
||||
<p>We’ve removed the need for the <code>duration</code> calculation algother now. Therefore, we can write:</p>
|
||||
|
||||
@@ -323,19 +328,19 @@ Problem Statement Suppose we have a dataset that captures the arrival and depart
|
||||
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_DISTINCT</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_trucks</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_UNIQUE</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_truck_count</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_open</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">departure_time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_close</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">data</span><span class="p">)</span><span class="w"> </span><span class="n">A</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="k">data</span><span class="w"> </span><span class="n">B</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="k">data</span><span class="w"> </span><span class="n">B</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="k">ON</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="k">ON</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span></span></span></code></pre></div><p>Can we simplify this even further?</p>
|
||||
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="p">)</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span></span></span></code></pre></div><p>Can we simplify this even further?</p>
|
||||
<h3 id="simplification-part-2">Simplification: Part 2</h3>
|
||||
<p>I think the SQL query in the above section is very easy to ready already. However, it is a little clunky overall, and there is a way that we can leverage DuckDB’s extensive optimizations to simplify our <strong>legibility</strong> by rewriting the query as a cross join:</p>
|
||||
|
||||
@@ -350,10 +355,10 @@ Problem Statement Suppose we have a dataset that captures the arrival and depart
|
||||
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">A</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_close</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_DISTINCT</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_trucks</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_UNIQUE</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_truck_count</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="k">data</span><span class="w"> </span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="k">data</span><span class="w"> </span><span class="n">B</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">window_close</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">AND</span><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">window_open</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span></span></span></code></pre></div><p>Why does this work? Before optimization on DuckDB, this is what the query plan looks like:</p>
|
||||
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="k">data</span><span class="w"> </span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="k">data</span><span class="w"> </span><span class="n">B</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="k">WHERE</span><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">window_close</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="k">AND</span><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">window_open</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span></span></span></code></pre></div><p>Why does this work? Before optimization on DuckDB, this is what the query plan looks like:</p>
|
||||
<details markdown="1"><summary>DuckDB query plan before optimization</summary>
|
||||
|
||||
|
||||
@@ -473,7 +478,7 @@ Problem Statement Suppose we have a dataset that captures the arrival and depart
|
||||
<li>The default match type is that it matches all three cases from the image above. Side note: it also has matches for <code>within</code> overlap, matching <code>start</code> and <code>end</code> windows,</li>
|
||||
<li>The last two matching columns in the join condition in <code>by</code> must specify the <code>start</code> and <code>end</code> points of the overlapping window. This isn’t a problem for us now, but does restrict for future uses where we may want non-equi joins on other cases.</li>
|
||||
</ol>
|
||||
<h3 id="the-code-_si_-the-code">The code, <em>si</em>, the code!</h3>
|
||||
<h3 id="the-code-si-the-code">The code, <em>si</em>, the code!</h3>
|
||||
<p>Without further ado:</p>
|
||||
|
||||
|
||||
|
||||
@@ -81,6 +81,9 @@ You have a large-ish set of (imperfectly) labelled data points. These data point
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -92,6 +95,8 @@ You have a large-ish set of (imperfectly) labelled data points. These data point
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
@@ -239,7 +244,7 @@ You have a large-ish set of (imperfectly) labelled data points. These data point
|
||||
<blockquote>
|
||||
<p><strong>NOTE</strong>: for a proof-of-concept implementation, or if you’re on the CPU, try the <code>all-MiniLM-L6-v2</code> model. It’s a much smaller and much faster model, although you sacrifice a little in terms of accuracy.</p>
|
||||
</blockquote>
|
||||
<h2 id="the-concept-of-_approximate_-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<h2 id="the-concept-of-approximate-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<p>Performing any kind of nearest neighbor algorithm on medium scale datasets (even bordering 10,000 rows and tens of columns) tends to be slow. A primary driver of this was the need to calculate all, or nearly all distances between all data points. <em>Approximate</em> nearest neighbor (ANN) algorithms work around this through various approaches, which warrant their own blog post. For now, it would suffice to understand that there are shortcuts that ANN algorithms take to give you if not the exact nearest neighbor, at least <em>one</em> of the nearest neighbors (hence the term <em>approximate</em>).</p>
|
||||
<p>There are several algorithms that you can use - I shall proceed with <code>faiss</code>, because it has a nice Python interface and is rather easy to work with. You can use any algorithm - a full list of the major ones are <a href="https://github.com/erikbern/ann-benchmarks">available here</a>.</p>
|
||||
<p>I’ll explain why we’re in the nearest neighbor territory in due course.</p>
|
||||
|
||||
@@ -76,6 +76,16 @@ It’s a VBA based PowerPoint add-on. Just a set of commands that work well with
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<link
|
||||
rel="stylesheet"
|
||||
href="https://cdn.jsdelivr.net/npm/katex@0.16.25/dist/katex.min.css"
|
||||
integrity="sha384-WcoG4HRXMzYzfCgiyfrySxx90XSl2rxY5mnVY5TwtWE6KLrArNKn0T/mOgNL0Mmi"
|
||||
crossorigin="anonymous"
|
||||
>
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -87,6 +97,8 @@ It’s a VBA based PowerPoint add-on. Just a set of commands that work well with
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
@@ -145,7 +157,7 @@ It’s a VBA based PowerPoint add-on. Just a set of commands that work well with
|
||||
<p><img src="/blog/003_powerpointsnap/02_Charts.png" alt="The UI for copying chart properties"></p>
|
||||
<p>What do these features do? You should be able to hover over the option and get a tooltip that shows what it’s capable of, but here’s another summary just in case:</p>
|
||||
<ol>
|
||||
<li>Sync Value/Date Axis: this will try to align the range, the ticks, the numeric values etc. of the “set” chart to the one you’ve selected. I couldn’t put in just $x$ and $y$ here because Microsoft internally doesn’t label them that way. Try either of these two options (you can undo!) and see what works best for your chart. This doesn’t work well yet for 3D charts.</li>
|
||||
<li>Sync Value/Date Axis: this will try to align the range, the ticks, the numeric values etc. of the “set” chart to the one you’ve selected. I couldn’t put in just <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">x</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>y</mi></mrow><annotation encoding="application/x-tex">y</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">y</span></span></span></span> here because Microsoft internally doesn’t label them that way. Try either of these two options (you can undo!) and see what works best for your chart. This doesn’t work well yet for 3D charts.</li>
|
||||
<li>Sync Plot/Title/Legend: often, you want to centre a title, or make sure that multiple charts that show nearly identical things for different variables all <em>look</em> exactly the same from a client perspective. But that’s usually difficult if you’ve already configured the charts a little - which can be remedied with this option!</li>
|
||||
<li>Format Painter: this is simply a helper for the normal format painter to align the formats of the text that you’ve selected with the way it originally is in the “set” chart. The reason for this feature is simply to avoid going back to <em>Home</em> to click on the <em>Format Painter</em> option again.</li>
|
||||
<li>Reset Axes Scales: in case you messed up somewhere, you can use this to rever to PowerPoint defaults.</li>
|
||||
|
||||
@@ -5,8 +5,8 @@
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Environment Variables and Multiprocessing | Avinash's Blog</title>
|
||||
<meta name="title" content="Environment Variables and Multiprocessing" />
|
||||
<title>Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines | Avinash's Blog</title>
|
||||
<meta name="title" content="Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines" />
|
||||
<meta name="description" content="Premise
|
||||
I needed to use a codebase that had a mix of light numpy operations, coupled with a few heavy mathematical optimization problems that did not use numpy.
|
||||
The API that I had access to provided to be too slow (even asynchronously), so I thought I’ll run this locally on a faster system in parallel to speed
|
||||
@@ -22,7 +22,7 @@ things up. Turns out, that was easier said than done, and I picked up the import
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="Environment Variables and Multiprocessing">
|
||||
<meta property="og:title" content="Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines">
|
||||
<meta property="og:description" content="Premise I needed to use a codebase that had a mix of light numpy operations, coupled with a few heavy mathematical optimization problems that did not use numpy. The API that I had access to provided to be too slow (even asynchronously), so I thought I’ll run this locally on a faster system in parallel to speed things up. Turns out, that was easier said than done, and I picked up the importance of environment variables along the way.">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="article">
|
||||
@@ -42,13 +42,13 @@ things up. Turns out, that was easier said than done, and I picked up the import
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="Environment Variables and Multiprocessing">
|
||||
<meta name="twitter:title" content="Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines">
|
||||
<meta name="twitter:description" content="Premise I needed to use a codebase that had a mix of light numpy operations, coupled with a few heavy mathematical optimization problems that did not use numpy. The API that I had access to provided to be too slow (even asynchronously), so I thought I’ll run this locally on a faster system in parallel to speed things up. Turns out, that was easier said than done, and I picked up the importance of environment variables along the way.">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="Environment Variables and Multiprocessing">
|
||||
<meta itemprop="name" content="Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines">
|
||||
<meta itemprop="description" content="Premise I needed to use a codebase that had a mix of light numpy operations, coupled with a few heavy mathematical optimization problems that did not use numpy. The API that I had access to provided to be too slow (even asynchronously), so I thought I’ll run this locally on a faster system in parallel to speed things up. Turns out, that was easier said than done, and I picked up the importance of environment variables along the way.">
|
||||
<meta itemprop="datePublished" content="2026-01-14T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-01-14T00:00:00+00:00">
|
||||
@@ -69,6 +69,9 @@ things up. Turns out, that was easier said than done, and I picked up the import
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -80,6 +83,8 @@ things up. Turns out, that was easier said than done, and I picked up the import
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
@@ -94,7 +99,7 @@ things up. Turns out, that was easier said than done, and I picked up the import
|
||||
</header>
|
||||
<main id="main-content">
|
||||
|
||||
<h1>Environment Variables and Multiprocessing</h1>
|
||||
<h1>Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines</h1>
|
||||
<p class="byline">
|
||||
<time datetime='2026-01-14' pubdate>
|
||||
2026-01-14
|
||||
@@ -252,7 +257,7 @@ environment variables:</p>
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">seq <span class="m">10</span> <span class="p">|</span> <span class="se">\
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl">parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s2">OPENBLAS_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s2">MKL_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s2">python test.py {}"</span></span></span></code></pre></div><p>An interesting side note is that the parallel, but single core run took 0.61 seconds, <strong>less</strong> than the time it took to run
|
||||
|
||||
407
public/blog/005_ldmb_as_image_db/index.html
Normal file
407
public/blog/005_ldmb_as_image_db/index.html
Normal file
@@ -0,0 +1,407 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Resolving I/O Bottlenecks for 100K Small Files with LMDB | Avinash's Blog</title>
|
||||
<meta name="title" content="Resolving I/O Bottlenecks for 100K Small Files with LMDB" />
|
||||
<meta name="description" content="Premise
|
||||
I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.
|
||||
I had several thousand images already.
|
||||
I was expecting several thousand more.
|
||||
My repository was tracking these images via DVC.
|
||||
My computer was also slowing down massively because of the sheer number of files.
|
||||
DVC itself was slowing down (after all, randomly accessing many files isn’t going to be fast).
|
||||
I also needed to access files at random for training/evaluating the model (lots of shuffling).
|
||||
Lastly, these images had their own associated metadata (labels, bounding boxes, “correct” text etc.), and they need to be stored along with the images - or least easily linkable to them." />
|
||||
<meta name="author" content="Avinash Mallya" />
|
||||
<meta name="keywords" content="python,machine-learning,storage,images,dvc,lmdb,bottleneck,io," />
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/blog/005_ldmb_as_image_db/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="Resolving I/O Bottlenecks for 100K Small Files with LMDB">
|
||||
<meta property="og:description" content="Premise I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.
|
||||
I had several thousand images already. I was expecting several thousand more. My repository was tracking these images via DVC. My computer was also slowing down massively because of the sheer number of files. DVC itself was slowing down (after all, randomly accessing many files isn’t going to be fast). I also needed to access files at random for training/evaluating the model (lots of shuffling). Lastly, these images had their own associated metadata (labels, bounding boxes, “correct” text etc.), and they need to be stored along with the images - or least easily linkable to them.">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="article">
|
||||
<meta property="article:section" content="blog">
|
||||
<meta property="article:published_time" content="2026-02-10T00:00:00+00:00">
|
||||
<meta property="article:modified_time" content="2026-02-10T00:00:00+00:00">
|
||||
<meta property="article:tag" content="Python">
|
||||
<meta property="article:tag" content="Machine-Learning">
|
||||
<meta property="article:tag" content="Storage">
|
||||
<meta property="article:tag" content="Images">
|
||||
<meta property="article:tag" content="Dvc">
|
||||
<meta property="article:tag" content="Lmdb">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="Resolving I/O Bottlenecks for 100K Small Files with LMDB">
|
||||
<meta name="twitter:description" content="Premise I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.
|
||||
I had several thousand images already. I was expecting several thousand more. My repository was tracking these images via DVC. My computer was also slowing down massively because of the sheer number of files. DVC itself was slowing down (after all, randomly accessing many files isn’t going to be fast). I also needed to access files at random for training/evaluating the model (lots of shuffling). Lastly, these images had their own associated metadata (labels, bounding boxes, “correct” text etc.), and they need to be stored along with the images - or least easily linkable to them.">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="Resolving I/O Bottlenecks for 100K Small Files with LMDB">
|
||||
<meta itemprop="description" content="Premise I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.
|
||||
I had several thousand images already. I was expecting several thousand more. My repository was tracking these images via DVC. My computer was also slowing down massively because of the sheer number of files. DVC itself was slowing down (after all, randomly accessing many files isn’t going to be fast). I also needed to access files at random for training/evaluating the model (lots of shuffling). Lastly, these images had their own associated metadata (labels, bounding boxes, “correct” text etc.), and they need to be stored along with the images - or least easily linkable to them.">
|
||||
<meta itemprop="datePublished" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="wordCount" content="1951">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta itemprop="keywords" content="Python,Machine-Learning,Storage,Images,Dvc,Lmdb,Bottleneck,Io">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
|
||||
<link href="/original.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
<link href="/syntax.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header><a class="skip-link" href="#main-content">Skip to main content</a>
|
||||
|
||||
<a href="/" class="title"><h1>Avinash's Blog</h1></a>
|
||||
<nav>
|
||||
<a href="/">about</a>
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</nav>
|
||||
</header>
|
||||
<main id="main-content">
|
||||
|
||||
<h1>Resolving I/O Bottlenecks for 100K Small Files with LMDB</h1>
|
||||
<p class="byline">
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
· Avinash Mallya
|
||||
</p>
|
||||
|
||||
<content>
|
||||
<h1 id="premise">Premise</h1>
|
||||
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
|
||||
<p>I had several thousand images already.
|
||||
I was expecting several thousand more.
|
||||
My repository was tracking these images via DVC.
|
||||
My computer was also slowing down massively because of the sheer number of files.
|
||||
DVC itself was slowing down (after all, randomly accessing many files isn’t going to be fast).
|
||||
I also needed to access files at random for training/evaluating the model (lots of shuffling).
|
||||
Lastly, these images had their own associated metadata (labels, bounding boxes, “correct” text etc.), and they need to be stored along with the images - or least easily linkable to them.</p>
|
||||
<p>I was primarily aiming for a “simple” solution, and didn’t need a productionizable codebase.</p>
|
||||
<h1 id="potential-solutions">Potential Solutions</h1>
|
||||
<h2 id="partitioning">Partitioning</h2>
|
||||
<p>A typical solution for “too many files” is to partition them by their name. It’s ideal if their
|
||||
name is a hash, so you can store the first character of the hash, then the second character, and
|
||||
then the actual file. So, for example, the directory changes from:</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln">2</span><span class="cl">├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln">3</span><span class="cl">├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln">4</span><span class="cl">├── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">5</span><span class="cl">├── cf1a2b3c4d.txt
|
||||
</span></span><span class="line"><span class="ln">6</span><span class="cl">├── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">7</span><span class="cl">└── a1f5e6d7c8.txt</span></span></code></pre></div><p>to</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">├── a/
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">│ └── 1/
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">│ ├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">│ └── a1f5e6d7c8.txt
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">├── b/
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">│ └── 7/
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">│ ├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">│ └── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">10</span><span class="cl">└── c/
|
||||
</span></span><span class="line"><span class="ln">11</span><span class="cl"> ├── 0/
|
||||
</span></span><span class="line"><span class="ln">12</span><span class="cl"> │ └── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">13</span><span class="cl"> └── f/
|
||||
</span></span><span class="line"><span class="ln">14</span><span class="cl"> └── cf1a2b3c4d.txt</span></span></code></pre></div><p>This isn’t novel - <code>git</code> and <code>dvc</code> both store their objects this way.
|
||||
This limits directory size, so file system look ups take less time.</p>
|
||||
<p>This isn’t a perfect solution - it required that I store the images as their hash,
|
||||
and handle the directory structure correctly. I will also need to maintain my own
|
||||
mechanism to maintain the link between the hash and the metadata, which means creating
|
||||
some sort of index. Lastly, DVC will still track files individually, which means that
|
||||
its <code>push</code>, <code>diff</code> and <code>pull</code> commands will still be slow.</p>
|
||||
<h2 id="separate-object-storage-maintain-only-the-index">Separate (object) storage, maintain only the index</h2>
|
||||
<p>Another solution would be to offload the storage to a medium capable of handling a large number of files in any order,
|
||||
while maintaining random access - for example, S3. Now my task will reduce to just correctly maintaining the index so
|
||||
that I know how many images are stored on S3, and link their metadata via their hash.</p>
|
||||
<p>If you’ve experienced retrieving a list of a large number of files stored on S3, you’d have first encountered the limit
|
||||
of 1000 objects that <code>boto3</code> enforces per request. You’ll need to work around it with pagination, which while standard,
|
||||
is still more work. You would have also realized that even after all this, S3 will take quite some time to give you the
|
||||
list even after you’ve optimized as much as you can.</p>
|
||||
<p>However, that means I need an active internet connection to access any data. It also introduces latency during training,
|
||||
and quite wasteful bandwidth in running multiple training sessions for the models. I can minimize these if I store the
|
||||
images in the same AWS region as I would be running the training in (say, an EC2), but that means I needed access to an
|
||||
EC2.</p>
|
||||
<p>This still left me the task of maintaining my own index, which I really wanted to avoid, as it would mean additional
|
||||
maintenance burden for a relatively nascent project that hasn’t reached production status, while demanding production
|
||||
code for an even more nascent pipeline.</p>
|
||||
<h2 id="what-about-a-database">What about a… database?</h2>
|
||||
<p>This feels much like reaching for your nose by looping your hand behind your head instead of touching it directly with
|
||||
your fingers. A database is great for many things, but setting one up and maintaining it (even a local SQLite one) isn’t
|
||||
really a quick and painless process, and has many gotchas. For instance, most databases aren’t optimized for storing a
|
||||
large number of binary blobs.</p>
|
||||
<p>I considered, and tested using a more modern embeddable analytical database like DuckDB for this purpose (I’m quite
|
||||
biased to using DuckDB and/or Polars to solve a large number of my data processing problems). I quickly
|
||||
found out that storing large binary blobs in it causes it to choke (which is fair, it isn’t really designed for that). Storing
|
||||
the files elsewhere while maintaining just the index in it still had the original problem - I needed to write the mechanism
|
||||
of maintaining the index.</p>
|
||||
<blockquote>
|
||||
<p>Note: HuggingFace now provides many images datasets (such as <a href="https://huggingface.co/datasets/ylecun/mnist">MNIST</a>) in
|
||||
the Parquet format, with the images stored using Arrow’s extension types (but still as binary blobs). My experience with
|
||||
storing binary data in Parquet hasn’t been great, but you could check this out to see if it meets your requirements.</p>
|
||||
</blockquote>
|
||||
<h1 id="the-solution-i-landed-on">The solution I landed on</h1>
|
||||
<h2 id="what-about-a-different-kind-of-database">What about a… <em>different</em> kind of database?</h2>
|
||||
<p>Let’s get down to first principles. What did I want to do? I wanted to store images. With those images, I also wanted
|
||||
to store its metadata. I wanted to access said data quickly. It became clearer to me that I was looking for a fast key-value
|
||||
store, and I stumbled upon <a href="https://www.symas.com/mdb">LMDB</a>.</p>
|
||||
<h2 id="lmdb">LMDB</h2>
|
||||
<p>Wikipedia’s <a href="https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database">entry</a> on LMDB indicates that it’s an incredibly
|
||||
small (64kB) piece of software that does one thing really well - be a ridiculously fast key-value store. I won’t pretend to
|
||||
understand how it works, the writeup on the Wiki provides plenty of good detail. I’ll focus, rather, on how I used it to
|
||||
solve my problem.</p>
|
||||
<h2 id="storing-and-retrieving-image-data-along-with-its-metadata">Storing and retrieving image data along with its metadata</h2>
|
||||
<p>LMDB is rather barebones. It exposes few features - the ability to write, and read a particular key (stored as a bytestring),
|
||||
that itself points to arbitrary bytes. I used its <a href="https://lmdb.readthedocs.io/en/release/">Python bindings</a>.</p>
|
||||
<p>I wrote a tiny class (~200 LoC) that did the following:</p>
|
||||
<ol>
|
||||
<li>Read the file names and the data of the images that I had, along with their metadata as it currently was, into Python. Batched to avoid running out of memory.</li>
|
||||
<li>Serialize the metadata, and read the image files as bytes, and link them to the keys f"{file_name}_image" and f"{file_name}_metadata" respectively.</li>
|
||||
<li>Store these as key-value pairs into the LMDB database, which is a <strong>single</strong> file.</li>
|
||||
<li>Provided a method to read the keys to identify all the images present in the database.</li>
|
||||
<li>Provided a method to retrieve an arbitrary set of images and their metadata quickly from the saved database.</li>
|
||||
</ol>
|
||||
<p>This has many advantages:</p>
|
||||
<ol>
|
||||
<li>LMDB is fast - really, really fast. Random access? Check. Fast retrieval of available images? Check.</li>
|
||||
<li>DVC tracking becomes simple - maintain a single file, and just version control that. No slow downs due to sheer number of files - either for DVC, or my computer.</li>
|
||||
<li>No index to maintain - the images and their metadata are stored in the same location, and linkable via a mere change in the suffix to their file name.</li>
|
||||
<li>Local access, practically zero latency.</li>
|
||||
</ol>
|
||||
<p>Which solves… all of the problems I had! When new files come in, all I needed to do was add them to the DB. LMDB has a few options available - you can
|
||||
avoid overwriting the same key, ensure that the database is de-duplicated.</p>
|
||||
<h1 id="the-code">The code</h1>
|
||||
<p>I’ve provided a sample code below that demonstrates storing just the images (not metadata) for the <a href="https://www.robots.ox.ac.uk/~vgg/data/flowers/102/">Oxford 102 Category Flower</a>
|
||||
dataset, which has around 8000 images.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">lmdb</span>
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">class</span> <span class="nc">ImageDB</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">env_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span> <span class="n">max_size_as_mb</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env_path</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">env_path</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env</span> <span class="o">=</span> <span class="n">lmdb</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">env_path</span><span class="p">,</span> <span class="n">map_size</span><span class="o">=</span><span class="n">max_size_as_mb</span> <span class="o">*</span> <span class="p">(</span><span class="mi">2</span><span class="o">**</span><span class="mi">20</span><span class="p">))</span>
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">db</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">open_db</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 10</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 11</span><span class="cl"> <span class="k">def</span> <span class="nf">save_image</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 12</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 13</span><span class="cl"> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 14</span><span class="cl"> <span class="n">image_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 15</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 16</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 17</span><span class="cl"> <span class="n">txn</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 18</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 19</span><span class="cl"> <span class="k">def</span> <span class="nf">read_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">bytes</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 20</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 21</span><span class="cl"> <span class="k">if</span> <span class="n">image_as_bytes</span> <span class="o">:=</span> <span class="n">txn</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 22</span><span class="cl"> <span class="k">return</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 23</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 24</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 25</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 26</span><span class="cl"> <span class="k">def</span> <span class="nf">save_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 27</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 28</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 29</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 30</span><span class="cl"> <span class="c1"># Note: you might need to enforce a batch size here</span>
|
||||
</span></span><span class="line"><span class="ln"> 31</span><span class="cl"> <span class="c1"># to aovid running out of memory because this loads</span>
|
||||
</span></span><span class="line"><span class="ln"> 32</span><span class="cl"> <span class="c1"># all images sent to this function as bytes.</span>
|
||||
</span></span><span class="line"><span class="ln"> 33</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 34</span><span class="cl"> <span class="n">item_tuples</span> <span class="o">=</span> <span class="p">[</span>
|
||||
</span></span><span class="line"><span class="ln"> 35</span><span class="cl"> <span class="p">(</span><span class="n">k</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 36</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">name_image</span><span class="o">.</span><span class="n">items</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 37</span><span class="cl"> <span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 38</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 39</span><span class="cl"> <span class="n">consumed</span><span class="p">,</span> <span class="n">added</span> <span class="o">=</span> <span class="n">cursor</span><span class="o">.</span><span class="n">putmulti</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 40</span><span class="cl"> <span class="n">item_tuples</span><span class="p">,</span> <span class="n">dupdata</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">overwrite</span><span class="o">=</span><span class="kc">False</span>
|
||||
</span></span><span class="line"><span class="ln"> 41</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 42</span><span class="cl"> <span class="nb">print</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 43</span><span class="cl"> <span class="sa">f</span><span class="s2">"Saved </span><span class="si">{</span><span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> out of </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images to the DB (</span><span class="si">{</span><span class="n">consumed</span> <span class="o">-</span> <span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> seem to already exist)."</span>
|
||||
</span></span><span class="line"><span class="ln"> 44</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 45</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 46</span><span class="cl"> <span class="k">def</span> <span class="nf">load_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 47</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 48</span><span class="cl"> <span class="n">names</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 49</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 50</span><span class="cl"> <span class="n">names_as_bytestrings</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">encode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">names</span><span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 51</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 52</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 53</span><span class="cl"> <span class="k">return</span> <span class="p">{</span>
|
||||
</span></span><span class="line"><span class="ln"> 54</span><span class="cl"> <span class="n">k</span><span class="o">.</span><span class="n">decode</span><span class="p">():</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 55</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_as_bytes</span> <span class="ow">in</span> <span class="n">cursor</span><span class="o">.</span><span class="n">getmulti</span><span class="p">(</span><span class="n">names_as_bytestrings</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 56</span><span class="cl"> <span class="p">}</span>
|
||||
</span></span><span class="line"><span class="ln"> 57</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 58</span><span class="cl"> <span class="k">def</span> <span class="nf">delete_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 59</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 60</span><span class="cl"> <span class="k">if</span> <span class="n">txn</span><span class="o">.</span><span class="n">delete</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 61</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Image </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2"> deleted successfully"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 62</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 63</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 64</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 65</span><span class="cl"> <span class="k">def</span> <span class="nf">retrieve_names</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 66</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 67</span><span class="cl"> <span class="k">return</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">decode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span><span class="o">.</span><span class="n">iternext</span><span class="p">(</span><span class="n">keys</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="kc">False</span><span class="p">)]</span>
|
||||
</span></span><span class="line"><span class="ln"> 68</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 69</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 70</span><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 71</span><span class="cl"> <span class="n">db</span> <span class="o">=</span> <span class="n">ImageDB</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="s2">"./db/"</span><span class="p">),</span> <span class="mi">512</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 72</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 73</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 74</span><span class="cl"> <span class="c1"># Save the results</span>
|
||||
</span></span><span class="line"><span class="ln"> 75</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg/"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 76</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span>
|
||||
</span></span><span class="line"><span class="ln"> 77</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 78</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 79</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 80</span><span class="cl"> <span class="c1"># Add last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 81</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 82</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 83</span><span class="cl"> <span class="k">del</span> <span class="n">name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 84</span><span class="cl"> <span class="c1"># How many images have been stored?</span>
|
||||
</span></span><span class="line"><span class="ln"> 85</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"The DB has </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">retrieve_names</span><span class="p">())</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images stored"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 86</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 87</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 88</span><span class="cl"> <span class="c1"># Load the results from the DB and check if they match the files on disk</span>
|
||||
</span></span><span class="line"><span class="ln"> 89</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 90</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 91</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 92</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 93</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 94</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 95</span><span class="cl"> <span class="c1"># Verify last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 96</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 97</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 98</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 99</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"All images stored are byte identical to the original ones!"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln">100</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln">101</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></span></span></code></pre></div><p>This should provide you a good starting point to implement additional features,
|
||||
such as storing metadata, filtering required input by metadata (such as extracting
|
||||
a specific label for evaluation) and so on.</p>
|
||||
<h1 id="caveats">Caveats</h1>
|
||||
<h2 id="avoid-pil-or-pay-the-small-price">Avoid PIL, or pay the (small) price</h2>
|
||||
<p>One gotcha that I initially faced is that the images I saved wasn’t the same as
|
||||
the images that I retrieved. This wasn’t LMDB’s fault, this was because I was
|
||||
reading the images from disk via PIL, and storing them as bytes in LMDB. PIL
|
||||
decodes and encodes the image, so a roundtrip will not necessarily be identical,
|
||||
even for lossless file formats (other than bitmap images).</p>
|
||||
<p>Don’t encode/re-encode the image before you store it, or be prepared for the
|
||||
stored data to not be byte-identical.</p>
|
||||
<h2 id="the-max_size_as_mb-argument">The <code>max_size_as_mb</code> argument</h2>
|
||||
<p>LMDB has an unusual design. You need to specify the upper bound of the DB size
|
||||
upon creation, and if it exceeds this size, it will fail. You can edit this
|
||||
later, with some caveats (on Windows, this will actually allocate the full
|
||||
size).</p>
|
||||
<h2 id="concurrency-and-lmdb">Concurrency and LMDB</h2>
|
||||
<p>LMDB, while extremely fast, has some considerations with concurrency. See
|
||||
<a href="https://lmdb.readthedocs.io/en/release/#threads">the documentation</a> for details.
|
||||
It may not be suited for distributed workloads.</p>
|
||||
<h1 id="alternatives">Alternatives</h1>
|
||||
<p>This article covers a “quick and dirty” solution, and was before more purpose-built
|
||||
solutions were available. Some alternatives are:</p>
|
||||
<ol>
|
||||
<li>If you’re comfortable operating directly on archives, a simple <code>tar</code> file will
|
||||
do - it can provide an offset index to provide random access to data.</li>
|
||||
<li><a href="https://github.com/webdataset/webdataset">Nvidia’s WebDataset</a>. Modern, open
|
||||
source and purpose built for large scale deep learning.</li>
|
||||
<li><a href="https://lancedb.com/">LanceDB</a>, which describes itself as “designed for multimodal”
|
||||
and “built for scale”. It’s built on top of Arrow, closely related to Parquet.</li>
|
||||
<li>As mentioned, HuggingFace has multiple solutions to this, starting with Arrow backed
|
||||
storage, and their own <a href="https://huggingface.co/docs/datasets/index"><code>datasets</code></a>.</li>
|
||||
</ol>
|
||||
<p>Use these if you want to scale to production level training.</p>
|
||||
|
||||
</content>
|
||||
<p>
|
||||
|
||||
<a class="blog-tags" href="/tags/python/">#python</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/machine-learning/">#machine-learning</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/storage/">#storage</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/images/">#images</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/dvc/">#dvc</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/lmdb/">#lmdb</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/bottleneck/">#bottleneck</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/io/">#io</a>
|
||||
|
||||
</p>
|
||||
|
||||
|
||||
|
||||
|
||||
</main>
|
||||
<footer><small>
|
||||
© Avinash Mallya | Design via <a href="https://github.com/clente/hugo-bearcub">Bear Cub</a>.
|
||||
</small></footer>
|
||||
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="blog" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -35,8 +35,8 @@
|
||||
|
||||
|
||||
<meta itemprop="name" content="blog">
|
||||
<meta itemprop="datePublished" content="2026-01-14T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-01-14T00:00:00+00:00">
|
||||
<meta itemprop="datePublished" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/blog/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
@@ -78,6 +83,19 @@
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/005_ldmb_as_image_db/">Resolving I/O Bottlenecks for 100K Small Files with LMDB</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
@@ -87,7 +105,7 @@
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/004_environment_variables_and_multiprocessing/">Environment Variables and Multiprocessing</a>
|
||||
<a href="/blog/004_environment_variables_and_multiprocessing/">Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines</a>
|
||||
|
||||
</li>
|
||||
|
||||
@@ -136,14 +154,36 @@
|
||||
|
||||
<a class="blog-tags" href="/tags/approximate/">#approximate</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/bottleneck/">#bottleneck</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/by-hand/">#by-hand</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/category/">#category</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/dvc/">#dvc</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/environment/">#environment</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/euler/">#euler</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/faiss/">#faiss</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/graph/">#graph</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/history/">#history</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/images/">#images</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/io/">#io</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/lmdb/">#lmdb</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/logarithms/">#logarithms</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/machine-learning/">#machine-learning</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/mathematics/">#mathematics</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/multiprocessing/">#multiprocessing</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/nearest/">#nearest</a>
|
||||
@@ -170,6 +210,10 @@
|
||||
|
||||
<a class="blog-tags" href="/tags/samples/">#samples</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/square-roots/">#square-roots</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/storage/">#storage</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/variables/">#variables</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/vba/">#vba</a>
|
||||
|
||||
@@ -7,10 +7,273 @@
|
||||
<generator>Hugo -- gohugo.io</generator>
|
||||
<language>en-US</language>
|
||||
<copyright>© Avinash Mallya</copyright>
|
||||
<lastBuildDate>Wed, 14 Jan 2026 00:00:00 +0000</lastBuildDate>
|
||||
<lastBuildDate>Tue, 10 Feb 2026 00:00:00 +0000</lastBuildDate>
|
||||
<atom:link href="https://avimallu.dev/blog/index.xml" rel="self" type="application/rss+xml" />
|
||||
<item>
|
||||
<title>Environment Variables and Multiprocessing</title>
|
||||
<title>Resolving I/O Bottlenecks for 100K Small Files with LMDB</title>
|
||||
<link>https://avimallu.dev/blog/005_ldmb_as_image_db/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/005_ldmb_as_image_db/</guid>
|
||||
<description><h1 id="premise">Premise</h1>
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
<p>I had several thousand images already.
I was expecting several thousand more.
My repository was tracking these images via DVC.
My computer was also slowing down massively because of the sheer number of files.
DVC itself was slowing down (after all, randomly accessing many files isn&rsquo;t going to be fast).
I also needed to access files at random for training/evaluating the model (lots of shuffling).
Lastly, these images had their own associated metadata (labels, bounding boxes, &ldquo;correct&rdquo; text etc.), and they need to be stored along with the images - or least easily linkable to them.</p></description>
|
||||
<content:encoded><![CDATA[<h1 id="premise">Premise</h1>
|
||||
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
|
||||
<p>I had several thousand images already.
|
||||
I was expecting several thousand more.
|
||||
My repository was tracking these images via DVC.
|
||||
My computer was also slowing down massively because of the sheer number of files.
|
||||
DVC itself was slowing down (after all, randomly accessing many files isn’t going to be fast).
|
||||
I also needed to access files at random for training/evaluating the model (lots of shuffling).
|
||||
Lastly, these images had their own associated metadata (labels, bounding boxes, “correct” text etc.), and they need to be stored along with the images - or least easily linkable to them.</p>
|
||||
<p>I was primarily aiming for a “simple” solution, and didn’t need a productionizable codebase.</p>
|
||||
<h1 id="potential-solutions">Potential Solutions</h1>
|
||||
<h2 id="partitioning">Partitioning</h2>
|
||||
<p>A typical solution for “too many files” is to partition them by their name. It’s ideal if their
|
||||
name is a hash, so you can store the first character of the hash, then the second character, and
|
||||
then the actual file. So, for example, the directory changes from:</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln">2</span><span class="cl">├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln">3</span><span class="cl">├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln">4</span><span class="cl">├── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">5</span><span class="cl">├── cf1a2b3c4d.txt
|
||||
</span></span><span class="line"><span class="ln">6</span><span class="cl">├── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">7</span><span class="cl">└── a1f5e6d7c8.txt</span></span></code></pre></div><p>to</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">├── a/
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">│ └── 1/
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">│ ├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">│ └── a1f5e6d7c8.txt
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">├── b/
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">│ └── 7/
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">│ ├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">│ └── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">10</span><span class="cl">└── c/
|
||||
</span></span><span class="line"><span class="ln">11</span><span class="cl"> ├── 0/
|
||||
</span></span><span class="line"><span class="ln">12</span><span class="cl"> │ └── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">13</span><span class="cl"> └── f/
|
||||
</span></span><span class="line"><span class="ln">14</span><span class="cl"> └── cf1a2b3c4d.txt</span></span></code></pre></div><p>This isn’t novel - <code>git</code> and <code>dvc</code> both store their objects this way.
|
||||
This limits directory size, so file system look ups take less time.</p>
|
||||
<p>This isn’t a perfect solution - it required that I store the images as their hash,
|
||||
and handle the directory structure correctly. I will also need to maintain my own
|
||||
mechanism to maintain the link between the hash and the metadata, which means creating
|
||||
some sort of index. Lastly, DVC will still track files individually, which means that
|
||||
its <code>push</code>, <code>diff</code> and <code>pull</code> commands will still be slow.</p>
|
||||
<h2 id="separate-object-storage-maintain-only-the-index">Separate (object) storage, maintain only the index</h2>
|
||||
<p>Another solution would be to offload the storage to a medium capable of handling a large number of files in any order,
|
||||
while maintaining random access - for example, S3. Now my task will reduce to just correctly maintaining the index so
|
||||
that I know how many images are stored on S3, and link their metadata via their hash.</p>
|
||||
<p>If you’ve experienced retrieving a list of a large number of files stored on S3, you’d have first encountered the limit
|
||||
of 1000 objects that <code>boto3</code> enforces per request. You’ll need to work around it with pagination, which while standard,
|
||||
is still more work. You would have also realized that even after all this, S3 will take quite some time to give you the
|
||||
list even after you’ve optimized as much as you can.</p>
|
||||
<p>However, that means I need an active internet connection to access any data. It also introduces latency during training,
|
||||
and quite wasteful bandwidth in running multiple training sessions for the models. I can minimize these if I store the
|
||||
images in the same AWS region as I would be running the training in (say, an EC2), but that means I needed access to an
|
||||
EC2.</p>
|
||||
<p>This still left me the task of maintaining my own index, which I really wanted to avoid, as it would mean additional
|
||||
maintenance burden for a relatively nascent project that hasn’t reached production status, while demanding production
|
||||
code for an even more nascent pipeline.</p>
|
||||
<h2 id="what-about-a-database">What about a… database?</h2>
|
||||
<p>This feels much like reaching for your nose by looping your hand behind your head instead of touching it directly with
|
||||
your fingers. A database is great for many things, but setting one up and maintaining it (even a local SQLite one) isn’t
|
||||
really a quick and painless process, and has many gotchas. For instance, most databases aren’t optimized for storing a
|
||||
large number of binary blobs.</p>
|
||||
<p>I considered, and tested using a more modern embeddable analytical database like DuckDB for this purpose (I’m quite
|
||||
biased to using DuckDB and/or Polars to solve a large number of my data processing problems). I quickly
|
||||
found out that storing large binary blobs in it causes it to choke (which is fair, it isn’t really designed for that). Storing
|
||||
the files elsewhere while maintaining just the index in it still had the original problem - I needed to write the mechanism
|
||||
of maintaining the index.</p>
|
||||
<blockquote>
|
||||
<p>Note: HuggingFace now provides many images datasets (such as <a href="https://huggingface.co/datasets/ylecun/mnist">MNIST</a>) in
|
||||
the Parquet format, with the images stored using Arrow’s extension types (but still as binary blobs). My experience with
|
||||
storing binary data in Parquet hasn’t been great, but you could check this out to see if it meets your requirements.</p>
|
||||
</blockquote>
|
||||
<h1 id="the-solution-i-landed-on">The solution I landed on</h1>
|
||||
<h2 id="what-about-a-different-kind-of-database">What about a… <em>different</em> kind of database?</h2>
|
||||
<p>Let’s get down to first principles. What did I want to do? I wanted to store images. With those images, I also wanted
|
||||
to store its metadata. I wanted to access said data quickly. It became clearer to me that I was looking for a fast key-value
|
||||
store, and I stumbled upon <a href="https://www.symas.com/mdb">LMDB</a>.</p>
|
||||
<h2 id="lmdb">LMDB</h2>
|
||||
<p>Wikipedia’s <a href="https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database">entry</a> on LMDB indicates that it’s an incredibly
|
||||
small (64kB) piece of software that does one thing really well - be a ridiculously fast key-value store. I won’t pretend to
|
||||
understand how it works, the writeup on the Wiki provides plenty of good detail. I’ll focus, rather, on how I used it to
|
||||
solve my problem.</p>
|
||||
<h2 id="storing-and-retrieving-image-data-along-with-its-metadata">Storing and retrieving image data along with its metadata</h2>
|
||||
<p>LMDB is rather barebones. It exposes few features - the ability to write, and read a particular key (stored as a bytestring),
|
||||
that itself points to arbitrary bytes. I used its <a href="https://lmdb.readthedocs.io/en/release/">Python bindings</a>.</p>
|
||||
<p>I wrote a tiny class (~200 LoC) that did the following:</p>
|
||||
<ol>
|
||||
<li>Read the file names and the data of the images that I had, along with their metadata as it currently was, into Python. Batched to avoid running out of memory.</li>
|
||||
<li>Serialize the metadata, and read the image files as bytes, and link them to the keys f"{file_name}_image" and f"{file_name}_metadata" respectively.</li>
|
||||
<li>Store these as key-value pairs into the LMDB database, which is a <strong>single</strong> file.</li>
|
||||
<li>Provided a method to read the keys to identify all the images present in the database.</li>
|
||||
<li>Provided a method to retrieve an arbitrary set of images and their metadata quickly from the saved database.</li>
|
||||
</ol>
|
||||
<p>This has many advantages:</p>
|
||||
<ol>
|
||||
<li>LMDB is fast - really, really fast. Random access? Check. Fast retrieval of available images? Check.</li>
|
||||
<li>DVC tracking becomes simple - maintain a single file, and just version control that. No slow downs due to sheer number of files - either for DVC, or my computer.</li>
|
||||
<li>No index to maintain - the images and their metadata are stored in the same location, and linkable via a mere change in the suffix to their file name.</li>
|
||||
<li>Local access, practically zero latency.</li>
|
||||
</ol>
|
||||
<p>Which solves… all of the problems I had! When new files come in, all I needed to do was add them to the DB. LMDB has a few options available - you can
|
||||
avoid overwriting the same key, ensure that the database is de-duplicated.</p>
|
||||
<h1 id="the-code">The code</h1>
|
||||
<p>I’ve provided a sample code below that demonstrates storing just the images (not metadata) for the <a href="https://www.robots.ox.ac.uk/~vgg/data/flowers/102/">Oxford 102 Category Flower</a>
|
||||
dataset, which has around 8000 images.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">lmdb</span>
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">class</span> <span class="nc">ImageDB</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">env_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span> <span class="n">max_size_as_mb</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env_path</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">env_path</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env</span> <span class="o">=</span> <span class="n">lmdb</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">env_path</span><span class="p">,</span> <span class="n">map_size</span><span class="o">=</span><span class="n">max_size_as_mb</span> <span class="o">*</span> <span class="p">(</span><span class="mi">2</span><span class="o">**</span><span class="mi">20</span><span class="p">))</span>
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">db</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">open_db</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 10</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 11</span><span class="cl"> <span class="k">def</span> <span class="nf">save_image</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 12</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 13</span><span class="cl"> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 14</span><span class="cl"> <span class="n">image_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 15</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 16</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 17</span><span class="cl"> <span class="n">txn</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 18</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 19</span><span class="cl"> <span class="k">def</span> <span class="nf">read_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">bytes</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 20</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 21</span><span class="cl"> <span class="k">if</span> <span class="n">image_as_bytes</span> <span class="o">:=</span> <span class="n">txn</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 22</span><span class="cl"> <span class="k">return</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 23</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 24</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 25</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 26</span><span class="cl"> <span class="k">def</span> <span class="nf">save_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 27</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 28</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 29</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 30</span><span class="cl"> <span class="c1"># Note: you might need to enforce a batch size here</span>
|
||||
</span></span><span class="line"><span class="ln"> 31</span><span class="cl"> <span class="c1"># to aovid running out of memory because this loads</span>
|
||||
</span></span><span class="line"><span class="ln"> 32</span><span class="cl"> <span class="c1"># all images sent to this function as bytes.</span>
|
||||
</span></span><span class="line"><span class="ln"> 33</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 34</span><span class="cl"> <span class="n">item_tuples</span> <span class="o">=</span> <span class="p">[</span>
|
||||
</span></span><span class="line"><span class="ln"> 35</span><span class="cl"> <span class="p">(</span><span class="n">k</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 36</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">name_image</span><span class="o">.</span><span class="n">items</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 37</span><span class="cl"> <span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 38</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 39</span><span class="cl"> <span class="n">consumed</span><span class="p">,</span> <span class="n">added</span> <span class="o">=</span> <span class="n">cursor</span><span class="o">.</span><span class="n">putmulti</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 40</span><span class="cl"> <span class="n">item_tuples</span><span class="p">,</span> <span class="n">dupdata</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">overwrite</span><span class="o">=</span><span class="kc">False</span>
|
||||
</span></span><span class="line"><span class="ln"> 41</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 42</span><span class="cl"> <span class="nb">print</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 43</span><span class="cl"> <span class="sa">f</span><span class="s2">"Saved </span><span class="si">{</span><span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> out of </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images to the DB (</span><span class="si">{</span><span class="n">consumed</span> <span class="o">-</span> <span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> seem to already exist)."</span>
|
||||
</span></span><span class="line"><span class="ln"> 44</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 45</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 46</span><span class="cl"> <span class="k">def</span> <span class="nf">load_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 47</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 48</span><span class="cl"> <span class="n">names</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 49</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 50</span><span class="cl"> <span class="n">names_as_bytestrings</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">encode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">names</span><span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 51</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 52</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 53</span><span class="cl"> <span class="k">return</span> <span class="p">{</span>
|
||||
</span></span><span class="line"><span class="ln"> 54</span><span class="cl"> <span class="n">k</span><span class="o">.</span><span class="n">decode</span><span class="p">():</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 55</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_as_bytes</span> <span class="ow">in</span> <span class="n">cursor</span><span class="o">.</span><span class="n">getmulti</span><span class="p">(</span><span class="n">names_as_bytestrings</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 56</span><span class="cl"> <span class="p">}</span>
|
||||
</span></span><span class="line"><span class="ln"> 57</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 58</span><span class="cl"> <span class="k">def</span> <span class="nf">delete_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 59</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 60</span><span class="cl"> <span class="k">if</span> <span class="n">txn</span><span class="o">.</span><span class="n">delete</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 61</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Image </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2"> deleted successfully"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 62</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 63</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 64</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 65</span><span class="cl"> <span class="k">def</span> <span class="nf">retrieve_names</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 66</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 67</span><span class="cl"> <span class="k">return</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">decode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span><span class="o">.</span><span class="n">iternext</span><span class="p">(</span><span class="n">keys</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="kc">False</span><span class="p">)]</span>
|
||||
</span></span><span class="line"><span class="ln"> 68</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 69</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 70</span><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 71</span><span class="cl"> <span class="n">db</span> <span class="o">=</span> <span class="n">ImageDB</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="s2">"./db/"</span><span class="p">),</span> <span class="mi">512</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 72</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 73</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 74</span><span class="cl"> <span class="c1"># Save the results</span>
|
||||
</span></span><span class="line"><span class="ln"> 75</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg/"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 76</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span>
|
||||
</span></span><span class="line"><span class="ln"> 77</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 78</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 79</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 80</span><span class="cl"> <span class="c1"># Add last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 81</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 82</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 83</span><span class="cl"> <span class="k">del</span> <span class="n">name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 84</span><span class="cl"> <span class="c1"># How many images have been stored?</span>
|
||||
</span></span><span class="line"><span class="ln"> 85</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"The DB has </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">retrieve_names</span><span class="p">())</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images stored"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 86</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 87</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 88</span><span class="cl"> <span class="c1"># Load the results from the DB and check if they match the files on disk</span>
|
||||
</span></span><span class="line"><span class="ln"> 89</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 90</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 91</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 92</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 93</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 94</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 95</span><span class="cl"> <span class="c1"># Verify last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 96</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 97</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 98</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 99</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"All images stored are byte identical to the original ones!"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln">100</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln">101</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></span></span></code></pre></div><p>This should provide you a good starting point to implement additional features,
|
||||
such as storing metadata, filtering required input by metadata (such as extracting
|
||||
a specific label for evaluation) and so on.</p>
|
||||
<h1 id="caveats">Caveats</h1>
|
||||
<h2 id="avoid-pil-or-pay-the-small-price">Avoid PIL, or pay the (small) price</h2>
|
||||
<p>One gotcha that I initially faced is that the images I saved wasn’t the same as
|
||||
the images that I retrieved. This wasn’t LMDB’s fault, this was because I was
|
||||
reading the images from disk via PIL, and storing them as bytes in LMDB. PIL
|
||||
decodes and encodes the image, so a roundtrip will not necessarily be identical,
|
||||
even for lossless file formats (other than bitmap images).</p>
|
||||
<p>Don’t encode/re-encode the image before you store it, or be prepared for the
|
||||
stored data to not be byte-identical.</p>
|
||||
<h2 id="the-max_size_as_mb-argument">The <code>max_size_as_mb</code> argument</h2>
|
||||
<p>LMDB has an unusual design. You need to specify the upper bound of the DB size
|
||||
upon creation, and if it exceeds this size, it will fail. You can edit this
|
||||
later, with some caveats (on Windows, this will actually allocate the full
|
||||
size).</p>
|
||||
<h2 id="concurrency-and-lmdb">Concurrency and LMDB</h2>
|
||||
<p>LMDB, while extremely fast, has some considerations with concurrency. See
|
||||
<a href="https://lmdb.readthedocs.io/en/release/#threads">the documentation</a> for details.
|
||||
It may not be suited for distributed workloads.</p>
|
||||
<h1 id="alternatives">Alternatives</h1>
|
||||
<p>This article covers a “quick and dirty” solution, and was before more purpose-built
|
||||
solutions were available. Some alternatives are:</p>
|
||||
<ol>
|
||||
<li>If you’re comfortable operating directly on archives, a simple <code>tar</code> file will
|
||||
do - it can provide an offset index to provide random access to data.</li>
|
||||
<li><a href="https://github.com/webdataset/webdataset">Nvidia’s WebDataset</a>. Modern, open
|
||||
source and purpose built for large scale deep learning.</li>
|
||||
<li><a href="https://lancedb.com/">LanceDB</a>, which describes itself as “designed for multimodal”
|
||||
and “built for scale”. It’s built on top of Arrow, closely related to Parquet.</li>
|
||||
<li>As mentioned, HuggingFace has multiple solutions to this, starting with Arrow backed
|
||||
storage, and their own <a href="https://huggingface.co/docs/datasets/index"><code>datasets</code></a>.</li>
|
||||
</ol>
|
||||
<p>Use these if you want to scale to production level training.</p>
|
||||
]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines</title>
|
||||
<link>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</link>
|
||||
<pubDate>Wed, 14 Jan 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</guid>
|
||||
@@ -164,7 +427,7 @@ environment variables:</p>
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">seq <span class="m">10</span> <span class="p">|</span> <span class="se">\
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl">parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s2">OPENBLAS_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s2">MKL_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s2">python test.py {}"</span></span></span></code></pre></div><p>An interesting side note is that the parallel, but single core run took 0.61 seconds, <strong>less</strong> than the time it took to run
|
||||
@@ -224,7 +487,7 @@ these threads is that they often are “sleeping”, and don’t <em
|
||||
<p><img src="/blog/003_powerpointsnap/02_Charts.png" alt="The UI for copying chart properties"></p>
|
||||
<p>What do these features do? You should be able to hover over the option and get a tooltip that shows what it’s capable of, but here’s another summary just in case:</p>
|
||||
<ol>
|
||||
<li>Sync Value/Date Axis: this will try to align the range, the ticks, the numeric values etc. of the “set” chart to the one you’ve selected. I couldn’t put in just $x$ and $y$ here because Microsoft internally doesn’t label them that way. Try either of these two options (you can undo!) and see what works best for your chart. This doesn’t work well yet for 3D charts.</li>
|
||||
<li>Sync Value/Date Axis: this will try to align the range, the ticks, the numeric values etc. of the “set” chart to the one you’ve selected. I couldn’t put in just <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">x</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>y</mi></mrow><annotation encoding="application/x-tex">y</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">y</span></span></span></span> here because Microsoft internally doesn’t label them that way. Try either of these two options (you can undo!) and see what works best for your chart. This doesn’t work well yet for 3D charts.</li>
|
||||
<li>Sync Plot/Title/Legend: often, you want to centre a title, or make sure that multiple charts that show nearly identical things for different variables all <em>look</em> exactly the same from a client perspective. But that’s usually difficult if you’ve already configured the charts a little - which can be remedied with this option!</li>
|
||||
<li>Format Painter: this is simply a helper for the normal format painter to align the formats of the text that you’ve selected with the way it originally is in the “set” chart. The reason for this feature is simply to avoid going back to <em>Home</em> to click on the <em>Format Painter</em> option again.</li>
|
||||
<li>Reset Axes Scales: in case you messed up somewhere, you can use this to rever to PowerPoint defaults.</li>
|
||||
@@ -411,7 +674,7 @@ these threads is that they often are “sleeping”, and don’t <em
|
||||
<blockquote>
|
||||
<p><strong>NOTE</strong>: for a proof-of-concept implementation, or if you’re on the CPU, try the <code>all-MiniLM-L6-v2</code> model. It’s a much smaller and much faster model, although you sacrifice a little in terms of accuracy.</p>
|
||||
</blockquote>
|
||||
<h2 id="the-concept-of-_approximate_-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<h2 id="the-concept-of-approximate-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<p>Performing any kind of nearest neighbor algorithm on medium scale datasets (even bordering 10,000 rows and tens of columns) tends to be slow. A primary driver of this was the need to calculate all, or nearly all distances between all data points. <em>Approximate</em> nearest neighbor (ANN) algorithms work around this through various approaches, which warrant their own blog post. For now, it would suffice to understand that there are shortcuts that ANN algorithms take to give you if not the exact nearest neighbor, at least <em>one</em> of the nearest neighbors (hence the term <em>approximate</em>).</p>
|
||||
<p>There are several algorithms that you can use - I shall proceed with <code>faiss</code>, because it has a nice Python interface and is rather easy to work with. You can use any algorithm - a full list of the major ones are <a href="https://github.com/erikbern/ann-benchmarks">available here</a>.</p>
|
||||
<p>I’ll explain why we’re in the nearest neighbor territory in due course.</p>
|
||||
@@ -753,24 +1016,24 @@ these threads is that they often are “sleeping”, and don’t <em
|
||||
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_DISTINCT</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_trucks</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_UNIQUE</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_truck_count</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_open</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">departure_time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_close</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">data</span><span class="p">)</span><span class="w"> </span><span class="n">A</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">DATEDIFF</span><span class="p">(</span><span class="s1">'seconds'</span><span class="p">,</span><span class="w"> </span><span class="n">arrival_time</span><span class="p">,</span><span class="w"> </span><span class="n">departure_time</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">duration</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">data</span><span class="p">)</span><span class="w"> </span><span class="n">B</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w"></span><span class="k">ON</span><span class="w"> </span><span class="p">((</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="k">ON</span><span class="w"> </span><span class="p">((</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">TO_SECONDS</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">duration</span><span class="p">))</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="p">)</span><span class="w"> </span><span class="k">OR</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="p">)</span><span class="w"> </span><span class="k">OR</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">TO_SECONDS</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">duration</span><span class="p">))</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="p">))</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span></span></span></code></pre></div><p>A small, succinct query such as this will need a bit of explanation to take it all in. Here’s one below, reproducible in Python (make sure to install <code>duckdb</code> first!). Expand it to view.</p>
|
||||
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span></span></span></code></pre></div><p>A small, succinct query such as this will need a bit of explanation to take it all in. Here’s one below, reproducible in Python (make sure to install <code>duckdb</code> first!). Expand it to view.</p>
|
||||
<details markdown="1"><summary>SQL with explanation.</summary>
|
||||
|
||||
|
||||
@@ -858,13 +1121,13 @@ these threads is that they often are “sleeping”, and don’t <em
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Case 2 in the diagram
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">TO_SECONDS</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">duration</span><span class="p">))</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="p">)</span><span class="w"> </span><span class="k">OR</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Case 3 in the diagram
|
||||
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1">-- Case 3 in the diagram
|
||||
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="p">)</span><span class="w"> </span><span class="k">OR</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- Case 4 in the diagram
|
||||
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"></span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1">-- Case 4 in the diagram
|
||||
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"> </span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">TO_SECONDS</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">duration</span><span class="p">))</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="p">)</span></span></span></code></pre></div><p>What is common between these three conditions? It takes a while to see it; but it becomes clear that all these cases require the start of the overlap to be <em>before</em> the window ends, and the end of the overlap to be <em>after</em> the window starts. This can be simplified to just:</p>
|
||||
|
||||
|
||||
@@ -872,7 +1135,7 @@ these threads is that they often are “sleeping”, and don’t <em
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span></span></span></code></pre></div><p>making our query much simpler!</p>
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span></span></span></code></pre></div><p>making our query much simpler!</p>
|
||||
<h3 id="simplified-sql-part-1">Simplified SQL: Part 1</h3>
|
||||
<p>We’ve removed the need for the <code>duration</code> calculation algother now. Therefore, we can write:</p>
|
||||
|
||||
@@ -888,19 +1151,19 @@ these threads is that they often are “sleeping”, and don’t <em
|
||||
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_DISTINCT</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_trucks</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_UNIQUE</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_truck_count</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">arrival_time</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_open</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">departure_time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_close</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">data</span><span class="p">)</span><span class="w"> </span><span class="n">A</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="k">data</span><span class="w"> </span><span class="n">B</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="k">data</span><span class="w"> </span><span class="n">B</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="k">ON</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="k">ON</span><span class="w"> </span><span class="p">(</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_close</span><span class="w"> </span><span class="k">AND</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">A</span><span class="p">.</span><span class="n">window_open</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span></span></span></code></pre></div><p>Can we simplify this even further?</p>
|
||||
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="p">)</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span></span></span></code></pre></div><p>Can we simplify this even further?</p>
|
||||
<h3 id="simplification-part-2">Simplification: Part 2</h3>
|
||||
<p>I think the SQL query in the above section is very easy to ready already. However, it is a little clunky overall, and there is a way that we can leverage DuckDB’s extensive optimizations to simplify our <strong>legibility</strong> by rewriting the query as a cross join:</p>
|
||||
|
||||
@@ -915,10 +1178,10 @@ these threads is that they often are “sleeping”, and don’t <em
|
||||
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">A</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="nb">INTERVAL</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">MINUTE</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">window_close</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_DISTINCT</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_trucks</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"> </span><span class="p">,</span><span class="n">LIST_UNIQUE</span><span class="p">(</span><span class="n">LIST</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">ID</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">docked_truck_count</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="k">data</span><span class="w"> </span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="k">data</span><span class="w"> </span><span class="n">B</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">window_close</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">AND</span><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">window_open</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span></span></span></code></pre></div><p>Why does this work? Before optimization on DuckDB, this is what the query plan looks like:</p>
|
||||
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="k">data</span><span class="w"> </span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="k">data</span><span class="w"> </span><span class="n">B</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="k">WHERE</span><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">arrival_time</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">window_close</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="k">AND</span><span class="w"> </span><span class="n">B</span><span class="p">.</span><span class="n">departure_time</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">window_open</span><span class="w">
|
||||
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span></span></span></code></pre></div><p>Why does this work? Before optimization on DuckDB, this is what the query plan looks like:</p>
|
||||
<details markdown="1"><summary>DuckDB query plan before optimization</summary>
|
||||
|
||||
|
||||
@@ -1038,7 +1301,7 @@ these threads is that they often are “sleeping”, and don’t <em
|
||||
<li>The default match type is that it matches all three cases from the image above. Side note: it also has matches for <code>within</code> overlap, matching <code>start</code> and <code>end</code> windows,</li>
|
||||
<li>The last two matching columns in the join condition in <code>by</code> must specify the <code>start</code> and <code>end</code> points of the overlapping window. This isn’t a problem for us now, but does restrict for future uses where we may want non-equi joins on other cases.</li>
|
||||
</ol>
|
||||
<h3 id="the-code-_si_-the-code">The code, <em>si</em>, the code!</h3>
|
||||
<h3 id="the-code-si-the-code">The code, <em>si</em>, the code!</h3>
|
||||
<p>Without further ado:</p>
|
||||
|
||||
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Categories" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -48,6 +48,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/categories/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -59,6 +62,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta name="generator" content="Hugo 0.142.0">
|
||||
<meta name="generator" content="Hugo 0.155.3">
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
@@ -10,7 +10,7 @@
|
||||
<meta name="title" content="about" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -45,8 +45,8 @@ A few posts where I show up some creative ways that I’ve solved complex proble
|
||||
<meta itemprop="description" content="Hi there! My name is Avinash Mallya (pronounced Uh-vin-aash Muh-ll-yeah), and I’m a data scientist by profession. This website is a creative outlet, and my piece of the internet where I show off.
|
||||
What’s here? You’ll find the following:
|
||||
A few posts where I show up some creative ways that I’ve solved complex problems. Links to projects that I’ve worked on, or have contributed to. An assortment of random things I’ve found interesting. Contact You can find me on:">
|
||||
<meta itemprop="datePublished" content="2026-01-14T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-01-14T00:00:00+00:00">
|
||||
<meta itemprop="datePublished" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="wordCount" content="94">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
@@ -61,6 +61,9 @@ A few posts where I show up some creative ways that I’ve solved complex proble
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -72,6 +75,8 @@ A few posts where I show up some creative ways that I’ve solved complex proble
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
989
public/index.xml
989
public/index.xml
File diff suppressed because one or more lines are too long
@@ -10,6 +10,8 @@
|
||||
<meta name="description" content="Most of my work is on private repositories, but I do find some time to learn new topics, contribute back to some of the open source packages I frequently use, or to create interesting tools.
|
||||
Featured projects
|
||||
|
||||
ducktabe: Short for DuckDB Tabular Explorer. A small tool to run SQL on parquet files I generated, for “quick and dirty” analysis. I am working on improving it over time and packaging it for distribution - currently it is in a pre-alpha stage.
|
||||
ducktabe-py: An offshoot of the Rust version above, but written entirely by AI. It is my first foray into spec-driven development, in a language and general area that I’ve got significant experience in. It is surprisingly capable, and is available on PyPI.
|
||||
BorrowChecker: A play on the same concept in Rust, this is a simple web-app that allows you to split complex receipts with multiple people in a simple manner. Runs entirely in-browser. Made with Dioxus and Rust. Repository link.
|
||||
PowerPointSnap: A mostly feature complete tool for PowerPoint on VBA that is filled with a lot of tricks to make it easy to consistently format presentations to impress clients - from my consulting days. Written in VBA. See accompanying blog post.
|
||||
|
||||
@@ -32,7 +34,7 @@ I wrote several chapters of the Polars Book, which have since been moved to the
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="projects">
|
||||
<meta property="og:description" content="Most of my work is on private repositories, but I do find some time to learn new topics, contribute back to some of the open source packages I frequently use, or to create interesting tools.
|
||||
Featured projects BorrowChecker: A play on the same concept in Rust, this is a simple web-app that allows you to split complex receipts with multiple people in a simple manner. Runs entirely in-browser. Made with Dioxus and Rust. Repository link. PowerPointSnap: A mostly feature complete tool for PowerPoint on VBA that is filled with a lot of tricks to make it easy to consistently format presentations to impress clients - from my consulting days. Written in VBA. See accompanying blog post. Other work or contributions IntelligentReceiptSplitter: A relatively simple predecessor to BorrowChecker that focussed on using an OCR framework followed by an LLM based parser to read receipts that could be further split manually. This combination significantly reduced hallucinations from LLMs but was still very computationally intensive to run. r.data.table.funs: A very small set of R functions that use data.table, that I found very useful earlier in my career to quicky churn out analyses. It is not ground-breaking, but rather something that anybody with sufficient basic skills in R and understand, and save an immense amount of time. I wrote several chapters of the Polars Book, which have since been moved to the main Polars repository. Polars was a breadth of fresh air in terms of speed and ergonomics, which I had been sorely missing after switching to Python from R (where projects like data.table and dplyr dominated), so I was eager to make it better for everybody making the switch.">
|
||||
Featured projects ducktabe: Short for DuckDB Tabular Explorer. A small tool to run SQL on parquet files I generated, for “quick and dirty” analysis. I am working on improving it over time and packaging it for distribution - currently it is in a pre-alpha stage. ducktabe-py: An offshoot of the Rust version above, but written entirely by AI. It is my first foray into spec-driven development, in a language and general area that I’ve got significant experience in. It is surprisingly capable, and is available on PyPI. BorrowChecker: A play on the same concept in Rust, this is a simple web-app that allows you to split complex receipts with multiple people in a simple manner. Runs entirely in-browser. Made with Dioxus and Rust. Repository link. PowerPointSnap: A mostly feature complete tool for PowerPoint on VBA that is filled with a lot of tricks to make it easy to consistently format presentations to impress clients - from my consulting days. Written in VBA. See accompanying blog post. Other work or contributions IntelligentReceiptSplitter: A relatively simple predecessor to BorrowChecker that focussed on using an OCR framework followed by an LLM based parser to read receipts that could be further split manually. This combination significantly reduced hallucinations from LLMs but was still very computationally intensive to run. r.data.table.funs: A very small set of R functions that use data.table, that I found very useful earlier in my career to quicky churn out analyses. It is not ground-breaking, but rather something that anybody with sufficient basic skills in R and understand, and save an immense amount of time. I wrote several chapters of the Polars Book, which have since been moved to the main Polars repository. Polars was a breadth of fresh air in terms of speed and ergonomics, which I had been sorely missing after switching to Python from R (where projects like data.table and dplyr dominated), so I was eager to make it better for everybody making the switch.">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="article">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
@@ -44,15 +46,15 @@ Featured projects BorrowChecker: A play on the same concept in Rust, this is a s
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="projects">
|
||||
<meta name="twitter:description" content="Most of my work is on private repositories, but I do find some time to learn new topics, contribute back to some of the open source packages I frequently use, or to create interesting tools.
|
||||
Featured projects BorrowChecker: A play on the same concept in Rust, this is a simple web-app that allows you to split complex receipts with multiple people in a simple manner. Runs entirely in-browser. Made with Dioxus and Rust. Repository link. PowerPointSnap: A mostly feature complete tool for PowerPoint on VBA that is filled with a lot of tricks to make it easy to consistently format presentations to impress clients - from my consulting days. Written in VBA. See accompanying blog post. Other work or contributions IntelligentReceiptSplitter: A relatively simple predecessor to BorrowChecker that focussed on using an OCR framework followed by an LLM based parser to read receipts that could be further split manually. This combination significantly reduced hallucinations from LLMs but was still very computationally intensive to run. r.data.table.funs: A very small set of R functions that use data.table, that I found very useful earlier in my career to quicky churn out analyses. It is not ground-breaking, but rather something that anybody with sufficient basic skills in R and understand, and save an immense amount of time. I wrote several chapters of the Polars Book, which have since been moved to the main Polars repository. Polars was a breadth of fresh air in terms of speed and ergonomics, which I had been sorely missing after switching to Python from R (where projects like data.table and dplyr dominated), so I was eager to make it better for everybody making the switch.">
|
||||
Featured projects ducktabe: Short for DuckDB Tabular Explorer. A small tool to run SQL on parquet files I generated, for “quick and dirty” analysis. I am working on improving it over time and packaging it for distribution - currently it is in a pre-alpha stage. ducktabe-py: An offshoot of the Rust version above, but written entirely by AI. It is my first foray into spec-driven development, in a language and general area that I’ve got significant experience in. It is surprisingly capable, and is available on PyPI. BorrowChecker: A play on the same concept in Rust, this is a simple web-app that allows you to split complex receipts with multiple people in a simple manner. Runs entirely in-browser. Made with Dioxus and Rust. Repository link. PowerPointSnap: A mostly feature complete tool for PowerPoint on VBA that is filled with a lot of tricks to make it easy to consistently format presentations to impress clients - from my consulting days. Written in VBA. See accompanying blog post. Other work or contributions IntelligentReceiptSplitter: A relatively simple predecessor to BorrowChecker that focussed on using an OCR framework followed by an LLM based parser to read receipts that could be further split manually. This combination significantly reduced hallucinations from LLMs but was still very computationally intensive to run. r.data.table.funs: A very small set of R functions that use data.table, that I found very useful earlier in my career to quicky churn out analyses. It is not ground-breaking, but rather something that anybody with sufficient basic skills in R and understand, and save an immense amount of time. I wrote several chapters of the Polars Book, which have since been moved to the main Polars repository. Polars was a breadth of fresh air in terms of speed and ergonomics, which I had been sorely missing after switching to Python from R (where projects like data.table and dplyr dominated), so I was eager to make it better for everybody making the switch.">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="projects">
|
||||
<meta itemprop="description" content="Most of my work is on private repositories, but I do find some time to learn new topics, contribute back to some of the open source packages I frequently use, or to create interesting tools.
|
||||
Featured projects BorrowChecker: A play on the same concept in Rust, this is a simple web-app that allows you to split complex receipts with multiple people in a simple manner. Runs entirely in-browser. Made with Dioxus and Rust. Repository link. PowerPointSnap: A mostly feature complete tool for PowerPoint on VBA that is filled with a lot of tricks to make it easy to consistently format presentations to impress clients - from my consulting days. Written in VBA. See accompanying blog post. Other work or contributions IntelligentReceiptSplitter: A relatively simple predecessor to BorrowChecker that focussed on using an OCR framework followed by an LLM based parser to read receipts that could be further split manually. This combination significantly reduced hallucinations from LLMs but was still very computationally intensive to run. r.data.table.funs: A very small set of R functions that use data.table, that I found very useful earlier in my career to quicky churn out analyses. It is not ground-breaking, but rather something that anybody with sufficient basic skills in R and understand, and save an immense amount of time. I wrote several chapters of the Polars Book, which have since been moved to the main Polars repository. Polars was a breadth of fresh air in terms of speed and ergonomics, which I had been sorely missing after switching to Python from R (where projects like data.table and dplyr dominated), so I was eager to make it better for everybody making the switch.">
|
||||
<meta itemprop="wordCount" content="276">
|
||||
Featured projects ducktabe: Short for DuckDB Tabular Explorer. A small tool to run SQL on parquet files I generated, for “quick and dirty” analysis. I am working on improving it over time and packaging it for distribution - currently it is in a pre-alpha stage. ducktabe-py: An offshoot of the Rust version above, but written entirely by AI. It is my first foray into spec-driven development, in a language and general area that I’ve got significant experience in. It is surprisingly capable, and is available on PyPI. BorrowChecker: A play on the same concept in Rust, this is a simple web-app that allows you to split complex receipts with multiple people in a simple manner. Runs entirely in-browser. Made with Dioxus and Rust. Repository link. PowerPointSnap: A mostly feature complete tool for PowerPoint on VBA that is filled with a lot of tricks to make it easy to consistently format presentations to impress clients - from my consulting days. Written in VBA. See accompanying blog post. Other work or contributions IntelligentReceiptSplitter: A relatively simple predecessor to BorrowChecker that focussed on using an OCR framework followed by an LLM based parser to read receipts that could be further split manually. This combination significantly reduced hallucinations from LLMs but was still very computationally intensive to run. r.data.table.funs: A very small set of R functions that use data.table, that I found very useful earlier in my career to quicky churn out analyses. It is not ground-breaking, but rather something that anybody with sufficient basic skills in R and understand, and save an immense amount of time. I wrote several chapters of the Polars Book, which have since been moved to the main Polars repository. Polars was a breadth of fresh air in terms of speed and ergonomics, which I had been sorely missing after switching to Python from R (where projects like data.table and dplyr dominated), so I was eager to make it better for everybody making the switch.">
|
||||
<meta itemprop="wordCount" content="361">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
@@ -65,6 +67,9 @@ Featured projects BorrowChecker: A play on the same concept in Rust, this is a s
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -76,6 +81,8 @@ Featured projects BorrowChecker: A play on the same concept in Rust, this is a s
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
@@ -94,6 +101,8 @@ Featured projects BorrowChecker: A play on the same concept in Rust, this is a s
|
||||
<p>Most of my work is on private repositories, but I do find some time to learn new topics, contribute back to some of the open source packages I frequently use, or to create interesting tools.</p>
|
||||
<h1 id="featured-projects">Featured projects</h1>
|
||||
<ol>
|
||||
<li><a href="https://github.com/avimallu/ducktabe">ducktabe</a>: Short for <strong>Duck</strong>DB <strong>Tab</strong>ular <strong>E</strong>xplorer. A small tool to run SQL on parquet files I generated, for “quick and dirty” analysis. I am working on improving it over time and packaging it for distribution - currently it is in a pre-alpha stage.</li>
|
||||
<li><a href="https://github.com/avimallu/ducktabe-py">ducktabe-py</a>: An offshoot of the Rust version above, but written entirely by AI. It is my first foray into spec-driven development, in a language and general area that I’ve got significant experience in. It is surprisingly capable, and is <a href="https://pypi.org/project/ducktabe/">available on PyPI</a>.</li>
|
||||
<li><a href="https://avimallu.github.io/BorrowChecker/">BorrowChecker</a>: A play on the same concept in Rust, this is a simple web-app that allows you to split complex receipts with multiple people in a simple manner. Runs entirely in-browser. Made with Dioxus and Rust. <a href="https://github.com/avimallu/BorrowChecker">Repository link</a>.</li>
|
||||
<li><a href="https://github.com/avimallu/PowerPointSnap">PowerPointSnap</a>: A mostly feature complete tool for PowerPoint on VBA that is filled with a lot of tricks to make it easy to consistently format presentations to impress clients - from my consulting days. Written in VBA. See accompanying <a href="https://avimallu.dev/blog/003_powerpointsnap/">blog post</a>.</li>
|
||||
</ol>
|
||||
|
||||
439
public/series/byhand/byhand_001_square_roots/index.html
Normal file
439
public/series/byhand/byhand_001_square_roots/index.html
Normal file
File diff suppressed because one or more lines are too long
516
public/series/byhand/byhand_002_logarithms/index.html
Normal file
516
public/series/byhand/byhand_002_logarithms/index.html
Normal file
File diff suppressed because one or more lines are too long
429
public/series/byhand_001_square_roots/index.html
Normal file
429
public/series/byhand_001_square_roots/index.html
Normal file
File diff suppressed because one or more lines are too long
500
public/series/byhand_002_logarithms/index.html
Normal file
500
public/series/byhand_002_logarithms/index.html
Normal file
File diff suppressed because one or more lines are too long
198
public/series/index.html
Normal file
198
public/series/index.html
Normal file
@@ -0,0 +1,198 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>series | Avinash's Blog</title>
|
||||
<meta name="title" content="series" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/series/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="series">
|
||||
<meta property="og:description" content="The By Hand Series">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="website">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="series">
|
||||
<meta name="twitter:description" content="The By Hand Series">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="series">
|
||||
<meta itemprop="description" content="The By Hand Series">
|
||||
<meta itemprop="datePublished" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="wordCount" content="4">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
|
||||
<link href="/original.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/series/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header><a class="skip-link" href="#main-content">Skip to main content</a>
|
||||
|
||||
<a href="/" class="title"><h1>Avinash's Blog</h1></a>
|
||||
<nav>
|
||||
<a href="/">about</a>
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</nav>
|
||||
</header>
|
||||
<main id="main-content">
|
||||
<content>
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/series/byhand/byhand_002_logarithms/">BasicsByHand: Logarithms</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/series/byhand/byhand_001_square_roots/">BasicsByHand: Square Roots</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
<div>
|
||||
|
||||
<a class="blog-tags" href="/tags/approximate/">#approximate</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/bottleneck/">#bottleneck</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/by-hand/">#by-hand</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/category/">#category</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/dvc/">#dvc</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/environment/">#environment</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/euler/">#euler</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/faiss/">#faiss</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/graph/">#graph</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/history/">#history</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/images/">#images</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/io/">#io</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/lmdb/">#lmdb</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/logarithms/">#logarithms</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/machine-learning/">#machine-learning</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/mathematics/">#mathematics</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/multiprocessing/">#multiprocessing</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/nearest/">#nearest</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/neighbor/">#neighbor</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/network/">#network</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/networkx/">#networkx</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/numpy/">#numpy</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/parallel/">#parallel</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/polars/">#polars</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/powerpoint/">#powerpoint</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/ppt/">#ppt</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/python/">#python</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/representative/">#representative</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/samples/">#samples</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/square-roots/">#square-roots</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/storage/">#storage</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/variables/">#variables</a>
|
||||
|
||||
<a class="blog-tags" href="/tags/vba/">#vba</a>
|
||||
|
||||
</div>
|
||||
|
||||
</content>
|
||||
|
||||
</main>
|
||||
<footer><small>
|
||||
© Avinash Mallya | Design via <a href="https://github.com/clente/hugo-bearcub">Bear Cub</a>.
|
||||
</small></footer>
|
||||
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
685
public/series/index.xml
Normal file
685
public/series/index.xml
Normal file
File diff suppressed because one or more lines are too long
@@ -3,17 +3,71 @@
|
||||
xmlns:xhtml="http://www.w3.org/1999/xhtml">
|
||||
<url>
|
||||
<loc>https://avimallu.dev/</loc>
|
||||
<lastmod>2026-01-14T00:00:00+00:00</lastmod>
|
||||
<lastmod>2026-02-15T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/blog/</loc>
|
||||
<lastmod>2026-01-14T00:00:00+00:00</lastmod>
|
||||
<lastmod>2026-02-10T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/series/</loc>
|
||||
<lastmod>2026-02-15T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/projects/</loc>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/environment/</loc>
|
||||
<lastmod>2026-01-14T00:00:00+00:00</lastmod>
|
||||
<loc>https://avimallu.dev/series/byhand/byhand_002_logarithms/</loc>
|
||||
<lastmod>2026-02-15T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</loc>
|
||||
<loc>https://avimallu.dev/series/byhand/byhand_001_square_roots/</loc>
|
||||
<lastmod>2026-02-15T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/by-hand/</loc>
|
||||
<lastmod>2026-02-15T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/euler/</loc>
|
||||
<lastmod>2026-02-15T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/history/</loc>
|
||||
<lastmod>2026-02-15T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/logarithms/</loc>
|
||||
<lastmod>2026-02-15T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/mathematics/</loc>
|
||||
<lastmod>2026-02-15T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/square-roots/</loc>
|
||||
<lastmod>2026-02-15T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/</loc>
|
||||
<lastmod>2026-02-15T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/bottleneck/</loc>
|
||||
<lastmod>2026-02-10T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/dvc/</loc>
|
||||
<lastmod>2026-02-10T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/images/</loc>
|
||||
<lastmod>2026-02-10T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/io/</loc>
|
||||
<lastmod>2026-02-10T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/lmdb/</loc>
|
||||
<lastmod>2026-02-10T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/machine-learning/</loc>
|
||||
<lastmod>2026-02-10T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/python/</loc>
|
||||
<lastmod>2026-02-10T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/blog/005_ldmb_as_image_db/</loc>
|
||||
<lastmod>2026-02-10T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/storage/</loc>
|
||||
<lastmod>2026-02-10T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/environment/</loc>
|
||||
<lastmod>2026-01-14T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/multiprocessing/</loc>
|
||||
@@ -25,10 +79,7 @@
|
||||
<loc>https://avimallu.dev/tags/parallel/</loc>
|
||||
<lastmod>2026-01-14T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/python/</loc>
|
||||
<lastmod>2026-01-14T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/</loc>
|
||||
<loc>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</loc>
|
||||
<lastmod>2026-01-14T00:00:00+00:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://avimallu.dev/tags/variables/</loc>
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Approximate" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/approximate/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -139,7 +139,7 @@
|
||||
<blockquote>
|
||||
<p><strong>NOTE</strong>: for a proof-of-concept implementation, or if you’re on the CPU, try the <code>all-MiniLM-L6-v2</code> model. It’s a much smaller and much faster model, although you sacrifice a little in terms of accuracy.</p>
|
||||
</blockquote>
|
||||
<h2 id="the-concept-of-_approximate_-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<h2 id="the-concept-of-approximate-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<p>Performing any kind of nearest neighbor algorithm on medium scale datasets (even bordering 10,000 rows and tens of columns) tends to be slow. A primary driver of this was the need to calculate all, or nearly all distances between all data points. <em>Approximate</em> nearest neighbor (ANN) algorithms work around this through various approaches, which warrant their own blog post. For now, it would suffice to understand that there are shortcuts that ANN algorithms take to give you if not the exact nearest neighbor, at least <em>one</em> of the nearest neighbors (hence the term <em>approximate</em>).</p>
|
||||
<p>There are several algorithms that you can use - I shall proceed with <code>faiss</code>, because it has a nice Python interface and is rather easy to work with. You can use any algorithm - a full list of the major ones are <a href="https://github.com/erikbern/ann-benchmarks">available here</a>.</p>
|
||||
<p>I’ll explain why we’re in the nearest neighbor territory in due course.</p>
|
||||
|
||||
113
public/tags/bottleneck/index.html
Normal file
113
public/tags/bottleneck/index.html
Normal file
@@ -0,0 +1,113 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Bottleneck | Avinash's Blog</title>
|
||||
<meta name="title" content="Bottleneck" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/tags/bottleneck/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="Bottleneck">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="website">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="Bottleneck">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="Bottleneck">
|
||||
<meta itemprop="datePublished" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
|
||||
<link href="/original.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/bottleneck/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header><a class="skip-link" href="#main-content">Skip to main content</a>
|
||||
|
||||
<a href="/" class="title"><h1>Avinash's Blog</h1></a>
|
||||
<nav>
|
||||
<a href="/">about</a>
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</nav>
|
||||
</header>
|
||||
<main id="main-content">
|
||||
<content>
|
||||
|
||||
<h3 class="blog-filter">Filtering for "Bottleneck"</h3>
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/005_ldmb_as_image_db/">Resolving I/O Bottlenecks for 100K Small Files with LMDB</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</content>
|
||||
|
||||
</main>
|
||||
<footer><small>
|
||||
© Avinash Mallya | Design via <a href="https://github.com/clente/hugo-bearcub">Bear Cub</a>.
|
||||
</small></footer>
|
||||
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
276
public/tags/bottleneck/index.xml
Normal file
276
public/tags/bottleneck/index.xml
Normal file
@@ -0,0 +1,276 @@
|
||||
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
|
||||
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
|
||||
<channel>
|
||||
<title>Bottleneck on Avinash's Blog</title>
|
||||
<link>https://avimallu.dev/tags/bottleneck/</link>
|
||||
<description>Recent content in Bottleneck on Avinash's Blog</description>
|
||||
<generator>Hugo -- gohugo.io</generator>
|
||||
<language>en-US</language>
|
||||
<copyright>© Avinash Mallya</copyright>
|
||||
<lastBuildDate>Tue, 10 Feb 2026 00:00:00 +0000</lastBuildDate>
|
||||
<atom:link href="https://avimallu.dev/tags/bottleneck/index.xml" rel="self" type="application/rss+xml" />
|
||||
<item>
|
||||
<title>Resolving I/O Bottlenecks for 100K Small Files with LMDB</title>
|
||||
<link>https://avimallu.dev/blog/005_ldmb_as_image_db/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/005_ldmb_as_image_db/</guid>
|
||||
<description><h1 id="premise">Premise</h1>
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
<p>I had several thousand images already.
I was expecting several thousand more.
My repository was tracking these images via DVC.
My computer was also slowing down massively because of the sheer number of files.
DVC itself was slowing down (after all, randomly accessing many files isn&rsquo;t going to be fast).
I also needed to access files at random for training/evaluating the model (lots of shuffling).
Lastly, these images had their own associated metadata (labels, bounding boxes, &ldquo;correct&rdquo; text etc.), and they need to be stored along with the images - or least easily linkable to them.</p></description>
|
||||
<content:encoded><![CDATA[<h1 id="premise">Premise</h1>
|
||||
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
|
||||
<p>I had several thousand images already.
|
||||
I was expecting several thousand more.
|
||||
My repository was tracking these images via DVC.
|
||||
My computer was also slowing down massively because of the sheer number of files.
|
||||
DVC itself was slowing down (after all, randomly accessing many files isn’t going to be fast).
|
||||
I also needed to access files at random for training/evaluating the model (lots of shuffling).
|
||||
Lastly, these images had their own associated metadata (labels, bounding boxes, “correct” text etc.), and they need to be stored along with the images - or least easily linkable to them.</p>
|
||||
<p>I was primarily aiming for a “simple” solution, and didn’t need a productionizable codebase.</p>
|
||||
<h1 id="potential-solutions">Potential Solutions</h1>
|
||||
<h2 id="partitioning">Partitioning</h2>
|
||||
<p>A typical solution for “too many files” is to partition them by their name. It’s ideal if their
|
||||
name is a hash, so you can store the first character of the hash, then the second character, and
|
||||
then the actual file. So, for example, the directory changes from:</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln">2</span><span class="cl">├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln">3</span><span class="cl">├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln">4</span><span class="cl">├── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">5</span><span class="cl">├── cf1a2b3c4d.txt
|
||||
</span></span><span class="line"><span class="ln">6</span><span class="cl">├── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">7</span><span class="cl">└── a1f5e6d7c8.txt</span></span></code></pre></div><p>to</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">├── a/
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">│ └── 1/
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">│ ├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">│ └── a1f5e6d7c8.txt
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">├── b/
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">│ └── 7/
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">│ ├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">│ └── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">10</span><span class="cl">└── c/
|
||||
</span></span><span class="line"><span class="ln">11</span><span class="cl"> ├── 0/
|
||||
</span></span><span class="line"><span class="ln">12</span><span class="cl"> │ └── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">13</span><span class="cl"> └── f/
|
||||
</span></span><span class="line"><span class="ln">14</span><span class="cl"> └── cf1a2b3c4d.txt</span></span></code></pre></div><p>This isn’t novel - <code>git</code> and <code>dvc</code> both store their objects this way.
|
||||
This limits directory size, so file system look ups take less time.</p>
|
||||
<p>This isn’t a perfect solution - it required that I store the images as their hash,
|
||||
and handle the directory structure correctly. I will also need to maintain my own
|
||||
mechanism to maintain the link between the hash and the metadata, which means creating
|
||||
some sort of index. Lastly, DVC will still track files individually, which means that
|
||||
its <code>push</code>, <code>diff</code> and <code>pull</code> commands will still be slow.</p>
|
||||
<h2 id="separate-object-storage-maintain-only-the-index">Separate (object) storage, maintain only the index</h2>
|
||||
<p>Another solution would be to offload the storage to a medium capable of handling a large number of files in any order,
|
||||
while maintaining random access - for example, S3. Now my task will reduce to just correctly maintaining the index so
|
||||
that I know how many images are stored on S3, and link their metadata via their hash.</p>
|
||||
<p>If you’ve experienced retrieving a list of a large number of files stored on S3, you’d have first encountered the limit
|
||||
of 1000 objects that <code>boto3</code> enforces per request. You’ll need to work around it with pagination, which while standard,
|
||||
is still more work. You would have also realized that even after all this, S3 will take quite some time to give you the
|
||||
list even after you’ve optimized as much as you can.</p>
|
||||
<p>However, that means I need an active internet connection to access any data. It also introduces latency during training,
|
||||
and quite wasteful bandwidth in running multiple training sessions for the models. I can minimize these if I store the
|
||||
images in the same AWS region as I would be running the training in (say, an EC2), but that means I needed access to an
|
||||
EC2.</p>
|
||||
<p>This still left me the task of maintaining my own index, which I really wanted to avoid, as it would mean additional
|
||||
maintenance burden for a relatively nascent project that hasn’t reached production status, while demanding production
|
||||
code for an even more nascent pipeline.</p>
|
||||
<h2 id="what-about-a-database">What about a… database?</h2>
|
||||
<p>This feels much like reaching for your nose by looping your hand behind your head instead of touching it directly with
|
||||
your fingers. A database is great for many things, but setting one up and maintaining it (even a local SQLite one) isn’t
|
||||
really a quick and painless process, and has many gotchas. For instance, most databases aren’t optimized for storing a
|
||||
large number of binary blobs.</p>
|
||||
<p>I considered, and tested using a more modern embeddable analytical database like DuckDB for this purpose (I’m quite
|
||||
biased to using DuckDB and/or Polars to solve a large number of my data processing problems). I quickly
|
||||
found out that storing large binary blobs in it causes it to choke (which is fair, it isn’t really designed for that). Storing
|
||||
the files elsewhere while maintaining just the index in it still had the original problem - I needed to write the mechanism
|
||||
of maintaining the index.</p>
|
||||
<blockquote>
|
||||
<p>Note: HuggingFace now provides many images datasets (such as <a href="https://huggingface.co/datasets/ylecun/mnist">MNIST</a>) in
|
||||
the Parquet format, with the images stored using Arrow’s extension types (but still as binary blobs). My experience with
|
||||
storing binary data in Parquet hasn’t been great, but you could check this out to see if it meets your requirements.</p>
|
||||
</blockquote>
|
||||
<h1 id="the-solution-i-landed-on">The solution I landed on</h1>
|
||||
<h2 id="what-about-a-different-kind-of-database">What about a… <em>different</em> kind of database?</h2>
|
||||
<p>Let’s get down to first principles. What did I want to do? I wanted to store images. With those images, I also wanted
|
||||
to store its metadata. I wanted to access said data quickly. It became clearer to me that I was looking for a fast key-value
|
||||
store, and I stumbled upon <a href="https://www.symas.com/mdb">LMDB</a>.</p>
|
||||
<h2 id="lmdb">LMDB</h2>
|
||||
<p>Wikipedia’s <a href="https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database">entry</a> on LMDB indicates that it’s an incredibly
|
||||
small (64kB) piece of software that does one thing really well - be a ridiculously fast key-value store. I won’t pretend to
|
||||
understand how it works, the writeup on the Wiki provides plenty of good detail. I’ll focus, rather, on how I used it to
|
||||
solve my problem.</p>
|
||||
<h2 id="storing-and-retrieving-image-data-along-with-its-metadata">Storing and retrieving image data along with its metadata</h2>
|
||||
<p>LMDB is rather barebones. It exposes few features - the ability to write, and read a particular key (stored as a bytestring),
|
||||
that itself points to arbitrary bytes. I used its <a href="https://lmdb.readthedocs.io/en/release/">Python bindings</a>.</p>
|
||||
<p>I wrote a tiny class (~200 LoC) that did the following:</p>
|
||||
<ol>
|
||||
<li>Read the file names and the data of the images that I had, along with their metadata as it currently was, into Python. Batched to avoid running out of memory.</li>
|
||||
<li>Serialize the metadata, and read the image files as bytes, and link them to the keys f"{file_name}_image" and f"{file_name}_metadata" respectively.</li>
|
||||
<li>Store these as key-value pairs into the LMDB database, which is a <strong>single</strong> file.</li>
|
||||
<li>Provided a method to read the keys to identify all the images present in the database.</li>
|
||||
<li>Provided a method to retrieve an arbitrary set of images and their metadata quickly from the saved database.</li>
|
||||
</ol>
|
||||
<p>This has many advantages:</p>
|
||||
<ol>
|
||||
<li>LMDB is fast - really, really fast. Random access? Check. Fast retrieval of available images? Check.</li>
|
||||
<li>DVC tracking becomes simple - maintain a single file, and just version control that. No slow downs due to sheer number of files - either for DVC, or my computer.</li>
|
||||
<li>No index to maintain - the images and their metadata are stored in the same location, and linkable via a mere change in the suffix to their file name.</li>
|
||||
<li>Local access, practically zero latency.</li>
|
||||
</ol>
|
||||
<p>Which solves… all of the problems I had! When new files come in, all I needed to do was add them to the DB. LMDB has a few options available - you can
|
||||
avoid overwriting the same key, ensure that the database is de-duplicated.</p>
|
||||
<h1 id="the-code">The code</h1>
|
||||
<p>I’ve provided a sample code below that demonstrates storing just the images (not metadata) for the <a href="https://www.robots.ox.ac.uk/~vgg/data/flowers/102/">Oxford 102 Category Flower</a>
|
||||
dataset, which has around 8000 images.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">lmdb</span>
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">class</span> <span class="nc">ImageDB</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">env_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span> <span class="n">max_size_as_mb</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env_path</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">env_path</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env</span> <span class="o">=</span> <span class="n">lmdb</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">env_path</span><span class="p">,</span> <span class="n">map_size</span><span class="o">=</span><span class="n">max_size_as_mb</span> <span class="o">*</span> <span class="p">(</span><span class="mi">2</span><span class="o">**</span><span class="mi">20</span><span class="p">))</span>
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">db</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">open_db</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 10</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 11</span><span class="cl"> <span class="k">def</span> <span class="nf">save_image</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 12</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 13</span><span class="cl"> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 14</span><span class="cl"> <span class="n">image_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 15</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 16</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 17</span><span class="cl"> <span class="n">txn</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 18</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 19</span><span class="cl"> <span class="k">def</span> <span class="nf">read_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">bytes</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 20</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 21</span><span class="cl"> <span class="k">if</span> <span class="n">image_as_bytes</span> <span class="o">:=</span> <span class="n">txn</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 22</span><span class="cl"> <span class="k">return</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 23</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 24</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 25</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 26</span><span class="cl"> <span class="k">def</span> <span class="nf">save_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 27</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 28</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 29</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 30</span><span class="cl"> <span class="c1"># Note: you might need to enforce a batch size here</span>
|
||||
</span></span><span class="line"><span class="ln"> 31</span><span class="cl"> <span class="c1"># to aovid running out of memory because this loads</span>
|
||||
</span></span><span class="line"><span class="ln"> 32</span><span class="cl"> <span class="c1"># all images sent to this function as bytes.</span>
|
||||
</span></span><span class="line"><span class="ln"> 33</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 34</span><span class="cl"> <span class="n">item_tuples</span> <span class="o">=</span> <span class="p">[</span>
|
||||
</span></span><span class="line"><span class="ln"> 35</span><span class="cl"> <span class="p">(</span><span class="n">k</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 36</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">name_image</span><span class="o">.</span><span class="n">items</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 37</span><span class="cl"> <span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 38</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 39</span><span class="cl"> <span class="n">consumed</span><span class="p">,</span> <span class="n">added</span> <span class="o">=</span> <span class="n">cursor</span><span class="o">.</span><span class="n">putmulti</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 40</span><span class="cl"> <span class="n">item_tuples</span><span class="p">,</span> <span class="n">dupdata</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">overwrite</span><span class="o">=</span><span class="kc">False</span>
|
||||
</span></span><span class="line"><span class="ln"> 41</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 42</span><span class="cl"> <span class="nb">print</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 43</span><span class="cl"> <span class="sa">f</span><span class="s2">"Saved </span><span class="si">{</span><span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> out of </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images to the DB (</span><span class="si">{</span><span class="n">consumed</span> <span class="o">-</span> <span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> seem to already exist)."</span>
|
||||
</span></span><span class="line"><span class="ln"> 44</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 45</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 46</span><span class="cl"> <span class="k">def</span> <span class="nf">load_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 47</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 48</span><span class="cl"> <span class="n">names</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 49</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 50</span><span class="cl"> <span class="n">names_as_bytestrings</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">encode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">names</span><span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 51</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 52</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 53</span><span class="cl"> <span class="k">return</span> <span class="p">{</span>
|
||||
</span></span><span class="line"><span class="ln"> 54</span><span class="cl"> <span class="n">k</span><span class="o">.</span><span class="n">decode</span><span class="p">():</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 55</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_as_bytes</span> <span class="ow">in</span> <span class="n">cursor</span><span class="o">.</span><span class="n">getmulti</span><span class="p">(</span><span class="n">names_as_bytestrings</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 56</span><span class="cl"> <span class="p">}</span>
|
||||
</span></span><span class="line"><span class="ln"> 57</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 58</span><span class="cl"> <span class="k">def</span> <span class="nf">delete_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 59</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 60</span><span class="cl"> <span class="k">if</span> <span class="n">txn</span><span class="o">.</span><span class="n">delete</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 61</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Image </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2"> deleted successfully"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 62</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 63</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 64</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 65</span><span class="cl"> <span class="k">def</span> <span class="nf">retrieve_names</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 66</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 67</span><span class="cl"> <span class="k">return</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">decode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span><span class="o">.</span><span class="n">iternext</span><span class="p">(</span><span class="n">keys</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="kc">False</span><span class="p">)]</span>
|
||||
</span></span><span class="line"><span class="ln"> 68</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 69</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 70</span><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 71</span><span class="cl"> <span class="n">db</span> <span class="o">=</span> <span class="n">ImageDB</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="s2">"./db/"</span><span class="p">),</span> <span class="mi">512</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 72</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 73</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 74</span><span class="cl"> <span class="c1"># Save the results</span>
|
||||
</span></span><span class="line"><span class="ln"> 75</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg/"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 76</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span>
|
||||
</span></span><span class="line"><span class="ln"> 77</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 78</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 79</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 80</span><span class="cl"> <span class="c1"># Add last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 81</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 82</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 83</span><span class="cl"> <span class="k">del</span> <span class="n">name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 84</span><span class="cl"> <span class="c1"># How many images have been stored?</span>
|
||||
</span></span><span class="line"><span class="ln"> 85</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"The DB has </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">retrieve_names</span><span class="p">())</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images stored"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 86</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 87</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 88</span><span class="cl"> <span class="c1"># Load the results from the DB and check if they match the files on disk</span>
|
||||
</span></span><span class="line"><span class="ln"> 89</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 90</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 91</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 92</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 93</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 94</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 95</span><span class="cl"> <span class="c1"># Verify last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 96</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 97</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 98</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 99</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"All images stored are byte identical to the original ones!"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln">100</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln">101</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></span></span></code></pre></div><p>This should provide you a good starting point to implement additional features,
|
||||
such as storing metadata, filtering required input by metadata (such as extracting
|
||||
a specific label for evaluation) and so on.</p>
|
||||
<h1 id="caveats">Caveats</h1>
|
||||
<h2 id="avoid-pil-or-pay-the-small-price">Avoid PIL, or pay the (small) price</h2>
|
||||
<p>One gotcha that I initially faced is that the images I saved wasn’t the same as
|
||||
the images that I retrieved. This wasn’t LMDB’s fault, this was because I was
|
||||
reading the images from disk via PIL, and storing them as bytes in LMDB. PIL
|
||||
decodes and encodes the image, so a roundtrip will not necessarily be identical,
|
||||
even for lossless file formats (other than bitmap images).</p>
|
||||
<p>Don’t encode/re-encode the image before you store it, or be prepared for the
|
||||
stored data to not be byte-identical.</p>
|
||||
<h2 id="the-max_size_as_mb-argument">The <code>max_size_as_mb</code> argument</h2>
|
||||
<p>LMDB has an unusual design. You need to specify the upper bound of the DB size
|
||||
upon creation, and if it exceeds this size, it will fail. You can edit this
|
||||
later, with some caveats (on Windows, this will actually allocate the full
|
||||
size).</p>
|
||||
<h2 id="concurrency-and-lmdb">Concurrency and LMDB</h2>
|
||||
<p>LMDB, while extremely fast, has some considerations with concurrency. See
|
||||
<a href="https://lmdb.readthedocs.io/en/release/#threads">the documentation</a> for details.
|
||||
It may not be suited for distributed workloads.</p>
|
||||
<h1 id="alternatives">Alternatives</h1>
|
||||
<p>This article covers a “quick and dirty” solution, and was before more purpose-built
|
||||
solutions were available. Some alternatives are:</p>
|
||||
<ol>
|
||||
<li>If you’re comfortable operating directly on archives, a simple <code>tar</code> file will
|
||||
do - it can provide an offset index to provide random access to data.</li>
|
||||
<li><a href="https://github.com/webdataset/webdataset">Nvidia’s WebDataset</a>. Modern, open
|
||||
source and purpose built for large scale deep learning.</li>
|
||||
<li><a href="https://lancedb.com/">LanceDB</a>, which describes itself as “designed for multimodal”
|
||||
and “built for scale”. It’s built on top of Arrow, closely related to Parquet.</li>
|
||||
<li>As mentioned, HuggingFace has multiple solutions to this, starting with Arrow backed
|
||||
storage, and their own <a href="https://huggingface.co/docs/datasets/index"><code>datasets</code></a>.</li>
|
||||
</ol>
|
||||
<p>Use these if you want to scale to production level training.</p>
|
||||
]]></content:encoded>
|
||||
</item>
|
||||
</channel>
|
||||
</rss>
|
||||
126
public/tags/by-hand/index.html
Normal file
126
public/tags/by-hand/index.html
Normal file
@@ -0,0 +1,126 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>By-Hand | Avinash's Blog</title>
|
||||
<meta name="title" content="By-Hand" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/tags/by-hand/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="By-Hand">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="website">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="By-Hand">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="By-Hand">
|
||||
<meta itemprop="datePublished" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
|
||||
<link href="/original.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/by-hand/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header><a class="skip-link" href="#main-content">Skip to main content</a>
|
||||
|
||||
<a href="/" class="title"><h1>Avinash's Blog</h1></a>
|
||||
<nav>
|
||||
<a href="/">about</a>
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</nav>
|
||||
</header>
|
||||
<main id="main-content">
|
||||
<content>
|
||||
|
||||
<h3 class="blog-filter">Filtering for "By-Hand"</h3>
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/series/byhand/byhand_002_logarithms/">BasicsByHand: Logarithms</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/series/byhand/byhand_001_square_roots/">BasicsByHand: Square Roots</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</content>
|
||||
|
||||
</main>
|
||||
<footer><small>
|
||||
© Avinash Mallya | Design via <a href="https://github.com/clente/hugo-bearcub">Bear Cub</a>.
|
||||
</small></footer>
|
||||
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
685
public/tags/by-hand/index.xml
Normal file
685
public/tags/by-hand/index.xml
Normal file
File diff suppressed because one or more lines are too long
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Category" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/category/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -139,7 +139,7 @@
|
||||
<blockquote>
|
||||
<p><strong>NOTE</strong>: for a proof-of-concept implementation, or if you’re on the CPU, try the <code>all-MiniLM-L6-v2</code> model. It’s a much smaller and much faster model, although you sacrifice a little in terms of accuracy.</p>
|
||||
</blockquote>
|
||||
<h2 id="the-concept-of-_approximate_-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<h2 id="the-concept-of-approximate-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<p>Performing any kind of nearest neighbor algorithm on medium scale datasets (even bordering 10,000 rows and tens of columns) tends to be slow. A primary driver of this was the need to calculate all, or nearly all distances between all data points. <em>Approximate</em> nearest neighbor (ANN) algorithms work around this through various approaches, which warrant their own blog post. For now, it would suffice to understand that there are shortcuts that ANN algorithms take to give you if not the exact nearest neighbor, at least <em>one</em> of the nearest neighbors (hence the term <em>approximate</em>).</p>
|
||||
<p>There are several algorithms that you can use - I shall proceed with <code>faiss</code>, because it has a nice Python interface and is rather easy to work with. You can use any algorithm - a full list of the major ones are <a href="https://github.com/erikbern/ann-benchmarks">available here</a>.</p>
|
||||
<p>I’ll explain why we’re in the nearest neighbor territory in due course.</p>
|
||||
|
||||
113
public/tags/dvc/index.html
Normal file
113
public/tags/dvc/index.html
Normal file
@@ -0,0 +1,113 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Dvc | Avinash's Blog</title>
|
||||
<meta name="title" content="Dvc" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/tags/dvc/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="Dvc">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="website">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="Dvc">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="Dvc">
|
||||
<meta itemprop="datePublished" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
|
||||
<link href="/original.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/dvc/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header><a class="skip-link" href="#main-content">Skip to main content</a>
|
||||
|
||||
<a href="/" class="title"><h1>Avinash's Blog</h1></a>
|
||||
<nav>
|
||||
<a href="/">about</a>
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</nav>
|
||||
</header>
|
||||
<main id="main-content">
|
||||
<content>
|
||||
|
||||
<h3 class="blog-filter">Filtering for "Dvc"</h3>
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/005_ldmb_as_image_db/">Resolving I/O Bottlenecks for 100K Small Files with LMDB</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</content>
|
||||
|
||||
</main>
|
||||
<footer><small>
|
||||
© Avinash Mallya | Design via <a href="https://github.com/clente/hugo-bearcub">Bear Cub</a>.
|
||||
</small></footer>
|
||||
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
276
public/tags/dvc/index.xml
Normal file
276
public/tags/dvc/index.xml
Normal file
@@ -0,0 +1,276 @@
|
||||
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
|
||||
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
|
||||
<channel>
|
||||
<title>Dvc on Avinash's Blog</title>
|
||||
<link>https://avimallu.dev/tags/dvc/</link>
|
||||
<description>Recent content in Dvc on Avinash's Blog</description>
|
||||
<generator>Hugo -- gohugo.io</generator>
|
||||
<language>en-US</language>
|
||||
<copyright>© Avinash Mallya</copyright>
|
||||
<lastBuildDate>Tue, 10 Feb 2026 00:00:00 +0000</lastBuildDate>
|
||||
<atom:link href="https://avimallu.dev/tags/dvc/index.xml" rel="self" type="application/rss+xml" />
|
||||
<item>
|
||||
<title>Resolving I/O Bottlenecks for 100K Small Files with LMDB</title>
|
||||
<link>https://avimallu.dev/blog/005_ldmb_as_image_db/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/005_ldmb_as_image_db/</guid>
|
||||
<description><h1 id="premise">Premise</h1>
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
<p>I had several thousand images already.
I was expecting several thousand more.
My repository was tracking these images via DVC.
My computer was also slowing down massively because of the sheer number of files.
DVC itself was slowing down (after all, randomly accessing many files isn&rsquo;t going to be fast).
I also needed to access files at random for training/evaluating the model (lots of shuffling).
Lastly, these images had their own associated metadata (labels, bounding boxes, &ldquo;correct&rdquo; text etc.), and they need to be stored along with the images - or least easily linkable to them.</p></description>
|
||||
<content:encoded><![CDATA[<h1 id="premise">Premise</h1>
|
||||
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
|
||||
<p>I had several thousand images already.
|
||||
I was expecting several thousand more.
|
||||
My repository was tracking these images via DVC.
|
||||
My computer was also slowing down massively because of the sheer number of files.
|
||||
DVC itself was slowing down (after all, randomly accessing many files isn’t going to be fast).
|
||||
I also needed to access files at random for training/evaluating the model (lots of shuffling).
|
||||
Lastly, these images had their own associated metadata (labels, bounding boxes, “correct” text etc.), and they need to be stored along with the images - or least easily linkable to them.</p>
|
||||
<p>I was primarily aiming for a “simple” solution, and didn’t need a productionizable codebase.</p>
|
||||
<h1 id="potential-solutions">Potential Solutions</h1>
|
||||
<h2 id="partitioning">Partitioning</h2>
|
||||
<p>A typical solution for “too many files” is to partition them by their name. It’s ideal if their
|
||||
name is a hash, so you can store the first character of the hash, then the second character, and
|
||||
then the actual file. So, for example, the directory changes from:</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln">2</span><span class="cl">├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln">3</span><span class="cl">├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln">4</span><span class="cl">├── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">5</span><span class="cl">├── cf1a2b3c4d.txt
|
||||
</span></span><span class="line"><span class="ln">6</span><span class="cl">├── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">7</span><span class="cl">└── a1f5e6d7c8.txt</span></span></code></pre></div><p>to</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">├── a/
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">│ └── 1/
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">│ ├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">│ └── a1f5e6d7c8.txt
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">├── b/
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">│ └── 7/
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">│ ├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">│ └── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">10</span><span class="cl">└── c/
|
||||
</span></span><span class="line"><span class="ln">11</span><span class="cl"> ├── 0/
|
||||
</span></span><span class="line"><span class="ln">12</span><span class="cl"> │ └── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">13</span><span class="cl"> └── f/
|
||||
</span></span><span class="line"><span class="ln">14</span><span class="cl"> └── cf1a2b3c4d.txt</span></span></code></pre></div><p>This isn’t novel - <code>git</code> and <code>dvc</code> both store their objects this way.
|
||||
This limits directory size, so file system look ups take less time.</p>
|
||||
<p>This isn’t a perfect solution - it required that I store the images as their hash,
|
||||
and handle the directory structure correctly. I will also need to maintain my own
|
||||
mechanism to maintain the link between the hash and the metadata, which means creating
|
||||
some sort of index. Lastly, DVC will still track files individually, which means that
|
||||
its <code>push</code>, <code>diff</code> and <code>pull</code> commands will still be slow.</p>
|
||||
<h2 id="separate-object-storage-maintain-only-the-index">Separate (object) storage, maintain only the index</h2>
|
||||
<p>Another solution would be to offload the storage to a medium capable of handling a large number of files in any order,
|
||||
while maintaining random access - for example, S3. Now my task will reduce to just correctly maintaining the index so
|
||||
that I know how many images are stored on S3, and link their metadata via their hash.</p>
|
||||
<p>If you’ve experienced retrieving a list of a large number of files stored on S3, you’d have first encountered the limit
|
||||
of 1000 objects that <code>boto3</code> enforces per request. You’ll need to work around it with pagination, which while standard,
|
||||
is still more work. You would have also realized that even after all this, S3 will take quite some time to give you the
|
||||
list even after you’ve optimized as much as you can.</p>
|
||||
<p>However, that means I need an active internet connection to access any data. It also introduces latency during training,
|
||||
and quite wasteful bandwidth in running multiple training sessions for the models. I can minimize these if I store the
|
||||
images in the same AWS region as I would be running the training in (say, an EC2), but that means I needed access to an
|
||||
EC2.</p>
|
||||
<p>This still left me the task of maintaining my own index, which I really wanted to avoid, as it would mean additional
|
||||
maintenance burden for a relatively nascent project that hasn’t reached production status, while demanding production
|
||||
code for an even more nascent pipeline.</p>
|
||||
<h2 id="what-about-a-database">What about a… database?</h2>
|
||||
<p>This feels much like reaching for your nose by looping your hand behind your head instead of touching it directly with
|
||||
your fingers. A database is great for many things, but setting one up and maintaining it (even a local SQLite one) isn’t
|
||||
really a quick and painless process, and has many gotchas. For instance, most databases aren’t optimized for storing a
|
||||
large number of binary blobs.</p>
|
||||
<p>I considered, and tested using a more modern embeddable analytical database like DuckDB for this purpose (I’m quite
|
||||
biased to using DuckDB and/or Polars to solve a large number of my data processing problems). I quickly
|
||||
found out that storing large binary blobs in it causes it to choke (which is fair, it isn’t really designed for that). Storing
|
||||
the files elsewhere while maintaining just the index in it still had the original problem - I needed to write the mechanism
|
||||
of maintaining the index.</p>
|
||||
<blockquote>
|
||||
<p>Note: HuggingFace now provides many images datasets (such as <a href="https://huggingface.co/datasets/ylecun/mnist">MNIST</a>) in
|
||||
the Parquet format, with the images stored using Arrow’s extension types (but still as binary blobs). My experience with
|
||||
storing binary data in Parquet hasn’t been great, but you could check this out to see if it meets your requirements.</p>
|
||||
</blockquote>
|
||||
<h1 id="the-solution-i-landed-on">The solution I landed on</h1>
|
||||
<h2 id="what-about-a-different-kind-of-database">What about a… <em>different</em> kind of database?</h2>
|
||||
<p>Let’s get down to first principles. What did I want to do? I wanted to store images. With those images, I also wanted
|
||||
to store its metadata. I wanted to access said data quickly. It became clearer to me that I was looking for a fast key-value
|
||||
store, and I stumbled upon <a href="https://www.symas.com/mdb">LMDB</a>.</p>
|
||||
<h2 id="lmdb">LMDB</h2>
|
||||
<p>Wikipedia’s <a href="https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database">entry</a> on LMDB indicates that it’s an incredibly
|
||||
small (64kB) piece of software that does one thing really well - be a ridiculously fast key-value store. I won’t pretend to
|
||||
understand how it works, the writeup on the Wiki provides plenty of good detail. I’ll focus, rather, on how I used it to
|
||||
solve my problem.</p>
|
||||
<h2 id="storing-and-retrieving-image-data-along-with-its-metadata">Storing and retrieving image data along with its metadata</h2>
|
||||
<p>LMDB is rather barebones. It exposes few features - the ability to write, and read a particular key (stored as a bytestring),
|
||||
that itself points to arbitrary bytes. I used its <a href="https://lmdb.readthedocs.io/en/release/">Python bindings</a>.</p>
|
||||
<p>I wrote a tiny class (~200 LoC) that did the following:</p>
|
||||
<ol>
|
||||
<li>Read the file names and the data of the images that I had, along with their metadata as it currently was, into Python. Batched to avoid running out of memory.</li>
|
||||
<li>Serialize the metadata, and read the image files as bytes, and link them to the keys f"{file_name}_image" and f"{file_name}_metadata" respectively.</li>
|
||||
<li>Store these as key-value pairs into the LMDB database, which is a <strong>single</strong> file.</li>
|
||||
<li>Provided a method to read the keys to identify all the images present in the database.</li>
|
||||
<li>Provided a method to retrieve an arbitrary set of images and their metadata quickly from the saved database.</li>
|
||||
</ol>
|
||||
<p>This has many advantages:</p>
|
||||
<ol>
|
||||
<li>LMDB is fast - really, really fast. Random access? Check. Fast retrieval of available images? Check.</li>
|
||||
<li>DVC tracking becomes simple - maintain a single file, and just version control that. No slow downs due to sheer number of files - either for DVC, or my computer.</li>
|
||||
<li>No index to maintain - the images and their metadata are stored in the same location, and linkable via a mere change in the suffix to their file name.</li>
|
||||
<li>Local access, practically zero latency.</li>
|
||||
</ol>
|
||||
<p>Which solves… all of the problems I had! When new files come in, all I needed to do was add them to the DB. LMDB has a few options available - you can
|
||||
avoid overwriting the same key, ensure that the database is de-duplicated.</p>
|
||||
<h1 id="the-code">The code</h1>
|
||||
<p>I’ve provided a sample code below that demonstrates storing just the images (not metadata) for the <a href="https://www.robots.ox.ac.uk/~vgg/data/flowers/102/">Oxford 102 Category Flower</a>
|
||||
dataset, which has around 8000 images.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">lmdb</span>
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">class</span> <span class="nc">ImageDB</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">env_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span> <span class="n">max_size_as_mb</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env_path</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">env_path</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env</span> <span class="o">=</span> <span class="n">lmdb</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">env_path</span><span class="p">,</span> <span class="n">map_size</span><span class="o">=</span><span class="n">max_size_as_mb</span> <span class="o">*</span> <span class="p">(</span><span class="mi">2</span><span class="o">**</span><span class="mi">20</span><span class="p">))</span>
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">db</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">open_db</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 10</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 11</span><span class="cl"> <span class="k">def</span> <span class="nf">save_image</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 12</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 13</span><span class="cl"> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 14</span><span class="cl"> <span class="n">image_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 15</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 16</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 17</span><span class="cl"> <span class="n">txn</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 18</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 19</span><span class="cl"> <span class="k">def</span> <span class="nf">read_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">bytes</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 20</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 21</span><span class="cl"> <span class="k">if</span> <span class="n">image_as_bytes</span> <span class="o">:=</span> <span class="n">txn</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 22</span><span class="cl"> <span class="k">return</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 23</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 24</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 25</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 26</span><span class="cl"> <span class="k">def</span> <span class="nf">save_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 27</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 28</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 29</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 30</span><span class="cl"> <span class="c1"># Note: you might need to enforce a batch size here</span>
|
||||
</span></span><span class="line"><span class="ln"> 31</span><span class="cl"> <span class="c1"># to aovid running out of memory because this loads</span>
|
||||
</span></span><span class="line"><span class="ln"> 32</span><span class="cl"> <span class="c1"># all images sent to this function as bytes.</span>
|
||||
</span></span><span class="line"><span class="ln"> 33</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 34</span><span class="cl"> <span class="n">item_tuples</span> <span class="o">=</span> <span class="p">[</span>
|
||||
</span></span><span class="line"><span class="ln"> 35</span><span class="cl"> <span class="p">(</span><span class="n">k</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 36</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">name_image</span><span class="o">.</span><span class="n">items</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 37</span><span class="cl"> <span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 38</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 39</span><span class="cl"> <span class="n">consumed</span><span class="p">,</span> <span class="n">added</span> <span class="o">=</span> <span class="n">cursor</span><span class="o">.</span><span class="n">putmulti</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 40</span><span class="cl"> <span class="n">item_tuples</span><span class="p">,</span> <span class="n">dupdata</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">overwrite</span><span class="o">=</span><span class="kc">False</span>
|
||||
</span></span><span class="line"><span class="ln"> 41</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 42</span><span class="cl"> <span class="nb">print</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 43</span><span class="cl"> <span class="sa">f</span><span class="s2">"Saved </span><span class="si">{</span><span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> out of </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images to the DB (</span><span class="si">{</span><span class="n">consumed</span> <span class="o">-</span> <span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> seem to already exist)."</span>
|
||||
</span></span><span class="line"><span class="ln"> 44</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 45</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 46</span><span class="cl"> <span class="k">def</span> <span class="nf">load_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 47</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 48</span><span class="cl"> <span class="n">names</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 49</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 50</span><span class="cl"> <span class="n">names_as_bytestrings</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">encode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">names</span><span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 51</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 52</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 53</span><span class="cl"> <span class="k">return</span> <span class="p">{</span>
|
||||
</span></span><span class="line"><span class="ln"> 54</span><span class="cl"> <span class="n">k</span><span class="o">.</span><span class="n">decode</span><span class="p">():</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 55</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_as_bytes</span> <span class="ow">in</span> <span class="n">cursor</span><span class="o">.</span><span class="n">getmulti</span><span class="p">(</span><span class="n">names_as_bytestrings</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 56</span><span class="cl"> <span class="p">}</span>
|
||||
</span></span><span class="line"><span class="ln"> 57</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 58</span><span class="cl"> <span class="k">def</span> <span class="nf">delete_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 59</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 60</span><span class="cl"> <span class="k">if</span> <span class="n">txn</span><span class="o">.</span><span class="n">delete</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 61</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Image </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2"> deleted successfully"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 62</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 63</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 64</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 65</span><span class="cl"> <span class="k">def</span> <span class="nf">retrieve_names</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 66</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 67</span><span class="cl"> <span class="k">return</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">decode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span><span class="o">.</span><span class="n">iternext</span><span class="p">(</span><span class="n">keys</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="kc">False</span><span class="p">)]</span>
|
||||
</span></span><span class="line"><span class="ln"> 68</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 69</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 70</span><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 71</span><span class="cl"> <span class="n">db</span> <span class="o">=</span> <span class="n">ImageDB</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="s2">"./db/"</span><span class="p">),</span> <span class="mi">512</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 72</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 73</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 74</span><span class="cl"> <span class="c1"># Save the results</span>
|
||||
</span></span><span class="line"><span class="ln"> 75</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg/"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 76</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span>
|
||||
</span></span><span class="line"><span class="ln"> 77</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 78</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 79</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 80</span><span class="cl"> <span class="c1"># Add last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 81</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 82</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 83</span><span class="cl"> <span class="k">del</span> <span class="n">name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 84</span><span class="cl"> <span class="c1"># How many images have been stored?</span>
|
||||
</span></span><span class="line"><span class="ln"> 85</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"The DB has </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">retrieve_names</span><span class="p">())</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images stored"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 86</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 87</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 88</span><span class="cl"> <span class="c1"># Load the results from the DB and check if they match the files on disk</span>
|
||||
</span></span><span class="line"><span class="ln"> 89</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 90</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 91</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 92</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 93</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 94</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 95</span><span class="cl"> <span class="c1"># Verify last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 96</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 97</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 98</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 99</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"All images stored are byte identical to the original ones!"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln">100</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln">101</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></span></span></code></pre></div><p>This should provide you a good starting point to implement additional features,
|
||||
such as storing metadata, filtering required input by metadata (such as extracting
|
||||
a specific label for evaluation) and so on.</p>
|
||||
<h1 id="caveats">Caveats</h1>
|
||||
<h2 id="avoid-pil-or-pay-the-small-price">Avoid PIL, or pay the (small) price</h2>
|
||||
<p>One gotcha that I initially faced is that the images I saved wasn’t the same as
|
||||
the images that I retrieved. This wasn’t LMDB’s fault, this was because I was
|
||||
reading the images from disk via PIL, and storing them as bytes in LMDB. PIL
|
||||
decodes and encodes the image, so a roundtrip will not necessarily be identical,
|
||||
even for lossless file formats (other than bitmap images).</p>
|
||||
<p>Don’t encode/re-encode the image before you store it, or be prepared for the
|
||||
stored data to not be byte-identical.</p>
|
||||
<h2 id="the-max_size_as_mb-argument">The <code>max_size_as_mb</code> argument</h2>
|
||||
<p>LMDB has an unusual design. You need to specify the upper bound of the DB size
|
||||
upon creation, and if it exceeds this size, it will fail. You can edit this
|
||||
later, with some caveats (on Windows, this will actually allocate the full
|
||||
size).</p>
|
||||
<h2 id="concurrency-and-lmdb">Concurrency and LMDB</h2>
|
||||
<p>LMDB, while extremely fast, has some considerations with concurrency. See
|
||||
<a href="https://lmdb.readthedocs.io/en/release/#threads">the documentation</a> for details.
|
||||
It may not be suited for distributed workloads.</p>
|
||||
<h1 id="alternatives">Alternatives</h1>
|
||||
<p>This article covers a “quick and dirty” solution, and was before more purpose-built
|
||||
solutions were available. Some alternatives are:</p>
|
||||
<ol>
|
||||
<li>If you’re comfortable operating directly on archives, a simple <code>tar</code> file will
|
||||
do - it can provide an offset index to provide random access to data.</li>
|
||||
<li><a href="https://github.com/webdataset/webdataset">Nvidia’s WebDataset</a>. Modern, open
|
||||
source and purpose built for large scale deep learning.</li>
|
||||
<li><a href="https://lancedb.com/">LanceDB</a>, which describes itself as “designed for multimodal”
|
||||
and “built for scale”. It’s built on top of Arrow, closely related to Parquet.</li>
|
||||
<li>As mentioned, HuggingFace has multiple solutions to this, starting with Arrow backed
|
||||
storage, and their own <a href="https://huggingface.co/docs/datasets/index"><code>datasets</code></a>.</li>
|
||||
</ol>
|
||||
<p>Use these if you want to scale to production level training.</p>
|
||||
]]></content:encoded>
|
||||
</item>
|
||||
</channel>
|
||||
</rss>
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Environment" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/environment/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
@@ -89,7 +94,7 @@
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/004_environment_variables_and_multiprocessing/">Environment Variables and Multiprocessing</a>
|
||||
<a href="/blog/004_environment_variables_and_multiprocessing/">Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines</a>
|
||||
|
||||
</li>
|
||||
|
||||
|
||||
@@ -10,7 +10,7 @@
|
||||
<lastBuildDate>Wed, 14 Jan 2026 00:00:00 +0000</lastBuildDate>
|
||||
<atom:link href="https://avimallu.dev/tags/environment/index.xml" rel="self" type="application/rss+xml" />
|
||||
<item>
|
||||
<title>Environment Variables and Multiprocessing</title>
|
||||
<title>Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines</title>
|
||||
<link>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</link>
|
||||
<pubDate>Wed, 14 Jan 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</guid>
|
||||
@@ -164,7 +164,7 @@ environment variables:</p>
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">seq <span class="m">10</span> <span class="p">|</span> <span class="se">\
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl">parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s2">OPENBLAS_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s2">MKL_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s2">python test.py {}"</span></span></span></code></pre></div><p>An interesting side note is that the parallel, but single core run took 0.61 seconds, <strong>less</strong> than the time it took to run
|
||||
|
||||
113
public/tags/euler/index.html
Normal file
113
public/tags/euler/index.html
Normal file
@@ -0,0 +1,113 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Euler | Avinash's Blog</title>
|
||||
<meta name="title" content="Euler" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/tags/euler/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="Euler">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="website">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="Euler">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="Euler">
|
||||
<meta itemprop="datePublished" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
|
||||
<link href="/original.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/euler/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header><a class="skip-link" href="#main-content">Skip to main content</a>
|
||||
|
||||
<a href="/" class="title"><h1>Avinash's Blog</h1></a>
|
||||
<nav>
|
||||
<a href="/">about</a>
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</nav>
|
||||
</header>
|
||||
<main id="main-content">
|
||||
<content>
|
||||
|
||||
<h3 class="blog-filter">Filtering for "Euler"</h3>
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/series/byhand/byhand_002_logarithms/">BasicsByHand: Logarithms</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</content>
|
||||
|
||||
</main>
|
||||
<footer><small>
|
||||
© Avinash Mallya | Design via <a href="https://github.com/clente/hugo-bearcub">Bear Cub</a>.
|
||||
</small></footer>
|
||||
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
384
public/tags/euler/index.xml
Normal file
384
public/tags/euler/index.xml
Normal file
File diff suppressed because one or more lines are too long
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Faiss" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/faiss/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -139,7 +139,7 @@
|
||||
<blockquote>
|
||||
<p><strong>NOTE</strong>: for a proof-of-concept implementation, or if you’re on the CPU, try the <code>all-MiniLM-L6-v2</code> model. It’s a much smaller and much faster model, although you sacrifice a little in terms of accuracy.</p>
|
||||
</blockquote>
|
||||
<h2 id="the-concept-of-_approximate_-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<h2 id="the-concept-of-approximate-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<p>Performing any kind of nearest neighbor algorithm on medium scale datasets (even bordering 10,000 rows and tens of columns) tends to be slow. A primary driver of this was the need to calculate all, or nearly all distances between all data points. <em>Approximate</em> nearest neighbor (ANN) algorithms work around this through various approaches, which warrant their own blog post. For now, it would suffice to understand that there are shortcuts that ANN algorithms take to give you if not the exact nearest neighbor, at least <em>one</em> of the nearest neighbors (hence the term <em>approximate</em>).</p>
|
||||
<p>There are several algorithms that you can use - I shall proceed with <code>faiss</code>, because it has a nice Python interface and is rather easy to work with. You can use any algorithm - a full list of the major ones are <a href="https://github.com/erikbern/ann-benchmarks">available here</a>.</p>
|
||||
<p>I’ll explain why we’re in the nearest neighbor territory in due course.</p>
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Graph" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/graph/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -139,7 +139,7 @@
|
||||
<blockquote>
|
||||
<p><strong>NOTE</strong>: for a proof-of-concept implementation, or if you’re on the CPU, try the <code>all-MiniLM-L6-v2</code> model. It’s a much smaller and much faster model, although you sacrifice a little in terms of accuracy.</p>
|
||||
</blockquote>
|
||||
<h2 id="the-concept-of-_approximate_-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<h2 id="the-concept-of-approximate-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<p>Performing any kind of nearest neighbor algorithm on medium scale datasets (even bordering 10,000 rows and tens of columns) tends to be slow. A primary driver of this was the need to calculate all, or nearly all distances between all data points. <em>Approximate</em> nearest neighbor (ANN) algorithms work around this through various approaches, which warrant their own blog post. For now, it would suffice to understand that there are shortcuts that ANN algorithms take to give you if not the exact nearest neighbor, at least <em>one</em> of the nearest neighbors (hence the term <em>approximate</em>).</p>
|
||||
<p>There are several algorithms that you can use - I shall proceed with <code>faiss</code>, because it has a nice Python interface and is rather easy to work with. You can use any algorithm - a full list of the major ones are <a href="https://github.com/erikbern/ann-benchmarks">available here</a>.</p>
|
||||
<p>I’ll explain why we’re in the nearest neighbor territory in due course.</p>
|
||||
|
||||
113
public/tags/history/index.html
Normal file
113
public/tags/history/index.html
Normal file
@@ -0,0 +1,113 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>History | Avinash's Blog</title>
|
||||
<meta name="title" content="History" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/tags/history/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="History">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="website">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="History">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="History">
|
||||
<meta itemprop="datePublished" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
|
||||
<link href="/original.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/history/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header><a class="skip-link" href="#main-content">Skip to main content</a>
|
||||
|
||||
<a href="/" class="title"><h1>Avinash's Blog</h1></a>
|
||||
<nav>
|
||||
<a href="/">about</a>
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</nav>
|
||||
</header>
|
||||
<main id="main-content">
|
||||
<content>
|
||||
|
||||
<h3 class="blog-filter">Filtering for "History"</h3>
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/series/byhand/byhand_002_logarithms/">BasicsByHand: Logarithms</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</content>
|
||||
|
||||
</main>
|
||||
<footer><small>
|
||||
© Avinash Mallya | Design via <a href="https://github.com/clente/hugo-bearcub">Bear Cub</a>.
|
||||
</small></footer>
|
||||
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
384
public/tags/history/index.xml
Normal file
384
public/tags/history/index.xml
Normal file
File diff suppressed because one or more lines are too long
113
public/tags/images/index.html
Normal file
113
public/tags/images/index.html
Normal file
@@ -0,0 +1,113 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Images | Avinash's Blog</title>
|
||||
<meta name="title" content="Images" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/tags/images/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="Images">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="website">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="Images">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="Images">
|
||||
<meta itemprop="datePublished" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
|
||||
<link href="/original.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/images/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header><a class="skip-link" href="#main-content">Skip to main content</a>
|
||||
|
||||
<a href="/" class="title"><h1>Avinash's Blog</h1></a>
|
||||
<nav>
|
||||
<a href="/">about</a>
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</nav>
|
||||
</header>
|
||||
<main id="main-content">
|
||||
<content>
|
||||
|
||||
<h3 class="blog-filter">Filtering for "Images"</h3>
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/005_ldmb_as_image_db/">Resolving I/O Bottlenecks for 100K Small Files with LMDB</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</content>
|
||||
|
||||
</main>
|
||||
<footer><small>
|
||||
© Avinash Mallya | Design via <a href="https://github.com/clente/hugo-bearcub">Bear Cub</a>.
|
||||
</small></footer>
|
||||
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
276
public/tags/images/index.xml
Normal file
276
public/tags/images/index.xml
Normal file
@@ -0,0 +1,276 @@
|
||||
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
|
||||
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
|
||||
<channel>
|
||||
<title>Images on Avinash's Blog</title>
|
||||
<link>https://avimallu.dev/tags/images/</link>
|
||||
<description>Recent content in Images on Avinash's Blog</description>
|
||||
<generator>Hugo -- gohugo.io</generator>
|
||||
<language>en-US</language>
|
||||
<copyright>© Avinash Mallya</copyright>
|
||||
<lastBuildDate>Tue, 10 Feb 2026 00:00:00 +0000</lastBuildDate>
|
||||
<atom:link href="https://avimallu.dev/tags/images/index.xml" rel="self" type="application/rss+xml" />
|
||||
<item>
|
||||
<title>Resolving I/O Bottlenecks for 100K Small Files with LMDB</title>
|
||||
<link>https://avimallu.dev/blog/005_ldmb_as_image_db/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/005_ldmb_as_image_db/</guid>
|
||||
<description><h1 id="premise">Premise</h1>
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
<p>I had several thousand images already.
I was expecting several thousand more.
My repository was tracking these images via DVC.
My computer was also slowing down massively because of the sheer number of files.
DVC itself was slowing down (after all, randomly accessing many files isn&rsquo;t going to be fast).
I also needed to access files at random for training/evaluating the model (lots of shuffling).
Lastly, these images had their own associated metadata (labels, bounding boxes, &ldquo;correct&rdquo; text etc.), and they need to be stored along with the images - or least easily linkable to them.</p></description>
|
||||
<content:encoded><![CDATA[<h1 id="premise">Premise</h1>
|
||||
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
|
||||
<p>I had several thousand images already.
|
||||
I was expecting several thousand more.
|
||||
My repository was tracking these images via DVC.
|
||||
My computer was also slowing down massively because of the sheer number of files.
|
||||
DVC itself was slowing down (after all, randomly accessing many files isn’t going to be fast).
|
||||
I also needed to access files at random for training/evaluating the model (lots of shuffling).
|
||||
Lastly, these images had their own associated metadata (labels, bounding boxes, “correct” text etc.), and they need to be stored along with the images - or least easily linkable to them.</p>
|
||||
<p>I was primarily aiming for a “simple” solution, and didn’t need a productionizable codebase.</p>
|
||||
<h1 id="potential-solutions">Potential Solutions</h1>
|
||||
<h2 id="partitioning">Partitioning</h2>
|
||||
<p>A typical solution for “too many files” is to partition them by their name. It’s ideal if their
|
||||
name is a hash, so you can store the first character of the hash, then the second character, and
|
||||
then the actual file. So, for example, the directory changes from:</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln">2</span><span class="cl">├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln">3</span><span class="cl">├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln">4</span><span class="cl">├── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">5</span><span class="cl">├── cf1a2b3c4d.txt
|
||||
</span></span><span class="line"><span class="ln">6</span><span class="cl">├── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">7</span><span class="cl">└── a1f5e6d7c8.txt</span></span></code></pre></div><p>to</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">├── a/
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">│ └── 1/
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">│ ├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">│ └── a1f5e6d7c8.txt
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">├── b/
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">│ └── 7/
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">│ ├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">│ └── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">10</span><span class="cl">└── c/
|
||||
</span></span><span class="line"><span class="ln">11</span><span class="cl"> ├── 0/
|
||||
</span></span><span class="line"><span class="ln">12</span><span class="cl"> │ └── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">13</span><span class="cl"> └── f/
|
||||
</span></span><span class="line"><span class="ln">14</span><span class="cl"> └── cf1a2b3c4d.txt</span></span></code></pre></div><p>This isn’t novel - <code>git</code> and <code>dvc</code> both store their objects this way.
|
||||
This limits directory size, so file system look ups take less time.</p>
|
||||
<p>This isn’t a perfect solution - it required that I store the images as their hash,
|
||||
and handle the directory structure correctly. I will also need to maintain my own
|
||||
mechanism to maintain the link between the hash and the metadata, which means creating
|
||||
some sort of index. Lastly, DVC will still track files individually, which means that
|
||||
its <code>push</code>, <code>diff</code> and <code>pull</code> commands will still be slow.</p>
|
||||
<h2 id="separate-object-storage-maintain-only-the-index">Separate (object) storage, maintain only the index</h2>
|
||||
<p>Another solution would be to offload the storage to a medium capable of handling a large number of files in any order,
|
||||
while maintaining random access - for example, S3. Now my task will reduce to just correctly maintaining the index so
|
||||
that I know how many images are stored on S3, and link their metadata via their hash.</p>
|
||||
<p>If you’ve experienced retrieving a list of a large number of files stored on S3, you’d have first encountered the limit
|
||||
of 1000 objects that <code>boto3</code> enforces per request. You’ll need to work around it with pagination, which while standard,
|
||||
is still more work. You would have also realized that even after all this, S3 will take quite some time to give you the
|
||||
list even after you’ve optimized as much as you can.</p>
|
||||
<p>However, that means I need an active internet connection to access any data. It also introduces latency during training,
|
||||
and quite wasteful bandwidth in running multiple training sessions for the models. I can minimize these if I store the
|
||||
images in the same AWS region as I would be running the training in (say, an EC2), but that means I needed access to an
|
||||
EC2.</p>
|
||||
<p>This still left me the task of maintaining my own index, which I really wanted to avoid, as it would mean additional
|
||||
maintenance burden for a relatively nascent project that hasn’t reached production status, while demanding production
|
||||
code for an even more nascent pipeline.</p>
|
||||
<h2 id="what-about-a-database">What about a… database?</h2>
|
||||
<p>This feels much like reaching for your nose by looping your hand behind your head instead of touching it directly with
|
||||
your fingers. A database is great for many things, but setting one up and maintaining it (even a local SQLite one) isn’t
|
||||
really a quick and painless process, and has many gotchas. For instance, most databases aren’t optimized for storing a
|
||||
large number of binary blobs.</p>
|
||||
<p>I considered, and tested using a more modern embeddable analytical database like DuckDB for this purpose (I’m quite
|
||||
biased to using DuckDB and/or Polars to solve a large number of my data processing problems). I quickly
|
||||
found out that storing large binary blobs in it causes it to choke (which is fair, it isn’t really designed for that). Storing
|
||||
the files elsewhere while maintaining just the index in it still had the original problem - I needed to write the mechanism
|
||||
of maintaining the index.</p>
|
||||
<blockquote>
|
||||
<p>Note: HuggingFace now provides many images datasets (such as <a href="https://huggingface.co/datasets/ylecun/mnist">MNIST</a>) in
|
||||
the Parquet format, with the images stored using Arrow’s extension types (but still as binary blobs). My experience with
|
||||
storing binary data in Parquet hasn’t been great, but you could check this out to see if it meets your requirements.</p>
|
||||
</blockquote>
|
||||
<h1 id="the-solution-i-landed-on">The solution I landed on</h1>
|
||||
<h2 id="what-about-a-different-kind-of-database">What about a… <em>different</em> kind of database?</h2>
|
||||
<p>Let’s get down to first principles. What did I want to do? I wanted to store images. With those images, I also wanted
|
||||
to store its metadata. I wanted to access said data quickly. It became clearer to me that I was looking for a fast key-value
|
||||
store, and I stumbled upon <a href="https://www.symas.com/mdb">LMDB</a>.</p>
|
||||
<h2 id="lmdb">LMDB</h2>
|
||||
<p>Wikipedia’s <a href="https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database">entry</a> on LMDB indicates that it’s an incredibly
|
||||
small (64kB) piece of software that does one thing really well - be a ridiculously fast key-value store. I won’t pretend to
|
||||
understand how it works, the writeup on the Wiki provides plenty of good detail. I’ll focus, rather, on how I used it to
|
||||
solve my problem.</p>
|
||||
<h2 id="storing-and-retrieving-image-data-along-with-its-metadata">Storing and retrieving image data along with its metadata</h2>
|
||||
<p>LMDB is rather barebones. It exposes few features - the ability to write, and read a particular key (stored as a bytestring),
|
||||
that itself points to arbitrary bytes. I used its <a href="https://lmdb.readthedocs.io/en/release/">Python bindings</a>.</p>
|
||||
<p>I wrote a tiny class (~200 LoC) that did the following:</p>
|
||||
<ol>
|
||||
<li>Read the file names and the data of the images that I had, along with their metadata as it currently was, into Python. Batched to avoid running out of memory.</li>
|
||||
<li>Serialize the metadata, and read the image files as bytes, and link them to the keys f"{file_name}_image" and f"{file_name}_metadata" respectively.</li>
|
||||
<li>Store these as key-value pairs into the LMDB database, which is a <strong>single</strong> file.</li>
|
||||
<li>Provided a method to read the keys to identify all the images present in the database.</li>
|
||||
<li>Provided a method to retrieve an arbitrary set of images and their metadata quickly from the saved database.</li>
|
||||
</ol>
|
||||
<p>This has many advantages:</p>
|
||||
<ol>
|
||||
<li>LMDB is fast - really, really fast. Random access? Check. Fast retrieval of available images? Check.</li>
|
||||
<li>DVC tracking becomes simple - maintain a single file, and just version control that. No slow downs due to sheer number of files - either for DVC, or my computer.</li>
|
||||
<li>No index to maintain - the images and their metadata are stored in the same location, and linkable via a mere change in the suffix to their file name.</li>
|
||||
<li>Local access, practically zero latency.</li>
|
||||
</ol>
|
||||
<p>Which solves… all of the problems I had! When new files come in, all I needed to do was add them to the DB. LMDB has a few options available - you can
|
||||
avoid overwriting the same key, ensure that the database is de-duplicated.</p>
|
||||
<h1 id="the-code">The code</h1>
|
||||
<p>I’ve provided a sample code below that demonstrates storing just the images (not metadata) for the <a href="https://www.robots.ox.ac.uk/~vgg/data/flowers/102/">Oxford 102 Category Flower</a>
|
||||
dataset, which has around 8000 images.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">lmdb</span>
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">class</span> <span class="nc">ImageDB</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">env_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span> <span class="n">max_size_as_mb</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env_path</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">env_path</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env</span> <span class="o">=</span> <span class="n">lmdb</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">env_path</span><span class="p">,</span> <span class="n">map_size</span><span class="o">=</span><span class="n">max_size_as_mb</span> <span class="o">*</span> <span class="p">(</span><span class="mi">2</span><span class="o">**</span><span class="mi">20</span><span class="p">))</span>
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">db</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">open_db</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 10</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 11</span><span class="cl"> <span class="k">def</span> <span class="nf">save_image</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 12</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 13</span><span class="cl"> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 14</span><span class="cl"> <span class="n">image_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 15</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 16</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 17</span><span class="cl"> <span class="n">txn</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 18</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 19</span><span class="cl"> <span class="k">def</span> <span class="nf">read_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">bytes</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 20</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 21</span><span class="cl"> <span class="k">if</span> <span class="n">image_as_bytes</span> <span class="o">:=</span> <span class="n">txn</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 22</span><span class="cl"> <span class="k">return</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 23</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 24</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 25</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 26</span><span class="cl"> <span class="k">def</span> <span class="nf">save_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 27</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 28</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 29</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 30</span><span class="cl"> <span class="c1"># Note: you might need to enforce a batch size here</span>
|
||||
</span></span><span class="line"><span class="ln"> 31</span><span class="cl"> <span class="c1"># to aovid running out of memory because this loads</span>
|
||||
</span></span><span class="line"><span class="ln"> 32</span><span class="cl"> <span class="c1"># all images sent to this function as bytes.</span>
|
||||
</span></span><span class="line"><span class="ln"> 33</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 34</span><span class="cl"> <span class="n">item_tuples</span> <span class="o">=</span> <span class="p">[</span>
|
||||
</span></span><span class="line"><span class="ln"> 35</span><span class="cl"> <span class="p">(</span><span class="n">k</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 36</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">name_image</span><span class="o">.</span><span class="n">items</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 37</span><span class="cl"> <span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 38</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 39</span><span class="cl"> <span class="n">consumed</span><span class="p">,</span> <span class="n">added</span> <span class="o">=</span> <span class="n">cursor</span><span class="o">.</span><span class="n">putmulti</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 40</span><span class="cl"> <span class="n">item_tuples</span><span class="p">,</span> <span class="n">dupdata</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">overwrite</span><span class="o">=</span><span class="kc">False</span>
|
||||
</span></span><span class="line"><span class="ln"> 41</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 42</span><span class="cl"> <span class="nb">print</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 43</span><span class="cl"> <span class="sa">f</span><span class="s2">"Saved </span><span class="si">{</span><span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> out of </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images to the DB (</span><span class="si">{</span><span class="n">consumed</span> <span class="o">-</span> <span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> seem to already exist)."</span>
|
||||
</span></span><span class="line"><span class="ln"> 44</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 45</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 46</span><span class="cl"> <span class="k">def</span> <span class="nf">load_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 47</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 48</span><span class="cl"> <span class="n">names</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 49</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 50</span><span class="cl"> <span class="n">names_as_bytestrings</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">encode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">names</span><span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 51</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 52</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 53</span><span class="cl"> <span class="k">return</span> <span class="p">{</span>
|
||||
</span></span><span class="line"><span class="ln"> 54</span><span class="cl"> <span class="n">k</span><span class="o">.</span><span class="n">decode</span><span class="p">():</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 55</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_as_bytes</span> <span class="ow">in</span> <span class="n">cursor</span><span class="o">.</span><span class="n">getmulti</span><span class="p">(</span><span class="n">names_as_bytestrings</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 56</span><span class="cl"> <span class="p">}</span>
|
||||
</span></span><span class="line"><span class="ln"> 57</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 58</span><span class="cl"> <span class="k">def</span> <span class="nf">delete_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 59</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 60</span><span class="cl"> <span class="k">if</span> <span class="n">txn</span><span class="o">.</span><span class="n">delete</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 61</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Image </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2"> deleted successfully"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 62</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 63</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 64</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 65</span><span class="cl"> <span class="k">def</span> <span class="nf">retrieve_names</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 66</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 67</span><span class="cl"> <span class="k">return</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">decode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span><span class="o">.</span><span class="n">iternext</span><span class="p">(</span><span class="n">keys</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="kc">False</span><span class="p">)]</span>
|
||||
</span></span><span class="line"><span class="ln"> 68</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 69</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 70</span><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 71</span><span class="cl"> <span class="n">db</span> <span class="o">=</span> <span class="n">ImageDB</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="s2">"./db/"</span><span class="p">),</span> <span class="mi">512</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 72</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 73</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 74</span><span class="cl"> <span class="c1"># Save the results</span>
|
||||
</span></span><span class="line"><span class="ln"> 75</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg/"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 76</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span>
|
||||
</span></span><span class="line"><span class="ln"> 77</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 78</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 79</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 80</span><span class="cl"> <span class="c1"># Add last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 81</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 82</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 83</span><span class="cl"> <span class="k">del</span> <span class="n">name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 84</span><span class="cl"> <span class="c1"># How many images have been stored?</span>
|
||||
</span></span><span class="line"><span class="ln"> 85</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"The DB has </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">retrieve_names</span><span class="p">())</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images stored"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 86</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 87</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 88</span><span class="cl"> <span class="c1"># Load the results from the DB and check if they match the files on disk</span>
|
||||
</span></span><span class="line"><span class="ln"> 89</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 90</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 91</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 92</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 93</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 94</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 95</span><span class="cl"> <span class="c1"># Verify last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 96</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 97</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 98</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 99</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"All images stored are byte identical to the original ones!"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln">100</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln">101</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></span></span></code></pre></div><p>This should provide you a good starting point to implement additional features,
|
||||
such as storing metadata, filtering required input by metadata (such as extracting
|
||||
a specific label for evaluation) and so on.</p>
|
||||
<h1 id="caveats">Caveats</h1>
|
||||
<h2 id="avoid-pil-or-pay-the-small-price">Avoid PIL, or pay the (small) price</h2>
|
||||
<p>One gotcha that I initially faced is that the images I saved wasn’t the same as
|
||||
the images that I retrieved. This wasn’t LMDB’s fault, this was because I was
|
||||
reading the images from disk via PIL, and storing them as bytes in LMDB. PIL
|
||||
decodes and encodes the image, so a roundtrip will not necessarily be identical,
|
||||
even for lossless file formats (other than bitmap images).</p>
|
||||
<p>Don’t encode/re-encode the image before you store it, or be prepared for the
|
||||
stored data to not be byte-identical.</p>
|
||||
<h2 id="the-max_size_as_mb-argument">The <code>max_size_as_mb</code> argument</h2>
|
||||
<p>LMDB has an unusual design. You need to specify the upper bound of the DB size
|
||||
upon creation, and if it exceeds this size, it will fail. You can edit this
|
||||
later, with some caveats (on Windows, this will actually allocate the full
|
||||
size).</p>
|
||||
<h2 id="concurrency-and-lmdb">Concurrency and LMDB</h2>
|
||||
<p>LMDB, while extremely fast, has some considerations with concurrency. See
|
||||
<a href="https://lmdb.readthedocs.io/en/release/#threads">the documentation</a> for details.
|
||||
It may not be suited for distributed workloads.</p>
|
||||
<h1 id="alternatives">Alternatives</h1>
|
||||
<p>This article covers a “quick and dirty” solution, and was before more purpose-built
|
||||
solutions were available. Some alternatives are:</p>
|
||||
<ol>
|
||||
<li>If you’re comfortable operating directly on archives, a simple <code>tar</code> file will
|
||||
do - it can provide an offset index to provide random access to data.</li>
|
||||
<li><a href="https://github.com/webdataset/webdataset">Nvidia’s WebDataset</a>. Modern, open
|
||||
source and purpose built for large scale deep learning.</li>
|
||||
<li><a href="https://lancedb.com/">LanceDB</a>, which describes itself as “designed for multimodal”
|
||||
and “built for scale”. It’s built on top of Arrow, closely related to Parquet.</li>
|
||||
<li>As mentioned, HuggingFace has multiple solutions to this, starting with Arrow backed
|
||||
storage, and their own <a href="https://huggingface.co/docs/datasets/index"><code>datasets</code></a>.</li>
|
||||
</ol>
|
||||
<p>Use these if you want to scale to production level training.</p>
|
||||
]]></content:encoded>
|
||||
</item>
|
||||
</channel>
|
||||
</rss>
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Tags" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -35,8 +35,8 @@
|
||||
|
||||
|
||||
<meta itemprop="name" content="Tags">
|
||||
<meta itemprop="datePublished" content="2026-01-14T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-01-14T00:00:00+00:00">
|
||||
<meta itemprop="datePublished" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
@@ -80,6 +85,188 @@
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/tags/by-hand/">By-Hand</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/tags/euler/">Euler</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/tags/history/">History</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/tags/logarithms/">Logarithms</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/tags/mathematics/">Mathematics</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/tags/square-roots/">Square-Roots</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/tags/bottleneck/">Bottleneck</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/tags/dvc/">Dvc</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/tags/images/">Images</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/tags/io/">Io</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/tags/lmdb/">Lmdb</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/tags/machine-learning/">Machine-Learning</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/tags/python/">Python</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/tags/storage/">Storage</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
@@ -132,19 +319,6 @@
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-01-14' pubdate>
|
||||
2026-01-14
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/tags/python/">Python</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
|
||||
@@ -7,8 +7,120 @@
|
||||
<generator>Hugo -- gohugo.io</generator>
|
||||
<language>en-US</language>
|
||||
<copyright>© Avinash Mallya</copyright>
|
||||
<lastBuildDate>Wed, 14 Jan 2026 00:00:00 +0000</lastBuildDate>
|
||||
<lastBuildDate>Sun, 15 Feb 2026 00:00:00 +0000</lastBuildDate>
|
||||
<atom:link href="https://avimallu.dev/tags/index.xml" rel="self" type="application/rss+xml" />
|
||||
<item>
|
||||
<title>By-Hand</title>
|
||||
<link>https://avimallu.dev/tags/by-hand/</link>
|
||||
<pubDate>Sun, 15 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/tags/by-hand/</guid>
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Euler</title>
|
||||
<link>https://avimallu.dev/tags/euler/</link>
|
||||
<pubDate>Sun, 15 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/tags/euler/</guid>
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>History</title>
|
||||
<link>https://avimallu.dev/tags/history/</link>
|
||||
<pubDate>Sun, 15 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/tags/history/</guid>
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Logarithms</title>
|
||||
<link>https://avimallu.dev/tags/logarithms/</link>
|
||||
<pubDate>Sun, 15 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/tags/logarithms/</guid>
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Mathematics</title>
|
||||
<link>https://avimallu.dev/tags/mathematics/</link>
|
||||
<pubDate>Sun, 15 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/tags/mathematics/</guid>
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Square-Roots</title>
|
||||
<link>https://avimallu.dev/tags/square-roots/</link>
|
||||
<pubDate>Sun, 15 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/tags/square-roots/</guid>
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Bottleneck</title>
|
||||
<link>https://avimallu.dev/tags/bottleneck/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/tags/bottleneck/</guid>
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Dvc</title>
|
||||
<link>https://avimallu.dev/tags/dvc/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/tags/dvc/</guid>
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Images</title>
|
||||
<link>https://avimallu.dev/tags/images/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/tags/images/</guid>
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Io</title>
|
||||
<link>https://avimallu.dev/tags/io/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/tags/io/</guid>
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Lmdb</title>
|
||||
<link>https://avimallu.dev/tags/lmdb/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/tags/lmdb/</guid>
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Machine-Learning</title>
|
||||
<link>https://avimallu.dev/tags/machine-learning/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/tags/machine-learning/</guid>
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Python</title>
|
||||
<link>https://avimallu.dev/tags/python/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/tags/python/</guid>
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Storage</title>
|
||||
<link>https://avimallu.dev/tags/storage/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/tags/storage/</guid>
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Environment</title>
|
||||
<link>https://avimallu.dev/tags/environment/</link>
|
||||
@@ -41,14 +153,6 @@
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Python</title>
|
||||
<link>https://avimallu.dev/tags/python/</link>
|
||||
<pubDate>Wed, 14 Jan 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/tags/python/</guid>
|
||||
<description></description>
|
||||
<content:encoded><![CDATA[]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Variables</title>
|
||||
<link>https://avimallu.dev/tags/variables/</link>
|
||||
|
||||
113
public/tags/io/index.html
Normal file
113
public/tags/io/index.html
Normal file
@@ -0,0 +1,113 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Io | Avinash's Blog</title>
|
||||
<meta name="title" content="Io" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/tags/io/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="Io">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="website">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="Io">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="Io">
|
||||
<meta itemprop="datePublished" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
|
||||
<link href="/original.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/io/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header><a class="skip-link" href="#main-content">Skip to main content</a>
|
||||
|
||||
<a href="/" class="title"><h1>Avinash's Blog</h1></a>
|
||||
<nav>
|
||||
<a href="/">about</a>
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</nav>
|
||||
</header>
|
||||
<main id="main-content">
|
||||
<content>
|
||||
|
||||
<h3 class="blog-filter">Filtering for "Io"</h3>
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/005_ldmb_as_image_db/">Resolving I/O Bottlenecks for 100K Small Files with LMDB</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</content>
|
||||
|
||||
</main>
|
||||
<footer><small>
|
||||
© Avinash Mallya | Design via <a href="https://github.com/clente/hugo-bearcub">Bear Cub</a>.
|
||||
</small></footer>
|
||||
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
276
public/tags/io/index.xml
Normal file
276
public/tags/io/index.xml
Normal file
@@ -0,0 +1,276 @@
|
||||
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
|
||||
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
|
||||
<channel>
|
||||
<title>Io on Avinash's Blog</title>
|
||||
<link>https://avimallu.dev/tags/io/</link>
|
||||
<description>Recent content in Io on Avinash's Blog</description>
|
||||
<generator>Hugo -- gohugo.io</generator>
|
||||
<language>en-US</language>
|
||||
<copyright>© Avinash Mallya</copyright>
|
||||
<lastBuildDate>Tue, 10 Feb 2026 00:00:00 +0000</lastBuildDate>
|
||||
<atom:link href="https://avimallu.dev/tags/io/index.xml" rel="self" type="application/rss+xml" />
|
||||
<item>
|
||||
<title>Resolving I/O Bottlenecks for 100K Small Files with LMDB</title>
|
||||
<link>https://avimallu.dev/blog/005_ldmb_as_image_db/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/005_ldmb_as_image_db/</guid>
|
||||
<description><h1 id="premise">Premise</h1>
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
<p>I had several thousand images already.
I was expecting several thousand more.
My repository was tracking these images via DVC.
My computer was also slowing down massively because of the sheer number of files.
DVC itself was slowing down (after all, randomly accessing many files isn&rsquo;t going to be fast).
I also needed to access files at random for training/evaluating the model (lots of shuffling).
Lastly, these images had their own associated metadata (labels, bounding boxes, &ldquo;correct&rdquo; text etc.), and they need to be stored along with the images - or least easily linkable to them.</p></description>
|
||||
<content:encoded><![CDATA[<h1 id="premise">Premise</h1>
|
||||
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
|
||||
<p>I had several thousand images already.
|
||||
I was expecting several thousand more.
|
||||
My repository was tracking these images via DVC.
|
||||
My computer was also slowing down massively because of the sheer number of files.
|
||||
DVC itself was slowing down (after all, randomly accessing many files isn’t going to be fast).
|
||||
I also needed to access files at random for training/evaluating the model (lots of shuffling).
|
||||
Lastly, these images had their own associated metadata (labels, bounding boxes, “correct” text etc.), and they need to be stored along with the images - or least easily linkable to them.</p>
|
||||
<p>I was primarily aiming for a “simple” solution, and didn’t need a productionizable codebase.</p>
|
||||
<h1 id="potential-solutions">Potential Solutions</h1>
|
||||
<h2 id="partitioning">Partitioning</h2>
|
||||
<p>A typical solution for “too many files” is to partition them by their name. It’s ideal if their
|
||||
name is a hash, so you can store the first character of the hash, then the second character, and
|
||||
then the actual file. So, for example, the directory changes from:</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln">2</span><span class="cl">├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln">3</span><span class="cl">├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln">4</span><span class="cl">├── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">5</span><span class="cl">├── cf1a2b3c4d.txt
|
||||
</span></span><span class="line"><span class="ln">6</span><span class="cl">├── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">7</span><span class="cl">└── a1f5e6d7c8.txt</span></span></code></pre></div><p>to</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">├── a/
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">│ └── 1/
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">│ ├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">│ └── a1f5e6d7c8.txt
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">├── b/
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">│ └── 7/
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">│ ├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">│ └── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">10</span><span class="cl">└── c/
|
||||
</span></span><span class="line"><span class="ln">11</span><span class="cl"> ├── 0/
|
||||
</span></span><span class="line"><span class="ln">12</span><span class="cl"> │ └── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">13</span><span class="cl"> └── f/
|
||||
</span></span><span class="line"><span class="ln">14</span><span class="cl"> └── cf1a2b3c4d.txt</span></span></code></pre></div><p>This isn’t novel - <code>git</code> and <code>dvc</code> both store their objects this way.
|
||||
This limits directory size, so file system look ups take less time.</p>
|
||||
<p>This isn’t a perfect solution - it required that I store the images as their hash,
|
||||
and handle the directory structure correctly. I will also need to maintain my own
|
||||
mechanism to maintain the link between the hash and the metadata, which means creating
|
||||
some sort of index. Lastly, DVC will still track files individually, which means that
|
||||
its <code>push</code>, <code>diff</code> and <code>pull</code> commands will still be slow.</p>
|
||||
<h2 id="separate-object-storage-maintain-only-the-index">Separate (object) storage, maintain only the index</h2>
|
||||
<p>Another solution would be to offload the storage to a medium capable of handling a large number of files in any order,
|
||||
while maintaining random access - for example, S3. Now my task will reduce to just correctly maintaining the index so
|
||||
that I know how many images are stored on S3, and link their metadata via their hash.</p>
|
||||
<p>If you’ve experienced retrieving a list of a large number of files stored on S3, you’d have first encountered the limit
|
||||
of 1000 objects that <code>boto3</code> enforces per request. You’ll need to work around it with pagination, which while standard,
|
||||
is still more work. You would have also realized that even after all this, S3 will take quite some time to give you the
|
||||
list even after you’ve optimized as much as you can.</p>
|
||||
<p>However, that means I need an active internet connection to access any data. It also introduces latency during training,
|
||||
and quite wasteful bandwidth in running multiple training sessions for the models. I can minimize these if I store the
|
||||
images in the same AWS region as I would be running the training in (say, an EC2), but that means I needed access to an
|
||||
EC2.</p>
|
||||
<p>This still left me the task of maintaining my own index, which I really wanted to avoid, as it would mean additional
|
||||
maintenance burden for a relatively nascent project that hasn’t reached production status, while demanding production
|
||||
code for an even more nascent pipeline.</p>
|
||||
<h2 id="what-about-a-database">What about a… database?</h2>
|
||||
<p>This feels much like reaching for your nose by looping your hand behind your head instead of touching it directly with
|
||||
your fingers. A database is great for many things, but setting one up and maintaining it (even a local SQLite one) isn’t
|
||||
really a quick and painless process, and has many gotchas. For instance, most databases aren’t optimized for storing a
|
||||
large number of binary blobs.</p>
|
||||
<p>I considered, and tested using a more modern embeddable analytical database like DuckDB for this purpose (I’m quite
|
||||
biased to using DuckDB and/or Polars to solve a large number of my data processing problems). I quickly
|
||||
found out that storing large binary blobs in it causes it to choke (which is fair, it isn’t really designed for that). Storing
|
||||
the files elsewhere while maintaining just the index in it still had the original problem - I needed to write the mechanism
|
||||
of maintaining the index.</p>
|
||||
<blockquote>
|
||||
<p>Note: HuggingFace now provides many images datasets (such as <a href="https://huggingface.co/datasets/ylecun/mnist">MNIST</a>) in
|
||||
the Parquet format, with the images stored using Arrow’s extension types (but still as binary blobs). My experience with
|
||||
storing binary data in Parquet hasn’t been great, but you could check this out to see if it meets your requirements.</p>
|
||||
</blockquote>
|
||||
<h1 id="the-solution-i-landed-on">The solution I landed on</h1>
|
||||
<h2 id="what-about-a-different-kind-of-database">What about a… <em>different</em> kind of database?</h2>
|
||||
<p>Let’s get down to first principles. What did I want to do? I wanted to store images. With those images, I also wanted
|
||||
to store its metadata. I wanted to access said data quickly. It became clearer to me that I was looking for a fast key-value
|
||||
store, and I stumbled upon <a href="https://www.symas.com/mdb">LMDB</a>.</p>
|
||||
<h2 id="lmdb">LMDB</h2>
|
||||
<p>Wikipedia’s <a href="https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database">entry</a> on LMDB indicates that it’s an incredibly
|
||||
small (64kB) piece of software that does one thing really well - be a ridiculously fast key-value store. I won’t pretend to
|
||||
understand how it works, the writeup on the Wiki provides plenty of good detail. I’ll focus, rather, on how I used it to
|
||||
solve my problem.</p>
|
||||
<h2 id="storing-and-retrieving-image-data-along-with-its-metadata">Storing and retrieving image data along with its metadata</h2>
|
||||
<p>LMDB is rather barebones. It exposes few features - the ability to write, and read a particular key (stored as a bytestring),
|
||||
that itself points to arbitrary bytes. I used its <a href="https://lmdb.readthedocs.io/en/release/">Python bindings</a>.</p>
|
||||
<p>I wrote a tiny class (~200 LoC) that did the following:</p>
|
||||
<ol>
|
||||
<li>Read the file names and the data of the images that I had, along with their metadata as it currently was, into Python. Batched to avoid running out of memory.</li>
|
||||
<li>Serialize the metadata, and read the image files as bytes, and link them to the keys f"{file_name}_image" and f"{file_name}_metadata" respectively.</li>
|
||||
<li>Store these as key-value pairs into the LMDB database, which is a <strong>single</strong> file.</li>
|
||||
<li>Provided a method to read the keys to identify all the images present in the database.</li>
|
||||
<li>Provided a method to retrieve an arbitrary set of images and their metadata quickly from the saved database.</li>
|
||||
</ol>
|
||||
<p>This has many advantages:</p>
|
||||
<ol>
|
||||
<li>LMDB is fast - really, really fast. Random access? Check. Fast retrieval of available images? Check.</li>
|
||||
<li>DVC tracking becomes simple - maintain a single file, and just version control that. No slow downs due to sheer number of files - either for DVC, or my computer.</li>
|
||||
<li>No index to maintain - the images and their metadata are stored in the same location, and linkable via a mere change in the suffix to their file name.</li>
|
||||
<li>Local access, practically zero latency.</li>
|
||||
</ol>
|
||||
<p>Which solves… all of the problems I had! When new files come in, all I needed to do was add them to the DB. LMDB has a few options available - you can
|
||||
avoid overwriting the same key, ensure that the database is de-duplicated.</p>
|
||||
<h1 id="the-code">The code</h1>
|
||||
<p>I’ve provided a sample code below that demonstrates storing just the images (not metadata) for the <a href="https://www.robots.ox.ac.uk/~vgg/data/flowers/102/">Oxford 102 Category Flower</a>
|
||||
dataset, which has around 8000 images.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">lmdb</span>
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">class</span> <span class="nc">ImageDB</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">env_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span> <span class="n">max_size_as_mb</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env_path</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">env_path</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env</span> <span class="o">=</span> <span class="n">lmdb</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">env_path</span><span class="p">,</span> <span class="n">map_size</span><span class="o">=</span><span class="n">max_size_as_mb</span> <span class="o">*</span> <span class="p">(</span><span class="mi">2</span><span class="o">**</span><span class="mi">20</span><span class="p">))</span>
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">db</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">open_db</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 10</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 11</span><span class="cl"> <span class="k">def</span> <span class="nf">save_image</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 12</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 13</span><span class="cl"> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 14</span><span class="cl"> <span class="n">image_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 15</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 16</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 17</span><span class="cl"> <span class="n">txn</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 18</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 19</span><span class="cl"> <span class="k">def</span> <span class="nf">read_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">bytes</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 20</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 21</span><span class="cl"> <span class="k">if</span> <span class="n">image_as_bytes</span> <span class="o">:=</span> <span class="n">txn</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 22</span><span class="cl"> <span class="k">return</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 23</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 24</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 25</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 26</span><span class="cl"> <span class="k">def</span> <span class="nf">save_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 27</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 28</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 29</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 30</span><span class="cl"> <span class="c1"># Note: you might need to enforce a batch size here</span>
|
||||
</span></span><span class="line"><span class="ln"> 31</span><span class="cl"> <span class="c1"># to aovid running out of memory because this loads</span>
|
||||
</span></span><span class="line"><span class="ln"> 32</span><span class="cl"> <span class="c1"># all images sent to this function as bytes.</span>
|
||||
</span></span><span class="line"><span class="ln"> 33</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 34</span><span class="cl"> <span class="n">item_tuples</span> <span class="o">=</span> <span class="p">[</span>
|
||||
</span></span><span class="line"><span class="ln"> 35</span><span class="cl"> <span class="p">(</span><span class="n">k</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 36</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">name_image</span><span class="o">.</span><span class="n">items</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 37</span><span class="cl"> <span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 38</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 39</span><span class="cl"> <span class="n">consumed</span><span class="p">,</span> <span class="n">added</span> <span class="o">=</span> <span class="n">cursor</span><span class="o">.</span><span class="n">putmulti</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 40</span><span class="cl"> <span class="n">item_tuples</span><span class="p">,</span> <span class="n">dupdata</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">overwrite</span><span class="o">=</span><span class="kc">False</span>
|
||||
</span></span><span class="line"><span class="ln"> 41</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 42</span><span class="cl"> <span class="nb">print</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 43</span><span class="cl"> <span class="sa">f</span><span class="s2">"Saved </span><span class="si">{</span><span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> out of </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images to the DB (</span><span class="si">{</span><span class="n">consumed</span> <span class="o">-</span> <span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> seem to already exist)."</span>
|
||||
</span></span><span class="line"><span class="ln"> 44</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 45</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 46</span><span class="cl"> <span class="k">def</span> <span class="nf">load_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 47</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 48</span><span class="cl"> <span class="n">names</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 49</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 50</span><span class="cl"> <span class="n">names_as_bytestrings</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">encode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">names</span><span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 51</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 52</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 53</span><span class="cl"> <span class="k">return</span> <span class="p">{</span>
|
||||
</span></span><span class="line"><span class="ln"> 54</span><span class="cl"> <span class="n">k</span><span class="o">.</span><span class="n">decode</span><span class="p">():</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 55</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_as_bytes</span> <span class="ow">in</span> <span class="n">cursor</span><span class="o">.</span><span class="n">getmulti</span><span class="p">(</span><span class="n">names_as_bytestrings</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 56</span><span class="cl"> <span class="p">}</span>
|
||||
</span></span><span class="line"><span class="ln"> 57</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 58</span><span class="cl"> <span class="k">def</span> <span class="nf">delete_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 59</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 60</span><span class="cl"> <span class="k">if</span> <span class="n">txn</span><span class="o">.</span><span class="n">delete</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 61</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Image </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2"> deleted successfully"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 62</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 63</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 64</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 65</span><span class="cl"> <span class="k">def</span> <span class="nf">retrieve_names</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 66</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 67</span><span class="cl"> <span class="k">return</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">decode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span><span class="o">.</span><span class="n">iternext</span><span class="p">(</span><span class="n">keys</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="kc">False</span><span class="p">)]</span>
|
||||
</span></span><span class="line"><span class="ln"> 68</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 69</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 70</span><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 71</span><span class="cl"> <span class="n">db</span> <span class="o">=</span> <span class="n">ImageDB</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="s2">"./db/"</span><span class="p">),</span> <span class="mi">512</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 72</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 73</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 74</span><span class="cl"> <span class="c1"># Save the results</span>
|
||||
</span></span><span class="line"><span class="ln"> 75</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg/"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 76</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span>
|
||||
</span></span><span class="line"><span class="ln"> 77</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 78</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 79</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 80</span><span class="cl"> <span class="c1"># Add last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 81</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 82</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 83</span><span class="cl"> <span class="k">del</span> <span class="n">name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 84</span><span class="cl"> <span class="c1"># How many images have been stored?</span>
|
||||
</span></span><span class="line"><span class="ln"> 85</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"The DB has </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">retrieve_names</span><span class="p">())</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images stored"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 86</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 87</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 88</span><span class="cl"> <span class="c1"># Load the results from the DB and check if they match the files on disk</span>
|
||||
</span></span><span class="line"><span class="ln"> 89</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 90</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 91</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 92</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 93</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 94</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 95</span><span class="cl"> <span class="c1"># Verify last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 96</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 97</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 98</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 99</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"All images stored are byte identical to the original ones!"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln">100</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln">101</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></span></span></code></pre></div><p>This should provide you a good starting point to implement additional features,
|
||||
such as storing metadata, filtering required input by metadata (such as extracting
|
||||
a specific label for evaluation) and so on.</p>
|
||||
<h1 id="caveats">Caveats</h1>
|
||||
<h2 id="avoid-pil-or-pay-the-small-price">Avoid PIL, or pay the (small) price</h2>
|
||||
<p>One gotcha that I initially faced is that the images I saved wasn’t the same as
|
||||
the images that I retrieved. This wasn’t LMDB’s fault, this was because I was
|
||||
reading the images from disk via PIL, and storing them as bytes in LMDB. PIL
|
||||
decodes and encodes the image, so a roundtrip will not necessarily be identical,
|
||||
even for lossless file formats (other than bitmap images).</p>
|
||||
<p>Don’t encode/re-encode the image before you store it, or be prepared for the
|
||||
stored data to not be byte-identical.</p>
|
||||
<h2 id="the-max_size_as_mb-argument">The <code>max_size_as_mb</code> argument</h2>
|
||||
<p>LMDB has an unusual design. You need to specify the upper bound of the DB size
|
||||
upon creation, and if it exceeds this size, it will fail. You can edit this
|
||||
later, with some caveats (on Windows, this will actually allocate the full
|
||||
size).</p>
|
||||
<h2 id="concurrency-and-lmdb">Concurrency and LMDB</h2>
|
||||
<p>LMDB, while extremely fast, has some considerations with concurrency. See
|
||||
<a href="https://lmdb.readthedocs.io/en/release/#threads">the documentation</a> for details.
|
||||
It may not be suited for distributed workloads.</p>
|
||||
<h1 id="alternatives">Alternatives</h1>
|
||||
<p>This article covers a “quick and dirty” solution, and was before more purpose-built
|
||||
solutions were available. Some alternatives are:</p>
|
||||
<ol>
|
||||
<li>If you’re comfortable operating directly on archives, a simple <code>tar</code> file will
|
||||
do - it can provide an offset index to provide random access to data.</li>
|
||||
<li><a href="https://github.com/webdataset/webdataset">Nvidia’s WebDataset</a>. Modern, open
|
||||
source and purpose built for large scale deep learning.</li>
|
||||
<li><a href="https://lancedb.com/">LanceDB</a>, which describes itself as “designed for multimodal”
|
||||
and “built for scale”. It’s built on top of Arrow, closely related to Parquet.</li>
|
||||
<li>As mentioned, HuggingFace has multiple solutions to this, starting with Arrow backed
|
||||
storage, and their own <a href="https://huggingface.co/docs/datasets/index"><code>datasets</code></a>.</li>
|
||||
</ol>
|
||||
<p>Use these if you want to scale to production level training.</p>
|
||||
]]></content:encoded>
|
||||
</item>
|
||||
</channel>
|
||||
</rss>
|
||||
113
public/tags/lmdb/index.html
Normal file
113
public/tags/lmdb/index.html
Normal file
@@ -0,0 +1,113 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Lmdb | Avinash's Blog</title>
|
||||
<meta name="title" content="Lmdb" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/tags/lmdb/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="Lmdb">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="website">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="Lmdb">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="Lmdb">
|
||||
<meta itemprop="datePublished" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
|
||||
<link href="/original.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/lmdb/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header><a class="skip-link" href="#main-content">Skip to main content</a>
|
||||
|
||||
<a href="/" class="title"><h1>Avinash's Blog</h1></a>
|
||||
<nav>
|
||||
<a href="/">about</a>
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</nav>
|
||||
</header>
|
||||
<main id="main-content">
|
||||
<content>
|
||||
|
||||
<h3 class="blog-filter">Filtering for "Lmdb"</h3>
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/005_ldmb_as_image_db/">Resolving I/O Bottlenecks for 100K Small Files with LMDB</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</content>
|
||||
|
||||
</main>
|
||||
<footer><small>
|
||||
© Avinash Mallya | Design via <a href="https://github.com/clente/hugo-bearcub">Bear Cub</a>.
|
||||
</small></footer>
|
||||
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
276
public/tags/lmdb/index.xml
Normal file
276
public/tags/lmdb/index.xml
Normal file
@@ -0,0 +1,276 @@
|
||||
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
|
||||
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
|
||||
<channel>
|
||||
<title>Lmdb on Avinash's Blog</title>
|
||||
<link>https://avimallu.dev/tags/lmdb/</link>
|
||||
<description>Recent content in Lmdb on Avinash's Blog</description>
|
||||
<generator>Hugo -- gohugo.io</generator>
|
||||
<language>en-US</language>
|
||||
<copyright>© Avinash Mallya</copyright>
|
||||
<lastBuildDate>Tue, 10 Feb 2026 00:00:00 +0000</lastBuildDate>
|
||||
<atom:link href="https://avimallu.dev/tags/lmdb/index.xml" rel="self" type="application/rss+xml" />
|
||||
<item>
|
||||
<title>Resolving I/O Bottlenecks for 100K Small Files with LMDB</title>
|
||||
<link>https://avimallu.dev/blog/005_ldmb_as_image_db/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/005_ldmb_as_image_db/</guid>
|
||||
<description><h1 id="premise">Premise</h1>
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
<p>I had several thousand images already.
I was expecting several thousand more.
My repository was tracking these images via DVC.
My computer was also slowing down massively because of the sheer number of files.
DVC itself was slowing down (after all, randomly accessing many files isn&rsquo;t going to be fast).
I also needed to access files at random for training/evaluating the model (lots of shuffling).
Lastly, these images had their own associated metadata (labels, bounding boxes, &ldquo;correct&rdquo; text etc.), and they need to be stored along with the images - or least easily linkable to them.</p></description>
|
||||
<content:encoded><![CDATA[<h1 id="premise">Premise</h1>
|
||||
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
|
||||
<p>I had several thousand images already.
|
||||
I was expecting several thousand more.
|
||||
My repository was tracking these images via DVC.
|
||||
My computer was also slowing down massively because of the sheer number of files.
|
||||
DVC itself was slowing down (after all, randomly accessing many files isn’t going to be fast).
|
||||
I also needed to access files at random for training/evaluating the model (lots of shuffling).
|
||||
Lastly, these images had their own associated metadata (labels, bounding boxes, “correct” text etc.), and they need to be stored along with the images - or least easily linkable to them.</p>
|
||||
<p>I was primarily aiming for a “simple” solution, and didn’t need a productionizable codebase.</p>
|
||||
<h1 id="potential-solutions">Potential Solutions</h1>
|
||||
<h2 id="partitioning">Partitioning</h2>
|
||||
<p>A typical solution for “too many files” is to partition them by their name. It’s ideal if their
|
||||
name is a hash, so you can store the first character of the hash, then the second character, and
|
||||
then the actual file. So, for example, the directory changes from:</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln">2</span><span class="cl">├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln">3</span><span class="cl">├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln">4</span><span class="cl">├── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">5</span><span class="cl">├── cf1a2b3c4d.txt
|
||||
</span></span><span class="line"><span class="ln">6</span><span class="cl">├── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">7</span><span class="cl">└── a1f5e6d7c8.txt</span></span></code></pre></div><p>to</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">├── a/
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">│ └── 1/
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">│ ├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">│ └── a1f5e6d7c8.txt
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">├── b/
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">│ └── 7/
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">│ ├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">│ └── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">10</span><span class="cl">└── c/
|
||||
</span></span><span class="line"><span class="ln">11</span><span class="cl"> ├── 0/
|
||||
</span></span><span class="line"><span class="ln">12</span><span class="cl"> │ └── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">13</span><span class="cl"> └── f/
|
||||
</span></span><span class="line"><span class="ln">14</span><span class="cl"> └── cf1a2b3c4d.txt</span></span></code></pre></div><p>This isn’t novel - <code>git</code> and <code>dvc</code> both store their objects this way.
|
||||
This limits directory size, so file system look ups take less time.</p>
|
||||
<p>This isn’t a perfect solution - it required that I store the images as their hash,
|
||||
and handle the directory structure correctly. I will also need to maintain my own
|
||||
mechanism to maintain the link between the hash and the metadata, which means creating
|
||||
some sort of index. Lastly, DVC will still track files individually, which means that
|
||||
its <code>push</code>, <code>diff</code> and <code>pull</code> commands will still be slow.</p>
|
||||
<h2 id="separate-object-storage-maintain-only-the-index">Separate (object) storage, maintain only the index</h2>
|
||||
<p>Another solution would be to offload the storage to a medium capable of handling a large number of files in any order,
|
||||
while maintaining random access - for example, S3. Now my task will reduce to just correctly maintaining the index so
|
||||
that I know how many images are stored on S3, and link their metadata via their hash.</p>
|
||||
<p>If you’ve experienced retrieving a list of a large number of files stored on S3, you’d have first encountered the limit
|
||||
of 1000 objects that <code>boto3</code> enforces per request. You’ll need to work around it with pagination, which while standard,
|
||||
is still more work. You would have also realized that even after all this, S3 will take quite some time to give you the
|
||||
list even after you’ve optimized as much as you can.</p>
|
||||
<p>However, that means I need an active internet connection to access any data. It also introduces latency during training,
|
||||
and quite wasteful bandwidth in running multiple training sessions for the models. I can minimize these if I store the
|
||||
images in the same AWS region as I would be running the training in (say, an EC2), but that means I needed access to an
|
||||
EC2.</p>
|
||||
<p>This still left me the task of maintaining my own index, which I really wanted to avoid, as it would mean additional
|
||||
maintenance burden for a relatively nascent project that hasn’t reached production status, while demanding production
|
||||
code for an even more nascent pipeline.</p>
|
||||
<h2 id="what-about-a-database">What about a… database?</h2>
|
||||
<p>This feels much like reaching for your nose by looping your hand behind your head instead of touching it directly with
|
||||
your fingers. A database is great for many things, but setting one up and maintaining it (even a local SQLite one) isn’t
|
||||
really a quick and painless process, and has many gotchas. For instance, most databases aren’t optimized for storing a
|
||||
large number of binary blobs.</p>
|
||||
<p>I considered, and tested using a more modern embeddable analytical database like DuckDB for this purpose (I’m quite
|
||||
biased to using DuckDB and/or Polars to solve a large number of my data processing problems). I quickly
|
||||
found out that storing large binary blobs in it causes it to choke (which is fair, it isn’t really designed for that). Storing
|
||||
the files elsewhere while maintaining just the index in it still had the original problem - I needed to write the mechanism
|
||||
of maintaining the index.</p>
|
||||
<blockquote>
|
||||
<p>Note: HuggingFace now provides many images datasets (such as <a href="https://huggingface.co/datasets/ylecun/mnist">MNIST</a>) in
|
||||
the Parquet format, with the images stored using Arrow’s extension types (but still as binary blobs). My experience with
|
||||
storing binary data in Parquet hasn’t been great, but you could check this out to see if it meets your requirements.</p>
|
||||
</blockquote>
|
||||
<h1 id="the-solution-i-landed-on">The solution I landed on</h1>
|
||||
<h2 id="what-about-a-different-kind-of-database">What about a… <em>different</em> kind of database?</h2>
|
||||
<p>Let’s get down to first principles. What did I want to do? I wanted to store images. With those images, I also wanted
|
||||
to store its metadata. I wanted to access said data quickly. It became clearer to me that I was looking for a fast key-value
|
||||
store, and I stumbled upon <a href="https://www.symas.com/mdb">LMDB</a>.</p>
|
||||
<h2 id="lmdb">LMDB</h2>
|
||||
<p>Wikipedia’s <a href="https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database">entry</a> on LMDB indicates that it’s an incredibly
|
||||
small (64kB) piece of software that does one thing really well - be a ridiculously fast key-value store. I won’t pretend to
|
||||
understand how it works, the writeup on the Wiki provides plenty of good detail. I’ll focus, rather, on how I used it to
|
||||
solve my problem.</p>
|
||||
<h2 id="storing-and-retrieving-image-data-along-with-its-metadata">Storing and retrieving image data along with its metadata</h2>
|
||||
<p>LMDB is rather barebones. It exposes few features - the ability to write, and read a particular key (stored as a bytestring),
|
||||
that itself points to arbitrary bytes. I used its <a href="https://lmdb.readthedocs.io/en/release/">Python bindings</a>.</p>
|
||||
<p>I wrote a tiny class (~200 LoC) that did the following:</p>
|
||||
<ol>
|
||||
<li>Read the file names and the data of the images that I had, along with their metadata as it currently was, into Python. Batched to avoid running out of memory.</li>
|
||||
<li>Serialize the metadata, and read the image files as bytes, and link them to the keys f"{file_name}_image" and f"{file_name}_metadata" respectively.</li>
|
||||
<li>Store these as key-value pairs into the LMDB database, which is a <strong>single</strong> file.</li>
|
||||
<li>Provided a method to read the keys to identify all the images present in the database.</li>
|
||||
<li>Provided a method to retrieve an arbitrary set of images and their metadata quickly from the saved database.</li>
|
||||
</ol>
|
||||
<p>This has many advantages:</p>
|
||||
<ol>
|
||||
<li>LMDB is fast - really, really fast. Random access? Check. Fast retrieval of available images? Check.</li>
|
||||
<li>DVC tracking becomes simple - maintain a single file, and just version control that. No slow downs due to sheer number of files - either for DVC, or my computer.</li>
|
||||
<li>No index to maintain - the images and their metadata are stored in the same location, and linkable via a mere change in the suffix to their file name.</li>
|
||||
<li>Local access, practically zero latency.</li>
|
||||
</ol>
|
||||
<p>Which solves… all of the problems I had! When new files come in, all I needed to do was add them to the DB. LMDB has a few options available - you can
|
||||
avoid overwriting the same key, ensure that the database is de-duplicated.</p>
|
||||
<h1 id="the-code">The code</h1>
|
||||
<p>I’ve provided a sample code below that demonstrates storing just the images (not metadata) for the <a href="https://www.robots.ox.ac.uk/~vgg/data/flowers/102/">Oxford 102 Category Flower</a>
|
||||
dataset, which has around 8000 images.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">lmdb</span>
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">class</span> <span class="nc">ImageDB</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">env_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span> <span class="n">max_size_as_mb</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env_path</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">env_path</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env</span> <span class="o">=</span> <span class="n">lmdb</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">env_path</span><span class="p">,</span> <span class="n">map_size</span><span class="o">=</span><span class="n">max_size_as_mb</span> <span class="o">*</span> <span class="p">(</span><span class="mi">2</span><span class="o">**</span><span class="mi">20</span><span class="p">))</span>
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">db</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">open_db</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 10</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 11</span><span class="cl"> <span class="k">def</span> <span class="nf">save_image</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 12</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 13</span><span class="cl"> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 14</span><span class="cl"> <span class="n">image_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 15</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 16</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 17</span><span class="cl"> <span class="n">txn</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 18</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 19</span><span class="cl"> <span class="k">def</span> <span class="nf">read_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">bytes</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 20</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 21</span><span class="cl"> <span class="k">if</span> <span class="n">image_as_bytes</span> <span class="o">:=</span> <span class="n">txn</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 22</span><span class="cl"> <span class="k">return</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 23</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 24</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 25</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 26</span><span class="cl"> <span class="k">def</span> <span class="nf">save_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 27</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 28</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 29</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 30</span><span class="cl"> <span class="c1"># Note: you might need to enforce a batch size here</span>
|
||||
</span></span><span class="line"><span class="ln"> 31</span><span class="cl"> <span class="c1"># to aovid running out of memory because this loads</span>
|
||||
</span></span><span class="line"><span class="ln"> 32</span><span class="cl"> <span class="c1"># all images sent to this function as bytes.</span>
|
||||
</span></span><span class="line"><span class="ln"> 33</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 34</span><span class="cl"> <span class="n">item_tuples</span> <span class="o">=</span> <span class="p">[</span>
|
||||
</span></span><span class="line"><span class="ln"> 35</span><span class="cl"> <span class="p">(</span><span class="n">k</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 36</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">name_image</span><span class="o">.</span><span class="n">items</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 37</span><span class="cl"> <span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 38</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 39</span><span class="cl"> <span class="n">consumed</span><span class="p">,</span> <span class="n">added</span> <span class="o">=</span> <span class="n">cursor</span><span class="o">.</span><span class="n">putmulti</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 40</span><span class="cl"> <span class="n">item_tuples</span><span class="p">,</span> <span class="n">dupdata</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">overwrite</span><span class="o">=</span><span class="kc">False</span>
|
||||
</span></span><span class="line"><span class="ln"> 41</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 42</span><span class="cl"> <span class="nb">print</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 43</span><span class="cl"> <span class="sa">f</span><span class="s2">"Saved </span><span class="si">{</span><span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> out of </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images to the DB (</span><span class="si">{</span><span class="n">consumed</span> <span class="o">-</span> <span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> seem to already exist)."</span>
|
||||
</span></span><span class="line"><span class="ln"> 44</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 45</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 46</span><span class="cl"> <span class="k">def</span> <span class="nf">load_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 47</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 48</span><span class="cl"> <span class="n">names</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 49</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 50</span><span class="cl"> <span class="n">names_as_bytestrings</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">encode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">names</span><span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 51</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 52</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 53</span><span class="cl"> <span class="k">return</span> <span class="p">{</span>
|
||||
</span></span><span class="line"><span class="ln"> 54</span><span class="cl"> <span class="n">k</span><span class="o">.</span><span class="n">decode</span><span class="p">():</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 55</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_as_bytes</span> <span class="ow">in</span> <span class="n">cursor</span><span class="o">.</span><span class="n">getmulti</span><span class="p">(</span><span class="n">names_as_bytestrings</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 56</span><span class="cl"> <span class="p">}</span>
|
||||
</span></span><span class="line"><span class="ln"> 57</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 58</span><span class="cl"> <span class="k">def</span> <span class="nf">delete_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 59</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 60</span><span class="cl"> <span class="k">if</span> <span class="n">txn</span><span class="o">.</span><span class="n">delete</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 61</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Image </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2"> deleted successfully"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 62</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 63</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 64</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 65</span><span class="cl"> <span class="k">def</span> <span class="nf">retrieve_names</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 66</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 67</span><span class="cl"> <span class="k">return</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">decode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span><span class="o">.</span><span class="n">iternext</span><span class="p">(</span><span class="n">keys</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="kc">False</span><span class="p">)]</span>
|
||||
</span></span><span class="line"><span class="ln"> 68</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 69</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 70</span><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 71</span><span class="cl"> <span class="n">db</span> <span class="o">=</span> <span class="n">ImageDB</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="s2">"./db/"</span><span class="p">),</span> <span class="mi">512</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 72</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 73</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 74</span><span class="cl"> <span class="c1"># Save the results</span>
|
||||
</span></span><span class="line"><span class="ln"> 75</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg/"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 76</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span>
|
||||
</span></span><span class="line"><span class="ln"> 77</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 78</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 79</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 80</span><span class="cl"> <span class="c1"># Add last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 81</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 82</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 83</span><span class="cl"> <span class="k">del</span> <span class="n">name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 84</span><span class="cl"> <span class="c1"># How many images have been stored?</span>
|
||||
</span></span><span class="line"><span class="ln"> 85</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"The DB has </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">retrieve_names</span><span class="p">())</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images stored"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 86</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 87</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 88</span><span class="cl"> <span class="c1"># Load the results from the DB and check if they match the files on disk</span>
|
||||
</span></span><span class="line"><span class="ln"> 89</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 90</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 91</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 92</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 93</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 94</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 95</span><span class="cl"> <span class="c1"># Verify last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 96</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 97</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 98</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 99</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"All images stored are byte identical to the original ones!"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln">100</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln">101</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></span></span></code></pre></div><p>This should provide you a good starting point to implement additional features,
|
||||
such as storing metadata, filtering required input by metadata (such as extracting
|
||||
a specific label for evaluation) and so on.</p>
|
||||
<h1 id="caveats">Caveats</h1>
|
||||
<h2 id="avoid-pil-or-pay-the-small-price">Avoid PIL, or pay the (small) price</h2>
|
||||
<p>One gotcha that I initially faced is that the images I saved wasn’t the same as
|
||||
the images that I retrieved. This wasn’t LMDB’s fault, this was because I was
|
||||
reading the images from disk via PIL, and storing them as bytes in LMDB. PIL
|
||||
decodes and encodes the image, so a roundtrip will not necessarily be identical,
|
||||
even for lossless file formats (other than bitmap images).</p>
|
||||
<p>Don’t encode/re-encode the image before you store it, or be prepared for the
|
||||
stored data to not be byte-identical.</p>
|
||||
<h2 id="the-max_size_as_mb-argument">The <code>max_size_as_mb</code> argument</h2>
|
||||
<p>LMDB has an unusual design. You need to specify the upper bound of the DB size
|
||||
upon creation, and if it exceeds this size, it will fail. You can edit this
|
||||
later, with some caveats (on Windows, this will actually allocate the full
|
||||
size).</p>
|
||||
<h2 id="concurrency-and-lmdb">Concurrency and LMDB</h2>
|
||||
<p>LMDB, while extremely fast, has some considerations with concurrency. See
|
||||
<a href="https://lmdb.readthedocs.io/en/release/#threads">the documentation</a> for details.
|
||||
It may not be suited for distributed workloads.</p>
|
||||
<h1 id="alternatives">Alternatives</h1>
|
||||
<p>This article covers a “quick and dirty” solution, and was before more purpose-built
|
||||
solutions were available. Some alternatives are:</p>
|
||||
<ol>
|
||||
<li>If you’re comfortable operating directly on archives, a simple <code>tar</code> file will
|
||||
do - it can provide an offset index to provide random access to data.</li>
|
||||
<li><a href="https://github.com/webdataset/webdataset">Nvidia’s WebDataset</a>. Modern, open
|
||||
source and purpose built for large scale deep learning.</li>
|
||||
<li><a href="https://lancedb.com/">LanceDB</a>, which describes itself as “designed for multimodal”
|
||||
and “built for scale”. It’s built on top of Arrow, closely related to Parquet.</li>
|
||||
<li>As mentioned, HuggingFace has multiple solutions to this, starting with Arrow backed
|
||||
storage, and their own <a href="https://huggingface.co/docs/datasets/index"><code>datasets</code></a>.</li>
|
||||
</ol>
|
||||
<p>Use these if you want to scale to production level training.</p>
|
||||
]]></content:encoded>
|
||||
</item>
|
||||
</channel>
|
||||
</rss>
|
||||
113
public/tags/logarithms/index.html
Normal file
113
public/tags/logarithms/index.html
Normal file
@@ -0,0 +1,113 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Logarithms | Avinash's Blog</title>
|
||||
<meta name="title" content="Logarithms" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/tags/logarithms/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="Logarithms">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="website">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="Logarithms">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="Logarithms">
|
||||
<meta itemprop="datePublished" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
|
||||
<link href="/original.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/logarithms/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header><a class="skip-link" href="#main-content">Skip to main content</a>
|
||||
|
||||
<a href="/" class="title"><h1>Avinash's Blog</h1></a>
|
||||
<nav>
|
||||
<a href="/">about</a>
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</nav>
|
||||
</header>
|
||||
<main id="main-content">
|
||||
<content>
|
||||
|
||||
<h3 class="blog-filter">Filtering for "Logarithms"</h3>
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/series/byhand/byhand_002_logarithms/">BasicsByHand: Logarithms</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</content>
|
||||
|
||||
</main>
|
||||
<footer><small>
|
||||
© Avinash Mallya | Design via <a href="https://github.com/clente/hugo-bearcub">Bear Cub</a>.
|
||||
</small></footer>
|
||||
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
384
public/tags/logarithms/index.xml
Normal file
384
public/tags/logarithms/index.xml
Normal file
File diff suppressed because one or more lines are too long
113
public/tags/machine-learning/index.html
Normal file
113
public/tags/machine-learning/index.html
Normal file
@@ -0,0 +1,113 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Machine-Learning | Avinash's Blog</title>
|
||||
<meta name="title" content="Machine-Learning" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/tags/machine-learning/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="Machine-Learning">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="website">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="Machine-Learning">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="Machine-Learning">
|
||||
<meta itemprop="datePublished" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
|
||||
<link href="/original.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/machine-learning/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header><a class="skip-link" href="#main-content">Skip to main content</a>
|
||||
|
||||
<a href="/" class="title"><h1>Avinash's Blog</h1></a>
|
||||
<nav>
|
||||
<a href="/">about</a>
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</nav>
|
||||
</header>
|
||||
<main id="main-content">
|
||||
<content>
|
||||
|
||||
<h3 class="blog-filter">Filtering for "Machine-Learning"</h3>
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/005_ldmb_as_image_db/">Resolving I/O Bottlenecks for 100K Small Files with LMDB</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</content>
|
||||
|
||||
</main>
|
||||
<footer><small>
|
||||
© Avinash Mallya | Design via <a href="https://github.com/clente/hugo-bearcub">Bear Cub</a>.
|
||||
</small></footer>
|
||||
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
276
public/tags/machine-learning/index.xml
Normal file
276
public/tags/machine-learning/index.xml
Normal file
@@ -0,0 +1,276 @@
|
||||
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
|
||||
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
|
||||
<channel>
|
||||
<title>Machine-Learning on Avinash's Blog</title>
|
||||
<link>https://avimallu.dev/tags/machine-learning/</link>
|
||||
<description>Recent content in Machine-Learning on Avinash's Blog</description>
|
||||
<generator>Hugo -- gohugo.io</generator>
|
||||
<language>en-US</language>
|
||||
<copyright>© Avinash Mallya</copyright>
|
||||
<lastBuildDate>Tue, 10 Feb 2026 00:00:00 +0000</lastBuildDate>
|
||||
<atom:link href="https://avimallu.dev/tags/machine-learning/index.xml" rel="self" type="application/rss+xml" />
|
||||
<item>
|
||||
<title>Resolving I/O Bottlenecks for 100K Small Files with LMDB</title>
|
||||
<link>https://avimallu.dev/blog/005_ldmb_as_image_db/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/005_ldmb_as_image_db/</guid>
|
||||
<description><h1 id="premise">Premise</h1>
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
<p>I had several thousand images already.
I was expecting several thousand more.
My repository was tracking these images via DVC.
My computer was also slowing down massively because of the sheer number of files.
DVC itself was slowing down (after all, randomly accessing many files isn&rsquo;t going to be fast).
I also needed to access files at random for training/evaluating the model (lots of shuffling).
Lastly, these images had their own associated metadata (labels, bounding boxes, &ldquo;correct&rdquo; text etc.), and they need to be stored along with the images - or least easily linkable to them.</p></description>
|
||||
<content:encoded><![CDATA[<h1 id="premise">Premise</h1>
|
||||
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
|
||||
<p>I had several thousand images already.
|
||||
I was expecting several thousand more.
|
||||
My repository was tracking these images via DVC.
|
||||
My computer was also slowing down massively because of the sheer number of files.
|
||||
DVC itself was slowing down (after all, randomly accessing many files isn’t going to be fast).
|
||||
I also needed to access files at random for training/evaluating the model (lots of shuffling).
|
||||
Lastly, these images had their own associated metadata (labels, bounding boxes, “correct” text etc.), and they need to be stored along with the images - or least easily linkable to them.</p>
|
||||
<p>I was primarily aiming for a “simple” solution, and didn’t need a productionizable codebase.</p>
|
||||
<h1 id="potential-solutions">Potential Solutions</h1>
|
||||
<h2 id="partitioning">Partitioning</h2>
|
||||
<p>A typical solution for “too many files” is to partition them by their name. It’s ideal if their
|
||||
name is a hash, so you can store the first character of the hash, then the second character, and
|
||||
then the actual file. So, for example, the directory changes from:</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln">2</span><span class="cl">├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln">3</span><span class="cl">├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln">4</span><span class="cl">├── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">5</span><span class="cl">├── cf1a2b3c4d.txt
|
||||
</span></span><span class="line"><span class="ln">6</span><span class="cl">├── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">7</span><span class="cl">└── a1f5e6d7c8.txt</span></span></code></pre></div><p>to</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">├── a/
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">│ └── 1/
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">│ ├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">│ └── a1f5e6d7c8.txt
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">├── b/
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">│ └── 7/
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">│ ├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">│ └── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">10</span><span class="cl">└── c/
|
||||
</span></span><span class="line"><span class="ln">11</span><span class="cl"> ├── 0/
|
||||
</span></span><span class="line"><span class="ln">12</span><span class="cl"> │ └── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">13</span><span class="cl"> └── f/
|
||||
</span></span><span class="line"><span class="ln">14</span><span class="cl"> └── cf1a2b3c4d.txt</span></span></code></pre></div><p>This isn’t novel - <code>git</code> and <code>dvc</code> both store their objects this way.
|
||||
This limits directory size, so file system look ups take less time.</p>
|
||||
<p>This isn’t a perfect solution - it required that I store the images as their hash,
|
||||
and handle the directory structure correctly. I will also need to maintain my own
|
||||
mechanism to maintain the link between the hash and the metadata, which means creating
|
||||
some sort of index. Lastly, DVC will still track files individually, which means that
|
||||
its <code>push</code>, <code>diff</code> and <code>pull</code> commands will still be slow.</p>
|
||||
<h2 id="separate-object-storage-maintain-only-the-index">Separate (object) storage, maintain only the index</h2>
|
||||
<p>Another solution would be to offload the storage to a medium capable of handling a large number of files in any order,
|
||||
while maintaining random access - for example, S3. Now my task will reduce to just correctly maintaining the index so
|
||||
that I know how many images are stored on S3, and link their metadata via their hash.</p>
|
||||
<p>If you’ve experienced retrieving a list of a large number of files stored on S3, you’d have first encountered the limit
|
||||
of 1000 objects that <code>boto3</code> enforces per request. You’ll need to work around it with pagination, which while standard,
|
||||
is still more work. You would have also realized that even after all this, S3 will take quite some time to give you the
|
||||
list even after you’ve optimized as much as you can.</p>
|
||||
<p>However, that means I need an active internet connection to access any data. It also introduces latency during training,
|
||||
and quite wasteful bandwidth in running multiple training sessions for the models. I can minimize these if I store the
|
||||
images in the same AWS region as I would be running the training in (say, an EC2), but that means I needed access to an
|
||||
EC2.</p>
|
||||
<p>This still left me the task of maintaining my own index, which I really wanted to avoid, as it would mean additional
|
||||
maintenance burden for a relatively nascent project that hasn’t reached production status, while demanding production
|
||||
code for an even more nascent pipeline.</p>
|
||||
<h2 id="what-about-a-database">What about a… database?</h2>
|
||||
<p>This feels much like reaching for your nose by looping your hand behind your head instead of touching it directly with
|
||||
your fingers. A database is great for many things, but setting one up and maintaining it (even a local SQLite one) isn’t
|
||||
really a quick and painless process, and has many gotchas. For instance, most databases aren’t optimized for storing a
|
||||
large number of binary blobs.</p>
|
||||
<p>I considered, and tested using a more modern embeddable analytical database like DuckDB for this purpose (I’m quite
|
||||
biased to using DuckDB and/or Polars to solve a large number of my data processing problems). I quickly
|
||||
found out that storing large binary blobs in it causes it to choke (which is fair, it isn’t really designed for that). Storing
|
||||
the files elsewhere while maintaining just the index in it still had the original problem - I needed to write the mechanism
|
||||
of maintaining the index.</p>
|
||||
<blockquote>
|
||||
<p>Note: HuggingFace now provides many images datasets (such as <a href="https://huggingface.co/datasets/ylecun/mnist">MNIST</a>) in
|
||||
the Parquet format, with the images stored using Arrow’s extension types (but still as binary blobs). My experience with
|
||||
storing binary data in Parquet hasn’t been great, but you could check this out to see if it meets your requirements.</p>
|
||||
</blockquote>
|
||||
<h1 id="the-solution-i-landed-on">The solution I landed on</h1>
|
||||
<h2 id="what-about-a-different-kind-of-database">What about a… <em>different</em> kind of database?</h2>
|
||||
<p>Let’s get down to first principles. What did I want to do? I wanted to store images. With those images, I also wanted
|
||||
to store its metadata. I wanted to access said data quickly. It became clearer to me that I was looking for a fast key-value
|
||||
store, and I stumbled upon <a href="https://www.symas.com/mdb">LMDB</a>.</p>
|
||||
<h2 id="lmdb">LMDB</h2>
|
||||
<p>Wikipedia’s <a href="https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database">entry</a> on LMDB indicates that it’s an incredibly
|
||||
small (64kB) piece of software that does one thing really well - be a ridiculously fast key-value store. I won’t pretend to
|
||||
understand how it works, the writeup on the Wiki provides plenty of good detail. I’ll focus, rather, on how I used it to
|
||||
solve my problem.</p>
|
||||
<h2 id="storing-and-retrieving-image-data-along-with-its-metadata">Storing and retrieving image data along with its metadata</h2>
|
||||
<p>LMDB is rather barebones. It exposes few features - the ability to write, and read a particular key (stored as a bytestring),
|
||||
that itself points to arbitrary bytes. I used its <a href="https://lmdb.readthedocs.io/en/release/">Python bindings</a>.</p>
|
||||
<p>I wrote a tiny class (~200 LoC) that did the following:</p>
|
||||
<ol>
|
||||
<li>Read the file names and the data of the images that I had, along with their metadata as it currently was, into Python. Batched to avoid running out of memory.</li>
|
||||
<li>Serialize the metadata, and read the image files as bytes, and link them to the keys f"{file_name}_image" and f"{file_name}_metadata" respectively.</li>
|
||||
<li>Store these as key-value pairs into the LMDB database, which is a <strong>single</strong> file.</li>
|
||||
<li>Provided a method to read the keys to identify all the images present in the database.</li>
|
||||
<li>Provided a method to retrieve an arbitrary set of images and their metadata quickly from the saved database.</li>
|
||||
</ol>
|
||||
<p>This has many advantages:</p>
|
||||
<ol>
|
||||
<li>LMDB is fast - really, really fast. Random access? Check. Fast retrieval of available images? Check.</li>
|
||||
<li>DVC tracking becomes simple - maintain a single file, and just version control that. No slow downs due to sheer number of files - either for DVC, or my computer.</li>
|
||||
<li>No index to maintain - the images and their metadata are stored in the same location, and linkable via a mere change in the suffix to their file name.</li>
|
||||
<li>Local access, practically zero latency.</li>
|
||||
</ol>
|
||||
<p>Which solves… all of the problems I had! When new files come in, all I needed to do was add them to the DB. LMDB has a few options available - you can
|
||||
avoid overwriting the same key, ensure that the database is de-duplicated.</p>
|
||||
<h1 id="the-code">The code</h1>
|
||||
<p>I’ve provided a sample code below that demonstrates storing just the images (not metadata) for the <a href="https://www.robots.ox.ac.uk/~vgg/data/flowers/102/">Oxford 102 Category Flower</a>
|
||||
dataset, which has around 8000 images.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">lmdb</span>
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">class</span> <span class="nc">ImageDB</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">env_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span> <span class="n">max_size_as_mb</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env_path</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">env_path</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env</span> <span class="o">=</span> <span class="n">lmdb</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">env_path</span><span class="p">,</span> <span class="n">map_size</span><span class="o">=</span><span class="n">max_size_as_mb</span> <span class="o">*</span> <span class="p">(</span><span class="mi">2</span><span class="o">**</span><span class="mi">20</span><span class="p">))</span>
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">db</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">open_db</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 10</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 11</span><span class="cl"> <span class="k">def</span> <span class="nf">save_image</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 12</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 13</span><span class="cl"> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 14</span><span class="cl"> <span class="n">image_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 15</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 16</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 17</span><span class="cl"> <span class="n">txn</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 18</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 19</span><span class="cl"> <span class="k">def</span> <span class="nf">read_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">bytes</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 20</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 21</span><span class="cl"> <span class="k">if</span> <span class="n">image_as_bytes</span> <span class="o">:=</span> <span class="n">txn</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 22</span><span class="cl"> <span class="k">return</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 23</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 24</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 25</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 26</span><span class="cl"> <span class="k">def</span> <span class="nf">save_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 27</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 28</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 29</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 30</span><span class="cl"> <span class="c1"># Note: you might need to enforce a batch size here</span>
|
||||
</span></span><span class="line"><span class="ln"> 31</span><span class="cl"> <span class="c1"># to aovid running out of memory because this loads</span>
|
||||
</span></span><span class="line"><span class="ln"> 32</span><span class="cl"> <span class="c1"># all images sent to this function as bytes.</span>
|
||||
</span></span><span class="line"><span class="ln"> 33</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 34</span><span class="cl"> <span class="n">item_tuples</span> <span class="o">=</span> <span class="p">[</span>
|
||||
</span></span><span class="line"><span class="ln"> 35</span><span class="cl"> <span class="p">(</span><span class="n">k</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 36</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">name_image</span><span class="o">.</span><span class="n">items</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 37</span><span class="cl"> <span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 38</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 39</span><span class="cl"> <span class="n">consumed</span><span class="p">,</span> <span class="n">added</span> <span class="o">=</span> <span class="n">cursor</span><span class="o">.</span><span class="n">putmulti</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 40</span><span class="cl"> <span class="n">item_tuples</span><span class="p">,</span> <span class="n">dupdata</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">overwrite</span><span class="o">=</span><span class="kc">False</span>
|
||||
</span></span><span class="line"><span class="ln"> 41</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 42</span><span class="cl"> <span class="nb">print</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 43</span><span class="cl"> <span class="sa">f</span><span class="s2">"Saved </span><span class="si">{</span><span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> out of </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images to the DB (</span><span class="si">{</span><span class="n">consumed</span> <span class="o">-</span> <span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> seem to already exist)."</span>
|
||||
</span></span><span class="line"><span class="ln"> 44</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 45</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 46</span><span class="cl"> <span class="k">def</span> <span class="nf">load_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 47</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 48</span><span class="cl"> <span class="n">names</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 49</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 50</span><span class="cl"> <span class="n">names_as_bytestrings</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">encode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">names</span><span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 51</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 52</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 53</span><span class="cl"> <span class="k">return</span> <span class="p">{</span>
|
||||
</span></span><span class="line"><span class="ln"> 54</span><span class="cl"> <span class="n">k</span><span class="o">.</span><span class="n">decode</span><span class="p">():</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 55</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_as_bytes</span> <span class="ow">in</span> <span class="n">cursor</span><span class="o">.</span><span class="n">getmulti</span><span class="p">(</span><span class="n">names_as_bytestrings</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 56</span><span class="cl"> <span class="p">}</span>
|
||||
</span></span><span class="line"><span class="ln"> 57</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 58</span><span class="cl"> <span class="k">def</span> <span class="nf">delete_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 59</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 60</span><span class="cl"> <span class="k">if</span> <span class="n">txn</span><span class="o">.</span><span class="n">delete</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 61</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Image </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2"> deleted successfully"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 62</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 63</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 64</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 65</span><span class="cl"> <span class="k">def</span> <span class="nf">retrieve_names</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 66</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 67</span><span class="cl"> <span class="k">return</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">decode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span><span class="o">.</span><span class="n">iternext</span><span class="p">(</span><span class="n">keys</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="kc">False</span><span class="p">)]</span>
|
||||
</span></span><span class="line"><span class="ln"> 68</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 69</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 70</span><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 71</span><span class="cl"> <span class="n">db</span> <span class="o">=</span> <span class="n">ImageDB</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="s2">"./db/"</span><span class="p">),</span> <span class="mi">512</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 72</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 73</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 74</span><span class="cl"> <span class="c1"># Save the results</span>
|
||||
</span></span><span class="line"><span class="ln"> 75</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg/"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 76</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span>
|
||||
</span></span><span class="line"><span class="ln"> 77</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 78</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 79</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 80</span><span class="cl"> <span class="c1"># Add last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 81</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 82</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 83</span><span class="cl"> <span class="k">del</span> <span class="n">name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 84</span><span class="cl"> <span class="c1"># How many images have been stored?</span>
|
||||
</span></span><span class="line"><span class="ln"> 85</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"The DB has </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">retrieve_names</span><span class="p">())</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images stored"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 86</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 87</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 88</span><span class="cl"> <span class="c1"># Load the results from the DB and check if they match the files on disk</span>
|
||||
</span></span><span class="line"><span class="ln"> 89</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 90</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 91</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 92</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 93</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 94</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 95</span><span class="cl"> <span class="c1"># Verify last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 96</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 97</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 98</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 99</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"All images stored are byte identical to the original ones!"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln">100</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln">101</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></span></span></code></pre></div><p>This should provide you a good starting point to implement additional features,
|
||||
such as storing metadata, filtering required input by metadata (such as extracting
|
||||
a specific label for evaluation) and so on.</p>
|
||||
<h1 id="caveats">Caveats</h1>
|
||||
<h2 id="avoid-pil-or-pay-the-small-price">Avoid PIL, or pay the (small) price</h2>
|
||||
<p>One gotcha that I initially faced is that the images I saved wasn’t the same as
|
||||
the images that I retrieved. This wasn’t LMDB’s fault, this was because I was
|
||||
reading the images from disk via PIL, and storing them as bytes in LMDB. PIL
|
||||
decodes and encodes the image, so a roundtrip will not necessarily be identical,
|
||||
even for lossless file formats (other than bitmap images).</p>
|
||||
<p>Don’t encode/re-encode the image before you store it, or be prepared for the
|
||||
stored data to not be byte-identical.</p>
|
||||
<h2 id="the-max_size_as_mb-argument">The <code>max_size_as_mb</code> argument</h2>
|
||||
<p>LMDB has an unusual design. You need to specify the upper bound of the DB size
|
||||
upon creation, and if it exceeds this size, it will fail. You can edit this
|
||||
later, with some caveats (on Windows, this will actually allocate the full
|
||||
size).</p>
|
||||
<h2 id="concurrency-and-lmdb">Concurrency and LMDB</h2>
|
||||
<p>LMDB, while extremely fast, has some considerations with concurrency. See
|
||||
<a href="https://lmdb.readthedocs.io/en/release/#threads">the documentation</a> for details.
|
||||
It may not be suited for distributed workloads.</p>
|
||||
<h1 id="alternatives">Alternatives</h1>
|
||||
<p>This article covers a “quick and dirty” solution, and was before more purpose-built
|
||||
solutions were available. Some alternatives are:</p>
|
||||
<ol>
|
||||
<li>If you’re comfortable operating directly on archives, a simple <code>tar</code> file will
|
||||
do - it can provide an offset index to provide random access to data.</li>
|
||||
<li><a href="https://github.com/webdataset/webdataset">Nvidia’s WebDataset</a>. Modern, open
|
||||
source and purpose built for large scale deep learning.</li>
|
||||
<li><a href="https://lancedb.com/">LanceDB</a>, which describes itself as “designed for multimodal”
|
||||
and “built for scale”. It’s built on top of Arrow, closely related to Parquet.</li>
|
||||
<li>As mentioned, HuggingFace has multiple solutions to this, starting with Arrow backed
|
||||
storage, and their own <a href="https://huggingface.co/docs/datasets/index"><code>datasets</code></a>.</li>
|
||||
</ol>
|
||||
<p>Use these if you want to scale to production level training.</p>
|
||||
]]></content:encoded>
|
||||
</item>
|
||||
</channel>
|
||||
</rss>
|
||||
126
public/tags/mathematics/index.html
Normal file
126
public/tags/mathematics/index.html
Normal file
@@ -0,0 +1,126 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Mathematics | Avinash's Blog</title>
|
||||
<meta name="title" content="Mathematics" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/tags/mathematics/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="Mathematics">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="website">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="Mathematics">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="Mathematics">
|
||||
<meta itemprop="datePublished" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
|
||||
<link href="/original.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/mathematics/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header><a class="skip-link" href="#main-content">Skip to main content</a>
|
||||
|
||||
<a href="/" class="title"><h1>Avinash's Blog</h1></a>
|
||||
<nav>
|
||||
<a href="/">about</a>
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</nav>
|
||||
</header>
|
||||
<main id="main-content">
|
||||
<content>
|
||||
|
||||
<h3 class="blog-filter">Filtering for "Mathematics"</h3>
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/series/byhand/byhand_002_logarithms/">BasicsByHand: Logarithms</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/series/byhand/byhand_001_square_roots/">BasicsByHand: Square Roots</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</content>
|
||||
|
||||
</main>
|
||||
<footer><small>
|
||||
© Avinash Mallya | Design via <a href="https://github.com/clente/hugo-bearcub">Bear Cub</a>.
|
||||
</small></footer>
|
||||
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
685
public/tags/mathematics/index.xml
Normal file
685
public/tags/mathematics/index.xml
Normal file
File diff suppressed because one or more lines are too long
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Multiprocessing" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/multiprocessing/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
@@ -89,7 +94,7 @@
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/004_environment_variables_and_multiprocessing/">Environment Variables and Multiprocessing</a>
|
||||
<a href="/blog/004_environment_variables_and_multiprocessing/">Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines</a>
|
||||
|
||||
</li>
|
||||
|
||||
|
||||
@@ -10,7 +10,7 @@
|
||||
<lastBuildDate>Wed, 14 Jan 2026 00:00:00 +0000</lastBuildDate>
|
||||
<atom:link href="https://avimallu.dev/tags/multiprocessing/index.xml" rel="self" type="application/rss+xml" />
|
||||
<item>
|
||||
<title>Environment Variables and Multiprocessing</title>
|
||||
<title>Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines</title>
|
||||
<link>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</link>
|
||||
<pubDate>Wed, 14 Jan 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</guid>
|
||||
@@ -164,7 +164,7 @@ environment variables:</p>
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">seq <span class="m">10</span> <span class="p">|</span> <span class="se">\
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl">parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s2">OPENBLAS_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s2">MKL_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s2">python test.py {}"</span></span></span></code></pre></div><p>An interesting side note is that the parallel, but single core run took 0.61 seconds, <strong>less</strong> than the time it took to run
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Nearest" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/nearest/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -139,7 +139,7 @@
|
||||
<blockquote>
|
||||
<p><strong>NOTE</strong>: for a proof-of-concept implementation, or if you’re on the CPU, try the <code>all-MiniLM-L6-v2</code> model. It’s a much smaller and much faster model, although you sacrifice a little in terms of accuracy.</p>
|
||||
</blockquote>
|
||||
<h2 id="the-concept-of-_approximate_-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<h2 id="the-concept-of-approximate-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<p>Performing any kind of nearest neighbor algorithm on medium scale datasets (even bordering 10,000 rows and tens of columns) tends to be slow. A primary driver of this was the need to calculate all, or nearly all distances between all data points. <em>Approximate</em> nearest neighbor (ANN) algorithms work around this through various approaches, which warrant their own blog post. For now, it would suffice to understand that there are shortcuts that ANN algorithms take to give you if not the exact nearest neighbor, at least <em>one</em> of the nearest neighbors (hence the term <em>approximate</em>).</p>
|
||||
<p>There are several algorithms that you can use - I shall proceed with <code>faiss</code>, because it has a nice Python interface and is rather easy to work with. You can use any algorithm - a full list of the major ones are <a href="https://github.com/erikbern/ann-benchmarks">available here</a>.</p>
|
||||
<p>I’ll explain why we’re in the nearest neighbor territory in due course.</p>
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Neighbor" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/neighbor/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -139,7 +139,7 @@
|
||||
<blockquote>
|
||||
<p><strong>NOTE</strong>: for a proof-of-concept implementation, or if you’re on the CPU, try the <code>all-MiniLM-L6-v2</code> model. It’s a much smaller and much faster model, although you sacrifice a little in terms of accuracy.</p>
|
||||
</blockquote>
|
||||
<h2 id="the-concept-of-_approximate_-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<h2 id="the-concept-of-approximate-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<p>Performing any kind of nearest neighbor algorithm on medium scale datasets (even bordering 10,000 rows and tens of columns) tends to be slow. A primary driver of this was the need to calculate all, or nearly all distances between all data points. <em>Approximate</em> nearest neighbor (ANN) algorithms work around this through various approaches, which warrant their own blog post. For now, it would suffice to understand that there are shortcuts that ANN algorithms take to give you if not the exact nearest neighbor, at least <em>one</em> of the nearest neighbors (hence the term <em>approximate</em>).</p>
|
||||
<p>There are several algorithms that you can use - I shall proceed with <code>faiss</code>, because it has a nice Python interface and is rather easy to work with. You can use any algorithm - a full list of the major ones are <a href="https://github.com/erikbern/ann-benchmarks">available here</a>.</p>
|
||||
<p>I’ll explain why we’re in the nearest neighbor territory in due course.</p>
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Network" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/network/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -139,7 +139,7 @@
|
||||
<blockquote>
|
||||
<p><strong>NOTE</strong>: for a proof-of-concept implementation, or if you’re on the CPU, try the <code>all-MiniLM-L6-v2</code> model. It’s a much smaller and much faster model, although you sacrifice a little in terms of accuracy.</p>
|
||||
</blockquote>
|
||||
<h2 id="the-concept-of-_approximate_-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<h2 id="the-concept-of-approximate-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<p>Performing any kind of nearest neighbor algorithm on medium scale datasets (even bordering 10,000 rows and tens of columns) tends to be slow. A primary driver of this was the need to calculate all, or nearly all distances between all data points. <em>Approximate</em> nearest neighbor (ANN) algorithms work around this through various approaches, which warrant their own blog post. For now, it would suffice to understand that there are shortcuts that ANN algorithms take to give you if not the exact nearest neighbor, at least <em>one</em> of the nearest neighbors (hence the term <em>approximate</em>).</p>
|
||||
<p>There are several algorithms that you can use - I shall proceed with <code>faiss</code>, because it has a nice Python interface and is rather easy to work with. You can use any algorithm - a full list of the major ones are <a href="https://github.com/erikbern/ann-benchmarks">available here</a>.</p>
|
||||
<p>I’ll explain why we’re in the nearest neighbor territory in due course.</p>
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Networkx" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/networkx/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -139,7 +139,7 @@
|
||||
<blockquote>
|
||||
<p><strong>NOTE</strong>: for a proof-of-concept implementation, or if you’re on the CPU, try the <code>all-MiniLM-L6-v2</code> model. It’s a much smaller and much faster model, although you sacrifice a little in terms of accuracy.</p>
|
||||
</blockquote>
|
||||
<h2 id="the-concept-of-_approximate_-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<h2 id="the-concept-of-approximate-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<p>Performing any kind of nearest neighbor algorithm on medium scale datasets (even bordering 10,000 rows and tens of columns) tends to be slow. A primary driver of this was the need to calculate all, or nearly all distances between all data points. <em>Approximate</em> nearest neighbor (ANN) algorithms work around this through various approaches, which warrant their own blog post. For now, it would suffice to understand that there are shortcuts that ANN algorithms take to give you if not the exact nearest neighbor, at least <em>one</em> of the nearest neighbors (hence the term <em>approximate</em>).</p>
|
||||
<p>There are several algorithms that you can use - I shall proceed with <code>faiss</code>, because it has a nice Python interface and is rather easy to work with. You can use any algorithm - a full list of the major ones are <a href="https://github.com/erikbern/ann-benchmarks">available here</a>.</p>
|
||||
<p>I’ll explain why we’re in the nearest neighbor territory in due course.</p>
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Numpy" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/numpy/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
@@ -89,7 +94,7 @@
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/004_environment_variables_and_multiprocessing/">Environment Variables and Multiprocessing</a>
|
||||
<a href="/blog/004_environment_variables_and_multiprocessing/">Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines</a>
|
||||
|
||||
</li>
|
||||
|
||||
|
||||
@@ -10,7 +10,7 @@
|
||||
<lastBuildDate>Wed, 14 Jan 2026 00:00:00 +0000</lastBuildDate>
|
||||
<atom:link href="https://avimallu.dev/tags/numpy/index.xml" rel="self" type="application/rss+xml" />
|
||||
<item>
|
||||
<title>Environment Variables and Multiprocessing</title>
|
||||
<title>Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines</title>
|
||||
<link>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</link>
|
||||
<pubDate>Wed, 14 Jan 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</guid>
|
||||
@@ -164,7 +164,7 @@ environment variables:</p>
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">seq <span class="m">10</span> <span class="p">|</span> <span class="se">\
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl">parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s2">OPENBLAS_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s2">MKL_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s2">python test.py {}"</span></span></span></code></pre></div><p>An interesting side note is that the parallel, but single core run took 0.61 seconds, <strong>less</strong> than the time it took to run
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Parallel" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/parallel/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
@@ -89,7 +94,7 @@
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/004_environment_variables_and_multiprocessing/">Environment Variables and Multiprocessing</a>
|
||||
<a href="/blog/004_environment_variables_and_multiprocessing/">Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines</a>
|
||||
|
||||
</li>
|
||||
|
||||
|
||||
@@ -10,7 +10,7 @@
|
||||
<lastBuildDate>Wed, 14 Jan 2026 00:00:00 +0000</lastBuildDate>
|
||||
<atom:link href="https://avimallu.dev/tags/parallel/index.xml" rel="self" type="application/rss+xml" />
|
||||
<item>
|
||||
<title>Environment Variables and Multiprocessing</title>
|
||||
<title>Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines</title>
|
||||
<link>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</link>
|
||||
<pubDate>Wed, 14 Jan 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</guid>
|
||||
@@ -164,7 +164,7 @@ environment variables:</p>
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">seq <span class="m">10</span> <span class="p">|</span> <span class="se">\
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl">parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s2">OPENBLAS_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s2">MKL_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s2">python test.py {}"</span></span></span></code></pre></div><p>An interesting side note is that the parallel, but single core run took 0.61 seconds, <strong>less</strong> than the time it took to run
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Polars" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/polars/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -139,7 +139,7 @@
|
||||
<blockquote>
|
||||
<p><strong>NOTE</strong>: for a proof-of-concept implementation, or if you’re on the CPU, try the <code>all-MiniLM-L6-v2</code> model. It’s a much smaller and much faster model, although you sacrifice a little in terms of accuracy.</p>
|
||||
</blockquote>
|
||||
<h2 id="the-concept-of-_approximate_-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<h2 id="the-concept-of-approximate-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<p>Performing any kind of nearest neighbor algorithm on medium scale datasets (even bordering 10,000 rows and tens of columns) tends to be slow. A primary driver of this was the need to calculate all, or nearly all distances between all data points. <em>Approximate</em> nearest neighbor (ANN) algorithms work around this through various approaches, which warrant their own blog post. For now, it would suffice to understand that there are shortcuts that ANN algorithms take to give you if not the exact nearest neighbor, at least <em>one</em> of the nearest neighbors (hence the term <em>approximate</em>).</p>
|
||||
<p>There are several algorithms that you can use - I shall proceed with <code>faiss</code>, because it has a nice Python interface and is rather easy to work with. You can use any algorithm - a full list of the major ones are <a href="https://github.com/erikbern/ann-benchmarks">available here</a>.</p>
|
||||
<p>I’ll explain why we’re in the nearest neighbor territory in due course.</p>
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Powerpoint" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/powerpoint/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -50,7 +50,7 @@
|
||||
<p><img src="/blog/003_powerpointsnap/02_Charts.png" alt="The UI for copying chart properties"></p>
|
||||
<p>What do these features do? You should be able to hover over the option and get a tooltip that shows what it’s capable of, but here’s another summary just in case:</p>
|
||||
<ol>
|
||||
<li>Sync Value/Date Axis: this will try to align the range, the ticks, the numeric values etc. of the “set” chart to the one you’ve selected. I couldn’t put in just $x$ and $y$ here because Microsoft internally doesn’t label them that way. Try either of these two options (you can undo!) and see what works best for your chart. This doesn’t work well yet for 3D charts.</li>
|
||||
<li>Sync Value/Date Axis: this will try to align the range, the ticks, the numeric values etc. of the “set” chart to the one you’ve selected. I couldn’t put in just <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">x</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>y</mi></mrow><annotation encoding="application/x-tex">y</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">y</span></span></span></span> here because Microsoft internally doesn’t label them that way. Try either of these two options (you can undo!) and see what works best for your chart. This doesn’t work well yet for 3D charts.</li>
|
||||
<li>Sync Plot/Title/Legend: often, you want to centre a title, or make sure that multiple charts that show nearly identical things for different variables all <em>look</em> exactly the same from a client perspective. But that’s usually difficult if you’ve already configured the charts a little - which can be remedied with this option!</li>
|
||||
<li>Format Painter: this is simply a helper for the normal format painter to align the formats of the text that you’ve selected with the way it originally is in the “set” chart. The reason for this feature is simply to avoid going back to <em>Home</em> to click on the <em>Format Painter</em> option again.</li>
|
||||
<li>Reset Axes Scales: in case you messed up somewhere, you can use this to rever to PowerPoint defaults.</li>
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Ppt" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/ppt/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -50,7 +50,7 @@
|
||||
<p><img src="/blog/003_powerpointsnap/02_Charts.png" alt="The UI for copying chart properties"></p>
|
||||
<p>What do these features do? You should be able to hover over the option and get a tooltip that shows what it’s capable of, but here’s another summary just in case:</p>
|
||||
<ol>
|
||||
<li>Sync Value/Date Axis: this will try to align the range, the ticks, the numeric values etc. of the “set” chart to the one you’ve selected. I couldn’t put in just $x$ and $y$ here because Microsoft internally doesn’t label them that way. Try either of these two options (you can undo!) and see what works best for your chart. This doesn’t work well yet for 3D charts.</li>
|
||||
<li>Sync Value/Date Axis: this will try to align the range, the ticks, the numeric values etc. of the “set” chart to the one you’ve selected. I couldn’t put in just <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">x</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>y</mi></mrow><annotation encoding="application/x-tex">y</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">y</span></span></span></span> here because Microsoft internally doesn’t label them that way. Try either of these two options (you can undo!) and see what works best for your chart. This doesn’t work well yet for 3D charts.</li>
|
||||
<li>Sync Plot/Title/Legend: often, you want to centre a title, or make sure that multiple charts that show nearly identical things for different variables all <em>look</em> exactly the same from a client perspective. But that’s usually difficult if you’ve already configured the charts a little - which can be remedied with this option!</li>
|
||||
<li>Format Painter: this is simply a helper for the normal format painter to align the formats of the text that you’ve selected with the way it originally is in the “set” chart. The reason for this feature is simply to avoid going back to <em>Home</em> to click on the <em>Format Painter</em> option again.</li>
|
||||
<li>Reset Axes Scales: in case you messed up somewhere, you can use this to rever to PowerPoint defaults.</li>
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Python" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -35,8 +35,8 @@
|
||||
|
||||
|
||||
<meta itemprop="name" content="Python">
|
||||
<meta itemprop="datePublished" content="2026-01-14T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-01-14T00:00:00+00:00">
|
||||
<meta itemprop="datePublished" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/python/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
@@ -80,6 +85,19 @@
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/005_ldmb_as_image_db/">Resolving I/O Bottlenecks for 100K Small Files with LMDB</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
@@ -89,7 +107,7 @@
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/004_environment_variables_and_multiprocessing/">Environment Variables and Multiprocessing</a>
|
||||
<a href="/blog/004_environment_variables_and_multiprocessing/">Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines</a>
|
||||
|
||||
</li>
|
||||
|
||||
|
||||
@@ -7,10 +7,273 @@
|
||||
<generator>Hugo -- gohugo.io</generator>
|
||||
<language>en-US</language>
|
||||
<copyright>© Avinash Mallya</copyright>
|
||||
<lastBuildDate>Wed, 14 Jan 2026 00:00:00 +0000</lastBuildDate>
|
||||
<lastBuildDate>Tue, 10 Feb 2026 00:00:00 +0000</lastBuildDate>
|
||||
<atom:link href="https://avimallu.dev/tags/python/index.xml" rel="self" type="application/rss+xml" />
|
||||
<item>
|
||||
<title>Environment Variables and Multiprocessing</title>
|
||||
<title>Resolving I/O Bottlenecks for 100K Small Files with LMDB</title>
|
||||
<link>https://avimallu.dev/blog/005_ldmb_as_image_db/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/005_ldmb_as_image_db/</guid>
|
||||
<description><h1 id="premise">Premise</h1>
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
<p>I had several thousand images already.
I was expecting several thousand more.
My repository was tracking these images via DVC.
My computer was also slowing down massively because of the sheer number of files.
DVC itself was slowing down (after all, randomly accessing many files isn&rsquo;t going to be fast).
I also needed to access files at random for training/evaluating the model (lots of shuffling).
Lastly, these images had their own associated metadata (labels, bounding boxes, &ldquo;correct&rdquo; text etc.), and they need to be stored along with the images - or least easily linkable to them.</p></description>
|
||||
<content:encoded><![CDATA[<h1 id="premise">Premise</h1>
|
||||
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
|
||||
<p>I had several thousand images already.
|
||||
I was expecting several thousand more.
|
||||
My repository was tracking these images via DVC.
|
||||
My computer was also slowing down massively because of the sheer number of files.
|
||||
DVC itself was slowing down (after all, randomly accessing many files isn’t going to be fast).
|
||||
I also needed to access files at random for training/evaluating the model (lots of shuffling).
|
||||
Lastly, these images had their own associated metadata (labels, bounding boxes, “correct” text etc.), and they need to be stored along with the images - or least easily linkable to them.</p>
|
||||
<p>I was primarily aiming for a “simple” solution, and didn’t need a productionizable codebase.</p>
|
||||
<h1 id="potential-solutions">Potential Solutions</h1>
|
||||
<h2 id="partitioning">Partitioning</h2>
|
||||
<p>A typical solution for “too many files” is to partition them by their name. It’s ideal if their
|
||||
name is a hash, so you can store the first character of the hash, then the second character, and
|
||||
then the actual file. So, for example, the directory changes from:</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln">2</span><span class="cl">├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln">3</span><span class="cl">├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln">4</span><span class="cl">├── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">5</span><span class="cl">├── cf1a2b3c4d.txt
|
||||
</span></span><span class="line"><span class="ln">6</span><span class="cl">├── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">7</span><span class="cl">└── a1f5e6d7c8.txt</span></span></code></pre></div><p>to</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">├── a/
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">│ └── 1/
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">│ ├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">│ └── a1f5e6d7c8.txt
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">├── b/
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">│ └── 7/
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">│ ├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">│ └── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">10</span><span class="cl">└── c/
|
||||
</span></span><span class="line"><span class="ln">11</span><span class="cl"> ├── 0/
|
||||
</span></span><span class="line"><span class="ln">12</span><span class="cl"> │ └── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">13</span><span class="cl"> └── f/
|
||||
</span></span><span class="line"><span class="ln">14</span><span class="cl"> └── cf1a2b3c4d.txt</span></span></code></pre></div><p>This isn’t novel - <code>git</code> and <code>dvc</code> both store their objects this way.
|
||||
This limits directory size, so file system look ups take less time.</p>
|
||||
<p>This isn’t a perfect solution - it required that I store the images as their hash,
|
||||
and handle the directory structure correctly. I will also need to maintain my own
|
||||
mechanism to maintain the link between the hash and the metadata, which means creating
|
||||
some sort of index. Lastly, DVC will still track files individually, which means that
|
||||
its <code>push</code>, <code>diff</code> and <code>pull</code> commands will still be slow.</p>
|
||||
<h2 id="separate-object-storage-maintain-only-the-index">Separate (object) storage, maintain only the index</h2>
|
||||
<p>Another solution would be to offload the storage to a medium capable of handling a large number of files in any order,
|
||||
while maintaining random access - for example, S3. Now my task will reduce to just correctly maintaining the index so
|
||||
that I know how many images are stored on S3, and link their metadata via their hash.</p>
|
||||
<p>If you’ve experienced retrieving a list of a large number of files stored on S3, you’d have first encountered the limit
|
||||
of 1000 objects that <code>boto3</code> enforces per request. You’ll need to work around it with pagination, which while standard,
|
||||
is still more work. You would have also realized that even after all this, S3 will take quite some time to give you the
|
||||
list even after you’ve optimized as much as you can.</p>
|
||||
<p>However, that means I need an active internet connection to access any data. It also introduces latency during training,
|
||||
and quite wasteful bandwidth in running multiple training sessions for the models. I can minimize these if I store the
|
||||
images in the same AWS region as I would be running the training in (say, an EC2), but that means I needed access to an
|
||||
EC2.</p>
|
||||
<p>This still left me the task of maintaining my own index, which I really wanted to avoid, as it would mean additional
|
||||
maintenance burden for a relatively nascent project that hasn’t reached production status, while demanding production
|
||||
code for an even more nascent pipeline.</p>
|
||||
<h2 id="what-about-a-database">What about a… database?</h2>
|
||||
<p>This feels much like reaching for your nose by looping your hand behind your head instead of touching it directly with
|
||||
your fingers. A database is great for many things, but setting one up and maintaining it (even a local SQLite one) isn’t
|
||||
really a quick and painless process, and has many gotchas. For instance, most databases aren’t optimized for storing a
|
||||
large number of binary blobs.</p>
|
||||
<p>I considered, and tested using a more modern embeddable analytical database like DuckDB for this purpose (I’m quite
|
||||
biased to using DuckDB and/or Polars to solve a large number of my data processing problems). I quickly
|
||||
found out that storing large binary blobs in it causes it to choke (which is fair, it isn’t really designed for that). Storing
|
||||
the files elsewhere while maintaining just the index in it still had the original problem - I needed to write the mechanism
|
||||
of maintaining the index.</p>
|
||||
<blockquote>
|
||||
<p>Note: HuggingFace now provides many images datasets (such as <a href="https://huggingface.co/datasets/ylecun/mnist">MNIST</a>) in
|
||||
the Parquet format, with the images stored using Arrow’s extension types (but still as binary blobs). My experience with
|
||||
storing binary data in Parquet hasn’t been great, but you could check this out to see if it meets your requirements.</p>
|
||||
</blockquote>
|
||||
<h1 id="the-solution-i-landed-on">The solution I landed on</h1>
|
||||
<h2 id="what-about-a-different-kind-of-database">What about a… <em>different</em> kind of database?</h2>
|
||||
<p>Let’s get down to first principles. What did I want to do? I wanted to store images. With those images, I also wanted
|
||||
to store its metadata. I wanted to access said data quickly. It became clearer to me that I was looking for a fast key-value
|
||||
store, and I stumbled upon <a href="https://www.symas.com/mdb">LMDB</a>.</p>
|
||||
<h2 id="lmdb">LMDB</h2>
|
||||
<p>Wikipedia’s <a href="https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database">entry</a> on LMDB indicates that it’s an incredibly
|
||||
small (64kB) piece of software that does one thing really well - be a ridiculously fast key-value store. I won’t pretend to
|
||||
understand how it works, the writeup on the Wiki provides plenty of good detail. I’ll focus, rather, on how I used it to
|
||||
solve my problem.</p>
|
||||
<h2 id="storing-and-retrieving-image-data-along-with-its-metadata">Storing and retrieving image data along with its metadata</h2>
|
||||
<p>LMDB is rather barebones. It exposes few features - the ability to write, and read a particular key (stored as a bytestring),
|
||||
that itself points to arbitrary bytes. I used its <a href="https://lmdb.readthedocs.io/en/release/">Python bindings</a>.</p>
|
||||
<p>I wrote a tiny class (~200 LoC) that did the following:</p>
|
||||
<ol>
|
||||
<li>Read the file names and the data of the images that I had, along with their metadata as it currently was, into Python. Batched to avoid running out of memory.</li>
|
||||
<li>Serialize the metadata, and read the image files as bytes, and link them to the keys f"{file_name}_image" and f"{file_name}_metadata" respectively.</li>
|
||||
<li>Store these as key-value pairs into the LMDB database, which is a <strong>single</strong> file.</li>
|
||||
<li>Provided a method to read the keys to identify all the images present in the database.</li>
|
||||
<li>Provided a method to retrieve an arbitrary set of images and their metadata quickly from the saved database.</li>
|
||||
</ol>
|
||||
<p>This has many advantages:</p>
|
||||
<ol>
|
||||
<li>LMDB is fast - really, really fast. Random access? Check. Fast retrieval of available images? Check.</li>
|
||||
<li>DVC tracking becomes simple - maintain a single file, and just version control that. No slow downs due to sheer number of files - either for DVC, or my computer.</li>
|
||||
<li>No index to maintain - the images and their metadata are stored in the same location, and linkable via a mere change in the suffix to their file name.</li>
|
||||
<li>Local access, practically zero latency.</li>
|
||||
</ol>
|
||||
<p>Which solves… all of the problems I had! When new files come in, all I needed to do was add them to the DB. LMDB has a few options available - you can
|
||||
avoid overwriting the same key, ensure that the database is de-duplicated.</p>
|
||||
<h1 id="the-code">The code</h1>
|
||||
<p>I’ve provided a sample code below that demonstrates storing just the images (not metadata) for the <a href="https://www.robots.ox.ac.uk/~vgg/data/flowers/102/">Oxford 102 Category Flower</a>
|
||||
dataset, which has around 8000 images.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">lmdb</span>
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">class</span> <span class="nc">ImageDB</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">env_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span> <span class="n">max_size_as_mb</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env_path</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">env_path</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env</span> <span class="o">=</span> <span class="n">lmdb</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">env_path</span><span class="p">,</span> <span class="n">map_size</span><span class="o">=</span><span class="n">max_size_as_mb</span> <span class="o">*</span> <span class="p">(</span><span class="mi">2</span><span class="o">**</span><span class="mi">20</span><span class="p">))</span>
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">db</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">open_db</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 10</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 11</span><span class="cl"> <span class="k">def</span> <span class="nf">save_image</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 12</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 13</span><span class="cl"> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 14</span><span class="cl"> <span class="n">image_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 15</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 16</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 17</span><span class="cl"> <span class="n">txn</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 18</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 19</span><span class="cl"> <span class="k">def</span> <span class="nf">read_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">bytes</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 20</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 21</span><span class="cl"> <span class="k">if</span> <span class="n">image_as_bytes</span> <span class="o">:=</span> <span class="n">txn</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 22</span><span class="cl"> <span class="k">return</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 23</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 24</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 25</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 26</span><span class="cl"> <span class="k">def</span> <span class="nf">save_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 27</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 28</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 29</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 30</span><span class="cl"> <span class="c1"># Note: you might need to enforce a batch size here</span>
|
||||
</span></span><span class="line"><span class="ln"> 31</span><span class="cl"> <span class="c1"># to aovid running out of memory because this loads</span>
|
||||
</span></span><span class="line"><span class="ln"> 32</span><span class="cl"> <span class="c1"># all images sent to this function as bytes.</span>
|
||||
</span></span><span class="line"><span class="ln"> 33</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 34</span><span class="cl"> <span class="n">item_tuples</span> <span class="o">=</span> <span class="p">[</span>
|
||||
</span></span><span class="line"><span class="ln"> 35</span><span class="cl"> <span class="p">(</span><span class="n">k</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 36</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">name_image</span><span class="o">.</span><span class="n">items</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 37</span><span class="cl"> <span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 38</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 39</span><span class="cl"> <span class="n">consumed</span><span class="p">,</span> <span class="n">added</span> <span class="o">=</span> <span class="n">cursor</span><span class="o">.</span><span class="n">putmulti</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 40</span><span class="cl"> <span class="n">item_tuples</span><span class="p">,</span> <span class="n">dupdata</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">overwrite</span><span class="o">=</span><span class="kc">False</span>
|
||||
</span></span><span class="line"><span class="ln"> 41</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 42</span><span class="cl"> <span class="nb">print</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 43</span><span class="cl"> <span class="sa">f</span><span class="s2">"Saved </span><span class="si">{</span><span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> out of </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images to the DB (</span><span class="si">{</span><span class="n">consumed</span> <span class="o">-</span> <span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> seem to already exist)."</span>
|
||||
</span></span><span class="line"><span class="ln"> 44</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 45</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 46</span><span class="cl"> <span class="k">def</span> <span class="nf">load_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 47</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 48</span><span class="cl"> <span class="n">names</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 49</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 50</span><span class="cl"> <span class="n">names_as_bytestrings</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">encode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">names</span><span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 51</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 52</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 53</span><span class="cl"> <span class="k">return</span> <span class="p">{</span>
|
||||
</span></span><span class="line"><span class="ln"> 54</span><span class="cl"> <span class="n">k</span><span class="o">.</span><span class="n">decode</span><span class="p">():</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 55</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_as_bytes</span> <span class="ow">in</span> <span class="n">cursor</span><span class="o">.</span><span class="n">getmulti</span><span class="p">(</span><span class="n">names_as_bytestrings</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 56</span><span class="cl"> <span class="p">}</span>
|
||||
</span></span><span class="line"><span class="ln"> 57</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 58</span><span class="cl"> <span class="k">def</span> <span class="nf">delete_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 59</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 60</span><span class="cl"> <span class="k">if</span> <span class="n">txn</span><span class="o">.</span><span class="n">delete</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 61</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Image </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2"> deleted successfully"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 62</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 63</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 64</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 65</span><span class="cl"> <span class="k">def</span> <span class="nf">retrieve_names</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 66</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 67</span><span class="cl"> <span class="k">return</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">decode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span><span class="o">.</span><span class="n">iternext</span><span class="p">(</span><span class="n">keys</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="kc">False</span><span class="p">)]</span>
|
||||
</span></span><span class="line"><span class="ln"> 68</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 69</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 70</span><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 71</span><span class="cl"> <span class="n">db</span> <span class="o">=</span> <span class="n">ImageDB</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="s2">"./db/"</span><span class="p">),</span> <span class="mi">512</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 72</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 73</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 74</span><span class="cl"> <span class="c1"># Save the results</span>
|
||||
</span></span><span class="line"><span class="ln"> 75</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg/"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 76</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span>
|
||||
</span></span><span class="line"><span class="ln"> 77</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 78</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 79</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 80</span><span class="cl"> <span class="c1"># Add last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 81</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 82</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 83</span><span class="cl"> <span class="k">del</span> <span class="n">name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 84</span><span class="cl"> <span class="c1"># How many images have been stored?</span>
|
||||
</span></span><span class="line"><span class="ln"> 85</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"The DB has </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">retrieve_names</span><span class="p">())</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images stored"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 86</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 87</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 88</span><span class="cl"> <span class="c1"># Load the results from the DB and check if they match the files on disk</span>
|
||||
</span></span><span class="line"><span class="ln"> 89</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 90</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 91</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 92</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 93</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 94</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 95</span><span class="cl"> <span class="c1"># Verify last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 96</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 97</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 98</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 99</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"All images stored are byte identical to the original ones!"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln">100</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln">101</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></span></span></code></pre></div><p>This should provide you a good starting point to implement additional features,
|
||||
such as storing metadata, filtering required input by metadata (such as extracting
|
||||
a specific label for evaluation) and so on.</p>
|
||||
<h1 id="caveats">Caveats</h1>
|
||||
<h2 id="avoid-pil-or-pay-the-small-price">Avoid PIL, or pay the (small) price</h2>
|
||||
<p>One gotcha that I initially faced is that the images I saved wasn’t the same as
|
||||
the images that I retrieved. This wasn’t LMDB’s fault, this was because I was
|
||||
reading the images from disk via PIL, and storing them as bytes in LMDB. PIL
|
||||
decodes and encodes the image, so a roundtrip will not necessarily be identical,
|
||||
even for lossless file formats (other than bitmap images).</p>
|
||||
<p>Don’t encode/re-encode the image before you store it, or be prepared for the
|
||||
stored data to not be byte-identical.</p>
|
||||
<h2 id="the-max_size_as_mb-argument">The <code>max_size_as_mb</code> argument</h2>
|
||||
<p>LMDB has an unusual design. You need to specify the upper bound of the DB size
|
||||
upon creation, and if it exceeds this size, it will fail. You can edit this
|
||||
later, with some caveats (on Windows, this will actually allocate the full
|
||||
size).</p>
|
||||
<h2 id="concurrency-and-lmdb">Concurrency and LMDB</h2>
|
||||
<p>LMDB, while extremely fast, has some considerations with concurrency. See
|
||||
<a href="https://lmdb.readthedocs.io/en/release/#threads">the documentation</a> for details.
|
||||
It may not be suited for distributed workloads.</p>
|
||||
<h1 id="alternatives">Alternatives</h1>
|
||||
<p>This article covers a “quick and dirty” solution, and was before more purpose-built
|
||||
solutions were available. Some alternatives are:</p>
|
||||
<ol>
|
||||
<li>If you’re comfortable operating directly on archives, a simple <code>tar</code> file will
|
||||
do - it can provide an offset index to provide random access to data.</li>
|
||||
<li><a href="https://github.com/webdataset/webdataset">Nvidia’s WebDataset</a>. Modern, open
|
||||
source and purpose built for large scale deep learning.</li>
|
||||
<li><a href="https://lancedb.com/">LanceDB</a>, which describes itself as “designed for multimodal”
|
||||
and “built for scale”. It’s built on top of Arrow, closely related to Parquet.</li>
|
||||
<li>As mentioned, HuggingFace has multiple solutions to this, starting with Arrow backed
|
||||
storage, and their own <a href="https://huggingface.co/docs/datasets/index"><code>datasets</code></a>.</li>
|
||||
</ol>
|
||||
<p>Use these if you want to scale to production level training.</p>
|
||||
]]></content:encoded>
|
||||
</item>
|
||||
<item>
|
||||
<title>Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines</title>
|
||||
<link>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</link>
|
||||
<pubDate>Wed, 14 Jan 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</guid>
|
||||
@@ -164,7 +427,7 @@ environment variables:</p>
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">seq <span class="m">10</span> <span class="p">|</span> <span class="se">\
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl">parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s2">OPENBLAS_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s2">MKL_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s2">python test.py {}"</span></span></span></code></pre></div><p>An interesting side note is that the parallel, but single core run took 0.61 seconds, <strong>less</strong> than the time it took to run
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Representative" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/representative/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -139,7 +139,7 @@
|
||||
<blockquote>
|
||||
<p><strong>NOTE</strong>: for a proof-of-concept implementation, or if you’re on the CPU, try the <code>all-MiniLM-L6-v2</code> model. It’s a much smaller and much faster model, although you sacrifice a little in terms of accuracy.</p>
|
||||
</blockquote>
|
||||
<h2 id="the-concept-of-_approximate_-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<h2 id="the-concept-of-approximate-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<p>Performing any kind of nearest neighbor algorithm on medium scale datasets (even bordering 10,000 rows and tens of columns) tends to be slow. A primary driver of this was the need to calculate all, or nearly all distances between all data points. <em>Approximate</em> nearest neighbor (ANN) algorithms work around this through various approaches, which warrant their own blog post. For now, it would suffice to understand that there are shortcuts that ANN algorithms take to give you if not the exact nearest neighbor, at least <em>one</em> of the nearest neighbors (hence the term <em>approximate</em>).</p>
|
||||
<p>There are several algorithms that you can use - I shall proceed with <code>faiss</code>, because it has a nice Python interface and is rather easy to work with. You can use any algorithm - a full list of the major ones are <a href="https://github.com/erikbern/ann-benchmarks">available here</a>.</p>
|
||||
<p>I’ll explain why we’re in the nearest neighbor territory in due course.</p>
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Samples" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/samples/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -139,7 +139,7 @@
|
||||
<blockquote>
|
||||
<p><strong>NOTE</strong>: for a proof-of-concept implementation, or if you’re on the CPU, try the <code>all-MiniLM-L6-v2</code> model. It’s a much smaller and much faster model, although you sacrifice a little in terms of accuracy.</p>
|
||||
</blockquote>
|
||||
<h2 id="the-concept-of-_approximate_-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<h2 id="the-concept-of-approximate-nearest-neighbors">The concept of <em>approximate</em> nearest neighbors</h2>
|
||||
<p>Performing any kind of nearest neighbor algorithm on medium scale datasets (even bordering 10,000 rows and tens of columns) tends to be slow. A primary driver of this was the need to calculate all, or nearly all distances between all data points. <em>Approximate</em> nearest neighbor (ANN) algorithms work around this through various approaches, which warrant their own blog post. For now, it would suffice to understand that there are shortcuts that ANN algorithms take to give you if not the exact nearest neighbor, at least <em>one</em> of the nearest neighbors (hence the term <em>approximate</em>).</p>
|
||||
<p>There are several algorithms that you can use - I shall proceed with <code>faiss</code>, because it has a nice Python interface and is rather easy to work with. You can use any algorithm - a full list of the major ones are <a href="https://github.com/erikbern/ann-benchmarks">available here</a>.</p>
|
||||
<p>I’ll explain why we’re in the nearest neighbor territory in due course.</p>
|
||||
|
||||
113
public/tags/square-roots/index.html
Normal file
113
public/tags/square-roots/index.html
Normal file
@@ -0,0 +1,113 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Square-Roots | Avinash's Blog</title>
|
||||
<meta name="title" content="Square-Roots" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/tags/square-roots/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="Square-Roots">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="website">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="Square-Roots">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="Square-Roots">
|
||||
<meta itemprop="datePublished" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-15T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
|
||||
<link href="/original.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/square-roots/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header><a class="skip-link" href="#main-content">Skip to main content</a>
|
||||
|
||||
<a href="/" class="title"><h1>Avinash's Blog</h1></a>
|
||||
<nav>
|
||||
<a href="/">about</a>
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</nav>
|
||||
</header>
|
||||
<main id="main-content">
|
||||
<content>
|
||||
|
||||
<h3 class="blog-filter">Filtering for "Square-Roots"</h3>
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-15' pubdate>
|
||||
2026-02-15
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/series/byhand/byhand_001_square_roots/">BasicsByHand: Square Roots</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</content>
|
||||
|
||||
</main>
|
||||
<footer><small>
|
||||
© Avinash Mallya | Design via <a href="https://github.com/clente/hugo-bearcub">Bear Cub</a>.
|
||||
</small></footer>
|
||||
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
314
public/tags/square-roots/index.xml
Normal file
314
public/tags/square-roots/index.xml
Normal file
File diff suppressed because one or more lines are too long
113
public/tags/storage/index.html
Normal file
113
public/tags/storage/index.html
Normal file
@@ -0,0 +1,113 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-US">
|
||||
|
||||
<head>
|
||||
<meta http-equiv="X-Clacks-Overhead" content="GNU Terry Pratchett" />
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Storage | Avinash's Blog</title>
|
||||
<meta name="title" content="Storage" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<meta property="og:url" content="https://avimallu.dev/tags/storage/">
|
||||
<meta property="og:site_name" content="Avinash's Blog">
|
||||
<meta property="og:title" content="Storage">
|
||||
<meta property="og:locale" content="en_US">
|
||||
<meta property="og:type" content="website">
|
||||
<meta property="og:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary_large_image">
|
||||
<meta name="twitter:image" content="https://avimallu.dev/static/favicon.ico">
|
||||
<meta name="twitter:title" content="Storage">
|
||||
|
||||
|
||||
|
||||
|
||||
<meta itemprop="name" content="Storage">
|
||||
<meta itemprop="datePublished" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="dateModified" content="2026-02-10T00:00:00+00:00">
|
||||
<meta itemprop="image" content="https://avimallu.dev/static/favicon.ico">
|
||||
|
||||
<meta name="referrer" content="no-referrer-when-downgrade" />
|
||||
|
||||
|
||||
<link href="/original.min.css" rel="stylesheet">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/storage/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<header><a class="skip-link" href="#main-content">Skip to main content</a>
|
||||
|
||||
<a href="/" class="title"><h1>Avinash's Blog</h1></a>
|
||||
<nav>
|
||||
<a href="/">about</a>
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</nav>
|
||||
</header>
|
||||
<main id="main-content">
|
||||
<content>
|
||||
|
||||
<h3 class="blog-filter">Filtering for "Storage"</h3>
|
||||
|
||||
<ul class="blog-posts">
|
||||
|
||||
<li>
|
||||
<span>
|
||||
<i>
|
||||
<time datetime='2026-02-10' pubdate>
|
||||
2026-02-10
|
||||
</time>
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/005_ldmb_as_image_db/">Resolving I/O Bottlenecks for 100K Small Files with LMDB</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</content>
|
||||
|
||||
</main>
|
||||
<footer><small>
|
||||
© Avinash Mallya | Design via <a href="https://github.com/clente/hugo-bearcub">Bear Cub</a>.
|
||||
</small></footer>
|
||||
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
276
public/tags/storage/index.xml
Normal file
276
public/tags/storage/index.xml
Normal file
@@ -0,0 +1,276 @@
|
||||
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
|
||||
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
|
||||
<channel>
|
||||
<title>Storage on Avinash's Blog</title>
|
||||
<link>https://avimallu.dev/tags/storage/</link>
|
||||
<description>Recent content in Storage on Avinash's Blog</description>
|
||||
<generator>Hugo -- gohugo.io</generator>
|
||||
<language>en-US</language>
|
||||
<copyright>© Avinash Mallya</copyright>
|
||||
<lastBuildDate>Tue, 10 Feb 2026 00:00:00 +0000</lastBuildDate>
|
||||
<atom:link href="https://avimallu.dev/tags/storage/index.xml" rel="self" type="application/rss+xml" />
|
||||
<item>
|
||||
<title>Resolving I/O Bottlenecks for 100K Small Files with LMDB</title>
|
||||
<link>https://avimallu.dev/blog/005_ldmb_as_image_db/</link>
|
||||
<pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/005_ldmb_as_image_db/</guid>
|
||||
<description><h1 id="premise">Premise</h1>
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
<p>I had several thousand images already.
I was expecting several thousand more.
My repository was tracking these images via DVC.
My computer was also slowing down massively because of the sheer number of files.
DVC itself was slowing down (after all, randomly accessing many files isn&rsquo;t going to be fast).
I also needed to access files at random for training/evaluating the model (lots of shuffling).
Lastly, these images had their own associated metadata (labels, bounding boxes, &ldquo;correct&rdquo; text etc.), and they need to be stored along with the images - or least easily linkable to them.</p></description>
|
||||
<content:encoded><![CDATA[<h1 id="premise">Premise</h1>
|
||||
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
|
||||
<p>I had several thousand images already.
|
||||
I was expecting several thousand more.
|
||||
My repository was tracking these images via DVC.
|
||||
My computer was also slowing down massively because of the sheer number of files.
|
||||
DVC itself was slowing down (after all, randomly accessing many files isn’t going to be fast).
|
||||
I also needed to access files at random for training/evaluating the model (lots of shuffling).
|
||||
Lastly, these images had their own associated metadata (labels, bounding boxes, “correct” text etc.), and they need to be stored along with the images - or least easily linkable to them.</p>
|
||||
<p>I was primarily aiming for a “simple” solution, and didn’t need a productionizable codebase.</p>
|
||||
<h1 id="potential-solutions">Potential Solutions</h1>
|
||||
<h2 id="partitioning">Partitioning</h2>
|
||||
<p>A typical solution for “too many files” is to partition them by their name. It’s ideal if their
|
||||
name is a hash, so you can store the first character of the hash, then the second character, and
|
||||
then the actual file. So, for example, the directory changes from:</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln">2</span><span class="cl">├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln">3</span><span class="cl">├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln">4</span><span class="cl">├── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">5</span><span class="cl">├── cf1a2b3c4d.txt
|
||||
</span></span><span class="line"><span class="ln">6</span><span class="cl">├── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">7</span><span class="cl">└── a1f5e6d7c8.txt</span></span></code></pre></div><p>to</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">files/
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">├── a/
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">│ └── 1/
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">│ ├── a1b2c3d4e5.txt
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">│ └── a1f5e6d7c8.txt
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">├── b/
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">│ └── 7/
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">│ ├── b7f8a9c0d1.txt
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">│ └── b7e4d2f1a0.txt
|
||||
</span></span><span class="line"><span class="ln">10</span><span class="cl">└── c/
|
||||
</span></span><span class="line"><span class="ln">11</span><span class="cl"> ├── 0/
|
||||
</span></span><span class="line"><span class="ln">12</span><span class="cl"> │ └── c0d1e2f3a4.txt
|
||||
</span></span><span class="line"><span class="ln">13</span><span class="cl"> └── f/
|
||||
</span></span><span class="line"><span class="ln">14</span><span class="cl"> └── cf1a2b3c4d.txt</span></span></code></pre></div><p>This isn’t novel - <code>git</code> and <code>dvc</code> both store their objects this way.
|
||||
This limits directory size, so file system look ups take less time.</p>
|
||||
<p>This isn’t a perfect solution - it required that I store the images as their hash,
|
||||
and handle the directory structure correctly. I will also need to maintain my own
|
||||
mechanism to maintain the link between the hash and the metadata, which means creating
|
||||
some sort of index. Lastly, DVC will still track files individually, which means that
|
||||
its <code>push</code>, <code>diff</code> and <code>pull</code> commands will still be slow.</p>
|
||||
<h2 id="separate-object-storage-maintain-only-the-index">Separate (object) storage, maintain only the index</h2>
|
||||
<p>Another solution would be to offload the storage to a medium capable of handling a large number of files in any order,
|
||||
while maintaining random access - for example, S3. Now my task will reduce to just correctly maintaining the index so
|
||||
that I know how many images are stored on S3, and link their metadata via their hash.</p>
|
||||
<p>If you’ve experienced retrieving a list of a large number of files stored on S3, you’d have first encountered the limit
|
||||
of 1000 objects that <code>boto3</code> enforces per request. You’ll need to work around it with pagination, which while standard,
|
||||
is still more work. You would have also realized that even after all this, S3 will take quite some time to give you the
|
||||
list even after you’ve optimized as much as you can.</p>
|
||||
<p>However, that means I need an active internet connection to access any data. It also introduces latency during training,
|
||||
and quite wasteful bandwidth in running multiple training sessions for the models. I can minimize these if I store the
|
||||
images in the same AWS region as I would be running the training in (say, an EC2), but that means I needed access to an
|
||||
EC2.</p>
|
||||
<p>This still left me the task of maintaining my own index, which I really wanted to avoid, as it would mean additional
|
||||
maintenance burden for a relatively nascent project that hasn’t reached production status, while demanding production
|
||||
code for an even more nascent pipeline.</p>
|
||||
<h2 id="what-about-a-database">What about a… database?</h2>
|
||||
<p>This feels much like reaching for your nose by looping your hand behind your head instead of touching it directly with
|
||||
your fingers. A database is great for many things, but setting one up and maintaining it (even a local SQLite one) isn’t
|
||||
really a quick and painless process, and has many gotchas. For instance, most databases aren’t optimized for storing a
|
||||
large number of binary blobs.</p>
|
||||
<p>I considered, and tested using a more modern embeddable analytical database like DuckDB for this purpose (I’m quite
|
||||
biased to using DuckDB and/or Polars to solve a large number of my data processing problems). I quickly
|
||||
found out that storing large binary blobs in it causes it to choke (which is fair, it isn’t really designed for that). Storing
|
||||
the files elsewhere while maintaining just the index in it still had the original problem - I needed to write the mechanism
|
||||
of maintaining the index.</p>
|
||||
<blockquote>
|
||||
<p>Note: HuggingFace now provides many images datasets (such as <a href="https://huggingface.co/datasets/ylecun/mnist">MNIST</a>) in
|
||||
the Parquet format, with the images stored using Arrow’s extension types (but still as binary blobs). My experience with
|
||||
storing binary data in Parquet hasn’t been great, but you could check this out to see if it meets your requirements.</p>
|
||||
</blockquote>
|
||||
<h1 id="the-solution-i-landed-on">The solution I landed on</h1>
|
||||
<h2 id="what-about-a-different-kind-of-database">What about a… <em>different</em> kind of database?</h2>
|
||||
<p>Let’s get down to first principles. What did I want to do? I wanted to store images. With those images, I also wanted
|
||||
to store its metadata. I wanted to access said data quickly. It became clearer to me that I was looking for a fast key-value
|
||||
store, and I stumbled upon <a href="https://www.symas.com/mdb">LMDB</a>.</p>
|
||||
<h2 id="lmdb">LMDB</h2>
|
||||
<p>Wikipedia’s <a href="https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database">entry</a> on LMDB indicates that it’s an incredibly
|
||||
small (64kB) piece of software that does one thing really well - be a ridiculously fast key-value store. I won’t pretend to
|
||||
understand how it works, the writeup on the Wiki provides plenty of good detail. I’ll focus, rather, on how I used it to
|
||||
solve my problem.</p>
|
||||
<h2 id="storing-and-retrieving-image-data-along-with-its-metadata">Storing and retrieving image data along with its metadata</h2>
|
||||
<p>LMDB is rather barebones. It exposes few features - the ability to write, and read a particular key (stored as a bytestring),
|
||||
that itself points to arbitrary bytes. I used its <a href="https://lmdb.readthedocs.io/en/release/">Python bindings</a>.</p>
|
||||
<p>I wrote a tiny class (~200 LoC) that did the following:</p>
|
||||
<ol>
|
||||
<li>Read the file names and the data of the images that I had, along with their metadata as it currently was, into Python. Batched to avoid running out of memory.</li>
|
||||
<li>Serialize the metadata, and read the image files as bytes, and link them to the keys f"{file_name}_image" and f"{file_name}_metadata" respectively.</li>
|
||||
<li>Store these as key-value pairs into the LMDB database, which is a <strong>single</strong> file.</li>
|
||||
<li>Provided a method to read the keys to identify all the images present in the database.</li>
|
||||
<li>Provided a method to retrieve an arbitrary set of images and their metadata quickly from the saved database.</li>
|
||||
</ol>
|
||||
<p>This has many advantages:</p>
|
||||
<ol>
|
||||
<li>LMDB is fast - really, really fast. Random access? Check. Fast retrieval of available images? Check.</li>
|
||||
<li>DVC tracking becomes simple - maintain a single file, and just version control that. No slow downs due to sheer number of files - either for DVC, or my computer.</li>
|
||||
<li>No index to maintain - the images and their metadata are stored in the same location, and linkable via a mere change in the suffix to their file name.</li>
|
||||
<li>Local access, practically zero latency.</li>
|
||||
</ol>
|
||||
<p>Which solves… all of the problems I had! When new files come in, all I needed to do was add them to the DB. LMDB has a few options available - you can
|
||||
avoid overwriting the same key, ensure that the database is de-duplicated.</p>
|
||||
<h1 id="the-code">The code</h1>
|
||||
<p>I’ve provided a sample code below that demonstrates storing just the images (not metadata) for the <a href="https://www.robots.ox.ac.uk/~vgg/data/flowers/102/">Oxford 102 Category Flower</a>
|
||||
dataset, which has around 8000 images.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
|
||||
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">lmdb</span>
|
||||
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">class</span> <span class="nc">ImageDB</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">env_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span> <span class="n">max_size_as_mb</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env_path</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">env_path</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">env</span> <span class="o">=</span> <span class="n">lmdb</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">env_path</span><span class="p">,</span> <span class="n">map_size</span><span class="o">=</span><span class="n">max_size_as_mb</span> <span class="o">*</span> <span class="p">(</span><span class="mi">2</span><span class="o">**</span><span class="mi">20</span><span class="p">))</span>
|
||||
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">db</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">open_db</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 10</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 11</span><span class="cl"> <span class="k">def</span> <span class="nf">save_image</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 12</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 13</span><span class="cl"> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 14</span><span class="cl"> <span class="n">image_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 15</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 16</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 17</span><span class="cl"> <span class="n">txn</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 18</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 19</span><span class="cl"> <span class="k">def</span> <span class="nf">read_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">bytes</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 20</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 21</span><span class="cl"> <span class="k">if</span> <span class="n">image_as_bytes</span> <span class="o">:=</span> <span class="n">txn</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 22</span><span class="cl"> <span class="k">return</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 23</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 24</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 25</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 26</span><span class="cl"> <span class="k">def</span> <span class="nf">save_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 27</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 28</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 29</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 30</span><span class="cl"> <span class="c1"># Note: you might need to enforce a batch size here</span>
|
||||
</span></span><span class="line"><span class="ln"> 31</span><span class="cl"> <span class="c1"># to aovid running out of memory because this loads</span>
|
||||
</span></span><span class="line"><span class="ln"> 32</span><span class="cl"> <span class="c1"># all images sent to this function as bytes.</span>
|
||||
</span></span><span class="line"><span class="ln"> 33</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 34</span><span class="cl"> <span class="n">item_tuples</span> <span class="o">=</span> <span class="p">[</span>
|
||||
</span></span><span class="line"><span class="ln"> 35</span><span class="cl"> <span class="p">(</span><span class="n">k</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
|
||||
</span></span><span class="line"><span class="ln"> 36</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">name_image</span><span class="o">.</span><span class="n">items</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 37</span><span class="cl"> <span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 38</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 39</span><span class="cl"> <span class="n">consumed</span><span class="p">,</span> <span class="n">added</span> <span class="o">=</span> <span class="n">cursor</span><span class="o">.</span><span class="n">putmulti</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 40</span><span class="cl"> <span class="n">item_tuples</span><span class="p">,</span> <span class="n">dupdata</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">overwrite</span><span class="o">=</span><span class="kc">False</span>
|
||||
</span></span><span class="line"><span class="ln"> 41</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 42</span><span class="cl"> <span class="nb">print</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 43</span><span class="cl"> <span class="sa">f</span><span class="s2">"Saved </span><span class="si">{</span><span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> out of </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images to the DB (</span><span class="si">{</span><span class="n">consumed</span> <span class="o">-</span> <span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> seem to already exist)."</span>
|
||||
</span></span><span class="line"><span class="ln"> 44</span><span class="cl"> <span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 45</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 46</span><span class="cl"> <span class="k">def</span> <span class="nf">load_images</span><span class="p">(</span>
|
||||
</span></span><span class="line"><span class="ln"> 47</span><span class="cl"> <span class="bp">self</span><span class="p">,</span>
|
||||
</span></span><span class="line"><span class="ln"> 48</span><span class="cl"> <span class="n">names</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
|
||||
</span></span><span class="line"><span class="ln"> 49</span><span class="cl"> <span class="p">)</span> <span class="o">-></span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 50</span><span class="cl"> <span class="n">names_as_bytestrings</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">encode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">names</span><span class="p">]</span>
|
||||
</span></span><span class="line"><span class="ln"> 51</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 52</span><span class="cl"> <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 53</span><span class="cl"> <span class="k">return</span> <span class="p">{</span>
|
||||
</span></span><span class="line"><span class="ln"> 54</span><span class="cl"> <span class="n">k</span><span class="o">.</span><span class="n">decode</span><span class="p">():</span> <span class="n">image_as_bytes</span>
|
||||
</span></span><span class="line"><span class="ln"> 55</span><span class="cl"> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_as_bytes</span> <span class="ow">in</span> <span class="n">cursor</span><span class="o">.</span><span class="n">getmulti</span><span class="p">(</span><span class="n">names_as_bytestrings</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 56</span><span class="cl"> <span class="p">}</span>
|
||||
</span></span><span class="line"><span class="ln"> 57</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 58</span><span class="cl"> <span class="k">def</span> <span class="nf">delete_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 59</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 60</span><span class="cl"> <span class="k">if</span> <span class="n">txn</span><span class="o">.</span><span class="n">delete</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
|
||||
</span></span><span class="line"><span class="ln"> 61</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Image </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2"> deleted successfully"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 62</span><span class="cl"> <span class="k">else</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 63</span><span class="cl"> <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 64</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 65</span><span class="cl"> <span class="k">def</span> <span class="nf">retrieve_names</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
|
||||
</span></span><span class="line"><span class="ln"> 66</span><span class="cl"> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 67</span><span class="cl"> <span class="k">return</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">decode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span><span class="o">.</span><span class="n">iternext</span><span class="p">(</span><span class="n">keys</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="kc">False</span><span class="p">)]</span>
|
||||
</span></span><span class="line"><span class="ln"> 68</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 69</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 70</span><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 71</span><span class="cl"> <span class="n">db</span> <span class="o">=</span> <span class="n">ImageDB</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="s2">"./db/"</span><span class="p">),</span> <span class="mi">512</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 72</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 73</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 74</span><span class="cl"> <span class="c1"># Save the results</span>
|
||||
</span></span><span class="line"><span class="ln"> 75</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg/"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 76</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span>
|
||||
</span></span><span class="line"><span class="ln"> 77</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 78</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 79</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 80</span><span class="cl"> <span class="c1"># Add last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 81</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 82</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 83</span><span class="cl"> <span class="k">del</span> <span class="n">name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 84</span><span class="cl"> <span class="c1"># How many images have been stored?</span>
|
||||
</span></span><span class="line"><span class="ln"> 85</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"The DB has </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">retrieve_names</span><span class="p">())</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images stored"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln"> 86</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 87</span><span class="cl"> <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 88</span><span class="cl"> <span class="c1"># Load the results from the DB and check if they match the files on disk</span>
|
||||
</span></span><span class="line"><span class="ln"> 89</span><span class="cl"> <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">"./data/jpg"</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.jpg"</span><span class="p">):</span>
|
||||
</span></span><span class="line"><span class="ln"> 90</span><span class="cl"> <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 91</span><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1000</span><span class="p">:</span>
|
||||
</span></span><span class="line"><span class="ln"> 92</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 93</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 94</span><span class="cl"> <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||||
</span></span><span class="line"><span class="ln"> 95</span><span class="cl"> <span class="c1"># Verify last batch also</span>
|
||||
</span></span><span class="line"><span class="ln"> 96</span><span class="cl"> <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
|
||||
</span></span><span class="line"><span class="ln"> 97</span><span class="cl"> <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
|
||||
</span></span><span class="line"><span class="ln"> 98</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln"> 99</span><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"All images stored are byte identical to the original ones!"</span><span class="p">)</span>
|
||||
</span></span><span class="line"><span class="ln">100</span><span class="cl">
|
||||
</span></span><span class="line"><span class="ln">101</span><span class="cl"> <span class="n">db</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></span></span></code></pre></div><p>This should provide you a good starting point to implement additional features,
|
||||
such as storing metadata, filtering required input by metadata (such as extracting
|
||||
a specific label for evaluation) and so on.</p>
|
||||
<h1 id="caveats">Caveats</h1>
|
||||
<h2 id="avoid-pil-or-pay-the-small-price">Avoid PIL, or pay the (small) price</h2>
|
||||
<p>One gotcha that I initially faced is that the images I saved wasn’t the same as
|
||||
the images that I retrieved. This wasn’t LMDB’s fault, this was because I was
|
||||
reading the images from disk via PIL, and storing them as bytes in LMDB. PIL
|
||||
decodes and encodes the image, so a roundtrip will not necessarily be identical,
|
||||
even for lossless file formats (other than bitmap images).</p>
|
||||
<p>Don’t encode/re-encode the image before you store it, or be prepared for the
|
||||
stored data to not be byte-identical.</p>
|
||||
<h2 id="the-max_size_as_mb-argument">The <code>max_size_as_mb</code> argument</h2>
|
||||
<p>LMDB has an unusual design. You need to specify the upper bound of the DB size
|
||||
upon creation, and if it exceeds this size, it will fail. You can edit this
|
||||
later, with some caveats (on Windows, this will actually allocate the full
|
||||
size).</p>
|
||||
<h2 id="concurrency-and-lmdb">Concurrency and LMDB</h2>
|
||||
<p>LMDB, while extremely fast, has some considerations with concurrency. See
|
||||
<a href="https://lmdb.readthedocs.io/en/release/#threads">the documentation</a> for details.
|
||||
It may not be suited for distributed workloads.</p>
|
||||
<h1 id="alternatives">Alternatives</h1>
|
||||
<p>This article covers a “quick and dirty” solution, and was before more purpose-built
|
||||
solutions were available. Some alternatives are:</p>
|
||||
<ol>
|
||||
<li>If you’re comfortable operating directly on archives, a simple <code>tar</code> file will
|
||||
do - it can provide an offset index to provide random access to data.</li>
|
||||
<li><a href="https://github.com/webdataset/webdataset">Nvidia’s WebDataset</a>. Modern, open
|
||||
source and purpose built for large scale deep learning.</li>
|
||||
<li><a href="https://lancedb.com/">LanceDB</a>, which describes itself as “designed for multimodal”
|
||||
and “built for scale”. It’s built on top of Arrow, closely related to Parquet.</li>
|
||||
<li>As mentioned, HuggingFace has multiple solutions to this, starting with Arrow backed
|
||||
storage, and their own <a href="https://huggingface.co/docs/datasets/index"><code>datasets</code></a>.</li>
|
||||
</ol>
|
||||
<p>Use these if you want to scale to production level training.</p>
|
||||
]]></content:encoded>
|
||||
</item>
|
||||
</channel>
|
||||
</rss>
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Variables" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/variables/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
@@ -89,7 +94,7 @@
|
||||
</i>
|
||||
</span>
|
||||
|
||||
<a href="/blog/004_environment_variables_and_multiprocessing/">Environment Variables and Multiprocessing</a>
|
||||
<a href="/blog/004_environment_variables_and_multiprocessing/">Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines</a>
|
||||
|
||||
</li>
|
||||
|
||||
|
||||
@@ -10,7 +10,7 @@
|
||||
<lastBuildDate>Wed, 14 Jan 2026 00:00:00 +0000</lastBuildDate>
|
||||
<atom:link href="https://avimallu.dev/tags/variables/index.xml" rel="self" type="application/rss+xml" />
|
||||
<item>
|
||||
<title>Environment Variables and Multiprocessing</title>
|
||||
<title>Resolving Multiprocessing Bottlenecks in Legacy Batch Pipelines</title>
|
||||
<link>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</link>
|
||||
<pubDate>Wed, 14 Jan 2026 00:00:00 +0000</pubDate>
|
||||
<guid>https://avimallu.dev/blog/004_environment_variables_and_multiprocessing/</guid>
|
||||
@@ -164,7 +164,7 @@ environment variables:</p>
|
||||
|
||||
|
||||
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">seq <span class="m">10</span> <span class="p">|</span> <span class="se">\
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">2</span><span class="cl">parallel -j32 <span class="s2">"OMP_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s2">OPENBLAS_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s2">MKL_NUM_THREADS=1 \
|
||||
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s2">python test.py {}"</span></span></span></code></pre></div><p>An interesting side note is that the parallel, but single core run took 0.61 seconds, <strong>less</strong> than the time it took to run
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
<meta name="title" content="Vba" />
|
||||
<meta name="description" content="" />
|
||||
<meta name="author" content="" />
|
||||
<meta name="keywords" content="approximate,category,environment,faiss,graph,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,variables,vba," />
|
||||
<meta name="keywords" content="approximate,bottleneck,by-hand,category,dvc,environment,euler,faiss,graph,history,images,io,lmdb,logarithms,machine-learning,mathematics,multiprocessing,nearest,neighbor,network,networkx,numpy,parallel,polars,powerpoint,ppt,python,representative,samples,square-roots,storage,variables,vba," />
|
||||
|
||||
|
||||
|
||||
@@ -50,6 +50,9 @@
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" href="https://avimallu.dev/tags/vba/index.xml" title="Avinash's Blog" />
|
||||
|
||||
|
||||
|
||||
|
||||
</head>
|
||||
|
||||
<body>
|
||||
@@ -61,6 +64,8 @@
|
||||
|
||||
<a href="/blog/">blog</a>
|
||||
|
||||
<a href="/series/">series</a>
|
||||
|
||||
<a href="/projects/">projects</a>
|
||||
|
||||
<a href='https://avimallu.dev/index.xml'>rss</a>
|
||||
|
||||
@@ -50,7 +50,7 @@
|
||||
<p><img src="/blog/003_powerpointsnap/02_Charts.png" alt="The UI for copying chart properties"></p>
|
||||
<p>What do these features do? You should be able to hover over the option and get a tooltip that shows what it’s capable of, but here’s another summary just in case:</p>
|
||||
<ol>
|
||||
<li>Sync Value/Date Axis: this will try to align the range, the ticks, the numeric values etc. of the “set” chart to the one you’ve selected. I couldn’t put in just $x$ and $y$ here because Microsoft internally doesn’t label them that way. Try either of these two options (you can undo!) and see what works best for your chart. This doesn’t work well yet for 3D charts.</li>
|
||||
<li>Sync Value/Date Axis: this will try to align the range, the ticks, the numeric values etc. of the “set” chart to the one you’ve selected. I couldn’t put in just <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">x</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>y</mi></mrow><annotation encoding="application/x-tex">y</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">y</span></span></span></span> here because Microsoft internally doesn’t label them that way. Try either of these two options (you can undo!) and see what works best for your chart. This doesn’t work well yet for 3D charts.</li>
|
||||
<li>Sync Plot/Title/Legend: often, you want to centre a title, or make sure that multiple charts that show nearly identical things for different variables all <em>look</em> exactly the same from a client perspective. But that’s usually difficult if you’ve already configured the charts a little - which can be remedied with this option!</li>
|
||||
<li>Format Painter: this is simply a helper for the normal format painter to align the formats of the text that you’ve selected with the way it originally is in the “set” chart. The reason for this feature is simply to avoid going back to <em>Home</em> to click on the <em>Format Painter</em> option again.</li>
|
||||
<li>Reset Axes Scales: in case you messed up somewhere, you can use this to rever to PowerPoint defaults.</li>
|
||||
|
||||
@@ -1 +1,4 @@
|
||||
hugo && sudo rsync -av --delete public/ /usr/share/nginx/html/
|
||||
hugo && rsync -av --delete public/ vps:~/www
|
||||
# Also might need
|
||||
ssh vps "chmod -R o+rX www/"
|
||||
|
||||
|
||||
Reference in New Issue
Block a user