yhdev/dat/doc/pwlookup.md

% Fast Key Lookup with a Small Read-Only Database

(Published on **2019-05-14**)

I was recently thinking about how to help users on
[VNDB.org](https://vndb.org/) improve their password security. One idea is to
show a warning when someone's password is already in some kind of database.
While there are several online APIs we can query for this purpose, I chose to
ignore these (apart from the potential privacy implications, I'd also rather
avoid dependencies on external services when possible). As an alternative,
there are several password dictionaries available for download that can be used
for offline querying.  I decided to go with the free [CrackStation password
dictionary](https://crackstation.net/crackstation-wordlist-password-cracking-dictionary.htm).

The dictionary comes in the form of a compressed text file with one password
per line. This is a great format to keep the size of the dictionary small, but
it's a really shitty format for fast querying. The obvious solution is to
import the dictionary into some database (any RDMBS or NoSQL store of your
choice), but those tend to be relatively heavyweight solutions in terms of code
base, operational dependencies, database size or all of the above. Generic
databases are typically optimized to support (moderately) fast inserts, updates
and deletes, which means they often can't pack data very tightly. Some
databases do support compression, though.

So all I need is a small, compressed database that supports only a single
operation: check if a key (password, in this case) is present. Surely I could
quickly hack together a simple database that provides a good size/performance
ratio?

## My Little B-tree

I had some misgivings about implementing a
[B-tree](https://en.wikipedia.org/wiki/B-tree). It may be a relatively simple
data structure, but I've written enough off-by-one errors in my life to be wary
of actually implementing one, and given this use case, I felt that there had to
be an even simpler solution. After thinking up and rejecting some alternative
strategies and after realizing that I really only needed to support two
operations - bulk data insertion and exact key lookup - I came to the
conclusion that a B-tree isn't such a bad idea after all. One major advantage
of B-trees is that they provide a natural solution to split up the data into
fixed-sized blocks, and that is a *very* useful property if we want to use
compression.

The database format I came up with is basically a concatenation of compressed
blocks. Blocks come in two forms: leaf blocks and, uh, let's call them *index
blocks*.

The leaf blocks contain all the password strings. I initially tried to encode
the passwords as a concatenated sequence of length-prefixed strings, but I
could not find a way to quickly parse and iterate through that data in Perl 5
(the target language for this little project[^1]). As it turns out, doing a
substring match on the full leaf block is faster than trying to work with
[unpack](https://perldoc.pl/functions/unpack) and
[substr](https://perldoc.pl/functions/substr), even if a naive substring match
can't take advantage of the string ordering to skip over data when it has found
a "larger" key. So I made sure that the leaf block starts with a null byte and
encoded the passwords as a sequence of null-terminated strings. That way we can
reliably find the key by doing a substring search on `"\x00${key}\x00"`.

The index blocks are the intermediate B-tree nodes and consist of a sorted list
of block references interleaved with keys. Like this:

```perl
$block1 # Contains all the passwords "less than" $password1
$password1
$block2 # Contains all the passwords "less than" $password2
$password2
$block3 # Contains all the passwords "greater than" $password2
```

The `$blockN` references are stored as a 64bit integer and encode the byte
offset of the block relative to the start of the database file, the compressed
byte length of the block, and whether it is a leaf or index block. Strictly
speaking only the offset is necessary, the length and leaf flag can be stored
alongside the block itself, but this approach saves a separate decoding step.
The `$passwordN` keys are encoded as a null-terminated string, if only for
consistency with leaf blocks.

Finally, at the end of the file, we store a reference to the *parent* block, so
that our lookup algorithm knows where to start its search.

Given a sorted list of passwords[^2], creating a database with that format is
relatively easy. The algorithm goes roughly as follows (apologies for the
pseudocode, the actual code is slightly messier in order to do the proper data
encoding):

```perl
my @blocks; # Stack of blocks, $block[0] is a leaf block.

for my $password (@passwords) {
    $blocks[0]->append_password($password);

    # Flush blocks when they get large enough.
    for(my $i=0; $blocks[$i]->length() > $block_size; $i++) {
        my $reference = $blocks[$i]->flush();
        $blocks[$i+1]->append_block_reference($reference);
    }
}

# Flush the remaining blocks.
for my $i (0..$#blocks) {
    my $reference = $blocks[$i]->flush();
    $blocks[$i+1]->append_block_reference($reference);
}

write_parent_node_reference();
```

That's it, really. No weird tree balancing tricks, no need to "modify" index
blocks in any other way than appending some data. *Flushing* a block is
nothing more than compressing the thing, appending it to the database file and
noting its length and byte offset for future reference.

Lookup is just as easy. I don't even need pseudocode to demonstrate it, here's
the actual implementation:

```perl
sub lookup_rec {
    my($q, $F, $ref) = @_;
    my $buf = readblock $F, $ref;
    if($ref->[0]) { # Is this a leaf block?
        return $buf =~ /\x00\Q$q\E\x00/; # Substring search
    } else {
        # This is an index block, walk through the block references and
        # password strings until we find a string that's larger than our query.
        while($buf =~ /(.{8})([^\x00]*)\x00/sg) {
            return lookup_rec($q, $F, dref $1) if $q lt $2;
        }
        return lookup_rec($q, $F, dref substr $buf, -8)
    }
}

# Usage: lookup($query, $database_filename)
#   returns true if $query exists in the database.
sub lookup {
    my($q, $f) = @_;
    open my $F, '<', $f or die $!;
    # Read the last 8 bytes in the file for the reference to the parent block.
    sysseek $F, -8, 2 or die $!;
    die $! if 8 != sysread $F, (my $buf), 8;
    # Start the recursive lookup
    lookup_rec($q, $F, dref $buf)
}
```

The full code, including that of the benchmarks below, can be found [on
git](https://g.blicky.net/pwlookup.git/tree/).

## Benchmarks

I benchmarked my little B-tree implementation with a few different compression
settings (no compression, gzip and zstandard) and block sizes (1k and 4k). For
comparison I also added a naive implementation that performs a simple linear
lookup in the sorted dictionary, and another one that uses
[LMDB](https://symas.com/lmdb/).

Here are the results with the `crackstation-human-only.txt.gz` dictionary,
containing 63,941,069 passwords at 247 MiB original size.

| Database        | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec |
|:----------------|--------------:|------------------:|------------:|
| Naive (plain)   |           684 |             6,376 | 0.16 (6.1 s)|
| Naive (gzip)    |           246 |             6,340 | 0.12 (8.3 s)|
| B-tree plain 1k |           698 |             6,460 |      17,857 |
| B-tree plain 4k |           687 |             6,436 |       9,345 |
| B-tree gzip 1k  |           261 |            10,772 |       9,345 |
| B-tree gzip 4k  |           244 |            10,572 |       5,076 |
| B-tree zstd 1k  |           291 |             6,856 |      12,345 |
| B-tree zstd 4k  |           268 |             6,724 |       6,944 |
| LMDB            |         1,282 |           590,792 |     333,333 |

Well shit. My little B-tree experiment does have an awesome size/performance
ratio when compared to the Naive approach (little surprise there), but the
performance difference with LMDB is *insane*. Although, really, that isn't too
surprising either,  LMDB is written in C and has been *heavily* optimized for
performance. LMDB's memory usage in this benchmark should be taken with a grain
of salt, its use of [mmap()](https://manned.org/mmap) tends to [throw off
memory measurements](https://lmdb.readthedocs.io/en/release/#memory-usage).

I used the default compression levels of zstd and gzip. I expect that a
slightly higher compression level for zstd could reduce the database sizes to
below gzip levels without too much of a performance penalty.

What's curious is that the *B-tree gzip 4k* database is smaller than the *Naive
(gzip)* one. I wonder if I have a bug somewhere that throws away a chunk of the
original data. Or if I somehow ended up using a different compression level. Or
if gzip is just being weird.

Here's the same benchmark with the `crackstation.txt.gz` dictionary, containing
1,212,356,398 passwords at 4.2 GiB original size[^3].

| Database        | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec |
|:----------------|--------------:|------------------:|------------:|
| Naive (plain)   |        14,968 |            38,536 | 0.01 (110 s)|
| Naive (gzip)    |         4,245 |            38,760 | 0.01 (136 s)|
| B-tree plain 1k |        15,377 |             6,288 |      15,384 |
| B-tree plain 4k |        15,071 |             6,456 |       8,196 |
| B-tree gzip 1k  |         4,926 |            10,780 |       7,352 |
| B-tree gzip 4k  |         4,344 |            10,720 |       4,273 |
| B-tree zstd 1k  |         5,389 |             6,708 |      10,000 |
| B-tree zstd 4k  |         4,586 |             6,692 |       5,917 |
| LMDB            |        26,453 |         3,259,368 |     266,666 |

The main conclusion I draw from this benchmark is that the B-tree
implementation scales pretty well with increasing database sizes, as one would
expect. I'm not sure why Perl decided to use more memory for the *Naive*
benchmarks, but it doesn't really matter.

## Improvements

Is this the best we can do? No way! Let's start with some low-hanging fruit:

- The current lookup function reads the database file from scratch on every
  lookup. An LRU cache for uncompressed blocks ought to speed things up
  considerably.
- Keys in index blocks are repeated in leaf blocks, this isn't really
  necessary.
- It's possible to encode an "offset inside block" field to the block
  references and add a few more strings to the index blocks, allowing parts of
  a block to be skipped when searching for the key. This way we can get some of
  the performance benefits of smaller block sizes without the costs of an
  increase in database size. Or store multiple (smaller) intermediate B-tree
  nodes inside a single block. Same thing.
- The lookup function could be rewritten in a faster language (C/C++/Rust), I'm
  pretty sure this would be a big win, too.

Thinking beyond B-trees, an alternative and likely much more efficient approach
is to use a hash function to assign strings to leaf blocks, store an array of
block pointers in the index blocks (without interleaving keys) and then use the
hash function to index the array for lookup. This makes it harder to cap the
size of leaf blocks, but with the small password strings that's not likely to
be a problem. It does significantly complicate creating the database in the
first place.

Perhaps an even better approach is to not store the strings in the first place
and simply use a (sufficiently) large [bloom
filter](https://en.wikipedia.org/wiki/Bloom_filter).

But this was just a little side project. My goal was to get 1 ms lookups with
the least amount of code and with a database size that isn't too far off from
the compressed dictionary. That goal turned out to be pretty unambitious.


[^1]: Yeah, people still use Perl 5 in 2019.
[^2]: But note that the passwords in the CrackStation dictionary are not sorted
  according to Perl's string comparison algorithm, so it requires a separate
  `LC_COLLATE=C sort` command to fix that. Also note that sorting a billion
  strings is a pretty challenging problem in its own right, but enough has been
  written about that.  Arguably, enough has been written about B-trees and
  databases as well.
[^3]: Running these benchmarks was a bit of a nightmare as I was running low on
  free space on my SSD. I had to delete some unused build artefacts from other
  projects in an emergency in order for `sort` to be able to finish sorting and
  writing the *Naive (plain)* database, upon which all the others are based.
  [Ncdu](/ncdu) has saved this experiment, its author deserves an extra tasty
  pizza for dinner today.