256 lines
12 KiB
Markdown
256 lines
12 KiB
Markdown
% Fast Key Lookup with a Small Read-Only Database
|
|
|
|
(Published on **2019-05-14**)
|
|
|
|
I was recently thinking about how to help users on
|
|
[VNDB.org](https://vndb.org/) improve their password security. One idea is to
|
|
show a warning when someone's password is already in some kind of database.
|
|
While there are several online APIs we can query for this purpose, I chose to
|
|
ignore these (apart from the potential privacy implications, I'd also rather
|
|
avoid dependencies on external services when possible). As an alternative,
|
|
there are several password dictionaries available for download that can be used
|
|
for offline querying. I decided to go with the free [CrackStation password
|
|
dictionary](https://crackstation.net/crackstation-wordlist-password-cracking-dictionary.htm).
|
|
|
|
The dictionary comes in the form of a compressed text file with one password
|
|
per line. This is a great format to keep the size of the dictionary small, but
|
|
it's a really shitty format for fast querying. The obvious solution is to
|
|
import the dictionary into some database (any RDMBS or NoSQL store of your
|
|
choice), but those tend to be relatively heavyweight solutions in terms of code
|
|
base, operational dependencies, database size or all of the above. Generic
|
|
databases are typically optimized to support (moderately) fast inserts, updates
|
|
and deletes, which means they often can't pack data very tightly. Some
|
|
databases do support compression, though.
|
|
|
|
So all I need is a small, compressed database that supports only a single
|
|
operation: check if a key (password, in this case) is present. Surely I could
|
|
quickly hack together a simple database that provides a good size/performance
|
|
ratio?
|
|
|
|
## My Little B-tree
|
|
|
|
I had some misgivings about implementing a
|
|
[B-tree](https://en.wikipedia.org/wiki/B-tree). It may be a relatively simple
|
|
data structure, but I've written enough off-by-one errors in my life to be wary
|
|
of actually implementing one, and given this use case, I felt that there had to
|
|
be an even simpler solution. After thinking up and rejecting some alternative
|
|
strategies and after realizing that I really only needed to support two
|
|
operations - bulk data insertion and exact key lookup - I came to the
|
|
conclusion that a B-tree isn't such a bad idea after all. One major advantage
|
|
of B-trees is that they provide a natural solution to split up the data into
|
|
fixed-sized blocks, and that is a *very* useful property if we want to use
|
|
compression.
|
|
|
|
The database format I came up with is basically a concatenation of compressed
|
|
blocks. Blocks come in two forms: leaf blocks and, uh, let's call them *index
|
|
blocks*.
|
|
|
|
The leaf blocks contain all the password strings. I initially tried to encode
|
|
the passwords as a concatenated sequence of length-prefixed strings, but I
|
|
could not find a way to quickly parse and iterate through that data in Perl 5
|
|
(the target language for this little project[^1]). As it turns out, doing a
|
|
substring match on the full leaf block is faster than trying to work with
|
|
[unpack](https://perldoc.pl/functions/unpack) and
|
|
[substr](https://perldoc.pl/functions/substr), even if a naive substring match
|
|
can't take advantage of the string ordering to skip over data when it has found
|
|
a "larger" key. So I made sure that the leaf block starts with a null byte and
|
|
encoded the passwords as a sequence of null-terminated strings. That way we can
|
|
reliably find the key by doing a substring search on `"\x00${key}\x00"`.
|
|
|
|
The index blocks are the intermediate B-tree nodes and consist of a sorted list
|
|
of block references interleaved with keys. Like this:
|
|
|
|
```perl
|
|
$block1 # Contains all the passwords "less than" $password1
|
|
$password1
|
|
$block2 # Contains all the passwords "less than" $password2
|
|
$password2
|
|
$block3 # Contains all the passwords "greater than" $password2
|
|
```
|
|
|
|
The `$blockN` references are stored as a 64bit integer and encode the byte
|
|
offset of the block relative to the start of the database file, the compressed
|
|
byte length of the block, and whether it is a leaf or index block. Strictly
|
|
speaking only the offset is necessary, the length and leaf flag can be stored
|
|
alongside the block itself, but this approach saves a separate decoding step.
|
|
The `$passwordN` keys are encoded as a null-terminated string, if only for
|
|
consistency with leaf blocks.
|
|
|
|
Finally, at the end of the file, we store a reference to the *parent* block, so
|
|
that our lookup algorithm knows where to start its search.
|
|
|
|
Given a sorted list of passwords[^2], creating a database with that format is
|
|
relatively easy. The algorithm goes roughly as follows (apologies for the
|
|
pseudocode, the actual code is slightly messier in order to do the proper data
|
|
encoding):
|
|
|
|
```perl
|
|
my @blocks; # Stack of blocks, $block[0] is a leaf block.
|
|
|
|
for my $password (@passwords) {
|
|
$blocks[0]->append_password($password);
|
|
|
|
# Flush blocks when they get large enough.
|
|
for(my $i=0; $blocks[$i]->length() > $block_size; $i++) {
|
|
my $reference = $blocks[$i]->flush();
|
|
$blocks[$i+1]->append_block_reference($reference);
|
|
}
|
|
}
|
|
|
|
# Flush the remaining blocks.
|
|
for my $i (0..$#blocks) {
|
|
my $reference = $blocks[$i]->flush();
|
|
$blocks[$i+1]->append_block_reference($reference);
|
|
}
|
|
|
|
write_parent_node_reference();
|
|
```
|
|
|
|
That's it, really. No weird tree balancing tricks, no need to "modify" index
|
|
blocks in any other way than appending some data. *Flushing* a block is
|
|
nothing more than compressing the thing, appending it to the database file and
|
|
noting its length and byte offset for future reference.
|
|
|
|
Lookup is just as easy. I don't even need pseudocode to demonstrate it, here's
|
|
the actual implementation:
|
|
|
|
```perl
|
|
sub lookup_rec {
|
|
my($q, $F, $ref) = @_;
|
|
my $buf = readblock $F, $ref;
|
|
if($ref->[0]) { # Is this a leaf block?
|
|
return $buf =~ /\x00\Q$q\E\x00/; # Substring search
|
|
} else {
|
|
# This is an index block, walk through the block references and
|
|
# password strings until we find a string that's larger than our query.
|
|
while($buf =~ /(.{8})([^\x00]*)\x00/sg) {
|
|
return lookup_rec($q, $F, dref $1) if $q lt $2;
|
|
}
|
|
return lookup_rec($q, $F, dref substr $buf, -8)
|
|
}
|
|
}
|
|
|
|
# Usage: lookup($query, $database_filename)
|
|
# returns true if $query exists in the database.
|
|
sub lookup {
|
|
my($q, $f) = @_;
|
|
open my $F, '<', $f or die $!;
|
|
# Read the last 8 bytes in the file for the reference to the parent block.
|
|
sysseek $F, -8, 2 or die $!;
|
|
die $! if 8 != sysread $F, (my $buf), 8;
|
|
# Start the recursive lookup
|
|
lookup_rec($q, $F, dref $buf)
|
|
}
|
|
```
|
|
|
|
The full code, including that of the benchmarks below, can be found [on
|
|
git](https://g.blicky.net/pwlookup.git/tree/).
|
|
|
|
## Benchmarks
|
|
|
|
I benchmarked my little B-tree implementation with a few different compression
|
|
settings (no compression, gzip and zstandard) and block sizes (1k and 4k). For
|
|
comparison I also added a naive implementation that performs a simple linear
|
|
lookup in the sorted dictionary, and another one that uses
|
|
[LMDB](https://symas.com/lmdb/).
|
|
|
|
Here are the results with the `crackstation-human-only.txt.gz` dictionary,
|
|
containing 63,941,069 passwords at 247 MiB original size.
|
|
|
|
| Database | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec |
|
|
|:----------------|--------------:|------------------:|------------:|
|
|
| Naive (plain) | 684 | 6,376 | 0.16 (6.1 s)|
|
|
| Naive (gzip) | 246 | 6,340 | 0.12 (8.3 s)|
|
|
| B-tree plain 1k | 698 | 6,460 | 17,857 |
|
|
| B-tree plain 4k | 687 | 6,436 | 9,345 |
|
|
| B-tree gzip 1k | 261 | 10,772 | 9,345 |
|
|
| B-tree gzip 4k | 244 | 10,572 | 5,076 |
|
|
| B-tree zstd 1k | 291 | 6,856 | 12,345 |
|
|
| B-tree zstd 4k | 268 | 6,724 | 6,944 |
|
|
| LMDB | 1,282 | 590,792 | 333,333 |
|
|
|
|
Well shit. My little B-tree experiment does have an awesome size/performance
|
|
ratio when compared to the Naive approach (little surprise there), but the
|
|
performance difference with LMDB is *insane*. Although, really, that isn't too
|
|
surprising either, LMDB is written in C and has been *heavily* optimized for
|
|
performance. LMDB's memory usage in this benchmark should be taken with a grain
|
|
of salt, its use of [mmap()](https://manned.org/mmap) tends to [throw off
|
|
memory measurements](https://lmdb.readthedocs.io/en/release/#memory-usage).
|
|
|
|
I used the default compression levels of zstd and gzip. I expect that a
|
|
slightly higher compression level for zstd could reduce the database sizes to
|
|
below gzip levels without too much of a performance penalty.
|
|
|
|
What's curious is that the *B-tree gzip 4k* database is smaller than the *Naive
|
|
(gzip)* one. I wonder if I have a bug somewhere that throws away a chunk of the
|
|
original data. Or if I somehow ended up using a different compression level. Or
|
|
if gzip is just being weird.
|
|
|
|
Here's the same benchmark with the `crackstation.txt.gz` dictionary, containing
|
|
1,212,356,398 passwords at 4.2 GiB original size[^3].
|
|
|
|
| Database | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec |
|
|
|:----------------|--------------:|------------------:|------------:|
|
|
| Naive (plain) | 14,968 | 38,536 | 0.01 (110 s)|
|
|
| Naive (gzip) | 4,245 | 38,760 | 0.01 (136 s)|
|
|
| B-tree plain 1k | 15,377 | 6,288 | 15,384 |
|
|
| B-tree plain 4k | 15,071 | 6,456 | 8,196 |
|
|
| B-tree gzip 1k | 4,926 | 10,780 | 7,352 |
|
|
| B-tree gzip 4k | 4,344 | 10,720 | 4,273 |
|
|
| B-tree zstd 1k | 5,389 | 6,708 | 10,000 |
|
|
| B-tree zstd 4k | 4,586 | 6,692 | 5,917 |
|
|
| LMDB | 26,453 | 3,259,368 | 266,666 |
|
|
|
|
The main conclusion I draw from this benchmark is that the B-tree
|
|
implementation scales pretty well with increasing database sizes, as one would
|
|
expect. I'm not sure why Perl decided to use more memory for the *Naive*
|
|
benchmarks, but it doesn't really matter.
|
|
|
|
## Improvements
|
|
|
|
Is this the best we can do? No way! Let's start with some low-hanging fruit:
|
|
|
|
- The current lookup function reads the database file from scratch on every
|
|
lookup. An LRU cache for uncompressed blocks ought to speed things up
|
|
considerably.
|
|
- Keys in index blocks are repeated in leaf blocks, this isn't really
|
|
necessary.
|
|
- It's possible to encode an "offset inside block" field to the block
|
|
references and add a few more strings to the index blocks, allowing parts of
|
|
a block to be skipped when searching for the key. This way we can get some of
|
|
the performance benefits of smaller block sizes without the costs of an
|
|
increase in database size. Or store multiple (smaller) intermediate B-tree
|
|
nodes inside a single block. Same thing.
|
|
- The lookup function could be rewritten in a faster language (C/C++/Rust), I'm
|
|
pretty sure this would be a big win, too.
|
|
|
|
Thinking beyond B-trees, an alternative and likely much more efficient approach
|
|
is to use a hash function to assign strings to leaf blocks, store an array of
|
|
block pointers in the index blocks (without interleaving keys) and then use the
|
|
hash function to index the array for lookup. This makes it harder to cap the
|
|
size of leaf blocks, but with the small password strings that's not likely to
|
|
be a problem. It does significantly complicate creating the database in the
|
|
first place.
|
|
|
|
Perhaps an even better approach is to not store the strings in the first place
|
|
and simply use a (sufficiently) large [bloom
|
|
filter](https://en.wikipedia.org/wiki/Bloom_filter).
|
|
|
|
But this was just a little side project. My goal was to get 1 ms lookups with
|
|
the least amount of code and with a database size that isn't too far off from
|
|
the compressed dictionary. That goal turned out to be pretty unambitious.
|
|
|
|
|
|
[^1]: Yeah, people still use Perl 5 in 2019.
|
|
[^2]: But note that the passwords in the CrackStation dictionary are not sorted
|
|
according to Perl's string comparison algorithm, so it requires a separate
|
|
`LC_COLLATE=C sort` command to fix that. Also note that sorting a billion
|
|
strings is a pretty challenging problem in its own right, but enough has been
|
|
written about that. Arguably, enough has been written about B-trees and
|
|
databases as well.
|
|
[^3]: Running these benchmarks was a bit of a nightmare as I was running low on
|
|
free space on my SSD. I had to delete some unused build artefacts from other
|
|
projects in an emergency in order for `sort` to be able to finish sorting and
|
|
writing the *Naive (plain)* database, upon which all the others are based.
|
|
[Ncdu](/ncdu) has saved this experiment, its author deserves an extra tasty
|
|
pizza for dinner today.
|