Add doc/pwlookup
This commit is contained in:
parent
9d3d07b590
commit
132f3745e1
6 changed files with 275 additions and 1 deletions
255
dat/doc/pwlookup.md
Normal file
255
dat/doc/pwlookup.md
Normal file
|
|
@ -0,0 +1,255 @@
|
|||
% Fast Key Lookup with a Small Read-Only Database
|
||||
|
||||
(Published on **2019-05-14**)
|
||||
|
||||
I was recently thinking about how to help users on
|
||||
[VNDB.org](https://vndb.org/) improve their password security. One idea is to
|
||||
show a warning when someone's password is already in some kind of database.
|
||||
While there are several online APIs we can query for this purpose, I chose to
|
||||
ignore these (apart from the potential privacy implications, I'd also rather
|
||||
avoid dependencies on external services when possible). As an alternative,
|
||||
there are several password dictionaries available for download that can be used
|
||||
for offline querying. I decided to go with the free [CrackStation password
|
||||
dictionary](https://crackstation.net/crackstation-wordlist-password-cracking-dictionary.htm).
|
||||
|
||||
The dictionary comes in the form of a compressed text file with one password
|
||||
per line. This is a great format to keep the size of the dictionary small, but
|
||||
it's a really shitty format for fast querying. The obvious solution is to
|
||||
import the dictionary into some database (any RDMBS or NoSQL store of your
|
||||
choice), but those tend to be relatively heavyweight solutions in terms of code
|
||||
base, operational dependencies, database size or all of the above. Generic
|
||||
databases are typically optimized to support (moderately) fast inserts, updates
|
||||
and deletes, which means they often can't pack data very tightly. Some
|
||||
databases do support compression, though.
|
||||
|
||||
So all I need is a small, compressed database that supports only a single
|
||||
operation: check if a key (password, in this case) is present. Surely I could
|
||||
quickly hack together a simple database that provides a good size/performance
|
||||
ratio?
|
||||
|
||||
## My Little B-tree
|
||||
|
||||
I had some misgivings about implementing a
|
||||
[B-tree](https://en.wikipedia.org/wiki/B-tree). It may be a relatively simple
|
||||
data structure, but I've written enough off-by-one errors in my life to be wary
|
||||
of actually implementing one, and given this use case, I felt that there had to
|
||||
be an even simpler solution. After thinking up and rejecting some alternative
|
||||
strategies and after realizing that I really only needed to support two
|
||||
operations - bulk data insertion and exact key lookup - I came to the
|
||||
conclusion that a B-tree isn't such a bad idea after all. One major advantage
|
||||
of B-trees is that they provide a natural solution to split up the data into
|
||||
fixed-sized blocks, and that is a *very* useful property if we want to use
|
||||
compression.
|
||||
|
||||
The database format I came up with is basically a concatenation of compressed
|
||||
blocks. Blocks come in two forms: leaf blocks and, uh, let's call them *index
|
||||
blocks*.
|
||||
|
||||
The leaf blocks contain all the password strings. I initially tried to encode
|
||||
the passwords as a concatenated sequence of length-prefixed strings, but I
|
||||
could not find a way to quickly parse and iterate through that data in Perl 5
|
||||
(the target language for this little project[^1]). As it turns out, doing a
|
||||
substring match on the full leaf block is faster than trying to work with
|
||||
[unpack](https://perldoc.pl/functions/unpack) and
|
||||
[substr](https://perldoc.pl/functions/substr), even if a naive substring match
|
||||
can't take advantage of the string ordering to skip over processing when it has
|
||||
found a "larger" key. So I made sure that the leaf block starts with a null
|
||||
byte and encoded the passwords as a sequence of null-terminated strings. That
|
||||
way we can reliably find the key by doing a substring search on
|
||||
`"\x00${key}\x00"`.
|
||||
|
||||
The index blocks are the intermediate B-tree nodes and consist of a sorted list
|
||||
of block references interleaved with keys. Like this:
|
||||
|
||||
```perl
|
||||
$block1 # Contains all the passwords "less than" $password1
|
||||
$password1
|
||||
$block2 # Contains all the passwords "less than" $password2
|
||||
$password2
|
||||
$block3 # Contains all the passwords "greater than" $password2
|
||||
```
|
||||
|
||||
The `$blockN` references are stored as a 64bit integer and encode the byte
|
||||
offset of the block relative to the start of the database file, the compressed
|
||||
byte length of the block, and whether it is a leaf or index block. Strictly
|
||||
speaking only the offset is necessary, the length and leaf flag can be stored
|
||||
alongside the block itself, but this approach saves a separate decoding step.
|
||||
The `$passwordN` keys are encoded as a null-terminated string, if only for
|
||||
consistency with leaf blocks.
|
||||
|
||||
Finally, at the end of the file, we store a reference to the *parent* block, so
|
||||
that our lookup algorithm knows where to start its search.
|
||||
|
||||
Given a sorted list of passwords[^2], creating a database with that format is
|
||||
relatively easy. The algorithm goes roughly as follows (apologies for the
|
||||
pseudocode, the actual code is slightly messier in order to do the proper data
|
||||
encoding):
|
||||
|
||||
```perl
|
||||
my @blocks; # Stack of blocks, $block[0] is a leaf block.
|
||||
|
||||
for my $password (@passwords) {
|
||||
$blocks[0]->append_password($password);
|
||||
|
||||
# Flush blocks when they get large enough.
|
||||
for(my $i=0; $blocks[$i]->length() > $block_size; $i++) {
|
||||
my $reference = $blocks[$i]->flush();
|
||||
$blocks[$i+1]->append_block_reference($reference);
|
||||
}
|
||||
}
|
||||
|
||||
# Flush the remaining blocks.
|
||||
for my $i (0..$#blocks) {
|
||||
my $reference = $blocks[$i]->flush();
|
||||
$blocks[$i+1]->append_block_reference($reference);
|
||||
}
|
||||
|
||||
write_parent_node_reference();
|
||||
```
|
||||
|
||||
That's it, really. No weird tree balancing tricks, no need to "modify" index
|
||||
blocks in any other way than appending some data. *Flushing* a block is
|
||||
nothing more than compressing the thing, appending it to the database file and
|
||||
noting its length and byte offset for future reference.
|
||||
|
||||
Lookup is just as easy. I don't even need pseudocode to demonstrate it, here's
|
||||
the actual implementation:
|
||||
|
||||
```perl
|
||||
sub lookup_rec {
|
||||
my($q, $F, $ref) = @_;
|
||||
my $buf = readblock $F, $ref;
|
||||
if($ref->[0]) { # Is this a leaf block?
|
||||
return $buf =~ /\x00\Q$q\E\x00/; # Substring search
|
||||
} else {
|
||||
# This is an index block, walk through the block references and
|
||||
# password strings until we find a string that's larger than our query.
|
||||
while($buf =~ /(.{8})([^\x00]*)\x00/sg) {
|
||||
return lookup_rec($q, $F, dref $1) if $q lt $2;
|
||||
}
|
||||
return lookup_rec($q, $F, dref substr $buf, -8)
|
||||
}
|
||||
}
|
||||
|
||||
# Usage: lookup($query, $database_filename)
|
||||
# returns true if $query exists in the database.
|
||||
sub lookup {
|
||||
my($q, $f) = @_;
|
||||
open my $F, '<', $f or die $!;
|
||||
# Read the last 8 bytes in the file for the reference to the parent block.
|
||||
sysseek $F, -8, 2 or die $!;
|
||||
die $! if 8 != sysread $F, (my $buf), 8;
|
||||
# Start the recursive lookup
|
||||
lookup_rec($q, $F, dref $buf)
|
||||
}
|
||||
```
|
||||
|
||||
The full code, including that of the benchmarks below, can be found [on
|
||||
git](https://g.blicky.net/pwlookup.git/tree/).
|
||||
|
||||
## Benchmarks
|
||||
|
||||
I benchmarked my little B-tree implementation with a few different compression
|
||||
settings (no compression, gzip and zstandard) and block sizes (1k and 4k). For
|
||||
comparison I also added a naive implementation that performs a simple linear
|
||||
lookup in the sorted dictionary, and another one that uses
|
||||
[LMDB](https://symas.com/lmdb/).
|
||||
|
||||
Here are the results with the `crackstation-human-only.txt.gz` dictionary,
|
||||
containing 63,941,069 passwords at 247 MiB original size.
|
||||
|
||||
| Database | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec |
|
||||
|:----------------|--------------:|------------------:|------------:|
|
||||
| Naive (plain) | 684 | 6,376 | 0.16 (6.1 s)|
|
||||
| Naive (gzip) | 246 | 6,340 | 0.12 (8.3 s)|
|
||||
| B-tree plain 1k | 698 | 6,460 | 17,857.14 |
|
||||
| B-tree plain 4k | 687 | 6,436 | 9,345.79 |
|
||||
| B-tree gzip 1k | 261 | 10,772 | 9,345.79 |
|
||||
| B-tree gzip 4k | 244 | 10,572 | 5,076.14 |
|
||||
| B-tree zstd 1k | 291 | 6,856 | 12,345.68 |
|
||||
| B-tree zstd 4k | 268 | 6,724 | 6,944.44 |
|
||||
| LMDB | 1,282 | 590,792 | 333,333.33 |
|
||||
|
||||
Well shit. My little B-tree experiment does have an awesome size/performance
|
||||
ratio when compared to the Naive approach (little surprise there), but the
|
||||
performance difference with LMDB is *insane*. Although, really, that isn't too
|
||||
surprising either, LMDB is written in C and has been *heavily* optimized for
|
||||
performance.
|
||||
|
||||
I used the default compression levels of zstd and gzip. I expect that a
|
||||
slightly higher compression level for zstd could reduce the database sizes to
|
||||
below gzip levels without too much of a performance penalty.
|
||||
|
||||
What's curious is that the *B-tree gzip 4k* database is smaller than the *Naive
|
||||
(gzip)* one. I wonder if I have a bug somewhere that throws away a chunk of the
|
||||
original data. Or if I somehow ended up using a different compression level. Or
|
||||
if gzip is just being weird.
|
||||
|
||||
Here's the same benchmark with the `crackstation.txt.gz` dictionary, containing
|
||||
1,212,356,398 passwords at 4.2 GiB original size[^3].
|
||||
|
||||
| Database | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec |
|
||||
|:----------------|--------------:|------------------:|------------:|
|
||||
| Naive (plain) | 14,968 | 38,536 | 0.01 (110 s)|
|
||||
| Naive (gzip) | 4,245 | 38,760 | 0.01 (136 s)|
|
||||
| B-tree plain 1k | 15,377 | 6,288 | 15,384.62 |
|
||||
| B-tree plain 4k | 15,071 | 6,456 | 8,196.72 |
|
||||
| B-tree gzip 1k | 4,926 | 10,780 | 7,352.94 |
|
||||
| B-tree gzip 4k | 4,344 | 10,720 | 4,273.50 |
|
||||
| B-tree zstd 1k | 5,389 | 6,708 | 10,000.00 |
|
||||
| B-tree zstd 4k | 4,586 | 6,692 | 5,917.16 |
|
||||
| LMDB | 26,453 | 3,259,368 | 266,666.67 |
|
||||
|
||||
The main conclusion I draw from this benchmark is that the B-tree
|
||||
implementation scales pretty well with increasing database sizes, as one would
|
||||
expect. I'm not sure why Perl decided to use more memory for the *Naive*
|
||||
benchmarks, but it doesn't really matter.
|
||||
|
||||
## Improvements
|
||||
|
||||
Is this the best we can do? No way! Let's start with some low-hanging fruit:
|
||||
|
||||
- The current lookup function reads the database file from scratch on every
|
||||
lookup. An LRU cache for uncompressed blocks ought to speed things up
|
||||
considerably.
|
||||
- Keys in index blocks are repeated in leaf blocks, this isn't really
|
||||
necessary.
|
||||
- It's possible to encode an "offset inside block" field to the block
|
||||
references and add a few more strings to the index blocks, allowing parts of
|
||||
a block can be skipped when searching for the key. This allows one to get
|
||||
some of the performance benefits of smaller block sizes without paying for
|
||||
the increase in database size. Or store multiple (smaller) intermediate
|
||||
B-tree nodes inside a single block. Same thing.
|
||||
- The lookup function could be rewritten in a faster language (C/C++/Rust), I'm
|
||||
pretty sure this would be a big win, too.
|
||||
|
||||
Thinking beyond B-trees, an alternative and likely much more efficient approach
|
||||
is to use a hash function to assign strings to leaf blocks, store an array of
|
||||
block pointers in the index blocks (without interleaving keys) and then use the
|
||||
hash function to index the array for lookup. This makes it harder to cap the
|
||||
size of leaf blocks, but with the small password strings that's not likely to
|
||||
be a problem. It does significantly complicates creating the database in the
|
||||
first place.
|
||||
|
||||
Perhaps an even better approach is to not store the strings in the first place
|
||||
and simply use a (sufficiently) large [bloom
|
||||
filter](https://en.wikipedia.org/wiki/Bloom_filter).
|
||||
|
||||
But this was just a little side project. My goal was to get 1 ms lookups with
|
||||
the least amount of code and with a database size that isn't too far off from
|
||||
the compressed dictionary. That goal turned out to be pretty unambitious.
|
||||
|
||||
|
||||
[^1]: Yeah, people still use Perl 5.
|
||||
[^2]: But note that the passwords in the CrackStation dictionary are not sorted
|
||||
according to Perl's string comparison algorithm, so it requires a separate
|
||||
`LC_COLLATE=C sort` command to fix that. Also note that sorting a billion
|
||||
strings is a pretty challenging problem in its own right, but enough has been
|
||||
written about that. Arguably, enough has been written about B-trees and
|
||||
databases as well.
|
||||
[^3]: Running these benchmarks was a bit of a nightmare as I was running low on
|
||||
free space on my SSD. I had to delete some unused build artefacts from other
|
||||
projects in an emergency in order for `sort` to be able to finish sorting and
|
||||
writing the *Naive (plain)* database, upon which all the others are based.
|
||||
[Ncdu](/ncdu) has saved this experiment, its author deserves a tasty pizza
|
||||
for dinner today.
|
||||
Loading…
Add table
Add a link
Reference in a new issue