Add doc/pwlookup

2019-05-14 19:43:47 +02:00 · 2019-05-14 19:43:47 +02:00 · 132f3745e1
commit 132f3745e1
parent 9d3d07b590
6 changed files with 275 additions and 1 deletions
--- a/dat/doc/pwlookup.md
+++ b/dat/doc/pwlookup.md
@ -0,0 +1,255 @@
+% Fast Key Lookup with a Small Read-Only Database
+
+(Published on **2019-05-14**)
+
+I was recently thinking about how to help users on
+[VNDB.org](https://vndb.org/) improve their password security. One idea is to
+show a warning when someone's password is already in some kind of database.
+While there are several online APIs we can query for this purpose, I chose to
+ignore these (apart from the potential privacy implications, I'd also rather
+avoid dependencies on external services when possible). As an alternative,
+there are several password dictionaries available for download that can be used
+for offline querying.  I decided to go with the free [CrackStation password
+dictionary](https://crackstation.net/crackstation-wordlist-password-cracking-dictionary.htm).
+
+The dictionary comes in the form of a compressed text file with one password
+per line. This is a great format to keep the size of the dictionary small, but
+it's a really shitty format for fast querying. The obvious solution is to
+import the dictionary into some database (any RDMBS or NoSQL store of your
+choice), but those tend to be relatively heavyweight solutions in terms of code
+base, operational dependencies, database size or all of the above. Generic
+databases are typically optimized to support (moderately) fast inserts, updates
+and deletes, which means they often can't pack data very tightly. Some
+databases do support compression, though.
+
+So all I need is a small, compressed database that supports only a single
+operation: check if a key (password, in this case) is present. Surely I could
+quickly hack together a simple database that provides a good size/performance
+ratio?
+
+## My Little B-tree
+
+I had some misgivings about implementing a
+[B-tree](https://en.wikipedia.org/wiki/B-tree). It may be a relatively simple
+data structure, but I've written enough off-by-one errors in my life to be wary
+of actually implementing one, and given this use case, I felt that there had to
+be an even simpler solution. After thinking up and rejecting some alternative
+strategies and after realizing that I really only needed to support two
+operations - bulk data insertion and exact key lookup - I came to the
+conclusion that a B-tree isn't such a bad idea after all. One major advantage
+of B-trees is that they provide a natural solution to split up the data into
+fixed-sized blocks, and that is a *very* useful property if we want to use
+compression.
+
+The database format I came up with is basically a concatenation of compressed
+blocks. Blocks come in two forms: leaf blocks and, uh, let's call them *index
+blocks*.
+
+The leaf blocks contain all the password strings. I initially tried to encode
+the passwords as a concatenated sequence of length-prefixed strings, but I
+could not find a way to quickly parse and iterate through that data in Perl 5
+(the target language for this little project[^1]). As it turns out, doing a
+substring match on the full leaf block is faster than trying to work with
+[unpack](https://perldoc.pl/functions/unpack) and
+[substr](https://perldoc.pl/functions/substr), even if a naive substring match
+can't take advantage of the string ordering to skip over processing when it has
+found a "larger" key. So I made sure that the leaf block starts with a null
+byte and encoded the passwords as a sequence of null-terminated strings. That
+way we can reliably find the key by doing a substring search on
+`"\x00${key}\x00"`.
+
+The index blocks are the intermediate B-tree nodes and consist of a sorted list
+of block references interleaved with keys. Like this:
+
+```perl
+$block1 # Contains all the passwords "less than" $password1
+$password1
+$block2 # Contains all the passwords "less than" $password2
+$password2
+$block3 # Contains all the passwords "greater than" $password2
+```
+
+The `$blockN` references are stored as a 64bit integer and encode the byte
+offset of the block relative to the start of the database file, the compressed
+byte length of the block, and whether it is a leaf or index block. Strictly
+speaking only the offset is necessary, the length and leaf flag can be stored
+alongside the block itself, but this approach saves a separate decoding step.
+The `$passwordN` keys are encoded as a null-terminated string, if only for
+consistency with leaf blocks.
+
+Finally, at the end of the file, we store a reference to the *parent* block, so
+that our lookup algorithm knows where to start its search.
+
+Given a sorted list of passwords[^2], creating a database with that format is
+relatively easy. The algorithm goes roughly as follows (apologies for the
+pseudocode, the actual code is slightly messier in order to do the proper data
+encoding):
+
+```perl
+my @blocks; # Stack of blocks, $block[0] is a leaf block.
+
+for my $password (@passwords) {
+    $blocks[0]->append_password($password);
+
+    # Flush blocks when they get large enough.
+    for(my $i=0; $blocks[$i]->length() > $block_size; $i++) {
+        my $reference = $blocks[$i]->flush();
+        $blocks[$i+1]->append_block_reference($reference);
+    }
+}
+
+# Flush the remaining blocks.
+for my $i (0..$#blocks) {
+    my $reference = $blocks[$i]->flush();
+    $blocks[$i+1]->append_block_reference($reference);
+}
+
+write_parent_node_reference();
+```
+
+That's it, really. No weird tree balancing tricks, no need to "modify" index
+blocks in any other way than appending some data. *Flushing* a block is
+nothing more than compressing the thing, appending it to the database file and
+noting its length and byte offset for future reference.
+
+Lookup is just as easy. I don't even need pseudocode to demonstrate it, here's
+the actual implementation:
+
+```perl
+sub lookup_rec {
+    my($q, $F, $ref) = @_;
+    my $buf = readblock $F, $ref;
+    if($ref->[0]) { # Is this a leaf block?
+        return $buf =~ /\x00\Q$q\E\x00/; # Substring search
+    } else {
+        # This is an index block, walk through the block references and
+        # password strings until we find a string that's larger than our query.
+        while($buf =~ /(.{8})([^\x00]*)\x00/sg) {
+            return lookup_rec($q, $F, dref $1) if $q lt $2;
+        }
+        return lookup_rec($q, $F, dref substr $buf, -8)
+    }
+}
+
+# Usage: lookup($query, $database_filename)
+#   returns true if $query exists in the database.
+sub lookup {
+    my($q, $f) = @_;
+    open my $F, '<', $f or die $!; 
+    # Read the last 8 bytes in the file for the reference to the parent block.
+    sysseek $F, -8, 2 or die $!;
+    die $! if 8 != sysread $F, (my $buf), 8;
+    # Start the recursive lookup
+    lookup_rec($q, $F, dref $buf)
+}
+```
+
+The full code, including that of the benchmarks below, can be found [on
+git](https://g.blicky.net/pwlookup.git/tree/).
+
+## Benchmarks
+
+I benchmarked my little B-tree implementation with a few different compression
+settings (no compression, gzip and zstandard) and block sizes (1k and 4k). For
+comparison I also added a naive implementation that performs a simple linear
+lookup in the sorted dictionary, and another one that uses
+[LMDB](https://symas.com/lmdb/).
+
+Here are the results with the `crackstation-human-only.txt.gz` dictionary,
+containing 63,941,069 passwords at 247 MiB original size.
+
+| Database        | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec |
+|:----------------|--------------:|------------------:|------------:|
+| Naive (plain)   |           684 |             6,376 | 0.16 (6.1 s)|
+| Naive (gzip)    |           246 |             6,340 | 0.12 (8.3 s)|
+| B-tree plain 1k |           698 |             6,460 |   17,857.14 |
+| B-tree plain 4k |           687 |             6,436 |    9,345.79 |
+| B-tree gzip 1k  |           261 |            10,772 |    9,345.79 |
+| B-tree gzip 4k  |           244 |            10,572 |    5,076.14 |
+| B-tree zstd 1k  |           291 |             6,856 |   12,345.68 |
+| B-tree zstd 4k  |           268 |             6,724 |    6,944.44 |
+| LMDB            |         1,282 |           590,792 |  333,333.33 |
+
+Well shit. My little B-tree experiment does have an awesome size/performance
+ratio when compared to the Naive approach (little surprise there), but the
+performance difference with LMDB is *insane*. Although, really, that isn't too
+surprising either,  LMDB is written in C and has been *heavily* optimized for
+performance.
+
+I used the default compression levels of zstd and gzip. I expect that a
+slightly higher compression level for zstd could reduce the database sizes to
+below gzip levels without too much of a performance penalty.
+
+What's curious is that the *B-tree gzip 4k* database is smaller than the *Naive
+(gzip)* one. I wonder if I have a bug somewhere that throws away a chunk of the
+original data. Or if I somehow ended up using a different compression level. Or
+if gzip is just being weird.
+
+Here's the same benchmark with the `crackstation.txt.gz` dictionary, containing
+1,212,356,398 passwords at 4.2 GiB original size[^3].
+
+| Database        | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec |
+|:----------------|--------------:|------------------:|------------:|
+| Naive (plain)   |        14,968 |            38,536 | 0.01 (110 s)|
+| Naive (gzip)    |         4,245 |            38,760 | 0.01 (136 s)|
+| B-tree plain 1k |        15,377 |             6,288 |   15,384.62 |
+| B-tree plain 4k |        15,071 |             6,456 |    8,196.72 |
+| B-tree gzip 1k  |         4,926 |            10,780 |    7,352.94 |
+| B-tree gzip 4k  |         4,344 |            10,720 |    4,273.50 |
+| B-tree zstd 1k  |         5,389 |             6,708 |   10,000.00 |
+| B-tree zstd 4k  |         4,586 |             6,692 |    5,917.16 |
+| LMDB            |        26,453 |         3,259,368 |  266,666.67 |
+
+The main conclusion I draw from this benchmark is that the B-tree
+implementation scales pretty well with increasing database sizes, as one would
+expect. I'm not sure why Perl decided to use more memory for the *Naive*
+benchmarks, but it doesn't really matter.
+
+## Improvements
+
+Is this the best we can do? No way! Let's start with some low-hanging fruit:
+
+- The current lookup function reads the database file from scratch on every
+  lookup. An LRU cache for uncompressed blocks ought to speed things up
+  considerably.
+- Keys in index blocks are repeated in leaf blocks, this isn't really
+  necessary.
+- It's possible to encode an "offset inside block" field to the block
+  references and add a few more strings to the index blocks, allowing parts of
+  a block can be skipped when searching for the key. This allows one to get
+  some of the performance benefits of smaller block sizes without paying for
+  the increase in database size. Or store multiple (smaller) intermediate
+  B-tree nodes inside a single block. Same thing.
+- The lookup function could be rewritten in a faster language (C/C++/Rust), I'm
+  pretty sure this would be a big win, too.
+
+Thinking beyond B-trees, an alternative and likely much more efficient approach
+is to use a hash function to assign strings to leaf blocks, store an array of
+block pointers in the index blocks (without interleaving keys) and then use the
+hash function to index the array for lookup. This makes it harder to cap the
+size of leaf blocks, but with the small password strings that's not likely to
+be a problem. It does significantly complicates creating the database in the
+first place.
+
+Perhaps an even better approach is to not store the strings in the first place
+and simply use a (sufficiently) large [bloom
+filter](https://en.wikipedia.org/wiki/Bloom_filter).
+
+But this was just a little side project. My goal was to get 1 ms lookups with
+the least amount of code and with a database size that isn't too far off from
+the compressed dictionary. That goal turned out to be pretty unambitious.
+
+
+[^1]: Yeah, people still use Perl 5.
+[^2]: But note that the passwords in the CrackStation dictionary are not sorted
+  according to Perl's string comparison algorithm, so it requires a separate
+  `LC_COLLATE=C sort` command to fix that. Also note that sorting a billion
+  strings is a pretty challenging problem in its own right, but enough has been
+  written about that.  Arguably, enough has been written about B-trees and
+  databases as well.
+[^3]: Running these benchmarks was a bit of a nightmare as I was running low on
+  free space on my SSD. I had to delete some unused build artefacts from other
+  projects in an emergency in order for `sort` to be able to finish sorting and
+  writing the *Naive (plain)* database, upon which all the others are based.
+  [Ncdu](/ncdu) has saved this experiment, its author deserves a tasty pizza
+  for dinner today.