Add doc/pwlookup

2019-05-14 19:43:47 +02:00 · 2019-05-14 19:43:47 +02:00 · 132f3745e1
commit 132f3745e1
parent 9d3d07b590
6 changed files with 275 additions and 1 deletions
--- a/.gitignore
+++ b/.gitignore
@ -46,6 +46,7 @@ pub/doc/commvis.html
 pub/doc/dcstats.html
 pub/doc/easyipc.html
 pub/doc/funcweb.html
 pub/doc/pwlookup.html
 pub/doc/sqlaccess.html
 pub/dump.html
 pub/dump/awshrink.html
--- a/1
+++ b/1
@ -19,6 +19,7 @@ PAGES=\
 	"doc/dcstats.md"\
 	"doc/easyipc.md"\
 	"doc/funcweb.md"\
 	"doc/pwlookup.md"\
 	"doc/sqlaccess.md"\
 	"dump.md"\
 	"dump/awshrink.md"\
--- a/dat/doc.md
+++ b/dat/doc.md
@ -6,6 +6,10 @@ rare occasions are published on this page.
 ## Articles That May As Well Be Considered Blog Posts
 `2019-05-14` - [Fast Key Lookup with a Small Read-Only Database](/doc/pwlookup)
 :   How to quickly check if a password is in a large (but nicely compressed)
    dictionary.
 `2017-05-28` - [An Opinionated Survey of Functional Web Development](/doc/funcweb)
 :   The title says it all.
--- a/dat/doc/pwlookup.md
+++ b/dat/doc/pwlookup.md
@ -0,0 +1,255 @@
 % Fast Key Lookup with a Small Read-Only Database
 (Published on **2019-05-14**)
 I was recently thinking about how to help users on
 [VNDB.org](https://vndb.org/) improve their password security. One idea is to
 show a warning when someone's password is already in some kind of database.
 While there are several online APIs we can query for this purpose, I chose to
 ignore these (apart from the potential privacy implications, I'd also rather
 avoid dependencies on external services when possible). As an alternative,
 there are several password dictionaries available for download that can be used
 for offline querying.  I decided to go with the free [CrackStation password
 dictionary](https://crackstation.net/crackstation-wordlist-password-cracking-dictionary.htm).
 The dictionary comes in the form of a compressed text file with one password
 per line. This is a great format to keep the size of the dictionary small, but
 it's a really shitty format for fast querying. The obvious solution is to
 import the dictionary into some database (any RDMBS or NoSQL store of your
 choice), but those tend to be relatively heavyweight solutions in terms of code
 base, operational dependencies, database size or all of the above. Generic
 databases are typically optimized to support (moderately) fast inserts, updates
 and deletes, which means they often can't pack data very tightly. Some
 databases do support compression, though.
 So all I need is a small, compressed database that supports only a single
 operation: check if a key (password, in this case) is present. Surely I could
 quickly hack together a simple database that provides a good size/performance
 ratio?
 ## My Little B-tree
 I had some misgivings about implementing a
 [B-tree](https://en.wikipedia.org/wiki/B-tree). It may be a relatively simple
 data structure, but I've written enough off-by-one errors in my life to be wary
 of actually implementing one, and given this use case, I felt that there had to
 be an even simpler solution. After thinking up and rejecting some alternative
 strategies and after realizing that I really only needed to support two
 operations - bulk data insertion and exact key lookup - I came to the
 conclusion that a B-tree isn't such a bad idea after all. One major advantage
 of B-trees is that they provide a natural solution to split up the data into
 fixed-sized blocks, and that is a *very* useful property if we want to use
 compression.
 The database format I came up with is basically a concatenation of compressed
 blocks. Blocks come in two forms: leaf blocks and, uh, let's call them *index
 blocks*.
 The leaf blocks contain all the password strings. I initially tried to encode
 the passwords as a concatenated sequence of length-prefixed strings, but I
 could not find a way to quickly parse and iterate through that data in Perl 5
 (the target language for this little project[^1]). As it turns out, doing a
 substring match on the full leaf block is faster than trying to work with
 [unpack](https://perldoc.pl/functions/unpack) and
 [substr](https://perldoc.pl/functions/substr), even if a naive substring match
 can't take advantage of the string ordering to skip over processing when it has
 found a "larger" key. So I made sure that the leaf block starts with a null
 byte and encoded the passwords as a sequence of null-terminated strings. That
 way we can reliably find the key by doing a substring search on
 `"\x00${key}\x00"`.
 The index blocks are the intermediate B-tree nodes and consist of a sorted list
 of block references interleaved with keys. Like this:
 ```perl
 $block1 # Contains all the passwords "less than" $password1
 $password1
 $block2 # Contains all the passwords "less than" $password2
 $password2
 $block3 # Contains all the passwords "greater than" $password2
 ```
 The `$blockN` references are stored as a 64bit integer and encode the byte
 offset of the block relative to the start of the database file, the compressed
 byte length of the block, and whether it is a leaf or index block. Strictly
 speaking only the offset is necessary, the length and leaf flag can be stored
 alongside the block itself, but this approach saves a separate decoding step.
 The `$passwordN` keys are encoded as a null-terminated string, if only for
 consistency with leaf blocks.
 Finally, at the end of the file, we store a reference to the *parent* block, so
 that our lookup algorithm knows where to start its search.
 Given a sorted list of passwords[^2], creating a database with that format is
 relatively easy. The algorithm goes roughly as follows (apologies for the
 pseudocode, the actual code is slightly messier in order to do the proper data
 encoding):
 ```perl
 my @blocks; # Stack of blocks, $block[0] is a leaf block.
 for my $password (@passwords) {
    $blocks[0]->append_password($password);
    # Flush blocks when they get large enough.
    for(my $i=0; $blocks[$i]->length() > $block_size; $i++) {
        my $reference = $blocks[$i]->flush();
        $blocks[$i+1]->append_block_reference($reference);
    }
 }
 # Flush the remaining blocks.
 for my $i (0..$#blocks) {
    my $reference = $blocks[$i]->flush();
    $blocks[$i+1]->append_block_reference($reference);
 }
 write_parent_node_reference();
 ```
 That's it, really. No weird tree balancing tricks, no need to "modify" index
 blocks in any other way than appending some data. *Flushing* a block is
 nothing more than compressing the thing, appending it to the database file and
 noting its length and byte offset for future reference.
 Lookup is just as easy. I don't even need pseudocode to demonstrate it, here's
 the actual implementation:
 ```perl
 sub lookup_rec {
    my($q, $F, $ref) = @_;
    my $buf = readblock $F, $ref;
    if($ref->[0]) { # Is this a leaf block?
        return $buf =~ /\x00\Q$q\E\x00/; # Substring search
    } else {
        # This is an index block, walk through the block references and
        # password strings until we find a string that's larger than our query.
        while($buf =~ /(.{8})([^\x00]*)\x00/sg) {
            return lookup_rec($q, $F, dref $1) if $q lt $2;
        }
        return lookup_rec($q, $F, dref substr $buf, -8)
    }
 }
 # Usage: lookup($query, $database_filename)
 #   returns true if $query exists in the database.
 sub lookup {
    my($q, $f) = @_;
    open my $F, '<', $f or die $!; 
    # Read the last 8 bytes in the file for the reference to the parent block.
    sysseek $F, -8, 2 or die $!;
    die $! if 8 != sysread $F, (my $buf), 8;
    # Start the recursive lookup
    lookup_rec($q, $F, dref $buf)
 }
 ```
 The full code, including that of the benchmarks below, can be found [on
 git](https://g.blicky.net/pwlookup.git/tree/).
 ## Benchmarks
 I benchmarked my little B-tree implementation with a few different compression
 settings (no compression, gzip and zstandard) and block sizes (1k and 4k). For
 comparison I also added a naive implementation that performs a simple linear
 lookup in the sorted dictionary, and another one that uses
 [LMDB](https://symas.com/lmdb/).
 Here are the results with the `crackstation-human-only.txt.gz` dictionary,
 containing 63,941,069 passwords at 247 MiB original size.
 | Database        | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec |
 |:----------------|--------------:|------------------:|------------:|
 | Naive (plain)   |           684 |             6,376 | 0.16 (6.1 s)|
 | Naive (gzip)    |           246 |             6,340 | 0.12 (8.3 s)|
 | B-tree plain 1k |           698 |             6,460 |   17,857.14 |
 | B-tree plain 4k |           687 |             6,436 |    9,345.79 |
 | B-tree gzip 1k  |           261 |            10,772 |    9,345.79 |
 | B-tree gzip 4k  |           244 |            10,572 |    5,076.14 |
 | B-tree zstd 1k  |           291 |             6,856 |   12,345.68 |
 | B-tree zstd 4k  |           268 |             6,724 |    6,944.44 |
 | LMDB            |         1,282 |           590,792 |  333,333.33 |
 Well shit. My little B-tree experiment does have an awesome size/performance
 ratio when compared to the Naive approach (little surprise there), but the
 performance difference with LMDB is *insane*. Although, really, that isn't too
 surprising either,  LMDB is written in C and has been *heavily* optimized for
 performance.
 I used the default compression levels of zstd and gzip. I expect that a
 slightly higher compression level for zstd could reduce the database sizes to
 below gzip levels without too much of a performance penalty.
 What's curious is that the *B-tree gzip 4k* database is smaller than the *Naive
 (gzip)* one. I wonder if I have a bug somewhere that throws away a chunk of the
 original data. Or if I somehow ended up using a different compression level. Or
 if gzip is just being weird.
 Here's the same benchmark with the `crackstation.txt.gz` dictionary, containing
 1,212,356,398 passwords at 4.2 GiB original size[^3].
 | Database        | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec |
 |:----------------|--------------:|------------------:|------------:|
 | Naive (plain)   |        14,968 |            38,536 | 0.01 (110 s)|
 | Naive (gzip)    |         4,245 |            38,760 | 0.01 (136 s)|
 | B-tree plain 1k |        15,377 |             6,288 |   15,384.62 |
 | B-tree plain 4k |        15,071 |             6,456 |    8,196.72 |
 | B-tree gzip 1k  |         4,926 |            10,780 |    7,352.94 |
 | B-tree gzip 4k  |         4,344 |            10,720 |    4,273.50 |
 | B-tree zstd 1k  |         5,389 |             6,708 |   10,000.00 |
 | B-tree zstd 4k  |         4,586 |             6,692 |    5,917.16 |
 | LMDB            |        26,453 |         3,259,368 |  266,666.67 |
 The main conclusion I draw from this benchmark is that the B-tree
 implementation scales pretty well with increasing database sizes, as one would
 expect. I'm not sure why Perl decided to use more memory for the *Naive*
 benchmarks, but it doesn't really matter.
 ## Improvements
 Is this the best we can do? No way! Let's start with some low-hanging fruit:
 - The current lookup function reads the database file from scratch on every
  lookup. An LRU cache for uncompressed blocks ought to speed things up
  considerably.
 - Keys in index blocks are repeated in leaf blocks, this isn't really
  necessary.
 - It's possible to encode an "offset inside block" field to the block
  references and add a few more strings to the index blocks, allowing parts of
  a block can be skipped when searching for the key. This allows one to get
  some of the performance benefits of smaller block sizes without paying for
  the increase in database size. Or store multiple (smaller) intermediate
  B-tree nodes inside a single block. Same thing.
 - The lookup function could be rewritten in a faster language (C/C++/Rust), I'm
  pretty sure this would be a big win, too.
 Thinking beyond B-trees, an alternative and likely much more efficient approach
 is to use a hash function to assign strings to leaf blocks, store an array of
 block pointers in the index blocks (without interleaving keys) and then use the
 hash function to index the array for lookup. This makes it harder to cap the
 size of leaf blocks, but with the small password strings that's not likely to
 be a problem. It does significantly complicates creating the database in the
 first place.
 Perhaps an even better approach is to not store the strings in the first place
 and simply use a (sufficiently) large [bloom
 filter](https://en.wikipedia.org/wiki/Bloom_filter).
 But this was just a little side project. My goal was to get 1 ms lookups with
 the least amount of code and with a database size that isn't too far off from
 the compressed dictionary. That goal turned out to be pretty unambitious.
 [^1]: Yeah, people still use Perl 5.
 [^2]: But note that the passwords in the CrackStation dictionary are not sorted
  according to Perl's string comparison algorithm, so it requires a separate
  `LC_COLLATE=C sort` command to fix that. Also note that sorting a billion
  strings is a pretty challenging problem in its own right, but enough has been
  written about that.  Arguably, enough has been written about B-trees and
  databases as well.
 [^3]: Running these benchmarks was a bit of a nightmare as I was running low on
  free space on my SSD. I had to delete some unused build artefacts from other
  projects in an emergency in order for `sort` to be able to finish sorting and
  writing the *Naive (plain)* database, upon which all the others are based.
  [Ncdu](/ncdu) has saved this experiment, its author deserves a tasty pizza
  for dinner today.
--- a/dat/index.md
+++ b/dat/index.md
@ -20,6 +20,10 @@ software or the incidental article on this site. Enjoy your stay!
 <!-- These announcements are parsed by mkfeed.pl, see that file for formatting -->
 ## Announcements <a href="/feed.atom"><img src="/img/feed_icon.png" alt="Atom feed"></a>
 `2019-05-14` - New article: Fast Key Lookup with a Small Read-Only Database <!-- link: /doc/pwlookup -->
 :   How to quickly check if a password is in a large (but nicely compressed)
    dictionary. Some code and a few benchmarks. [Read more.](/doc/pwlookup)
 `2019-04-30` - ncdc 1.22 released <!-- tags: ncdc, link: /ncdc -->
 :   There has been a little bit of renewed interest since the 1.21 release, so
    we have a few (small) new features and improvements. It's now possible to
--- a/pub/style.css
+++ b/pub/style.css
@ -44,6 +44,8 @@ header p b { display: block; margin-top: 10px; margin-bottom: 2px }
 b, strong { font-weight: bold }
 em, i, i a, em a { font-style: italic }
 sup { font-size: 80%; font-weight: bold }
 a.footnoteRef { text-decoration: none }
 main h1.title { margin-top: 0; font-size: 195% }
 main h1 { font-size: 150%; color: #000; margin: 2em 0 .3em 0; text-decoration: none }
@ -52,7 +54,7 @@ main h3 { font-size: 120%; color: #000; margin: 1em 0 .3em 0; text-decoration: n
 main code { font-family: monospace }
 main pre { font-family: monospace; font-size: 80%; margin: 0 0 10px 18px; display: block; padding: 0 0 0 15px; border-left: 1px dotted #999; overflow-x: auto } 
 main pre * { font-size: inherit; font-family: inherit }
-main p, main figure, main ul, main ol, main dl, main pre, main figure { margin-bottom: 0.7em; margin-left: 1em }
+main p, main figure, main ul, main ol, main dl, main pre, main figure, main table { margin-bottom: 0.7em; margin-left: 1em }
 main ul ul { margin-bottom: 0.5em }
 main ul, main ol { margin-left: 2.5em }
 main li { margin-bottom: .1em }
@ -61,6 +63,13 @@ main ul p, main ol p, main dl p { margin-left: 0 }
 main ul ul, main dd ul { margin-left: 1em }
 main dt { margin-bottom: .1em; }
 main figcaption { display: none }
 main table th, main table td { font-size: 80%; padding: 1px 7px }
 main table th { font-weight: bold }
 main section.footnotes hr { display: none }
 main section.footnotes { margin: 40px 10px 10px 10px }
 main section.footnotes p, main section.footnotes code { font-size: 80% }
 main section.footnotes em, main section.footnotes a { font-size: inherit }
 main img.right { float: right; margin: 0 0 5px 10px }
 main .sig { vertical-align: super }