diff --git a/.gitignore b/.gitignore index e9471d1..486390b 100644 --- a/.gitignore +++ b/.gitignore @@ -46,6 +46,7 @@ pub/doc/commvis.html pub/doc/dcstats.html pub/doc/easyipc.html pub/doc/funcweb.html +pub/doc/pwlookup.html pub/doc/sqlaccess.html pub/dump.html pub/dump/awshrink.html diff --git a/Makefile b/Makefile index 68401a5..e58fc1e 100644 --- a/Makefile +++ b/Makefile @@ -19,6 +19,7 @@ PAGES=\ "doc/dcstats.md"\ "doc/easyipc.md"\ "doc/funcweb.md"\ + "doc/pwlookup.md"\ "doc/sqlaccess.md"\ "dump.md"\ "dump/awshrink.md"\ diff --git a/dat/doc.md b/dat/doc.md index fdbabac..f0fda50 100644 --- a/dat/doc.md +++ b/dat/doc.md @@ -6,6 +6,10 @@ rare occasions are published on this page. ## Articles That May As Well Be Considered Blog Posts +`2019-05-14` - [Fast Key Lookup with a Small Read-Only Database](/doc/pwlookup) +: How to quickly check if a password is in a large (but nicely compressed) + dictionary. + `2017-05-28` - [An Opinionated Survey of Functional Web Development](/doc/funcweb) : The title says it all. diff --git a/dat/doc/pwlookup.md b/dat/doc/pwlookup.md new file mode 100644 index 0000000..86085b6 --- /dev/null +++ b/dat/doc/pwlookup.md @@ -0,0 +1,255 @@ +% Fast Key Lookup with a Small Read-Only Database + +(Published on **2019-05-14**) + +I was recently thinking about how to help users on +[VNDB.org](https://vndb.org/) improve their password security. One idea is to +show a warning when someone's password is already in some kind of database. +While there are several online APIs we can query for this purpose, I chose to +ignore these (apart from the potential privacy implications, I'd also rather +avoid dependencies on external services when possible). As an alternative, +there are several password dictionaries available for download that can be used +for offline querying. I decided to go with the free [CrackStation password +dictionary](https://crackstation.net/crackstation-wordlist-password-cracking-dictionary.htm). + +The dictionary comes in the form of a compressed text file with one password +per line. This is a great format to keep the size of the dictionary small, but +it's a really shitty format for fast querying. The obvious solution is to +import the dictionary into some database (any RDMBS or NoSQL store of your +choice), but those tend to be relatively heavyweight solutions in terms of code +base, operational dependencies, database size or all of the above. Generic +databases are typically optimized to support (moderately) fast inserts, updates +and deletes, which means they often can't pack data very tightly. Some +databases do support compression, though. + +So all I need is a small, compressed database that supports only a single +operation: check if a key (password, in this case) is present. Surely I could +quickly hack together a simple database that provides a good size/performance +ratio? + +## My Little B-tree + +I had some misgivings about implementing a +[B-tree](https://en.wikipedia.org/wiki/B-tree). It may be a relatively simple +data structure, but I've written enough off-by-one errors in my life to be wary +of actually implementing one, and given this use case, I felt that there had to +be an even simpler solution. After thinking up and rejecting some alternative +strategies and after realizing that I really only needed to support two +operations - bulk data insertion and exact key lookup - I came to the +conclusion that a B-tree isn't such a bad idea after all. One major advantage +of B-trees is that they provide a natural solution to split up the data into +fixed-sized blocks, and that is a *very* useful property if we want to use +compression. + +The database format I came up with is basically a concatenation of compressed +blocks. Blocks come in two forms: leaf blocks and, uh, let's call them *index +blocks*. + +The leaf blocks contain all the password strings. I initially tried to encode +the passwords as a concatenated sequence of length-prefixed strings, but I +could not find a way to quickly parse and iterate through that data in Perl 5 +(the target language for this little project[^1]). As it turns out, doing a +substring match on the full leaf block is faster than trying to work with +[unpack](https://perldoc.pl/functions/unpack) and +[substr](https://perldoc.pl/functions/substr), even if a naive substring match +can't take advantage of the string ordering to skip over processing when it has +found a "larger" key. So I made sure that the leaf block starts with a null +byte and encoded the passwords as a sequence of null-terminated strings. That +way we can reliably find the key by doing a substring search on +`"\x00${key}\x00"`. + +The index blocks are the intermediate B-tree nodes and consist of a sorted list +of block references interleaved with keys. Like this: + +```perl +$block1 # Contains all the passwords "less than" $password1 +$password1 +$block2 # Contains all the passwords "less than" $password2 +$password2 +$block3 # Contains all the passwords "greater than" $password2 +``` + +The `$blockN` references are stored as a 64bit integer and encode the byte +offset of the block relative to the start of the database file, the compressed +byte length of the block, and whether it is a leaf or index block. Strictly +speaking only the offset is necessary, the length and leaf flag can be stored +alongside the block itself, but this approach saves a separate decoding step. +The `$passwordN` keys are encoded as a null-terminated string, if only for +consistency with leaf blocks. + +Finally, at the end of the file, we store a reference to the *parent* block, so +that our lookup algorithm knows where to start its search. + +Given a sorted list of passwords[^2], creating a database with that format is +relatively easy. The algorithm goes roughly as follows (apologies for the +pseudocode, the actual code is slightly messier in order to do the proper data +encoding): + +```perl +my @blocks; # Stack of blocks, $block[0] is a leaf block. + +for my $password (@passwords) { + $blocks[0]->append_password($password); + + # Flush blocks when they get large enough. + for(my $i=0; $blocks[$i]->length() > $block_size; $i++) { + my $reference = $blocks[$i]->flush(); + $blocks[$i+1]->append_block_reference($reference); + } +} + +# Flush the remaining blocks. +for my $i (0..$#blocks) { + my $reference = $blocks[$i]->flush(); + $blocks[$i+1]->append_block_reference($reference); +} + +write_parent_node_reference(); +``` + +That's it, really. No weird tree balancing tricks, no need to "modify" index +blocks in any other way than appending some data. *Flushing* a block is +nothing more than compressing the thing, appending it to the database file and +noting its length and byte offset for future reference. + +Lookup is just as easy. I don't even need pseudocode to demonstrate it, here's +the actual implementation: + +```perl +sub lookup_rec { + my($q, $F, $ref) = @_; + my $buf = readblock $F, $ref; + if($ref->[0]) { # Is this a leaf block? + return $buf =~ /\x00\Q$q\E\x00/; # Substring search + } else { + # This is an index block, walk through the block references and + # password strings until we find a string that's larger than our query. + while($buf =~ /(.{8})([^\x00]*)\x00/sg) { + return lookup_rec($q, $F, dref $1) if $q lt $2; + } + return lookup_rec($q, $F, dref substr $buf, -8) + } +} + +# Usage: lookup($query, $database_filename) +# returns true if $query exists in the database. +sub lookup { + my($q, $f) = @_; + open my $F, '<', $f or die $!; + # Read the last 8 bytes in the file for the reference to the parent block. + sysseek $F, -8, 2 or die $!; + die $! if 8 != sysread $F, (my $buf), 8; + # Start the recursive lookup + lookup_rec($q, $F, dref $buf) +} +``` + +The full code, including that of the benchmarks below, can be found [on +git](https://g.blicky.net/pwlookup.git/tree/). + +## Benchmarks + +I benchmarked my little B-tree implementation with a few different compression +settings (no compression, gzip and zstandard) and block sizes (1k and 4k). For +comparison I also added a naive implementation that performs a simple linear +lookup in the sorted dictionary, and another one that uses +[LMDB](https://symas.com/lmdb/). + +Here are the results with the `crackstation-human-only.txt.gz` dictionary, +containing 63,941,069 passwords at 247 MiB original size. + +| Database | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec | +|:----------------|--------------:|------------------:|------------:| +| Naive (plain) | 684 | 6,376 | 0.16 (6.1 s)| +| Naive (gzip) | 246 | 6,340 | 0.12 (8.3 s)| +| B-tree plain 1k | 698 | 6,460 | 17,857.14 | +| B-tree plain 4k | 687 | 6,436 | 9,345.79 | +| B-tree gzip 1k | 261 | 10,772 | 9,345.79 | +| B-tree gzip 4k | 244 | 10,572 | 5,076.14 | +| B-tree zstd 1k | 291 | 6,856 | 12,345.68 | +| B-tree zstd 4k | 268 | 6,724 | 6,944.44 | +| LMDB | 1,282 | 590,792 | 333,333.33 | + +Well shit. My little B-tree experiment does have an awesome size/performance +ratio when compared to the Naive approach (little surprise there), but the +performance difference with LMDB is *insane*. Although, really, that isn't too +surprising either, LMDB is written in C and has been *heavily* optimized for +performance. + +I used the default compression levels of zstd and gzip. I expect that a +slightly higher compression level for zstd could reduce the database sizes to +below gzip levels without too much of a performance penalty. + +What's curious is that the *B-tree gzip 4k* database is smaller than the *Naive +(gzip)* one. I wonder if I have a bug somewhere that throws away a chunk of the +original data. Or if I somehow ended up using a different compression level. Or +if gzip is just being weird. + +Here's the same benchmark with the `crackstation.txt.gz` dictionary, containing +1,212,356,398 passwords at 4.2 GiB original size[^3]. + +| Database | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec | +|:----------------|--------------:|------------------:|------------:| +| Naive (plain) | 14,968 | 38,536 | 0.01 (110 s)| +| Naive (gzip) | 4,245 | 38,760 | 0.01 (136 s)| +| B-tree plain 1k | 15,377 | 6,288 | 15,384.62 | +| B-tree plain 4k | 15,071 | 6,456 | 8,196.72 | +| B-tree gzip 1k | 4,926 | 10,780 | 7,352.94 | +| B-tree gzip 4k | 4,344 | 10,720 | 4,273.50 | +| B-tree zstd 1k | 5,389 | 6,708 | 10,000.00 | +| B-tree zstd 4k | 4,586 | 6,692 | 5,917.16 | +| LMDB | 26,453 | 3,259,368 | 266,666.67 | + +The main conclusion I draw from this benchmark is that the B-tree +implementation scales pretty well with increasing database sizes, as one would +expect. I'm not sure why Perl decided to use more memory for the *Naive* +benchmarks, but it doesn't really matter. + +## Improvements + +Is this the best we can do? No way! Let's start with some low-hanging fruit: + +- The current lookup function reads the database file from scratch on every + lookup. An LRU cache for uncompressed blocks ought to speed things up + considerably. +- Keys in index blocks are repeated in leaf blocks, this isn't really + necessary. +- It's possible to encode an "offset inside block" field to the block + references and add a few more strings to the index blocks, allowing parts of + a block can be skipped when searching for the key. This allows one to get + some of the performance benefits of smaller block sizes without paying for + the increase in database size. Or store multiple (smaller) intermediate + B-tree nodes inside a single block. Same thing. +- The lookup function could be rewritten in a faster language (C/C++/Rust), I'm + pretty sure this would be a big win, too. + +Thinking beyond B-trees, an alternative and likely much more efficient approach +is to use a hash function to assign strings to leaf blocks, store an array of +block pointers in the index blocks (without interleaving keys) and then use the +hash function to index the array for lookup. This makes it harder to cap the +size of leaf blocks, but with the small password strings that's not likely to +be a problem. It does significantly complicates creating the database in the +first place. + +Perhaps an even better approach is to not store the strings in the first place +and simply use a (sufficiently) large [bloom +filter](https://en.wikipedia.org/wiki/Bloom_filter). + +But this was just a little side project. My goal was to get 1 ms lookups with +the least amount of code and with a database size that isn't too far off from +the compressed dictionary. That goal turned out to be pretty unambitious. + + +[^1]: Yeah, people still use Perl 5. +[^2]: But note that the passwords in the CrackStation dictionary are not sorted + according to Perl's string comparison algorithm, so it requires a separate + `LC_COLLATE=C sort` command to fix that. Also note that sorting a billion + strings is a pretty challenging problem in its own right, but enough has been + written about that. Arguably, enough has been written about B-trees and + databases as well. +[^3]: Running these benchmarks was a bit of a nightmare as I was running low on + free space on my SSD. I had to delete some unused build artefacts from other + projects in an emergency in order for `sort` to be able to finish sorting and + writing the *Naive (plain)* database, upon which all the others are based. + [Ncdu](/ncdu) has saved this experiment, its author deserves a tasty pizza + for dinner today. diff --git a/dat/index.md b/dat/index.md index a94d150..2af34bf 100644 --- a/dat/index.md +++ b/dat/index.md @@ -20,6 +20,10 @@ software or the incidental article on this site. Enjoy your stay! ## Announcements Atom feed +`2019-05-14` - New article: Fast Key Lookup with a Small Read-Only Database +: How to quickly check if a password is in a large (but nicely compressed) + dictionary. Some code and a few benchmarks. [Read more.](/doc/pwlookup) + `2019-04-30` - ncdc 1.22 released : There has been a little bit of renewed interest since the 1.21 release, so we have a few (small) new features and improvements. It's now possible to diff --git a/pub/style.css b/pub/style.css index 432623f..0027dc1 100644 --- a/pub/style.css +++ b/pub/style.css @@ -44,6 +44,8 @@ header p b { display: block; margin-top: 10px; margin-bottom: 2px } b, strong { font-weight: bold } em, i, i a, em a { font-style: italic } +sup { font-size: 80%; font-weight: bold } +a.footnoteRef { text-decoration: none } main h1.title { margin-top: 0; font-size: 195% } main h1 { font-size: 150%; color: #000; margin: 2em 0 .3em 0; text-decoration: none } @@ -52,7 +54,7 @@ main h3 { font-size: 120%; color: #000; margin: 1em 0 .3em 0; text-decoration: n main code { font-family: monospace } main pre { font-family: monospace; font-size: 80%; margin: 0 0 10px 18px; display: block; padding: 0 0 0 15px; border-left: 1px dotted #999; overflow-x: auto } main pre * { font-size: inherit; font-family: inherit } -main p, main figure, main ul, main ol, main dl, main pre, main figure { margin-bottom: 0.7em; margin-left: 1em } +main p, main figure, main ul, main ol, main dl, main pre, main figure, main table { margin-bottom: 0.7em; margin-left: 1em } main ul ul { margin-bottom: 0.5em } main ul, main ol { margin-left: 2.5em } main li { margin-bottom: .1em } @@ -61,6 +63,13 @@ main ul p, main ol p, main dl p { margin-left: 0 } main ul ul, main dd ul { margin-left: 1em } main dt { margin-bottom: .1em; } main figcaption { display: none } +main table th, main table td { font-size: 80%; padding: 1px 7px } +main table th { font-weight: bold } + +main section.footnotes hr { display: none } +main section.footnotes { margin: 40px 10px 10px 10px } +main section.footnotes p, main section.footnotes code { font-size: 80% } +main section.footnotes em, main section.footnotes a { font-size: inherit } main img.right { float: right; margin: 0 0 5px 10px } main .sig { vertical-align: super }