Add doc/pwlookup

This commit is contained in:
Yorhel 2019-05-14 19:43:47 +02:00
parent 9d3d07b590
commit 132f3745e1
6 changed files with 275 additions and 1 deletions

1
.gitignore vendored
View file

@ -46,6 +46,7 @@ pub/doc/commvis.html
pub/doc/dcstats.html pub/doc/dcstats.html
pub/doc/easyipc.html pub/doc/easyipc.html
pub/doc/funcweb.html pub/doc/funcweb.html
pub/doc/pwlookup.html
pub/doc/sqlaccess.html pub/doc/sqlaccess.html
pub/dump.html pub/dump.html
pub/dump/awshrink.html pub/dump/awshrink.html

View file

@ -19,6 +19,7 @@ PAGES=\
"doc/dcstats.md"\ "doc/dcstats.md"\
"doc/easyipc.md"\ "doc/easyipc.md"\
"doc/funcweb.md"\ "doc/funcweb.md"\
"doc/pwlookup.md"\
"doc/sqlaccess.md"\ "doc/sqlaccess.md"\
"dump.md"\ "dump.md"\
"dump/awshrink.md"\ "dump/awshrink.md"\

View file

@ -6,6 +6,10 @@ rare occasions are published on this page.
## Articles That May As Well Be Considered Blog Posts ## Articles That May As Well Be Considered Blog Posts
`2019-05-14` - [Fast Key Lookup with a Small Read-Only Database](/doc/pwlookup)
: How to quickly check if a password is in a large (but nicely compressed)
dictionary.
`2017-05-28` - [An Opinionated Survey of Functional Web Development](/doc/funcweb) `2017-05-28` - [An Opinionated Survey of Functional Web Development](/doc/funcweb)
: The title says it all. : The title says it all.

255
dat/doc/pwlookup.md Normal file
View file

@ -0,0 +1,255 @@
% Fast Key Lookup with a Small Read-Only Database
(Published on **2019-05-14**)
I was recently thinking about how to help users on
[VNDB.org](https://vndb.org/) improve their password security. One idea is to
show a warning when someone's password is already in some kind of database.
While there are several online APIs we can query for this purpose, I chose to
ignore these (apart from the potential privacy implications, I'd also rather
avoid dependencies on external services when possible). As an alternative,
there are several password dictionaries available for download that can be used
for offline querying. I decided to go with the free [CrackStation password
dictionary](https://crackstation.net/crackstation-wordlist-password-cracking-dictionary.htm).
The dictionary comes in the form of a compressed text file with one password
per line. This is a great format to keep the size of the dictionary small, but
it's a really shitty format for fast querying. The obvious solution is to
import the dictionary into some database (any RDMBS or NoSQL store of your
choice), but those tend to be relatively heavyweight solutions in terms of code
base, operational dependencies, database size or all of the above. Generic
databases are typically optimized to support (moderately) fast inserts, updates
and deletes, which means they often can't pack data very tightly. Some
databases do support compression, though.
So all I need is a small, compressed database that supports only a single
operation: check if a key (password, in this case) is present. Surely I could
quickly hack together a simple database that provides a good size/performance
ratio?
## My Little B-tree
I had some misgivings about implementing a
[B-tree](https://en.wikipedia.org/wiki/B-tree). It may be a relatively simple
data structure, but I've written enough off-by-one errors in my life to be wary
of actually implementing one, and given this use case, I felt that there had to
be an even simpler solution. After thinking up and rejecting some alternative
strategies and after realizing that I really only needed to support two
operations - bulk data insertion and exact key lookup - I came to the
conclusion that a B-tree isn't such a bad idea after all. One major advantage
of B-trees is that they provide a natural solution to split up the data into
fixed-sized blocks, and that is a *very* useful property if we want to use
compression.
The database format I came up with is basically a concatenation of compressed
blocks. Blocks come in two forms: leaf blocks and, uh, let's call them *index
blocks*.
The leaf blocks contain all the password strings. I initially tried to encode
the passwords as a concatenated sequence of length-prefixed strings, but I
could not find a way to quickly parse and iterate through that data in Perl 5
(the target language for this little project[^1]). As it turns out, doing a
substring match on the full leaf block is faster than trying to work with
[unpack](https://perldoc.pl/functions/unpack) and
[substr](https://perldoc.pl/functions/substr), even if a naive substring match
can't take advantage of the string ordering to skip over processing when it has
found a "larger" key. So I made sure that the leaf block starts with a null
byte and encoded the passwords as a sequence of null-terminated strings. That
way we can reliably find the key by doing a substring search on
`"\x00${key}\x00"`.
The index blocks are the intermediate B-tree nodes and consist of a sorted list
of block references interleaved with keys. Like this:
```perl
$block1 # Contains all the passwords "less than" $password1
$password1
$block2 # Contains all the passwords "less than" $password2
$password2
$block3 # Contains all the passwords "greater than" $password2
```
The `$blockN` references are stored as a 64bit integer and encode the byte
offset of the block relative to the start of the database file, the compressed
byte length of the block, and whether it is a leaf or index block. Strictly
speaking only the offset is necessary, the length and leaf flag can be stored
alongside the block itself, but this approach saves a separate decoding step.
The `$passwordN` keys are encoded as a null-terminated string, if only for
consistency with leaf blocks.
Finally, at the end of the file, we store a reference to the *parent* block, so
that our lookup algorithm knows where to start its search.
Given a sorted list of passwords[^2], creating a database with that format is
relatively easy. The algorithm goes roughly as follows (apologies for the
pseudocode, the actual code is slightly messier in order to do the proper data
encoding):
```perl
my @blocks; # Stack of blocks, $block[0] is a leaf block.
for my $password (@passwords) {
$blocks[0]->append_password($password);
# Flush blocks when they get large enough.
for(my $i=0; $blocks[$i]->length() > $block_size; $i++) {
my $reference = $blocks[$i]->flush();
$blocks[$i+1]->append_block_reference($reference);
}
}
# Flush the remaining blocks.
for my $i (0..$#blocks) {
my $reference = $blocks[$i]->flush();
$blocks[$i+1]->append_block_reference($reference);
}
write_parent_node_reference();
```
That's it, really. No weird tree balancing tricks, no need to "modify" index
blocks in any other way than appending some data. *Flushing* a block is
nothing more than compressing the thing, appending it to the database file and
noting its length and byte offset for future reference.
Lookup is just as easy. I don't even need pseudocode to demonstrate it, here's
the actual implementation:
```perl
sub lookup_rec {
my($q, $F, $ref) = @_;
my $buf = readblock $F, $ref;
if($ref->[0]) { # Is this a leaf block?
return $buf =~ /\x00\Q$q\E\x00/; # Substring search
} else {
# This is an index block, walk through the block references and
# password strings until we find a string that's larger than our query.
while($buf =~ /(.{8})([^\x00]*)\x00/sg) {
return lookup_rec($q, $F, dref $1) if $q lt $2;
}
return lookup_rec($q, $F, dref substr $buf, -8)
}
}
# Usage: lookup($query, $database_filename)
# returns true if $query exists in the database.
sub lookup {
my($q, $f) = @_;
open my $F, '<', $f or die $!;
# Read the last 8 bytes in the file for the reference to the parent block.
sysseek $F, -8, 2 or die $!;
die $! if 8 != sysread $F, (my $buf), 8;
# Start the recursive lookup
lookup_rec($q, $F, dref $buf)
}
```
The full code, including that of the benchmarks below, can be found [on
git](https://g.blicky.net/pwlookup.git/tree/).
## Benchmarks
I benchmarked my little B-tree implementation with a few different compression
settings (no compression, gzip and zstandard) and block sizes (1k and 4k). For
comparison I also added a naive implementation that performs a simple linear
lookup in the sorted dictionary, and another one that uses
[LMDB](https://symas.com/lmdb/).
Here are the results with the `crackstation-human-only.txt.gz` dictionary,
containing 63,941,069 passwords at 247 MiB original size.
| Database | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec |
|:----------------|--------------:|------------------:|------------:|
| Naive (plain) | 684 | 6,376 | 0.16 (6.1 s)|
| Naive (gzip) | 246 | 6,340 | 0.12 (8.3 s)|
| B-tree plain 1k | 698 | 6,460 | 17,857.14 |
| B-tree plain 4k | 687 | 6,436 | 9,345.79 |
| B-tree gzip 1k | 261 | 10,772 | 9,345.79 |
| B-tree gzip 4k | 244 | 10,572 | 5,076.14 |
| B-tree zstd 1k | 291 | 6,856 | 12,345.68 |
| B-tree zstd 4k | 268 | 6,724 | 6,944.44 |
| LMDB | 1,282 | 590,792 | 333,333.33 |
Well shit. My little B-tree experiment does have an awesome size/performance
ratio when compared to the Naive approach (little surprise there), but the
performance difference with LMDB is *insane*. Although, really, that isn't too
surprising either, LMDB is written in C and has been *heavily* optimized for
performance.
I used the default compression levels of zstd and gzip. I expect that a
slightly higher compression level for zstd could reduce the database sizes to
below gzip levels without too much of a performance penalty.
What's curious is that the *B-tree gzip 4k* database is smaller than the *Naive
(gzip)* one. I wonder if I have a bug somewhere that throws away a chunk of the
original data. Or if I somehow ended up using a different compression level. Or
if gzip is just being weird.
Here's the same benchmark with the `crackstation.txt.gz` dictionary, containing
1,212,356,398 passwords at 4.2 GiB original size[^3].
| Database | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec |
|:----------------|--------------:|------------------:|------------:|
| Naive (plain) | 14,968 | 38,536 | 0.01 (110 s)|
| Naive (gzip) | 4,245 | 38,760 | 0.01 (136 s)|
| B-tree plain 1k | 15,377 | 6,288 | 15,384.62 |
| B-tree plain 4k | 15,071 | 6,456 | 8,196.72 |
| B-tree gzip 1k | 4,926 | 10,780 | 7,352.94 |
| B-tree gzip 4k | 4,344 | 10,720 | 4,273.50 |
| B-tree zstd 1k | 5,389 | 6,708 | 10,000.00 |
| B-tree zstd 4k | 4,586 | 6,692 | 5,917.16 |
| LMDB | 26,453 | 3,259,368 | 266,666.67 |
The main conclusion I draw from this benchmark is that the B-tree
implementation scales pretty well with increasing database sizes, as one would
expect. I'm not sure why Perl decided to use more memory for the *Naive*
benchmarks, but it doesn't really matter.
## Improvements
Is this the best we can do? No way! Let's start with some low-hanging fruit:
- The current lookup function reads the database file from scratch on every
lookup. An LRU cache for uncompressed blocks ought to speed things up
considerably.
- Keys in index blocks are repeated in leaf blocks, this isn't really
necessary.
- It's possible to encode an "offset inside block" field to the block
references and add a few more strings to the index blocks, allowing parts of
a block can be skipped when searching for the key. This allows one to get
some of the performance benefits of smaller block sizes without paying for
the increase in database size. Or store multiple (smaller) intermediate
B-tree nodes inside a single block. Same thing.
- The lookup function could be rewritten in a faster language (C/C++/Rust), I'm
pretty sure this would be a big win, too.
Thinking beyond B-trees, an alternative and likely much more efficient approach
is to use a hash function to assign strings to leaf blocks, store an array of
block pointers in the index blocks (without interleaving keys) and then use the
hash function to index the array for lookup. This makes it harder to cap the
size of leaf blocks, but with the small password strings that's not likely to
be a problem. It does significantly complicates creating the database in the
first place.
Perhaps an even better approach is to not store the strings in the first place
and simply use a (sufficiently) large [bloom
filter](https://en.wikipedia.org/wiki/Bloom_filter).
But this was just a little side project. My goal was to get 1 ms lookups with
the least amount of code and with a database size that isn't too far off from
the compressed dictionary. That goal turned out to be pretty unambitious.
[^1]: Yeah, people still use Perl 5.
[^2]: But note that the passwords in the CrackStation dictionary are not sorted
according to Perl's string comparison algorithm, so it requires a separate
`LC_COLLATE=C sort` command to fix that. Also note that sorting a billion
strings is a pretty challenging problem in its own right, but enough has been
written about that. Arguably, enough has been written about B-trees and
databases as well.
[^3]: Running these benchmarks was a bit of a nightmare as I was running low on
free space on my SSD. I had to delete some unused build artefacts from other
projects in an emergency in order for `sort` to be able to finish sorting and
writing the *Naive (plain)* database, upon which all the others are based.
[Ncdu](/ncdu) has saved this experiment, its author deserves a tasty pizza
for dinner today.

View file

@ -20,6 +20,10 @@ software or the incidental article on this site. Enjoy your stay!
<!-- These announcements are parsed by mkfeed.pl, see that file for formatting --> <!-- These announcements are parsed by mkfeed.pl, see that file for formatting -->
## Announcements <a href="/feed.atom"><img src="/img/feed_icon.png" alt="Atom feed"></a> ## Announcements <a href="/feed.atom"><img src="/img/feed_icon.png" alt="Atom feed"></a>
`2019-05-14` - New article: Fast Key Lookup with a Small Read-Only Database <!-- link: /doc/pwlookup -->
: How to quickly check if a password is in a large (but nicely compressed)
dictionary. Some code and a few benchmarks. [Read more.](/doc/pwlookup)
`2019-04-30` - ncdc 1.22 released <!-- tags: ncdc, link: /ncdc --> `2019-04-30` - ncdc 1.22 released <!-- tags: ncdc, link: /ncdc -->
: There has been a little bit of renewed interest since the 1.21 release, so : There has been a little bit of renewed interest since the 1.21 release, so
we have a few (small) new features and improvements. It's now possible to we have a few (small) new features and improvements. It's now possible to

View file

@ -44,6 +44,8 @@ header p b { display: block; margin-top: 10px; margin-bottom: 2px }
b, strong { font-weight: bold } b, strong { font-weight: bold }
em, i, i a, em a { font-style: italic } em, i, i a, em a { font-style: italic }
sup { font-size: 80%; font-weight: bold }
a.footnoteRef { text-decoration: none }
main h1.title { margin-top: 0; font-size: 195% } main h1.title { margin-top: 0; font-size: 195% }
main h1 { font-size: 150%; color: #000; margin: 2em 0 .3em 0; text-decoration: none } main h1 { font-size: 150%; color: #000; margin: 2em 0 .3em 0; text-decoration: none }
@ -52,7 +54,7 @@ main h3 { font-size: 120%; color: #000; margin: 1em 0 .3em 0; text-decoration: n
main code { font-family: monospace } main code { font-family: monospace }
main pre { font-family: monospace; font-size: 80%; margin: 0 0 10px 18px; display: block; padding: 0 0 0 15px; border-left: 1px dotted #999; overflow-x: auto } main pre { font-family: monospace; font-size: 80%; margin: 0 0 10px 18px; display: block; padding: 0 0 0 15px; border-left: 1px dotted #999; overflow-x: auto }
main pre * { font-size: inherit; font-family: inherit } main pre * { font-size: inherit; font-family: inherit }
main p, main figure, main ul, main ol, main dl, main pre, main figure { margin-bottom: 0.7em; margin-left: 1em } main p, main figure, main ul, main ol, main dl, main pre, main figure, main table { margin-bottom: 0.7em; margin-left: 1em }
main ul ul { margin-bottom: 0.5em } main ul ul { margin-bottom: 0.5em }
main ul, main ol { margin-left: 2.5em } main ul, main ol { margin-left: 2.5em }
main li { margin-bottom: .1em } main li { margin-bottom: .1em }
@ -61,6 +63,13 @@ main ul p, main ol p, main dl p { margin-left: 0 }
main ul ul, main dd ul { margin-left: 1em } main ul ul, main dd ul { margin-left: 1em }
main dt { margin-bottom: .1em; } main dt { margin-bottom: .1em; }
main figcaption { display: none } main figcaption { display: none }
main table th, main table td { font-size: 80%; padding: 1px 7px }
main table th { font-weight: bold }
main section.footnotes hr { display: none }
main section.footnotes { margin: 40px 10px 10px 10px }
main section.footnotes p, main section.footnotes code { font-size: 80% }
main section.footnotes em, main section.footnotes a { font-size: inherit }
main img.right { float: right; margin: 0 0 5px 10px } main img.right { float: right; margin: 0 0 5px 10px }
main .sig { vertical-align: super } main .sig { vertical-align: super }