Add doc/pwlookup
This commit is contained in:
parent
9d3d07b590
commit
132f3745e1
6 changed files with 275 additions and 1 deletions
1
.gitignore
vendored
1
.gitignore
vendored
|
|
@ -46,6 +46,7 @@ pub/doc/commvis.html
|
||||||
pub/doc/dcstats.html
|
pub/doc/dcstats.html
|
||||||
pub/doc/easyipc.html
|
pub/doc/easyipc.html
|
||||||
pub/doc/funcweb.html
|
pub/doc/funcweb.html
|
||||||
|
pub/doc/pwlookup.html
|
||||||
pub/doc/sqlaccess.html
|
pub/doc/sqlaccess.html
|
||||||
pub/dump.html
|
pub/dump.html
|
||||||
pub/dump/awshrink.html
|
pub/dump/awshrink.html
|
||||||
|
|
|
||||||
1
Makefile
1
Makefile
|
|
@ -19,6 +19,7 @@ PAGES=\
|
||||||
"doc/dcstats.md"\
|
"doc/dcstats.md"\
|
||||||
"doc/easyipc.md"\
|
"doc/easyipc.md"\
|
||||||
"doc/funcweb.md"\
|
"doc/funcweb.md"\
|
||||||
|
"doc/pwlookup.md"\
|
||||||
"doc/sqlaccess.md"\
|
"doc/sqlaccess.md"\
|
||||||
"dump.md"\
|
"dump.md"\
|
||||||
"dump/awshrink.md"\
|
"dump/awshrink.md"\
|
||||||
|
|
|
||||||
|
|
@ -6,6 +6,10 @@ rare occasions are published on this page.
|
||||||
|
|
||||||
## Articles That May As Well Be Considered Blog Posts
|
## Articles That May As Well Be Considered Blog Posts
|
||||||
|
|
||||||
|
`2019-05-14` - [Fast Key Lookup with a Small Read-Only Database](/doc/pwlookup)
|
||||||
|
: How to quickly check if a password is in a large (but nicely compressed)
|
||||||
|
dictionary.
|
||||||
|
|
||||||
`2017-05-28` - [An Opinionated Survey of Functional Web Development](/doc/funcweb)
|
`2017-05-28` - [An Opinionated Survey of Functional Web Development](/doc/funcweb)
|
||||||
: The title says it all.
|
: The title says it all.
|
||||||
|
|
||||||
|
|
|
||||||
255
dat/doc/pwlookup.md
Normal file
255
dat/doc/pwlookup.md
Normal file
|
|
@ -0,0 +1,255 @@
|
||||||
|
% Fast Key Lookup with a Small Read-Only Database
|
||||||
|
|
||||||
|
(Published on **2019-05-14**)
|
||||||
|
|
||||||
|
I was recently thinking about how to help users on
|
||||||
|
[VNDB.org](https://vndb.org/) improve their password security. One idea is to
|
||||||
|
show a warning when someone's password is already in some kind of database.
|
||||||
|
While there are several online APIs we can query for this purpose, I chose to
|
||||||
|
ignore these (apart from the potential privacy implications, I'd also rather
|
||||||
|
avoid dependencies on external services when possible). As an alternative,
|
||||||
|
there are several password dictionaries available for download that can be used
|
||||||
|
for offline querying. I decided to go with the free [CrackStation password
|
||||||
|
dictionary](https://crackstation.net/crackstation-wordlist-password-cracking-dictionary.htm).
|
||||||
|
|
||||||
|
The dictionary comes in the form of a compressed text file with one password
|
||||||
|
per line. This is a great format to keep the size of the dictionary small, but
|
||||||
|
it's a really shitty format for fast querying. The obvious solution is to
|
||||||
|
import the dictionary into some database (any RDMBS or NoSQL store of your
|
||||||
|
choice), but those tend to be relatively heavyweight solutions in terms of code
|
||||||
|
base, operational dependencies, database size or all of the above. Generic
|
||||||
|
databases are typically optimized to support (moderately) fast inserts, updates
|
||||||
|
and deletes, which means they often can't pack data very tightly. Some
|
||||||
|
databases do support compression, though.
|
||||||
|
|
||||||
|
So all I need is a small, compressed database that supports only a single
|
||||||
|
operation: check if a key (password, in this case) is present. Surely I could
|
||||||
|
quickly hack together a simple database that provides a good size/performance
|
||||||
|
ratio?
|
||||||
|
|
||||||
|
## My Little B-tree
|
||||||
|
|
||||||
|
I had some misgivings about implementing a
|
||||||
|
[B-tree](https://en.wikipedia.org/wiki/B-tree). It may be a relatively simple
|
||||||
|
data structure, but I've written enough off-by-one errors in my life to be wary
|
||||||
|
of actually implementing one, and given this use case, I felt that there had to
|
||||||
|
be an even simpler solution. After thinking up and rejecting some alternative
|
||||||
|
strategies and after realizing that I really only needed to support two
|
||||||
|
operations - bulk data insertion and exact key lookup - I came to the
|
||||||
|
conclusion that a B-tree isn't such a bad idea after all. One major advantage
|
||||||
|
of B-trees is that they provide a natural solution to split up the data into
|
||||||
|
fixed-sized blocks, and that is a *very* useful property if we want to use
|
||||||
|
compression.
|
||||||
|
|
||||||
|
The database format I came up with is basically a concatenation of compressed
|
||||||
|
blocks. Blocks come in two forms: leaf blocks and, uh, let's call them *index
|
||||||
|
blocks*.
|
||||||
|
|
||||||
|
The leaf blocks contain all the password strings. I initially tried to encode
|
||||||
|
the passwords as a concatenated sequence of length-prefixed strings, but I
|
||||||
|
could not find a way to quickly parse and iterate through that data in Perl 5
|
||||||
|
(the target language for this little project[^1]). As it turns out, doing a
|
||||||
|
substring match on the full leaf block is faster than trying to work with
|
||||||
|
[unpack](https://perldoc.pl/functions/unpack) and
|
||||||
|
[substr](https://perldoc.pl/functions/substr), even if a naive substring match
|
||||||
|
can't take advantage of the string ordering to skip over processing when it has
|
||||||
|
found a "larger" key. So I made sure that the leaf block starts with a null
|
||||||
|
byte and encoded the passwords as a sequence of null-terminated strings. That
|
||||||
|
way we can reliably find the key by doing a substring search on
|
||||||
|
`"\x00${key}\x00"`.
|
||||||
|
|
||||||
|
The index blocks are the intermediate B-tree nodes and consist of a sorted list
|
||||||
|
of block references interleaved with keys. Like this:
|
||||||
|
|
||||||
|
```perl
|
||||||
|
$block1 # Contains all the passwords "less than" $password1
|
||||||
|
$password1
|
||||||
|
$block2 # Contains all the passwords "less than" $password2
|
||||||
|
$password2
|
||||||
|
$block3 # Contains all the passwords "greater than" $password2
|
||||||
|
```
|
||||||
|
|
||||||
|
The `$blockN` references are stored as a 64bit integer and encode the byte
|
||||||
|
offset of the block relative to the start of the database file, the compressed
|
||||||
|
byte length of the block, and whether it is a leaf or index block. Strictly
|
||||||
|
speaking only the offset is necessary, the length and leaf flag can be stored
|
||||||
|
alongside the block itself, but this approach saves a separate decoding step.
|
||||||
|
The `$passwordN` keys are encoded as a null-terminated string, if only for
|
||||||
|
consistency with leaf blocks.
|
||||||
|
|
||||||
|
Finally, at the end of the file, we store a reference to the *parent* block, so
|
||||||
|
that our lookup algorithm knows where to start its search.
|
||||||
|
|
||||||
|
Given a sorted list of passwords[^2], creating a database with that format is
|
||||||
|
relatively easy. The algorithm goes roughly as follows (apologies for the
|
||||||
|
pseudocode, the actual code is slightly messier in order to do the proper data
|
||||||
|
encoding):
|
||||||
|
|
||||||
|
```perl
|
||||||
|
my @blocks; # Stack of blocks, $block[0] is a leaf block.
|
||||||
|
|
||||||
|
for my $password (@passwords) {
|
||||||
|
$blocks[0]->append_password($password);
|
||||||
|
|
||||||
|
# Flush blocks when they get large enough.
|
||||||
|
for(my $i=0; $blocks[$i]->length() > $block_size; $i++) {
|
||||||
|
my $reference = $blocks[$i]->flush();
|
||||||
|
$blocks[$i+1]->append_block_reference($reference);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# Flush the remaining blocks.
|
||||||
|
for my $i (0..$#blocks) {
|
||||||
|
my $reference = $blocks[$i]->flush();
|
||||||
|
$blocks[$i+1]->append_block_reference($reference);
|
||||||
|
}
|
||||||
|
|
||||||
|
write_parent_node_reference();
|
||||||
|
```
|
||||||
|
|
||||||
|
That's it, really. No weird tree balancing tricks, no need to "modify" index
|
||||||
|
blocks in any other way than appending some data. *Flushing* a block is
|
||||||
|
nothing more than compressing the thing, appending it to the database file and
|
||||||
|
noting its length and byte offset for future reference.
|
||||||
|
|
||||||
|
Lookup is just as easy. I don't even need pseudocode to demonstrate it, here's
|
||||||
|
the actual implementation:
|
||||||
|
|
||||||
|
```perl
|
||||||
|
sub lookup_rec {
|
||||||
|
my($q, $F, $ref) = @_;
|
||||||
|
my $buf = readblock $F, $ref;
|
||||||
|
if($ref->[0]) { # Is this a leaf block?
|
||||||
|
return $buf =~ /\x00\Q$q\E\x00/; # Substring search
|
||||||
|
} else {
|
||||||
|
# This is an index block, walk through the block references and
|
||||||
|
# password strings until we find a string that's larger than our query.
|
||||||
|
while($buf =~ /(.{8})([^\x00]*)\x00/sg) {
|
||||||
|
return lookup_rec($q, $F, dref $1) if $q lt $2;
|
||||||
|
}
|
||||||
|
return lookup_rec($q, $F, dref substr $buf, -8)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# Usage: lookup($query, $database_filename)
|
||||||
|
# returns true if $query exists in the database.
|
||||||
|
sub lookup {
|
||||||
|
my($q, $f) = @_;
|
||||||
|
open my $F, '<', $f or die $!;
|
||||||
|
# Read the last 8 bytes in the file for the reference to the parent block.
|
||||||
|
sysseek $F, -8, 2 or die $!;
|
||||||
|
die $! if 8 != sysread $F, (my $buf), 8;
|
||||||
|
# Start the recursive lookup
|
||||||
|
lookup_rec($q, $F, dref $buf)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The full code, including that of the benchmarks below, can be found [on
|
||||||
|
git](https://g.blicky.net/pwlookup.git/tree/).
|
||||||
|
|
||||||
|
## Benchmarks
|
||||||
|
|
||||||
|
I benchmarked my little B-tree implementation with a few different compression
|
||||||
|
settings (no compression, gzip and zstandard) and block sizes (1k and 4k). For
|
||||||
|
comparison I also added a naive implementation that performs a simple linear
|
||||||
|
lookup in the sorted dictionary, and another one that uses
|
||||||
|
[LMDB](https://symas.com/lmdb/).
|
||||||
|
|
||||||
|
Here are the results with the `crackstation-human-only.txt.gz` dictionary,
|
||||||
|
containing 63,941,069 passwords at 247 MiB original size.
|
||||||
|
|
||||||
|
| Database | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec |
|
||||||
|
|:----------------|--------------:|------------------:|------------:|
|
||||||
|
| Naive (plain) | 684 | 6,376 | 0.16 (6.1 s)|
|
||||||
|
| Naive (gzip) | 246 | 6,340 | 0.12 (8.3 s)|
|
||||||
|
| B-tree plain 1k | 698 | 6,460 | 17,857.14 |
|
||||||
|
| B-tree plain 4k | 687 | 6,436 | 9,345.79 |
|
||||||
|
| B-tree gzip 1k | 261 | 10,772 | 9,345.79 |
|
||||||
|
| B-tree gzip 4k | 244 | 10,572 | 5,076.14 |
|
||||||
|
| B-tree zstd 1k | 291 | 6,856 | 12,345.68 |
|
||||||
|
| B-tree zstd 4k | 268 | 6,724 | 6,944.44 |
|
||||||
|
| LMDB | 1,282 | 590,792 | 333,333.33 |
|
||||||
|
|
||||||
|
Well shit. My little B-tree experiment does have an awesome size/performance
|
||||||
|
ratio when compared to the Naive approach (little surprise there), but the
|
||||||
|
performance difference with LMDB is *insane*. Although, really, that isn't too
|
||||||
|
surprising either, LMDB is written in C and has been *heavily* optimized for
|
||||||
|
performance.
|
||||||
|
|
||||||
|
I used the default compression levels of zstd and gzip. I expect that a
|
||||||
|
slightly higher compression level for zstd could reduce the database sizes to
|
||||||
|
below gzip levels without too much of a performance penalty.
|
||||||
|
|
||||||
|
What's curious is that the *B-tree gzip 4k* database is smaller than the *Naive
|
||||||
|
(gzip)* one. I wonder if I have a bug somewhere that throws away a chunk of the
|
||||||
|
original data. Or if I somehow ended up using a different compression level. Or
|
||||||
|
if gzip is just being weird.
|
||||||
|
|
||||||
|
Here's the same benchmark with the `crackstation.txt.gz` dictionary, containing
|
||||||
|
1,212,356,398 passwords at 4.2 GiB original size[^3].
|
||||||
|
|
||||||
|
| Database | DB Size (MiB) | Memory (RES, KiB) | Lookups/sec |
|
||||||
|
|:----------------|--------------:|------------------:|------------:|
|
||||||
|
| Naive (plain) | 14,968 | 38,536 | 0.01 (110 s)|
|
||||||
|
| Naive (gzip) | 4,245 | 38,760 | 0.01 (136 s)|
|
||||||
|
| B-tree plain 1k | 15,377 | 6,288 | 15,384.62 |
|
||||||
|
| B-tree plain 4k | 15,071 | 6,456 | 8,196.72 |
|
||||||
|
| B-tree gzip 1k | 4,926 | 10,780 | 7,352.94 |
|
||||||
|
| B-tree gzip 4k | 4,344 | 10,720 | 4,273.50 |
|
||||||
|
| B-tree zstd 1k | 5,389 | 6,708 | 10,000.00 |
|
||||||
|
| B-tree zstd 4k | 4,586 | 6,692 | 5,917.16 |
|
||||||
|
| LMDB | 26,453 | 3,259,368 | 266,666.67 |
|
||||||
|
|
||||||
|
The main conclusion I draw from this benchmark is that the B-tree
|
||||||
|
implementation scales pretty well with increasing database sizes, as one would
|
||||||
|
expect. I'm not sure why Perl decided to use more memory for the *Naive*
|
||||||
|
benchmarks, but it doesn't really matter.
|
||||||
|
|
||||||
|
## Improvements
|
||||||
|
|
||||||
|
Is this the best we can do? No way! Let's start with some low-hanging fruit:
|
||||||
|
|
||||||
|
- The current lookup function reads the database file from scratch on every
|
||||||
|
lookup. An LRU cache for uncompressed blocks ought to speed things up
|
||||||
|
considerably.
|
||||||
|
- Keys in index blocks are repeated in leaf blocks, this isn't really
|
||||||
|
necessary.
|
||||||
|
- It's possible to encode an "offset inside block" field to the block
|
||||||
|
references and add a few more strings to the index blocks, allowing parts of
|
||||||
|
a block can be skipped when searching for the key. This allows one to get
|
||||||
|
some of the performance benefits of smaller block sizes without paying for
|
||||||
|
the increase in database size. Or store multiple (smaller) intermediate
|
||||||
|
B-tree nodes inside a single block. Same thing.
|
||||||
|
- The lookup function could be rewritten in a faster language (C/C++/Rust), I'm
|
||||||
|
pretty sure this would be a big win, too.
|
||||||
|
|
||||||
|
Thinking beyond B-trees, an alternative and likely much more efficient approach
|
||||||
|
is to use a hash function to assign strings to leaf blocks, store an array of
|
||||||
|
block pointers in the index blocks (without interleaving keys) and then use the
|
||||||
|
hash function to index the array for lookup. This makes it harder to cap the
|
||||||
|
size of leaf blocks, but with the small password strings that's not likely to
|
||||||
|
be a problem. It does significantly complicates creating the database in the
|
||||||
|
first place.
|
||||||
|
|
||||||
|
Perhaps an even better approach is to not store the strings in the first place
|
||||||
|
and simply use a (sufficiently) large [bloom
|
||||||
|
filter](https://en.wikipedia.org/wiki/Bloom_filter).
|
||||||
|
|
||||||
|
But this was just a little side project. My goal was to get 1 ms lookups with
|
||||||
|
the least amount of code and with a database size that isn't too far off from
|
||||||
|
the compressed dictionary. That goal turned out to be pretty unambitious.
|
||||||
|
|
||||||
|
|
||||||
|
[^1]: Yeah, people still use Perl 5.
|
||||||
|
[^2]: But note that the passwords in the CrackStation dictionary are not sorted
|
||||||
|
according to Perl's string comparison algorithm, so it requires a separate
|
||||||
|
`LC_COLLATE=C sort` command to fix that. Also note that sorting a billion
|
||||||
|
strings is a pretty challenging problem in its own right, but enough has been
|
||||||
|
written about that. Arguably, enough has been written about B-trees and
|
||||||
|
databases as well.
|
||||||
|
[^3]: Running these benchmarks was a bit of a nightmare as I was running low on
|
||||||
|
free space on my SSD. I had to delete some unused build artefacts from other
|
||||||
|
projects in an emergency in order for `sort` to be able to finish sorting and
|
||||||
|
writing the *Naive (plain)* database, upon which all the others are based.
|
||||||
|
[Ncdu](/ncdu) has saved this experiment, its author deserves a tasty pizza
|
||||||
|
for dinner today.
|
||||||
|
|
@ -20,6 +20,10 @@ software or the incidental article on this site. Enjoy your stay!
|
||||||
<!-- These announcements are parsed by mkfeed.pl, see that file for formatting -->
|
<!-- These announcements are parsed by mkfeed.pl, see that file for formatting -->
|
||||||
## Announcements <a href="/feed.atom"><img src="/img/feed_icon.png" alt="Atom feed"></a>
|
## Announcements <a href="/feed.atom"><img src="/img/feed_icon.png" alt="Atom feed"></a>
|
||||||
|
|
||||||
|
`2019-05-14` - New article: Fast Key Lookup with a Small Read-Only Database <!-- link: /doc/pwlookup -->
|
||||||
|
: How to quickly check if a password is in a large (but nicely compressed)
|
||||||
|
dictionary. Some code and a few benchmarks. [Read more.](/doc/pwlookup)
|
||||||
|
|
||||||
`2019-04-30` - ncdc 1.22 released <!-- tags: ncdc, link: /ncdc -->
|
`2019-04-30` - ncdc 1.22 released <!-- tags: ncdc, link: /ncdc -->
|
||||||
: There has been a little bit of renewed interest since the 1.21 release, so
|
: There has been a little bit of renewed interest since the 1.21 release, so
|
||||||
we have a few (small) new features and improvements. It's now possible to
|
we have a few (small) new features and improvements. It's now possible to
|
||||||
|
|
|
||||||
|
|
@ -44,6 +44,8 @@ header p b { display: block; margin-top: 10px; margin-bottom: 2px }
|
||||||
|
|
||||||
b, strong { font-weight: bold }
|
b, strong { font-weight: bold }
|
||||||
em, i, i a, em a { font-style: italic }
|
em, i, i a, em a { font-style: italic }
|
||||||
|
sup { font-size: 80%; font-weight: bold }
|
||||||
|
a.footnoteRef { text-decoration: none }
|
||||||
|
|
||||||
main h1.title { margin-top: 0; font-size: 195% }
|
main h1.title { margin-top: 0; font-size: 195% }
|
||||||
main h1 { font-size: 150%; color: #000; margin: 2em 0 .3em 0; text-decoration: none }
|
main h1 { font-size: 150%; color: #000; margin: 2em 0 .3em 0; text-decoration: none }
|
||||||
|
|
@ -52,7 +54,7 @@ main h3 { font-size: 120%; color: #000; margin: 1em 0 .3em 0; text-decoration: n
|
||||||
main code { font-family: monospace }
|
main code { font-family: monospace }
|
||||||
main pre { font-family: monospace; font-size: 80%; margin: 0 0 10px 18px; display: block; padding: 0 0 0 15px; border-left: 1px dotted #999; overflow-x: auto }
|
main pre { font-family: monospace; font-size: 80%; margin: 0 0 10px 18px; display: block; padding: 0 0 0 15px; border-left: 1px dotted #999; overflow-x: auto }
|
||||||
main pre * { font-size: inherit; font-family: inherit }
|
main pre * { font-size: inherit; font-family: inherit }
|
||||||
main p, main figure, main ul, main ol, main dl, main pre, main figure { margin-bottom: 0.7em; margin-left: 1em }
|
main p, main figure, main ul, main ol, main dl, main pre, main figure, main table { margin-bottom: 0.7em; margin-left: 1em }
|
||||||
main ul ul { margin-bottom: 0.5em }
|
main ul ul { margin-bottom: 0.5em }
|
||||||
main ul, main ol { margin-left: 2.5em }
|
main ul, main ol { margin-left: 2.5em }
|
||||||
main li { margin-bottom: .1em }
|
main li { margin-bottom: .1em }
|
||||||
|
|
@ -61,6 +63,13 @@ main ul p, main ol p, main dl p { margin-left: 0 }
|
||||||
main ul ul, main dd ul { margin-left: 1em }
|
main ul ul, main dd ul { margin-left: 1em }
|
||||||
main dt { margin-bottom: .1em; }
|
main dt { margin-bottom: .1em; }
|
||||||
main figcaption { display: none }
|
main figcaption { display: none }
|
||||||
|
main table th, main table td { font-size: 80%; padding: 1px 7px }
|
||||||
|
main table th { font-weight: bold }
|
||||||
|
|
||||||
|
main section.footnotes hr { display: none }
|
||||||
|
main section.footnotes { margin: 40px 10px 10px 10px }
|
||||||
|
main section.footnotes p, main section.footnotes code { font-size: 80% }
|
||||||
|
main section.footnotes em, main section.footnotes a { font-size: inherit }
|
||||||
|
|
||||||
main img.right { float: right; margin: 0 0 5px 10px }
|
main img.right { float: right; margin: 0 0 5px 10px }
|
||||||
main .sig { vertical-align: super }
|
main .sig { vertical-align: super }
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue