dat/pwlookup.md: Some fixes and improvements

This commit is contained in:
Yorhel 2019-05-17 12:35:47 +02:00
parent 132f3745e1
commit 0e30dd4d99

View file

@ -52,11 +52,10 @@ could not find a way to quickly parse and iterate through that data in Perl 5
substring match on the full leaf block is faster than trying to work with
[unpack](https://perldoc.pl/functions/unpack) and
[substr](https://perldoc.pl/functions/substr), even if a naive substring match
can't take advantage of the string ordering to skip over processing when it has
found a "larger" key. So I made sure that the leaf block starts with a null
byte and encoded the passwords as a sequence of null-terminated strings. That
way we can reliably find the key by doing a substring search on
`"\x00${key}\x00"`.
can't take advantage of the string ordering to skip over data when it has found
a "larger" key. So I made sure that the leaf block starts with a null byte and
encoded the passwords as a sequence of null-terminated strings. That way we can
reliably find the key by doing a substring search on `"\x00${key}\x00"`.
The index blocks are the intermediate B-tree nodes and consist of a sorted list
of block references interleaved with keys. Like this:
@ -162,19 +161,21 @@ containing 63,941,069 passwords at 247 MiB original size.
|:----------------|--------------:|------------------:|------------:|
| Naive (plain) | 684 | 6,376 | 0.16 (6.1 s)|
| Naive (gzip) | 246 | 6,340 | 0.12 (8.3 s)|
| B-tree plain 1k | 698 | 6,460 | 17,857.14 |
| B-tree plain 4k | 687 | 6,436 | 9,345.79 |
| B-tree gzip 1k | 261 | 10,772 | 9,345.79 |
| B-tree gzip 4k | 244 | 10,572 | 5,076.14 |
| B-tree zstd 1k | 291 | 6,856 | 12,345.68 |
| B-tree zstd 4k | 268 | 6,724 | 6,944.44 |
| LMDB | 1,282 | 590,792 | 333,333.33 |
| B-tree plain 1k | 698 | 6,460 | 17,857 |
| B-tree plain 4k | 687 | 6,436 | 9,345 |
| B-tree gzip 1k | 261 | 10,772 | 9,345 |
| B-tree gzip 4k | 244 | 10,572 | 5,076 |
| B-tree zstd 1k | 291 | 6,856 | 12,345 |
| B-tree zstd 4k | 268 | 6,724 | 6,944 |
| LMDB | 1,282 | 590,792 | 333,333 |
Well shit. My little B-tree experiment does have an awesome size/performance
ratio when compared to the Naive approach (little surprise there), but the
performance difference with LMDB is *insane*. Although, really, that isn't too
surprising either, LMDB is written in C and has been *heavily* optimized for
performance.
performance. LMDB's memory usage in this benchmark should be taken with a grain
of salt, its use of [mmap()](https://manned.org/mmap) tends to [throw off
memory measurements](https://lmdb.readthedocs.io/en/release/#memory-usage).
I used the default compression levels of zstd and gzip. I expect that a
slightly higher compression level for zstd could reduce the database sizes to
@ -192,13 +193,13 @@ Here's the same benchmark with the `crackstation.txt.gz` dictionary, containing
|:----------------|--------------:|------------------:|------------:|
| Naive (plain) | 14,968 | 38,536 | 0.01 (110 s)|
| Naive (gzip) | 4,245 | 38,760 | 0.01 (136 s)|
| B-tree plain 1k | 15,377 | 6,288 | 15,384.62 |
| B-tree plain 4k | 15,071 | 6,456 | 8,196.72 |
| B-tree gzip 1k | 4,926 | 10,780 | 7,352.94 |
| B-tree gzip 4k | 4,344 | 10,720 | 4,273.50 |
| B-tree zstd 1k | 5,389 | 6,708 | 10,000.00 |
| B-tree zstd 4k | 4,586 | 6,692 | 5,917.16 |
| LMDB | 26,453 | 3,259,368 | 266,666.67 |
| B-tree plain 1k | 15,377 | 6,288 | 15,384 |
| B-tree plain 4k | 15,071 | 6,456 | 8,196 |
| B-tree gzip 1k | 4,926 | 10,780 | 7,352 |
| B-tree gzip 4k | 4,344 | 10,720 | 4,273 |
| B-tree zstd 1k | 5,389 | 6,708 | 10,000 |
| B-tree zstd 4k | 4,586 | 6,692 | 5,917 |
| LMDB | 26,453 | 3,259,368 | 266,666 |
The main conclusion I draw from this benchmark is that the B-tree
implementation scales pretty well with increasing database sizes, as one would
@ -216,10 +217,10 @@ Is this the best we can do? No way! Let's start with some low-hanging fruit:
necessary.
- It's possible to encode an "offset inside block" field to the block
references and add a few more strings to the index blocks, allowing parts of
a block can be skipped when searching for the key. This allows one to get
some of the performance benefits of smaller block sizes without paying for
the increase in database size. Or store multiple (smaller) intermediate
B-tree nodes inside a single block. Same thing.
a block to be skipped when searching for the key. This way we can get some of
the performance benefits of smaller block sizes without the costs of an
increase in database size. Or store multiple (smaller) intermediate B-tree
nodes inside a single block. Same thing.
- The lookup function could be rewritten in a faster language (C/C++/Rust), I'm
pretty sure this would be a big win, too.
@ -228,7 +229,7 @@ is to use a hash function to assign strings to leaf blocks, store an array of
block pointers in the index blocks (without interleaving keys) and then use the
hash function to index the array for lookup. This makes it harder to cap the
size of leaf blocks, but with the small password strings that's not likely to
be a problem. It does significantly complicates creating the database in the
be a problem. It does significantly complicate creating the database in the
first place.
Perhaps an even better approach is to not store the strings in the first place
@ -240,7 +241,7 @@ the least amount of code and with a database size that isn't too far off from
the compressed dictionary. That goal turned out to be pretty unambitious.
[^1]: Yeah, people still use Perl 5.
[^1]: Yeah, people still use Perl 5 in 2019.
[^2]: But note that the passwords in the CrackStation dictionary are not sorted
according to Perl's string comparison algorithm, so it requires a separate
`LC_COLLATE=C sort` command to fix that. Also note that sorting a billion
@ -251,5 +252,5 @@ the compressed dictionary. That goal turned out to be pretty unambitious.
free space on my SSD. I had to delete some unused build artefacts from other
projects in an emergency in order for `sort` to be able to finish sorting and
writing the *Naive (plain)* database, upon which all the others are based.
[Ncdu](/ncdu) has saved this experiment, its author deserves a tasty pizza
for dinner today.
[Ncdu](/ncdu) has saved this experiment, its author deserves an extra tasty
pizza for dinner today.