dat/pwlookup.md: Some fixes and improvements
This commit is contained in:
parent
132f3745e1
commit
0e30dd4d99
1 changed files with 29 additions and 28 deletions
|
|
@ -52,11 +52,10 @@ could not find a way to quickly parse and iterate through that data in Perl 5
|
|||
substring match on the full leaf block is faster than trying to work with
|
||||
[unpack](https://perldoc.pl/functions/unpack) and
|
||||
[substr](https://perldoc.pl/functions/substr), even if a naive substring match
|
||||
can't take advantage of the string ordering to skip over processing when it has
|
||||
found a "larger" key. So I made sure that the leaf block starts with a null
|
||||
byte and encoded the passwords as a sequence of null-terminated strings. That
|
||||
way we can reliably find the key by doing a substring search on
|
||||
`"\x00${key}\x00"`.
|
||||
can't take advantage of the string ordering to skip over data when it has found
|
||||
a "larger" key. So I made sure that the leaf block starts with a null byte and
|
||||
encoded the passwords as a sequence of null-terminated strings. That way we can
|
||||
reliably find the key by doing a substring search on `"\x00${key}\x00"`.
|
||||
|
||||
The index blocks are the intermediate B-tree nodes and consist of a sorted list
|
||||
of block references interleaved with keys. Like this:
|
||||
|
|
@ -162,19 +161,21 @@ containing 63,941,069 passwords at 247 MiB original size.
|
|||
|:----------------|--------------:|------------------:|------------:|
|
||||
| Naive (plain) | 684 | 6,376 | 0.16 (6.1 s)|
|
||||
| Naive (gzip) | 246 | 6,340 | 0.12 (8.3 s)|
|
||||
| B-tree plain 1k | 698 | 6,460 | 17,857.14 |
|
||||
| B-tree plain 4k | 687 | 6,436 | 9,345.79 |
|
||||
| B-tree gzip 1k | 261 | 10,772 | 9,345.79 |
|
||||
| B-tree gzip 4k | 244 | 10,572 | 5,076.14 |
|
||||
| B-tree zstd 1k | 291 | 6,856 | 12,345.68 |
|
||||
| B-tree zstd 4k | 268 | 6,724 | 6,944.44 |
|
||||
| LMDB | 1,282 | 590,792 | 333,333.33 |
|
||||
| B-tree plain 1k | 698 | 6,460 | 17,857 |
|
||||
| B-tree plain 4k | 687 | 6,436 | 9,345 |
|
||||
| B-tree gzip 1k | 261 | 10,772 | 9,345 |
|
||||
| B-tree gzip 4k | 244 | 10,572 | 5,076 |
|
||||
| B-tree zstd 1k | 291 | 6,856 | 12,345 |
|
||||
| B-tree zstd 4k | 268 | 6,724 | 6,944 |
|
||||
| LMDB | 1,282 | 590,792 | 333,333 |
|
||||
|
||||
Well shit. My little B-tree experiment does have an awesome size/performance
|
||||
ratio when compared to the Naive approach (little surprise there), but the
|
||||
performance difference with LMDB is *insane*. Although, really, that isn't too
|
||||
surprising either, LMDB is written in C and has been *heavily* optimized for
|
||||
performance.
|
||||
performance. LMDB's memory usage in this benchmark should be taken with a grain
|
||||
of salt, its use of [mmap()](https://manned.org/mmap) tends to [throw off
|
||||
memory measurements](https://lmdb.readthedocs.io/en/release/#memory-usage).
|
||||
|
||||
I used the default compression levels of zstd and gzip. I expect that a
|
||||
slightly higher compression level for zstd could reduce the database sizes to
|
||||
|
|
@ -192,13 +193,13 @@ Here's the same benchmark with the `crackstation.txt.gz` dictionary, containing
|
|||
|:----------------|--------------:|------------------:|------------:|
|
||||
| Naive (plain) | 14,968 | 38,536 | 0.01 (110 s)|
|
||||
| Naive (gzip) | 4,245 | 38,760 | 0.01 (136 s)|
|
||||
| B-tree plain 1k | 15,377 | 6,288 | 15,384.62 |
|
||||
| B-tree plain 4k | 15,071 | 6,456 | 8,196.72 |
|
||||
| B-tree gzip 1k | 4,926 | 10,780 | 7,352.94 |
|
||||
| B-tree gzip 4k | 4,344 | 10,720 | 4,273.50 |
|
||||
| B-tree zstd 1k | 5,389 | 6,708 | 10,000.00 |
|
||||
| B-tree zstd 4k | 4,586 | 6,692 | 5,917.16 |
|
||||
| LMDB | 26,453 | 3,259,368 | 266,666.67 |
|
||||
| B-tree plain 1k | 15,377 | 6,288 | 15,384 |
|
||||
| B-tree plain 4k | 15,071 | 6,456 | 8,196 |
|
||||
| B-tree gzip 1k | 4,926 | 10,780 | 7,352 |
|
||||
| B-tree gzip 4k | 4,344 | 10,720 | 4,273 |
|
||||
| B-tree zstd 1k | 5,389 | 6,708 | 10,000 |
|
||||
| B-tree zstd 4k | 4,586 | 6,692 | 5,917 |
|
||||
| LMDB | 26,453 | 3,259,368 | 266,666 |
|
||||
|
||||
The main conclusion I draw from this benchmark is that the B-tree
|
||||
implementation scales pretty well with increasing database sizes, as one would
|
||||
|
|
@ -216,10 +217,10 @@ Is this the best we can do? No way! Let's start with some low-hanging fruit:
|
|||
necessary.
|
||||
- It's possible to encode an "offset inside block" field to the block
|
||||
references and add a few more strings to the index blocks, allowing parts of
|
||||
a block can be skipped when searching for the key. This allows one to get
|
||||
some of the performance benefits of smaller block sizes without paying for
|
||||
the increase in database size. Or store multiple (smaller) intermediate
|
||||
B-tree nodes inside a single block. Same thing.
|
||||
a block to be skipped when searching for the key. This way we can get some of
|
||||
the performance benefits of smaller block sizes without the costs of an
|
||||
increase in database size. Or store multiple (smaller) intermediate B-tree
|
||||
nodes inside a single block. Same thing.
|
||||
- The lookup function could be rewritten in a faster language (C/C++/Rust), I'm
|
||||
pretty sure this would be a big win, too.
|
||||
|
||||
|
|
@ -228,7 +229,7 @@ is to use a hash function to assign strings to leaf blocks, store an array of
|
|||
block pointers in the index blocks (without interleaving keys) and then use the
|
||||
hash function to index the array for lookup. This makes it harder to cap the
|
||||
size of leaf blocks, but with the small password strings that's not likely to
|
||||
be a problem. It does significantly complicates creating the database in the
|
||||
be a problem. It does significantly complicate creating the database in the
|
||||
first place.
|
||||
|
||||
Perhaps an even better approach is to not store the strings in the first place
|
||||
|
|
@ -240,7 +241,7 @@ the least amount of code and with a database size that isn't too far off from
|
|||
the compressed dictionary. That goal turned out to be pretty unambitious.
|
||||
|
||||
|
||||
[^1]: Yeah, people still use Perl 5.
|
||||
[^1]: Yeah, people still use Perl 5 in 2019.
|
||||
[^2]: But note that the passwords in the CrackStation dictionary are not sorted
|
||||
according to Perl's string comparison algorithm, so it requires a separate
|
||||
`LC_COLLATE=C sort` command to fix that. Also note that sorting a billion
|
||||
|
|
@ -251,5 +252,5 @@ the compressed dictionary. That goal turned out to be pretty unambitious.
|
|||
free space on my SSD. I had to delete some unused build artefacts from other
|
||||
projects in an emergency in order for `sort` to be able to finish sorting and
|
||||
writing the *Naive (plain)* database, upon which all the others are based.
|
||||
[Ncdu](/ncdu) has saved this experiment, its author deserves a tasty pizza
|
||||
for dinner today.
|
||||
[Ncdu](/ncdu) has saved this experiment, its author deserves an extra tasty
|
||||
pizza for dinner today.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue