dat/pwlookup.md: Some fixes and improvements

2019-05-17 12:35:47 +02:00 · 2019-05-17 12:35:47 +02:00 · 0e30dd4d99
commit 0e30dd4d99
parent 132f3745e1
1 changed files with 29 additions and 28 deletions
--- a/dat/doc/pwlookup.md
+++ b/dat/doc/pwlookup.md
@ -52,11 +52,10 @@ could not find a way to quickly parse and iterate through that data in Perl 5
 substring match on the full leaf block is faster than trying to work with
 [unpack](https://perldoc.pl/functions/unpack) and
 [substr](https://perldoc.pl/functions/substr), even if a naive substring match
-can't take advantage of the string ordering to skip over processing when it has
-found a "larger" key. So I made sure that the leaf block starts with a null
-byte and encoded the passwords as a sequence of null-terminated strings. That
-way we can reliably find the key by doing a substring search on
-`"\x00${key}\x00"`.
+can't take advantage of the string ordering to skip over data when it has found
+a "larger" key. So I made sure that the leaf block starts with a null byte and
+encoded the passwords as a sequence of null-terminated strings. That way we can
+reliably find the key by doing a substring search on `"\x00${key}\x00"`.

 The index blocks are the intermediate B-tree nodes and consist of a sorted list
 of block references interleaved with keys. Like this:
@ -162,19 +161,21 @@ containing 63,941,069 passwords at 247 MiB original size.
 |:----------------|--------------:|------------------:|------------:|
 | Naive (plain)   |           684 |             6,376 | 0.16 (6.1 s)|
 | Naive (gzip)    |           246 |             6,340 | 0.12 (8.3 s)|
-| B-tree plain 1k |           698 |             6,460 |   17,857.14 |
-| B-tree plain 4k |           687 |             6,436 |    9,345.79 |
-| B-tree gzip 1k  |           261 |            10,772 |    9,345.79 |
-| B-tree gzip 4k  |           244 |            10,572 |    5,076.14 |
-| B-tree zstd 1k  |           291 |             6,856 |   12,345.68 |
-| B-tree zstd 4k  |           268 |             6,724 |    6,944.44 |
-| LMDB            |         1,282 |           590,792 |  333,333.33 |
+| B-tree plain 1k |           698 |             6,460 |      17,857 |
+| B-tree plain 4k |           687 |             6,436 |       9,345 |
+| B-tree gzip 1k  |           261 |            10,772 |       9,345 |
+| B-tree gzip 4k  |           244 |            10,572 |       5,076 |
+| B-tree zstd 1k  |           291 |             6,856 |      12,345 |
+| B-tree zstd 4k  |           268 |             6,724 |       6,944 |
+| LMDB            |         1,282 |           590,792 |     333,333 |

 Well shit. My little B-tree experiment does have an awesome size/performance
 ratio when compared to the Naive approach (little surprise there), but the
 performance difference with LMDB is *insane*. Although, really, that isn't too
 surprising either,  LMDB is written in C and has been *heavily* optimized for
-performance.
+performance. LMDB's memory usage in this benchmark should be taken with a grain
+of salt, its use of [mmap()](https://manned.org/mmap) tends to [throw off
+memory measurements](https://lmdb.readthedocs.io/en/release/#memory-usage).

 I used the default compression levels of zstd and gzip. I expect that a
 slightly higher compression level for zstd could reduce the database sizes to
@ -192,13 +193,13 @@ Here's the same benchmark with the `crackstation.txt.gz` dictionary, containing
 |:----------------|--------------:|------------------:|------------:|
 | Naive (plain)   |        14,968 |            38,536 | 0.01 (110 s)|
 | Naive (gzip)    |         4,245 |            38,760 | 0.01 (136 s)|
-| B-tree plain 1k |        15,377 |             6,288 |   15,384.62 |
-| B-tree plain 4k |        15,071 |             6,456 |    8,196.72 |
-| B-tree gzip 1k  |         4,926 |            10,780 |    7,352.94 |
-| B-tree gzip 4k  |         4,344 |            10,720 |    4,273.50 |
-| B-tree zstd 1k  |         5,389 |             6,708 |   10,000.00 |
-| B-tree zstd 4k  |         4,586 |             6,692 |    5,917.16 |
-| LMDB            |        26,453 |         3,259,368 |  266,666.67 |
+| B-tree plain 1k |        15,377 |             6,288 |      15,384 |
+| B-tree plain 4k |        15,071 |             6,456 |       8,196 |
+| B-tree gzip 1k  |         4,926 |            10,780 |       7,352 |
+| B-tree gzip 4k  |         4,344 |            10,720 |       4,273 |
+| B-tree zstd 1k  |         5,389 |             6,708 |      10,000 |
+| B-tree zstd 4k  |         4,586 |             6,692 |       5,917 |
+| LMDB            |        26,453 |         3,259,368 |     266,666 |

 The main conclusion I draw from this benchmark is that the B-tree
 implementation scales pretty well with increasing database sizes, as one would
@ -216,10 +217,10 @@ Is this the best we can do? No way! Let's start with some low-hanging fruit:
  necessary.
 - It's possible to encode an "offset inside block" field to the block
  references and add a few more strings to the index blocks, allowing parts of
-  a block can be skipped when searching for the key. This allows one to get
-  some of the performance benefits of smaller block sizes without paying for
-  the increase in database size. Or store multiple (smaller) intermediate
-  B-tree nodes inside a single block. Same thing.
+  a block to be skipped when searching for the key. This way we can get some of
+  the performance benefits of smaller block sizes without the costs of an
+  increase in database size. Or store multiple (smaller) intermediate B-tree
+  nodes inside a single block. Same thing.
 - The lookup function could be rewritten in a faster language (C/C++/Rust), I'm
  pretty sure this would be a big win, too.

@ -228,7 +229,7 @@ is to use a hash function to assign strings to leaf blocks, store an array of
 block pointers in the index blocks (without interleaving keys) and then use the
 hash function to index the array for lookup. This makes it harder to cap the
 size of leaf blocks, but with the small password strings that's not likely to
-be a problem. It does significantly complicates creating the database in the
+be a problem. It does significantly complicate creating the database in the
 first place.

 Perhaps an even better approach is to not store the strings in the first place
@ -240,7 +241,7 @@ the least amount of code and with a database size that isn't too far off from
 the compressed dictionary. That goal turned out to be pretty unambitious.


-[^1]: Yeah, people still use Perl 5.
+[^1]: Yeah, people still use Perl 5 in 2019.
 [^2]: But note that the passwords in the CrackStation dictionary are not sorted
  according to Perl's string comparison algorithm, so it requires a separate
  `LC_COLLATE=C sort` command to fix that. Also note that sorting a billion
@ -251,5 +252,5 @@ the compressed dictionary. That goal turned out to be pretty unambitious.
  free space on my SSD. I had to delete some unused build artefacts from other
  projects in an emergency in order for `sort` to be able to finish sorting and
  writing the *Naive (plain)* database, upon which all the others are based.
-  [Ncdu](/ncdu) has saved this experiment, its author deserves a tasty pizza
-  for dinner today.
+  [Ncdu](/ncdu) has saved this experiment, its author deserves an extra tasty
+  pizza for dinner today.