Unfortunately, this can lead to slightly confusing scenarios, because
the exact package of the displayed man page is not very well defined.
It's possible that, when browsing from a package listing to a man page,
you may see an included file that does not come from the package you
browsed from.
E.g. https://manned.org/pwrite/5f2909f6 - that man page simply includes
pread.2, but from the URL it's unclear from which package or system it
should be included.
The only way to fix this is to add the package ID to the link format.
Unlike tar, cpio does not have a separate entry for each directory, so
the link resolution can't assume that directory entries exist for each
path component.
I also mistakenly assumed that cpio handled hardlinks similarly to tar,
but that's clearly not the case. libarchive does help a bit, but these
differences still suck.
- Fix segfault on empty output (bug was in XS code)
- Still better end-of-URL detection
- Recognize a few common multicharacter sections in man references
- Grotty escape sequences are now better interpreted. I feel rather
stupid for not realizing the idea behind how those codes are supposed
to work earlier. It finally hit me when I read the BSD ul(1) source
code.
- URL end detection is slightly better (much better than the old C code)
- Man page references with : are recognized now (common in Perl modules).
- More efficient HTML escaping, no need to escape > and ".
There's still a bunch of improvements to make, but I have much more
confidence in the current implementation already.
The previous C code was troublesome.
- Didn't handle long lines
- I couldn't convince myself that it was free of memory safety issues
- Needed improving anyway, there are some formatting bugs. These are
hard to fix in the current code.
I mostly replicated the formatting bugs of the old C implementation in
Rust, and possibly added a few new bugs as well. It's not a significant
improvement right now, more testing and fixing will be needed.
The performance of both implementations is comparable, with the Rust
version being slightly faster in many cases (and slower in some others).
I did spend more time trying to optimize this Rust version than I did
with the old C code. I initially tried a naive-ish conversion of the C
code to Rust, but that turned out to be much slower and I had to resort
to using regexes and different data structures fix that.
HTTPS isn't used, so removing it saves some space.
The std SipHash API has been deprecated, and since hashing performance
isn't exactly critical in this case I've replaced it with SHA1, which
was already being used in man.rs.
Further improvements can be gained by caching the results of
get_contents(), since the same Contents file is often parsed multiple
times in a single cron run. But this is already a significant
achievement.