Primarily aimed at reducing the size of the old 'man' (now: files)
table, using smaller integers to refer to man contents and text fields,
and storing a shorthash as an integer for quick lookups. This better
normalization also removes the need to keep a separate 'man_index' cache
for the search function.
The old schema wasn't necessarily bad, but I was in the mood for some
optimizations. And a little cleanup.
Prolly introduces a bunch of new bugs, I haven't tested this too well.
Also greatly simplified basename_from_filename() because apparently I
couldn't write regexes back then.
(And the removed REFERENCES line is to sync schema.sql with the actual
state of the DB, which doesn't have that constraint for some reason.
I'll prolly fix that later)
We've got a lot of packages in the DB that have long been removed from
the Arch repos. These are still indexed, but won't clutter the package
listing anymore.
Also fixed an issue with packages.id numbers getting rather large
because the indexer allocates a new ID for every package on every
update.
Performance improvement. The ON CONFLICT DO UPDATE was primarily to make
sure that old man pages would get fixed when the indexer did a better
job at detecting the encoding, but there haven't been any relevant fixes
to the indexer lately so this forced-update won't do much now.
Unlike tar, cpio does not have a separate entry for each directory, so
the link resolution can't assume that directory entries exist for each
path component.
I also mistakenly assumed that cpio handled hardlinks similarly to tar,
but that's clearly not the case. libarchive does help a bit, but these
differences still suck.
HTTPS isn't used, so removing it saves some space.
The std SipHash API has been deprecated, and since hashing performance
isn't exactly critical in this case I've replaced it with SHA1, which
was already being used in man.rs.
Further improvements can be gained by caching the results of
get_contents(), since the same Contents file is often parsed multiple
times in a single cron run. But this is already a significant
achievement.
The encoding metadata will be very useful in finding badly decoded man
pages. The package 'arch' is necessary to properly identify which
package was used, which is not obvious now that I'm going to switch more
systems to the (more common) x86_64 arch.