That link broke, the /man.<hash>/<system>/<package> format only works
when <hash> is available in the latest version of <system> and
<package>, and I suspect broadening the search to fix that isn't worth
the extra resources. The canonical permalink format doesn't include
<package> and is resilient against this problem.
Seems to be working alright, and it does clean up a few things. The
biggest missing thing right now is schema-based validation for some
query parameters. I'm also seeing opportunities for FU::Pg to act as a
hash/shorthash codec, simplifying some error-prone manual conversions.
A much simpler approach to dead-marking is to do a periodic
UPDATE packages SET dead = true WHERE system = $1;
before running the indexer. Not as efficient in terms of avoiding
database writes, but there's no need to run this very often anyway.
The frontend always stripped off the encodings already, so no point in
keeping that in the DB indices. The full locale was extracted from the
filename, which we still keep, so no information is list.
SQL "migration" script:
BEGIN;
CREATE INDEX files_tmp_locale ON files (locale);
INSERT INTO locales (locale) VALUES ('pl_PL'), ('is_IS'), ('ko_KR');
WITH obs(id, locale, lang) AS (
SELECT id, locale, regexp_replace(locale, '^([^.]+)\..+$', '\1') FROM locales WHERE locale LIKE '%.%'
UNION ALL
SELECT id, locale, '' FROM locales WHERE locale LIKE 'node%' OR locale = 'common'
), rep(old, new) AS (
SELECT o.id, x.id FROM obs o LEFT JOIN locales x ON x.locale = o.lang
), upd AS (
UPDATE files SET locale = new FROM rep WHERE locale = old
) DELETE FROM locales WHERE id IN(SELECT id FROM obs);
DROP INDEX files_tmp_locale;
COMMIT;
Going from average ~100ms to ~10ms or so. The previous query had a
tendency to be much slower sometimes, let's see if this cache also takes
care of those outliers.
Migration script:
ALTER TABLE packages ADD COLUMN c_hasman boolean NOT NULL DEFAULT FALSE;
DROP INDEX packages_system_name_key;
CREATE UNIQUE INDEX packages_system_name_key ON packages (system, name) INCLUDE (id, c_hasman, dead);
UPDATE packages SET c_hasman = NOT c_hasman
WHERE c_hasman <> EXISTS(SELECT 1 FROM package_versions pv WHERE pv.package = packages.id AND EXISTS(SELECT 1 FROM files f WHERE f.pkgver = pv.id));
Which I broke in 83ab6c3671. Need to find
an alternative approach to detecting dead packages sometime. The 'dead'
flag isn't super important so it can wait.
This provides an almost 2x speedup in man page rendering time and
removes some heuristics to work around bad guesses by grog(1).
Funnily enough, this also fixes rendering of obscure man pages that
happen to use 'grap' macros; grog detected those correctly but my groff
installation doesn't actually support rendering that.
No doubt I broke rendering of other pages, will have to see.
Whether or not the package name itself or the (category,name) tuple
uniquely identified a package within a system has been a source of
confusion for a long time. Back in
03d278e4ff I ended up playing playing it
"safe" by going for (category,name), but in practice this doesn't make a
whole lot of sense. While it's *possible* for the same package name to
refer to completely different packages in different "categories", in
reality distributions can't sanely support this anyway.
For distributions where the category referred to a repository, the only
cases where the same package name was used in different repos was when
the package has moved from one repo to another. Those should certainly
not be treated as different packages.
For distributions where the category really referred to a category,
there's the Debian approach where the category is purely a tag and
doesn't help identify the package in any way, and then there's FreeBSD
where the category technically ought to be part of the name. There were
a few cases where FreeBSD used categories to separate out different
versions of the same package (e.g. ipv6 vs non-ipv6), but none were
relevant for man pages so I ended up merging those as well.
Getting rid of the categories simplifies and shortens URLs, unclutters
the UI a little bit and merges the packages in listings that should've
been merged all along.
Migration script:
-- Merge packages that are in multiple categories.
-- All versions are moved to the package with the lowest ID.
-- If the same version already exists in a lower ID, the higher-ID version is deleted.
BEGIN;
WITH migrate(old, new, second) AS (
SELECT q.id, MIN(p.id), MAX(p.id)
FROM packages p
JOIN packages q ON q.id > p.id AND p.system = q.system AND p.name = q.name
GROUP BY q.id
), ded(n) AS (
UPDATE packages SET dead = false
FROM migrate m
JOIN packages q ON q.id = m.old
WHERE packages.id = m.new AND packages.dead AND NOT q.dead
RETURNING 1
), mov(n) AS (
UPDATE package_versions SET package = m.new
FROM migrate m
WHERE package_versions.package = m.old
AND NOT EXISTS(
SELECT 1
FROM package_versions v
WHERE v.package IN(m.new, m.second)
AND v.version = package_versions.version)
RETURNING 1
), del(n) AS (
DELETE FROM packages WHERE id IN(SELECT old FROM migrate)
RETURNING 1
) SELECT (SELECT count(*) FROM migrate) AS migrate,
(SELECT count(*) FROM ded) AS ded,
(SELECT count(*) FROM mov) AS mov,
(SELECT count(*) FROM del) AS del;
ALTER TABLE packages DROP CONSTRAINT packages_system_name_category_key;
CREATE UNIQUE INDEX packages_system_name_key ON packages (system, name);
ALTER TABLE packages DROP COLUMN category;
COMMIT;
Removing the JS requirement and (hopefully) providing a more useful
view into the same data.
This view now also lists other man pages that happen to have the same
contents.