The new format allows for downloading and importing only a part of the
database - useful when only the metadata is required - and doesn't
include the wasteful preformatted HTML cache.
This also ensures that the new import.sql script is actually usable and
in sync with the actual database. The old schema.sql was neither.
(And this simplifies my backup scripts)
Downside is that this consumes significant disk space, requires
recreating the entire cache when changing something to the way that
pages are rendered and removes flexibility to add dynamic
render-influencing settings in the future.
Alas, crawlers are getting more aggressive and I don't like the idea of
adding more invasive anti-bot tech.
This might not be enough in the long term, we also have a few slow SQL
queries that I'm not yet sure how to optimize. But this ought to give us
more time, at least.
Leaving the rest to be formatted as links to the included man page
instead.
Primary reason for this change is to make it possible to cache formatted
man pages, as they now no longer depend on anything except the raw
source of the page itself.
That link broke, the /man.<hash>/<system>/<package> format only works
when <hash> is available in the latest version of <system> and
<package>, and I suspect broadening the search to fix that isn't worth
the extra resources. The canonical permalink format doesn't include
<package> and is resilient against this problem.
Seems to be working alright, and it does clean up a few things. The
biggest missing thing right now is schema-based validation for some
query parameters. I'm also seeing opportunities for FU::Pg to act as a
hash/shorthash codec, simplifying some error-prone manual conversions.
A much simpler approach to dead-marking is to do a periodic
UPDATE packages SET dead = true WHERE system = $1;
before running the indexer. Not as efficient in terms of avoiding
database writes, but there's no need to run this very often anyway.
The frontend always stripped off the encodings already, so no point in
keeping that in the DB indices. The full locale was extracted from the
filename, which we still keep, so no information is list.
SQL "migration" script:
BEGIN;
CREATE INDEX files_tmp_locale ON files (locale);
INSERT INTO locales (locale) VALUES ('pl_PL'), ('is_IS'), ('ko_KR');
WITH obs(id, locale, lang) AS (
SELECT id, locale, regexp_replace(locale, '^([^.]+)\..+$', '\1') FROM locales WHERE locale LIKE '%.%'
UNION ALL
SELECT id, locale, '' FROM locales WHERE locale LIKE 'node%' OR locale = 'common'
), rep(old, new) AS (
SELECT o.id, x.id FROM obs o LEFT JOIN locales x ON x.locale = o.lang
), upd AS (
UPDATE files SET locale = new FROM rep WHERE locale = old
) DELETE FROM locales WHERE id IN(SELECT id FROM obs);
DROP INDEX files_tmp_locale;
COMMIT;
Going from average ~100ms to ~10ms or so. The previous query had a
tendency to be much slower sometimes, let's see if this cache also takes
care of those outliers.
Migration script:
ALTER TABLE packages ADD COLUMN c_hasman boolean NOT NULL DEFAULT FALSE;
DROP INDEX packages_system_name_key;
CREATE UNIQUE INDEX packages_system_name_key ON packages (system, name) INCLUDE (id, c_hasman, dead);
UPDATE packages SET c_hasman = NOT c_hasman
WHERE c_hasman <> EXISTS(SELECT 1 FROM package_versions pv WHERE pv.package = packages.id AND EXISTS(SELECT 1 FROM files f WHERE f.pkgver = pv.id));