Commit graph

46 commits

Author SHA1 Message Date
Yorhel
c67091f096 indexer: Drop old package-dead-marking code for Arch
A much simpler approach to dead-marking is to do a periodic

  UPDATE packages SET dead = true WHERE system = $1;

before running the indexer. Not as efficient in terms of avoiding
database writes, but there's no need to run this very often anyway.
2024-05-13 08:57:35 +02:00
Yorhel
cd5d2c6a20 Remove encodings from "locales" table + delete incorrect locales
The frontend always stripped off the encodings already, so no point in
keeping that in the DB indices. The full locale was extracted from the
filename, which we still keep, so no information is list.

SQL "migration" script:

  BEGIN;
  CREATE INDEX files_tmp_locale ON files (locale);

  INSERT INTO locales (locale) VALUES ('pl_PL'), ('is_IS'), ('ko_KR');

  WITH obs(id, locale, lang) AS (
    SELECT id, locale, regexp_replace(locale, '^([^.]+)\..+$', '\1') FROM locales WHERE locale LIKE '%.%'
     UNION ALL
    SELECT id, locale, '' FROM locales WHERE locale LIKE 'node%' OR locale = 'common'
  ), rep(old, new) AS (
    SELECT o.id, x.id FROM obs o LEFT JOIN locales x ON x.locale = o.lang
  ), upd AS (
    UPDATE files SET locale = new FROM rep WHERE locale = old
  ) DELETE FROM locales WHERE id IN(SELECT id FROM obs);

  DROP INDEX files_tmp_locale;

  COMMIT;
2024-05-01 17:04:09 +02:00
Yorhel
1ee5c9c2df SQL: Add packages.c_hasman cache to speed up package listings
Going from average ~100ms to ~10ms or so. The previous query had a
tendency to be much slower sometimes, let's see if this cache also takes
care of those outliers.

Migration script:

  ALTER TABLE packages ADD COLUMN c_hasman boolean NOT NULL DEFAULT FALSE;

  DROP INDEX packages_system_name_key;
  CREATE UNIQUE INDEX packages_system_name_key ON packages (system, name) INCLUDE (id, c_hasman, dead);

  UPDATE packages SET c_hasman = NOT c_hasman
   WHERE c_hasman <> EXISTS(SELECT 1 FROM package_versions pv WHERE pv.package = packages.id AND EXISTS(SELECT 1 FROM files f WHERE f.pkgver = pv.id));
2024-04-29 21:15:40 +02:00
Yorhel
fc9a19e7c4 indexer: Disable broken dead-package checking for Arch
Which I broke in 83ab6c3671. Need to find
an alternative approach to detecting dead packages sometime. The 'dead'
flag isn't super important so it can wait.
2024-04-29 10:45:13 +02:00
Yorhel
83ab6c3671 Get rid of package categories
Whether or not the package name itself or the (category,name) tuple
uniquely identified a package within a system has been a source of
confusion for a long time. Back in
03d278e4ff I ended up playing playing it
"safe" by going for (category,name), but in practice this doesn't make a
whole lot of sense. While it's *possible* for the same package name to
refer to completely different packages in different "categories", in
reality distributions can't sanely support this anyway.

For distributions where the category referred to a repository, the only
cases where the same package name was used in different repos was when
the package has moved from one repo to another. Those should certainly
not be treated as different packages.

For distributions where the category really referred to a category,
there's the Debian approach where the category is purely a tag and
doesn't help identify the package in any way, and then there's FreeBSD
where the category technically ought to be part of the name.  There were
a few cases where FreeBSD used categories to separate out different
versions of the same package (e.g. ipv6 vs non-ipv6), but none were
relevant for man pages so I ended up merging those as well.

Getting rid of the categories simplifies and shortens URLs, unclutters
the UI a little bit and merges the packages in listings that should've
been merged all along.

Migration script:

  -- Merge packages that are in multiple categories.
  -- All versions are moved to the package with the lowest ID.
  -- If the same version already exists in a lower ID, the higher-ID version is deleted.
  BEGIN;
  WITH migrate(old, new, second) AS (
    SELECT q.id, MIN(p.id), MAX(p.id)
      FROM packages p
      JOIN packages q ON q.id > p.id AND p.system = q.system AND p.name = q.name
     GROUP BY q.id
  ), ded(n) AS (
    UPDATE packages SET dead = false
      FROM migrate m
      JOIN packages q ON q.id = m.old
     WHERE packages.id = m.new AND packages.dead AND NOT q.dead
    RETURNING 1
  ), mov(n) AS (
    UPDATE package_versions SET package = m.new
      FROM migrate m
     WHERE package_versions.package = m.old
       AND NOT EXISTS(
          SELECT 1
            FROM package_versions v
           WHERE v.package IN(m.new, m.second)
             AND v.version = package_versions.version)
    RETURNING 1
  ), del(n) AS (
    DELETE FROM packages WHERE id IN(SELECT old FROM migrate)
    RETURNING 1
  ) SELECT (SELECT count(*) FROM migrate) AS migrate,
           (SELECT count(*) FROM ded) AS ded,
           (SELECT count(*) FROM mov) AS mov,
           (SELECT count(*) FROM del) AS del;

  ALTER TABLE packages DROP CONSTRAINT packages_system_name_category_key;
  CREATE UNIQUE INDEX packages_system_name_key ON packages (system, name);
  ALTER TABLE packages DROP COLUMN category;
  COMMIT;
2024-04-28 10:37:04 +02:00
Yorhel
7302a6408a indexer: Skip some more non-manpage files 2023-05-26 12:16:22 +02:00
Yorhel
d19c56f285 Correctly handle a few more mis-identified locales 2021-12-16 13:44:39 +01:00
Yorhel
f376f1f137 Large-ish SQL schema revamp/optimizations
Primarily aimed at reducing the size of the old 'man' (now: files)
table, using smaller integers to refer to man contents and text fields,
and storing a shorthash as an integer for quick lookups. This better
normalization also removes the need to keep a separate 'man_index' cache
for the search function.

The old schema wasn't necessarily bad, but I was in the mood for some
optimizations. And a little cleanup.

Prolly introduces a bunch of new bugs, I haven't tested this too well.
2021-12-14 15:08:54 +01:00
Yorhel
7648603685 Recognize .zst-compressed man pages + fix SQL basename_from_filename() to recognize .xz
Also greatly simplified basename_from_filename() because apparently I
couldn't write regexes back then.

(And the removed REFERENCES line is to sync schema.sql with the actual
state of the DB, which doesn't have that constraint for some reason.
I'll prolly fix that later)
2021-12-13 18:16:16 +01:00
Yorhel
b27d55215a Arch: Mark deleted packages as dead and hide them from listings
We've got a lot of packages in the DB that have long been removed from
the Arch repos. These are still indexed, but won't clutter the package
listing anymore.

Also fixed an issue with packages.id numbers getting rather large
because the indexer allocates a new ID for every package on every
update.
2021-12-13 08:18:17 +01:00
Yorhel
82a626b7d4 indexer+www: Support Alpine Linux repos 2021-12-11 16:18:07 +01:00
Yorhel
c9e81a8922 indexer: More crate updates + warning fixes + 2018 edition 2021-12-11 14:56:22 +01:00
Yorhel
c48feedc85 indexer: Switch to ureq + debloat stuff a bit
And stop using the "url" crate directly, its API is too unstable for it
to be worth using.

...that applies to several other crates as well, but meh.
2021-12-11 12:26:57 +01:00
Yorhel
4588e67b64 Make the Rust garbage compile again 2021-12-11 11:53:26 +01:00
Yorhel
ce38ff885f indexer: Don't overwrite man page contents when hash already exist
Performance improvement. The ON CONFLICT DO UPDATE was primarily to make
sure that old man pages would get fixed when the indexer did a better
job at detecting the encoding, but there haven't been any relevant fixes
to the indexer lately so this forced-update won't do much now.
2019-05-25 08:47:21 +02:00
Yorhel
2974ee929e Rust: hyper -> reqwest for the indexer
Since Hyper doesn't provide a synchronous API anymore.
2019-05-25 08:44:45 +02:00
Yorhel
f0df5092c3 Rust dep updates 2019-05-25 08:27:23 +02:00
Yorhel
7aa89145ca indexer: Re-use memory buffer when reading RPM repo data
This avoids reading the entire uncompressed XML into a buffer.
2018-05-04 15:25:35 +02:00
Yorhel
2c7bf1507a indexer: Update crates to latest version
With the exception of Hyper, because the new tokio-based version is...
different.
2018-03-25 10:36:29 +02:00
Yorhel
8235fb28b8 indexer: Fix link resolution and hardlink handling for rpm
Unlike tar, cpio does not have a separate entry for each directory, so
the link resolution can't assume that directory entries exist for each
path component.

I also mistakenly assumed that cpio handled hardlinks similarly to tar,
but that's clearly not the case. libarchive does help a bit, but these
differences still suck.
2017-01-18 13:07:42 +01:00
Yorhel
608f79eb93 indexer: Add support for indexing RPM repositories
This code hasn't been thoroughly tested, I'll see how things go when
indexing a live repo.

And XML parsing sucks in every language.
2017-01-17 17:05:03 +01:00
Yorhel
f77db5f541 indexer: Add bare RPM directory indexing
This is for a few special cases, most RPM repos will have proper
metadata and all.
2017-01-17 12:50:25 +01:00
Yorhel
d720441fb4 indexer: Rust crate updates 2017-01-17 11:01:11 +01:00
Yorhel
8d6e7bc2d8 indexer: Prioritize 7bit encodings when decoding man pages
Fixes parsing of https://manned.org/xshisen/ae5d469f
2016-12-29 09:27:19 +01:00
Yorhel
eac4b6ac77 Dont index ELF binaries + remove some non-man-pages 2016-12-18 16:35:25 +01:00
Yorhel
d153004532 indexer: Support FreeBSD 9.3+; remove now obsolete add_index.pl 2016-12-18 15:08:56 +01:00
Yorhel
b9764fce4a indexer: Remove openssl + replace siphash with sha1 in cache filename
HTTPS isn't used, so removing it saves some space.

The std SipHash API has been deprecated, and since hashing performance
isn't exactly critical in this case I've replaced it with SHA1, which
was already being used in man.rs.
2016-12-11 13:41:10 +01:00
Yorhel
defaa032f8 indexer: Support for indexing FreeBSD <9.3 repositories 2016-12-11 10:59:54 +01:00
Yorhel
1ca0cd4325 Indexer: Remove pointless check 2016-11-27 10:59:31 +01:00
Yorhel
b79ecfb284 indexer: Fix bug in Contents file parsing + decrease cron verbosity
Turns out that not all Contents files heave a header.
2016-11-27 10:48:35 +01:00
Yorhel
eb15b6e2c7 indexer: Improve Debian Contents file parsing performance by 5.2x
Further improvements can be gained by caching the results of
get_contents(), since the same Contents file is often parsed multiple
times in a single cron run. But this is already a significant
achievement.
2016-11-26 16:57:05 +01:00
Yorhel
de28175cd3 Misc. indexing fixes 2016-11-20 16:41:08 +01:00
Yorhel
5d44d0e2ec Indexer: Add --dryrun and workarounds for old deb repos 2016-11-20 11:39:00 +01:00
Yorhel
ecb1a9e25b Indexer: Support reading date from .deb archives 2016-11-20 09:01:33 +01:00
Yorhel
a1e5a2d80d Indexer: Improve logging + cache management 2016-11-20 07:31:55 +01:00
Yorhel
4bdd91f65e Indexer: Initial support for debian repos 2016-11-19 15:27:24 +01:00
Yorhel
50fe17a604 Indexer: Support .deb archives 2016-11-15 21:15:35 +01:00
Yorhel
20141aa980 indexer: Improve charset detection + lower file cache time 2016-11-09 18:41:53 +01:00
Yorhel
7d2abfb3a4 indexer: Fix storing locale as NULL when empty
Perhaps it's better to get rid of NULL and make empty the default value.
But for now this'll do.
2016-11-06 16:24:45 +01:00
Yorhel
cb81bedac1 Add arch/encoding metadata to DB + Fetch Arch Linux x86_64
The encoding metadata will be very useful in finding badly decoded man
pages. The package 'arch' is necessary to properly identify which
package was used, which is not obvious now that I'm going to switch more
systems to the (more common) x86_64 arch.
2016-11-06 16:05:16 +01:00
Yorhel
1ca43665a1 indexer: Add file caching + Arch Linux indexing 2016-11-06 13:34:22 +01:00
Yorhel
35fab522d6 Indexer: Support HTTP fetching + misc improvements 2016-11-06 09:21:53 +01:00
Yorhel
aff68205b0 Add postgres package indexing + cli options 2016-11-05 10:22:31 +01:00
Yorhel
0cab758665 Add support for man page reading & decoding 2016-10-30 11:06:14 +01:00
Yorhel
c8bb4da246 Use libarchive3-sys crate directly + improve archread API
This all should offer a more convenient and robust interface to handle
all sorts of archives.
2016-10-29 09:33:39 +02:00
Yorhel
022e9acc4f WIP: Rewritten man page indexer in Rust
Currently just figuring out how to read archives. Turns out to not be as
simple as I had expected.
2016-10-22 14:54:37 +02:00