manned

Author	SHA1	Message	Date
Yorhel	9ff6bfc93f	indexer: Actually commit the transaction that resets the dead flag	2024-05-13 10:38:48 +02:00
Yorhel	c67091f096	indexer: Drop old package-dead-marking code for Arch A much simpler approach to dead-marking is to do a periodic UPDATE packages SET dead = true WHERE system = $1; before running the indexer. Not as efficient in terms of avoiding database writes, but there's no need to run this very often anyway.	2024-05-13 08:57:35 +02:00
Yorhel	cd5d2c6a20	Remove encodings from "locales" table + delete incorrect locales The frontend always stripped off the encodings already, so no point in keeping that in the DB indices. The full locale was extracted from the filename, which we still keep, so no information is list. SQL "migration" script: BEGIN; CREATE INDEX files_tmp_locale ON files (locale); INSERT INTO locales (locale) VALUES ('pl_PL'), ('is_IS'), ('ko_KR'); WITH obs(id, locale, lang) AS ( SELECT id, locale, regexp_replace(locale, '^([^.]+)\..+$', '\1') FROM locales WHERE locale LIKE '%.%' UNION ALL SELECT id, locale, '' FROM locales WHERE locale LIKE 'node%' OR locale = 'common' ), rep(old, new) AS ( SELECT o.id, x.id FROM obs o LEFT JOIN locales x ON x.locale = o.lang ), upd AS ( UPDATE files SET locale = new FROM rep WHERE locale = old ) DELETE FROM locales WHERE id IN(SELECT id FROM obs); DROP INDEX files_tmp_locale; COMMIT;	2024-05-01 17:04:09 +02:00
Yorhel	1ee5c9c2df	SQL: Add packages.c_hasman cache to speed up package listings Going from average ~100ms to ~10ms or so. The previous query had a tendency to be much slower sometimes, let's see if this cache also takes care of those outliers. Migration script: ALTER TABLE packages ADD COLUMN c_hasman boolean NOT NULL DEFAULT FALSE; DROP INDEX packages_system_name_key; CREATE UNIQUE INDEX packages_system_name_key ON packages (system, name) INCLUDE (id, c_hasman, dead); UPDATE packages SET c_hasman = NOT c_hasman WHERE c_hasman <> EXISTS(SELECT 1 FROM package_versions pv WHERE pv.package = packages.id AND EXISTS(SELECT 1 FROM files f WHERE f.pkgver = pv.id));	2024-04-29 21:15:40 +02:00
Yorhel	fc9a19e7c4	indexer: Disable broken dead-package checking for Arch Which I broke in `83ab6c3671`. Need to find an alternative approach to detecting dead packages sometime. The 'dead' flag isn't super important so it can wait.	2024-04-29 10:45:13 +02:00
Yorhel	83ab6c3671	Get rid of package categories Whether or not the package name itself or the (category,name) tuple uniquely identified a package within a system has been a source of confusion for a long time. Back in `03d278e4ff` I ended up playing playing it "safe" by going for (category,name), but in practice this doesn't make a whole lot of sense. While it's possible for the same package name to refer to completely different packages in different "categories", in reality distributions can't sanely support this anyway. For distributions where the category referred to a repository, the only cases where the same package name was used in different repos was when the package has moved from one repo to another. Those should certainly not be treated as different packages. For distributions where the category really referred to a category, there's the Debian approach where the category is purely a tag and doesn't help identify the package in any way, and then there's FreeBSD where the category technically ought to be part of the name. There were a few cases where FreeBSD used categories to separate out different versions of the same package (e.g. ipv6 vs non-ipv6), but none were relevant for man pages so I ended up merging those as well. Getting rid of the categories simplifies and shortens URLs, unclutters the UI a little bit and merges the packages in listings that should've been merged all along. Migration script: -- Merge packages that are in multiple categories. -- All versions are moved to the package with the lowest ID. -- If the same version already exists in a lower ID, the higher-ID version is deleted. BEGIN; WITH migrate(old, new, second) AS ( SELECT q.id, MIN(p.id), MAX(p.id) FROM packages p JOIN packages q ON q.id > p.id AND p.system = q.system AND p.name = q.name GROUP BY q.id ), ded(n) AS ( UPDATE packages SET dead = false FROM migrate m JOIN packages q ON q.id = m.old WHERE packages.id = m.new AND packages.dead AND NOT q.dead RETURNING 1 ), mov(n) AS ( UPDATE package_versions SET package = m.new FROM migrate m WHERE package_versions.package = m.old AND NOT EXISTS( SELECT 1 FROM package_versions v WHERE v.package IN(m.new, m.second) AND v.version = package_versions.version) RETURNING 1 ), del(n) AS ( DELETE FROM packages WHERE id IN(SELECT old FROM migrate) RETURNING 1 ) SELECT (SELECT count() FROM migrate) AS migrate, (SELECT count() FROM ded) AS ded, (SELECT count() FROM mov) AS mov, (SELECT count() FROM del) AS del; ALTER TABLE packages DROP CONSTRAINT packages_system_name_category_key; CREATE UNIQUE INDEX packages_system_name_key ON packages (system, name); ALTER TABLE packages DROP COLUMN category; COMMIT;	2024-04-28 10:37:04 +02:00
Yorhel	7302a6408a	indexer: Skip some more non-manpage files	2023-05-26 12:16:22 +02:00
Yorhel	d19c56f285	Correctly handle a few more mis-identified locales	2021-12-16 13:44:39 +01:00
Yorhel	f376f1f137	Large-ish SQL schema revamp/optimizations Primarily aimed at reducing the size of the old 'man' (now: files) table, using smaller integers to refer to man contents and text fields, and storing a shorthash as an integer for quick lookups. This better normalization also removes the need to keep a separate 'man_index' cache for the search function. The old schema wasn't necessarily bad, but I was in the mood for some optimizations. And a little cleanup. Prolly introduces a bunch of new bugs, I haven't tested this too well.	2021-12-14 15:08:54 +01:00
Yorhel	7648603685	Recognize .zst-compressed man pages + fix SQL basename_from_filename() to recognize .xz Also greatly simplified basename_from_filename() because apparently I couldn't write regexes back then. (And the removed REFERENCES line is to sync schema.sql with the actual state of the DB, which doesn't have that constraint for some reason. I'll prolly fix that later)	2021-12-13 18:16:16 +01:00
Yorhel	b27d55215a	Arch: Mark deleted packages as dead and hide them from listings We've got a lot of packages in the DB that have long been removed from the Arch repos. These are still indexed, but won't clutter the package listing anymore. Also fixed an issue with packages.id numbers getting rather large because the indexer allocates a new ID for every package on every update.	2021-12-13 08:18:17 +01:00
Yorhel	82a626b7d4	indexer+www: Support Alpine Linux repos	2021-12-11 16:18:07 +01:00
Yorhel	c9e81a8922	indexer: More crate updates + warning fixes + 2018 edition	2021-12-11 14:56:22 +01:00
Yorhel	c48feedc85	indexer: Switch to ureq + debloat stuff a bit And stop using the "url" crate directly, its API is too unstable for it to be worth using. ...that applies to several other crates as well, but meh.	2021-12-11 12:26:57 +01:00
Yorhel	4588e67b64	Make the Rust garbage compile again	2021-12-11 11:53:26 +01:00
Yorhel	ce38ff885f	indexer: Don't overwrite man page contents when hash already exist Performance improvement. The ON CONFLICT DO UPDATE was primarily to make sure that old man pages would get fixed when the indexer did a better job at detecting the encoding, but there haven't been any relevant fixes to the indexer lately so this forced-update won't do much now.	2019-05-25 08:47:21 +02:00
Yorhel	2974ee929e	Rust: hyper -> reqwest for the indexer Since Hyper doesn't provide a synchronous API anymore.	2019-05-25 08:44:45 +02:00
Yorhel	f0df5092c3	Rust dep updates	2019-05-25 08:27:23 +02:00
Yorhel	7aa89145ca	indexer: Re-use memory buffer when reading RPM repo data This avoids reading the entire uncompressed XML into a buffer.	2018-05-04 15:25:35 +02:00
Yorhel	2c7bf1507a	indexer: Update crates to latest version With the exception of Hyper, because the new tokio-based version is... different.	2018-03-25 10:36:29 +02:00
Yorhel	8235fb28b8	indexer: Fix link resolution and hardlink handling for rpm Unlike tar, cpio does not have a separate entry for each directory, so the link resolution can't assume that directory entries exist for each path component. I also mistakenly assumed that cpio handled hardlinks similarly to tar, but that's clearly not the case. libarchive does help a bit, but these differences still suck.	2017-01-18 13:07:42 +01:00
Yorhel	608f79eb93	indexer: Add support for indexing RPM repositories This code hasn't been thoroughly tested, I'll see how things go when indexing a live repo. And XML parsing sucks in every language.	2017-01-17 17:05:03 +01:00
Yorhel	f77db5f541	indexer: Add bare RPM directory indexing This is for a few special cases, most RPM repos will have proper metadata and all.	2017-01-17 12:50:25 +01:00
Yorhel	d720441fb4	indexer: Rust crate updates	2017-01-17 11:01:11 +01:00
Yorhel	8d6e7bc2d8	indexer: Prioritize 7bit encodings when decoding man pages Fixes parsing of https://manned.org/xshisen/ae5d469f	2016-12-29 09:27:19 +01:00
Yorhel	eac4b6ac77	Dont index ELF binaries + remove some non-man-pages	2016-12-18 16:35:25 +01:00
Yorhel	d153004532	indexer: Support FreeBSD 9.3+; remove now obsolete add_index.pl	2016-12-18 15:08:56 +01:00
Yorhel	b9764fce4a	indexer: Remove openssl + replace siphash with sha1 in cache filename HTTPS isn't used, so removing it saves some space. The std SipHash API has been deprecated, and since hashing performance isn't exactly critical in this case I've replaced it with SHA1, which was already being used in man.rs.	2016-12-11 13:41:10 +01:00
Yorhel	defaa032f8	indexer: Support for indexing FreeBSD <9.3 repositories	2016-12-11 10:59:54 +01:00
Yorhel	1ca0cd4325	Indexer: Remove pointless check	2016-11-27 10:59:31 +01:00
Yorhel	b79ecfb284	indexer: Fix bug in Contents file parsing + decrease cron verbosity Turns out that not all Contents files heave a header.	2016-11-27 10:48:35 +01:00
Yorhel	eb15b6e2c7	indexer: Improve Debian Contents file parsing performance by 5.2x Further improvements can be gained by caching the results of get_contents(), since the same Contents file is often parsed multiple times in a single cron run. But this is already a significant achievement.	2016-11-26 16:57:05 +01:00
Yorhel	de28175cd3	Misc. indexing fixes	2016-11-20 16:41:08 +01:00
Yorhel	5d44d0e2ec	Indexer: Add --dryrun and workarounds for old deb repos	2016-11-20 11:39:00 +01:00
Yorhel	ecb1a9e25b	Indexer: Support reading date from .deb archives	2016-11-20 09:01:33 +01:00
Yorhel	a1e5a2d80d	Indexer: Improve logging + cache management	2016-11-20 07:31:55 +01:00
Yorhel	4bdd91f65e	Indexer: Initial support for debian repos	2016-11-19 15:27:24 +01:00
Yorhel	50fe17a604	Indexer: Support .deb archives	2016-11-15 21:15:35 +01:00
Yorhel	20141aa980	indexer: Improve charset detection + lower file cache time	2016-11-09 18:41:53 +01:00
Yorhel	7d2abfb3a4	indexer: Fix storing locale as NULL when empty Perhaps it's better to get rid of NULL and make empty the default value. But for now this'll do.	2016-11-06 16:24:45 +01:00
Yorhel	cb81bedac1	Add arch/encoding metadata to DB + Fetch Arch Linux x86_64 The encoding metadata will be very useful in finding badly decoded man pages. The package 'arch' is necessary to properly identify which package was used, which is not obvious now that I'm going to switch more systems to the (more common) x86_64 arch.	2016-11-06 16:05:16 +01:00
Yorhel	1ca43665a1	indexer: Add file caching + Arch Linux indexing	2016-11-06 13:34:22 +01:00
Yorhel	35fab522d6	Indexer: Support HTTP fetching + misc improvements	2016-11-06 09:21:53 +01:00
Yorhel	aff68205b0	Add postgres package indexing + cli options	2016-11-05 10:22:31 +01:00
Yorhel	0cab758665	Add support for man page reading & decoding	2016-10-30 11:06:14 +01:00
Yorhel	c8bb4da246	Use libarchive3-sys crate directly + improve archread API This all should offer a more convenient and robust interface to handle all sorts of archives.	2016-10-29 09:33:39 +02:00
Yorhel	022e9acc4f	WIP: Rewritten man page indexer in Rust Currently just figuring out how to read archives. Turns out to not be as simple as I had expected.	2016-10-22 14:54:37 +02:00

47 commits