diff --git a/.gitignore b/.gitignore index 486390b..6b5a8e5 100644 --- a/.gitignore +++ b/.gitignore @@ -48,6 +48,7 @@ pub/doc/easyipc.html pub/doc/funcweb.html pub/doc/pwlookup.html pub/doc/sqlaccess.html +pub/doc/sqlobject.html pub/dump.html pub/dump/awshrink.html pub/dump/btrfssize.html diff --git a/Makefile b/Makefile index e58fc1e..963b19d 100644 --- a/Makefile +++ b/Makefile @@ -21,6 +21,7 @@ PAGES=\ "doc/funcweb.md"\ "doc/pwlookup.md"\ "doc/sqlaccess.md"\ + "doc/sqlobject.md"\ "dump.md"\ "dump/awshrink.md"\ "dump/btrfssize.md"\ diff --git a/dat/doc.md b/dat/doc.md index f0fda50..bbce18f 100644 --- a/dat/doc.md +++ b/dat/doc.md @@ -6,6 +6,10 @@ rare occasions are published on this page. ## Articles That May As Well Be Considered Blog Posts +`2019-08-13` - [From SQL to Nested Data Structures](/doc/sqlobject) +: How to easily fetch complex nested data structures from a normalized + relational database. + `2019-05-14` - [Fast Key Lookup with a Small Read-Only Database](/doc/pwlookup) : How to quickly check if a password is in a large (but nicely compressed) dictionary. diff --git a/dat/doc/sqlobject.md b/dat/doc/sqlobject.md new file mode 100644 index 0000000..4fbceb5 --- /dev/null +++ b/dat/doc/sqlobject.md @@ -0,0 +1,470 @@ +% From SQL to Nested Data Structures + +(Published on **2019-08-13**) + +This is the most typical problem you'll find in every application that talks +with an SQL database: You want to fetch a listing of some kind of database +entry, but the entries themselves may contain nested structured data. + +To demonstrate what I mean, let's start with a simple example of a game +database[^1]: We have a table with rows for each top-level game entry and a +table for releases of said games. There a one-to-many relation between games +and releases. + +```sql +CREATE TABLE games ( + game_id integer PRIMARY KEY, + title text +); + +CREATE TABLE releases ( + release_id integer PRIMARY KEY, + game_id integer REFERENCES games, + release_date date +); +``` + +Now suppose we want to display a listing of games with their release dates. It +would be nice if the database could give us a single JSON document with the +following structure: + +```json +[ + { + "game_id": 7, + "title": "Tsukihime", + "releases": [ + { "release_id": 339, "release_date": "2000-12-29" }, + { "release_id": 341, "release_date": "2006-12-28" } + ] + }, + { ... } +] +``` + +But relational databases are designed to only deal with rows and columns, so +getting that JSON structure with a single query is not going to happen. Some +databases do support structured column types, but dynamically constructing +those can be unwieldy and/or slow. + +A common and simple approach to fetch nested data is to first fetch the list of +games from the database, then iterate over the resulting rows and fetch the +relevant release entries at each iteration. I'll be using Perl 5 and +[DBIx::Simple](https://metacpan.org/pod/DBIx::Simple) in this article, but the +ideas should translate to any other dynamic programming language. Here's an +example of such iteration: + +```perl +my @games = $db->query('SELECT game_id, title FROM games')->hashes; + +for my $game (@games) { + $game->{releases} = [ + $db->query( + 'SELECT release_id, release_date FROM releases WHERE game_id = ?', + $game->{game_id} + )->hashes + ]; +} +# `@games` now has the requested structure. +``` + +This approach works, but it's not super convenient and it can be really slow +when fetching many rows. Of course, this is not a new problem. Almost every +programming language has several libraries and frameworks available to it to +help with fetching complex data structures from SQL databases, ranging from a +few helper functions to complete +[ORMs](https://en.wikipedia.org/wiki/Object-relational_mapping). What I propose +here is not a groundbreaking new technique, nor am I going to pitch a +particular library. Instead, I describe a simple yet generic API to deal with +the problem. + + +## Constructing SQL queries in Perl + +But before I get to the solution, let me first briefly cover how one can +dynamically construct and compose SQL queries. Having such an abstraction is +useful in general, but these concepts will also simplify the rest of this +article. + +SQL queries are treated as strings in most programming languages and one can +dynamically construct and compose queries by simple string concatenation. +There are several problems with actually constructing queries that way, but, +for our purposes, it almost suffices. Almost. The biggest problem in terms of +composability is that, not only do you need to concatenate strings to get the +final query, you also need to keep track of bind parameters. And that can get +annoying real quick when your queries are dynamic. + +Here's a simple example of using string interpolation to abstract common +functionality into a separate function, without losing much on flexibility: + +```perl +# Our abstraction to fetch game entries. +sub fetch_games { + my $game_ids = shift; + $db->query("SELECT * FROM games WHERE game_id IN($game_ids)"); + # We could be doing some transformations here. + # Sorting, pagination, fetching extra information, etc. +} + +# We can use that function to fetch games that have already been released. +my $ids = "SELECT game_id FROM releases WHERE release_date <= CURRENT_DATE"; +my $released_games = fetch_games($ids); +``` + +But what if we wanted to replace that `CURRENT_DATE` with a user-provided +value? Doing that without fear of arbitrary SQL injection involves using a bind +parameter, which our `fetch_games()` function doesn't support. Fortunately, +there's a solution. Or rather, there are many solutions, but let's stick with +one I'm familiar with: [SQL::Interp](https://metacpan.org/pod/SQL::Interp). +That module provides, among other things, a wonderful function called `sql()` +that takes a list of SQL strings and variable references (which are converted +into bind parameters) and returns a single value that can be interpolated +elsewhere. Let's use it to transform the above example: + +```perl +use SQL::Inerp 'sql'; + +sub fetch_games { + my $game_ids = shift; + # Note the use of `iquery` instead of `query`, + # this is a DBIx::Simple wrapper around `sql_interp()`. + $db->iquery("SELECT * FROM games WHERE game_id IN(", $game_ids, ")"); +} + +my $latest_date = '2004-01-20'; # Our user input + +# This is our "SQL + bind parameters" in a single variable. +my $ids = sql("SELECT game_id FROM releases WHERE release_date <=", \$latest_date); + +# ...which can be passed to fetch_games(): +my $games_with_release_before_latest_date = fetch_games($ids); +``` + +That gives us safe variable interpolation into our SQL queries and we can +compose queries as if they were normal strings[^2]. You can do a lot more +powerful query construction tricks with SQL::Interp, but I'll not cover that +here. Its documentation has several examples and there are hints at even more +fun tricks in the +[SQL::Interpolate](https://metacpan.org/pod/SQL::Interpolate::Macro#Builtin-Macros) +docs (which is a different and older module, but those examples are easily +ported). + + +## Structured documents + +With that out of the way, we can go back to fetching nested structures from a +relational database. Let's build upon the iteration example given earlier and +improve its performance. Instead of running a separate query to fetch the +releases for each game, let's fetch the releases of all the relevant games in a +single query. We select the necessary rows with an `IN(id1,id2,,...)` clause +and then merge the returned data back into the `@games` structure, as follows: + +```perl +my @games = $db->query('SELECT game_id, title FROM games')->hashes; + +# List of game_ids, which we can use for our IN() clause. +my @game_ids = map $_->{game_id}, @games; + +# Fetch all the releases linked to @game_ids. +my @releases = $db->iquery( + 'SELECT game_id, release_id, release_date FROM releases WHERE game_id IN', + \@game_ids +)->hashes; + +# Create a 'game_id' => [releases] lookup table for quick access +my %releases; +for my $release (@releases) { + # Add this release to %releases and remove the 'game_id' column, + # which was only needed for this merging step. + push @{$releases{ delete $release->{game_id} }}, $release; +} + +# Now merge the release information back into @games +for my $game (@games) { + $game->{releases} = $releases{ $game->{game_id} } || []; +} +# `@games` now has the requested structure. + +``` + +It works and it's beautifully fast, but it's also verbose and not something +you'd like to type out every time you want to query the database. Okay, I +admit, it's more verbose than necessary because I wanted this code to be +readable by people who aren't experts in Perl. It can be done with less code +but the number of steps won't change much. Rather than playing code golf, let's +do what every sensible programmer does when they notice repetition: Let's +abstract the repeated bits into a separate function. + +Let's first define the inputs of the function, we'll need: + +- `@games`: The list of objects that we want to extend. +- `"releases"`: The name of the field we are adding to the game objects. +- `"game_id"`: The column used for querying and merging. +- A function that, given a list of game ids, returns a query for fetching the + release information. + +The function would return a modified list of objects with the new information +embedded inside. Since we'll be passing the objects to the function as a +reference and since it's easier to modify data in-place in Perl, that's what +we'll be doing. In other (more functional) languages, it may be easier to avoid +mutating the given objects and return a list of fresh new objects instead. + +Here's what that function looks like. It does exactly the same as the code +listed above[^3], so I figure I can get away with a shorter and slightly less +readable version this time: + +```perl +sub enrich { + my($field_name, $merge_field, $sql, @objects) = @_; + my %ids = map +($_->{$merge_field}, []), @objects; + return if !keys %ids; + my @result = $db->iquery( $sql->([keys %ids]) )->hashes; + push @{$ids{ delete $_->{$merge_field} }}, $_ for (@result); + $_->{ $field_name } = $ids{ $_->{$merge_field} } for (@objects); +} +``` + +And here's how we can rewrite the previous example by using that function: + +```perl +my @games = $db->query('SELECT game_id, title FROM games')->hashes; + +enrich 'releases', 'game_id', sub { + sql 'SELECT game_id, release_id, release_date FROM releases WHERE game_id IN', $_[0] +}, @games; + +# `@games` now has the requested structure. +``` + +It's short, fast and readable. At least, it is once you get used to the +`enrich` function - I admit it may not be very intuitive the first time you see +it. + + +## More nesting + +Let's expand our example database with a new table. Each release may be +available for one or more platforms: + +```sql +CREATE TABLE releases_platforms ( + release_id integer REFERENCES releases, + platform text +); +``` + +We also want that list of platforms to be included in our nested data +structure, so that it looks like this: + +```json +[ + { + "game_id": 7, + "title": "Tsukihime", + "releases": [ + { + "release_id": 339, + "release_date": "2000-12-29", + "platforms": [ + { "platform": "Linux" }, + { "platform": "Windows" } + ] + }, + { ... } + ] + }, + { ... } +] +``` + +Fortunately, the `enrich` function can easily handle nested objects: all we +have to do is pass a list of *release* objects to `enrich` instead of *game* +objects, and we can do that with a simple `map`: + +```perl +enrich 'platforms', 'release_id', sub { + sql 'SELECT release_id, platform FROM releases_platforms WHERE release_id IN', $_[0] +}, map @{$_->{releases}}, @games; +``` + +That's it, we now have the data structure as described above. It is possible to +nest arbitrarily deep with this approach. + +This works because `enrich` modifies the data structure in-place. A version +that copies the given data structure would be slightly more complex, but that's +a surmountable problem. + + +## Variations + +That array of `platform` objects in the previous example is more verbose than +we'd like. It would be nicer if we could flatten that to an array of platform +strings, so that a release object looks more like this: + +```json +{ + "release_id": 339, + "release_date": "2000-12-29", + "platforms": [ "Linux", "Windows" ] +} +``` + +Of course, such a variation of `enrich` could be written quite easily: + +```perl +sub enrich_flatten { + # This is the same as the original enrich() + my($field_name, $merge_field, $sql, @objects) = @_; + my %ids = map +($_->{$merge_field}, []), @objects; + return if !keys %ids; + my @result = $db->iquery( $sql->([keys %ids]) )->hashes; + + # This is the actual merge strategy, which we'll modify slightly: + push @{$ids{ delete $_->{$merge_field} }}, values %$_ for (@result); + $_->{ $field_name } = $ids{ $_->{$merge_field} } for (@objects); +} +``` + +Usage is exactly the same: + + +```perl +enrich_flatten 'platforms', 'release_id', sub { + sql 'SELECT release_id, platform FROM releases_platforms WHERE release_id IN', $_[0] +}, map @{$_->{releases}}, @games; +``` + +Other merge strategies could be implemented in the same way. For example, one +could imagine a scenario where, instead of a one-to-many relation as in the +previous examples, there is a one-to-one mapping between the list of objects +and the query results. An alternative merge strategy in that case could be to +copy the columns of the new query into the already existing object. You could +achieve the exact same effect in the original SQL query by adding a plain old +`JOIN`, but dynamic application-level joins could still have their uses in +abstracting common functionality or structuring a larger code base. + +This version of `enrich` has a limitation that the `$merge_field` is used to +identify both the field in the original objects and the column in the query +results. This is fine in the example database - I deliberately used full +*game_id* and *release_id* column names rather than calling them just *id* - +but there will no doubt come a time when this limitation starts to get annoying +and you'll want to support different column names for object and result +queries. It's easy to support that, in any case. + + +## A more complex example + +This `enrich` function is cute and all, but the examples I've given so far can +be implemented just as easily with your average ORM. Yet there is an important +difference (apart from having one less dependency): `enrich` provides for +easier ad-hoc flexibility, meaning: you can use the full power of SQL in the +enrichment queries. + +Let's add some joins and filters to our example. Let's say we want the +following structure: + +```json +[ + { + "game_id": 7, + "title": "Tsukihime", + "platforms": [ "Windows", "Linux" ], + "releases": [ + { + "release_id": 339, + "release_date": "2000-12-29", + "platform_count": 2 + } + ] + }, + { ... } +] +``` + +That is, we want to have a per-game "platforms" array rather than per-release, +and instead of the platform list for releases, we just want a single +"platform_count". Let's say we want the list of releases ordered by release +date and we also want to be able to filter on the release date. Here's a full +example to achieve all of that: + +```perl +my @games = $db->query('SELECT game_id, title FROM games')->hashes; + +# This is the per-game platforms list. +enrich_flatten 'platforms', 'game_id', sub { + sql 'SELECT DISTINCT game_id, platform + FROM releases + JOIN releases_platforms USING (release_id) + WHERE game_id IN', $_[0] +}, @games; + +my $latest_date = '2004-01-20'; # Our user input + +# This is the releases list. +enrich 'releases', 'game_id', sub { + sql 'SELECT game_id, release_id, release_date, + (SELECT COUNT(*) FROM releases_platforms rp + WHERE rp.release_id = r.release_id) AS platform_count + FROM releases r + WHERE game_id IN', $_[0], ' + AND release_date <=', \$latest_date, ' + ORDER BY release_date' +}, @games; +``` + +It's not the most elegant piece of code ever written, but it does the trick. +The alternatives aren't likely to be much better. + + +## Final notes + +I've been using this approach - with a slightly different API - for the +experimental rewrite of [VNDB.org](https://vndb.org/) and I'm happy with the +result. The main advantage of the `enrich` family of functions is that you can +easily and efficiently run complex queries and merge the results to construct +complex JSON documents - ad-hoc and without any boilerplate, as there's no need +to define, declare or even name your data structure up front. This sometimes +makes it harder to follow the code, as you may have to visualize a monstrously +complex data structure in your mind, but this approach can be incredibly +helpful for quick prototyping. Good documentation and/or a good debugging +strategy can make up for lack of boilerplate. + +The exact API for the `enrich()` functions is something I've not fully decided +on yet. I've not put that much thought into the order of its arguments, the +need for "variations" for different scenarios could be reduced by adding the +merge strategy as a separate argument, at which point named arguments may be +necessary to keep things readable, etc. The exact API doesn't matter all that +much, it's the idea that matters. + +As I mentioned in the introduction: while I've been using Perl for these +examples, I'm convinced that this idea can be easily ported to other dynamic +languages. I did in fact experiment with porting this idea to Elixir and it was +every bit as convenient, in a way that +[Ecto](https://hexdocs.pm/ecto/Ecto.html)'s built-in ORM functionality wasn't. + +It's quite possible that this abstraction is already available in some +languages and libraries and that I've just missed it. It's also possible that +this is, in fact, a terrible idea in the long run. Only way to find out is to +give it a try. On that note, if you feel like playing around with the examples +in this article, I wrote [a little script](/download/code/sqlobject.pl) to test +that the examples actually work, which also makes for a good base for quick +experimentation. + + +[^1]: This is a rather simplified schema inspired by + [VNDB.org](https://vndb.org/) - the project that made me experiment with all + this. +[^2]: I'm sure that at this point you'd call me out as a madman for suggesting + that treating SQL as strings is an acceptable strategy for constructing + dynamic queries, and I'd actually be sympathetic to that view - I'm well + aware that it's rather brittle. So now you're going to try and convince me + that your favorite query builder library is the *ultimate* solution to this + problem and that everyone should use it. Except it's not available for Perl + and/or it has severe limitations and/or it has so much boilerplate that it's + an utter waste of time for quick prototyping. I've played around with query + builders in various languages, but the only one I actually kinda liked was + Elixir's [Ecto](https://hexdocs.pm/ecto/Ecto.html). But alas, I'm stuck with + Perl for now, and string interpolation isn't all that bad. +[^3]: I lied, this version does two more things: It removes duplicate + identifiers from the `IN` clause and does an early return if there are no + identifiers to fetch. diff --git a/dat/index.md b/dat/index.md index f14da82..127e678 100644 --- a/dat/index.md +++ b/dat/index.md @@ -20,6 +20,10 @@ the incidental article on this site. Enjoy your stay! ## Announcements Atom feed +`2019-08-13` - New article: From SQL to Nested Data Structures +: How to easily fetch complex nested data structures from a normalized + relational database. [Read more.](/doc/sqlobject) + `2019-08-05` - ncdu 1.14.1 released : A minor bugfix release. This fixes the `--exclude-caches` option, intermittend early exit on OS X and improves handling of out-of-memory diff --git a/dat/ncdc/install.md b/dat/ncdc/install.md index cd16bf4..0ef6aad 100644 --- a/dat/ncdc/install.md +++ b/dat/ncdc/install.md @@ -18,9 +18,9 @@ each OS and distribution to make the installation process a bit easier for everyone. If your system is missing from this page or if you're still having trouble, -don't hesitate to join the support hub at `adc://dc.blicky.net/` or send me a -mail at [projects@yorhel.nl](mailto:projects@yorhel.nl). Contributions to this -page are of course highly welcomed as well. :-) +don't hesitate to join the support hub at `adcs://dc.blicky.net:2780/` or send +me a mail at [projects@yorhel.nl](mailto:projects@yorhel.nl). Contributions to +this page are of course highly welcomed as well. :-) ## Statically linked binaries diff --git a/pub/download/code/sqlobject.pl b/pub/download/code/sqlobject.pl new file mode 100644 index 0000000..55bb3d5 --- /dev/null +++ b/pub/download/code/sqlobject.pl @@ -0,0 +1,193 @@ +#!/usr/bin/perl + +# For info: https://dev.yorhel.nl/doc/sqlobject +# Date: 2019-08-13 +# License: MIT + +use v5.12; +use warnings; +use DBIx::Simple; +use SQL::Interp 'sql'; +use Data::Dumper 'Dumper'; + +my $db = DBIx::Simple->connect('dbi:Pg:dbname=test', undef, undef, { RaiseError => 1 }); + +$db->query($_) for (split /;/, <<_); +DROP SCHEMA IF EXISTS sqlobject_test CASCADE; +CREATE SCHEMA sqlobject_test; +SET search_path TO sqlobject_test; + +CREATE TABLE games ( + game_id integer PRIMARY KEY, + title text +); + +CREATE TABLE releases ( + release_id integer PRIMARY KEY, + game_id integer REFERENCES games, + release_date date +); + +CREATE TABLE releases_platforms ( + release_id integer REFERENCES releases, + platform text +); + +INSERT INTO games VALUES (7, 'Tsukihime'); +INSERT INTO releases VALUES (339, 7, '2000-12-29'); +INSERT INTO releases VALUES (341, 7, '2006-12-28'); +INSERT INTO releases_platforms VALUES (339, 'Linux'); +INSERT INTO releases_platforms VALUES (339, 'Windows'); +INSERT INTO releases_platforms VALUES (341, 'Windows') +_ + + +{ + my @games = $db->query('SELECT game_id, title FROM games')->hashes; + + for my $game (@games) { + $game->{releases} = [ + $db->query( + 'SELECT release_id, release_date FROM releases WHERE game_id = ?', + $game->{game_id} + )->hashes + ]; + } + print Dumper \@games; +} + + +{ + # Our abstraction to fetch game entries. + my sub fetch_games { + my $game_ids = shift; + $db->query("SELECT * FROM games WHERE game_id IN($game_ids)")->hashes; + # We could be doing some transformations here. + # Sorting, pagination, fetching extra information, etc. + } + + # We can use that function to fetch games that have already been released. + my $ids = "SELECT game_id FROM releases WHERE release_date <= CURRENT_DATE"; + my @released_games = fetch_games($ids); + print Dumper \@released_games; +} + + +{ + + my sub fetch_games { + my $game_ids = shift; + # Note the use of `iquery` instead of `query`, + # this is a DBIx::Simple wrapper around `sql_interp()`. + $db->iquery("SELECT * FROM games WHERE game_id IN(", $game_ids, ")")->hashes; + } + + my $latest_date = '2019-01-20'; # Our user input + + # This is our "SQL + bind parameters" in a single variable. + my $ids = sql("SELECT game_id FROM releases WHERE release_date <=", \$latest_date); + + # ...which can be passed to fetch_games(): + my @games_with_release_before_latest_date = fetch_games($ids); + print Dumper \@games_with_release_before_latest_date; +} + + +{ + my @games = $db->query('SELECT game_id, title FROM games')->hashes; + + # List of game_ids, which we can use for our IN() clause. + my @game_ids = map $_->{game_id}, @games; + + # Fetch all the releases related to @game_ids. + my @releases = $db->iquery( + 'SELECT game_id, release_id, release_date FROM releases WHERE game_id IN', + \@game_ids + )->hashes; + + # Create a 'game_id' => [releases] lookup table for quick access + my %releases; + for my $release (@releases) { + # Add this release to %releases and remove the 'game_id' column, + # which was only needed for this merging step. + push @{$releases{ delete $release->{game_id} }}, $release; + } + + # Now merge the release information back into @games + for my $game (@games) { + $game->{releases} = $releases{ $game->{game_id} } || []; + } + print Dumper \@games; +} + + +sub enrich { + my($field_name, $merge_field, $sql, @objects) = @_; + my %ids = map +($_->{$merge_field}, []), @objects; + return if !keys %ids; + my @result = $db->iquery( $sql->([keys %ids]) )->hashes; + push @{$ids{ delete $_->{$merge_field} }}, $_ for (@result); + $_->{ $field_name } = $ids{ $_->{$merge_field} } for (@objects); +} + + +sub enrich_flatten { + # This is the same as the original enrich() + my($field_name, $merge_field, $sql, @objects) = @_; + my %ids = map +($_->{$merge_field}, []), @objects; + return if !keys %ids; + my @result = $db->iquery( $sql->([keys %ids]) )->hashes; + + # This is the actual merge strategy, which we'll modify slightly: + push @{$ids{ delete $_->{$merge_field} }}, values %$_ for (@result); + $_->{ $field_name } = $ids{ $_->{$merge_field} } for (@objects); +} + + +{ + my @games = $db->query('SELECT game_id, title FROM games')->hashes; + + enrich 'releases', 'game_id', sub { + sql 'SELECT game_id, release_id, release_date FROM releases WHERE game_id IN', $_[0] + }, @games; + + # More nesting + enrich 'platforms', 'release_id', sub { + sql 'SELECT release_id, platform FROM releases_platforms WHERE release_id IN', $_[0] + }, map @{$_->{releases}}, @games; + + # Flattened platforms + enrich_flatten 'platforms_flat', 'release_id', sub { + sql 'SELECT release_id, platform FROM releases_platforms WHERE release_id IN', $_[0] + }, map @{$_->{releases}}, @games; + + print Dumper \@games; +} + + +{ + my @games = $db->query('SELECT game_id, title FROM games')->hashes; + + # This is the per-game platforms list. + enrich_flatten 'platforms', 'game_id', sub { + sql 'SELECT DISTINCT game_id, platform + FROM releases + JOIN releases_platforms USING (release_id) + WHERE game_id IN', $_[0] + }, @games; + + my $latest_date = '2004-01-20'; # Our user input + + # This is the releases list. + enrich 'releases', 'game_id', sub { + sql 'SELECT game_id, release_id, release_date, + (SELECT COUNT(*) FROM releases_platforms rp + WHERE rp.release_id = r.release_id) AS platform_count + FROM releases r + WHERE game_id IN', $_[0], ' + AND release_date <=', \$latest_date, ' + ORDER BY release_date' + }, @games; + + print Dumper \@games; +}