470 lines
18 KiB
Markdown
470 lines
18 KiB
Markdown
% From SQL to Nested Data Structures
|
|
|
|
(Published on **2019-08-13**)
|
|
|
|
This is the most typical problem you'll find in every application that talks
|
|
with an SQL database: You want to fetch a listing of some kind of database
|
|
entry, but the entries themselves may contain nested structured data.
|
|
|
|
To demonstrate what I mean, let's start with a simple example of a game
|
|
database[^1]: We have a table with rows for each top-level game entry and a
|
|
table for releases of said games. There is a one-to-many relation between games
|
|
and releases.
|
|
|
|
```sql
|
|
CREATE TABLE games (
|
|
game_id integer PRIMARY KEY,
|
|
title text
|
|
);
|
|
|
|
CREATE TABLE releases (
|
|
release_id integer PRIMARY KEY,
|
|
game_id integer REFERENCES games,
|
|
release_date date
|
|
);
|
|
```
|
|
|
|
Now suppose we want to display a listing of games with their release dates. It
|
|
would be nice if the database could give us a single JSON document with the
|
|
following structure:
|
|
|
|
```json
|
|
[
|
|
{
|
|
"game_id": 7,
|
|
"title": "Tsukihime",
|
|
"releases": [
|
|
{ "release_id": 339, "release_date": "2000-12-29" },
|
|
{ "release_id": 341, "release_date": "2006-12-28" }
|
|
]
|
|
},
|
|
{ ... }
|
|
]
|
|
```
|
|
|
|
But relational databases are designed to only deal with rows and columns, so
|
|
getting that JSON structure with a single query is not going to happen. Some
|
|
databases do support structured column types, but dynamically constructing
|
|
those can be unwieldy and/or slow.
|
|
|
|
A common and simple approach to fetch nested data is to first fetch the list of
|
|
games from the database, then iterate over the resulting rows and fetch the
|
|
relevant release entries at each iteration. I'll be using Perl 5 and
|
|
[DBIx::Simple](https://metacpan.org/pod/DBIx::Simple) in this article, but the
|
|
ideas should translate to any other dynamic programming language. Here's an
|
|
example of such iteration:
|
|
|
|
```perl
|
|
my @games = $db->query('SELECT game_id, title FROM games')->hashes;
|
|
|
|
for my $game (@games) {
|
|
$game->{releases} = [
|
|
$db->query(
|
|
'SELECT release_id, release_date FROM releases WHERE game_id = ?',
|
|
$game->{game_id}
|
|
)->hashes
|
|
];
|
|
}
|
|
# `@games` now has the requested structure.
|
|
```
|
|
|
|
This approach works, but it's not super convenient and it can be really slow
|
|
when fetching many rows. Of course, this is not a new problem. Almost every
|
|
programming language has several libraries and frameworks available to it to
|
|
help with fetching complex data structures from SQL databases, ranging from a
|
|
few helper functions to complete
|
|
[ORMs](https://en.wikipedia.org/wiki/Object-relational_mapping). What I propose
|
|
here is not a groundbreaking new technique, nor am I going to pitch a
|
|
particular library. Instead, I describe a simple yet generic API to deal with
|
|
the problem.
|
|
|
|
|
|
## Constructing SQL queries in Perl
|
|
|
|
But before I get to the solution, let me first briefly cover how one can
|
|
dynamically construct and compose SQL queries. Having such an abstraction is
|
|
useful in general, but these concepts will also simplify the rest of this
|
|
article.
|
|
|
|
SQL queries are treated as strings in most programming languages and one can
|
|
dynamically construct and compose queries by simple string concatenation.
|
|
There are several problems with actually constructing queries that way, but,
|
|
for our purposes, it almost suffices. Almost. The biggest problem in terms of
|
|
composability is that, not only do you need to concatenate strings to get the
|
|
final query, you also need to keep track of bind parameters. And that can get
|
|
annoying real quick when your queries are dynamic.
|
|
|
|
Here's a simple example of using string interpolation to abstract common
|
|
functionality into a separate function, without losing much on flexibility:
|
|
|
|
```perl
|
|
# Our abstraction to fetch game entries.
|
|
sub fetch_games {
|
|
my $game_ids = shift;
|
|
$db->query("SELECT * FROM games WHERE game_id IN($game_ids)");
|
|
# We could be doing some transformations here.
|
|
# Sorting, pagination, fetching extra information, etc.
|
|
}
|
|
|
|
# We can use that function to fetch games that have already been released.
|
|
my $ids = "SELECT game_id FROM releases WHERE release_date <= CURRENT_DATE";
|
|
my $released_games = fetch_games($ids);
|
|
```
|
|
|
|
But what if we wanted to replace that `CURRENT_DATE` with a user-provided
|
|
value? Doing that without fear of arbitrary SQL injection involves using a bind
|
|
parameter, which our `fetch_games()` function doesn't support. Fortunately,
|
|
there's a solution. Or rather, there are many solutions, but let's stick with
|
|
one I'm familiar with: [SQL::Interp](https://metacpan.org/pod/SQL::Interp).
|
|
That module provides, among other things, a wonderful function called `sql()`
|
|
that takes a list of SQL strings and variable references (which are converted
|
|
into bind parameters) and returns a single value that can be interpolated
|
|
elsewhere. Let's use it to transform the above example:
|
|
|
|
```perl
|
|
use SQL::Inerp 'sql';
|
|
|
|
sub fetch_games {
|
|
my $game_ids = shift;
|
|
# Note the use of `iquery` instead of `query`,
|
|
# this is a DBIx::Simple wrapper around `sql_interp()`.
|
|
$db->iquery("SELECT * FROM games WHERE game_id IN(", $game_ids, ")");
|
|
}
|
|
|
|
my $latest_date = '2004-01-20'; # Our user input
|
|
|
|
# This is our "SQL + bind parameters" in a single variable.
|
|
my $ids = sql("SELECT game_id FROM releases WHERE release_date <=", \$latest_date);
|
|
|
|
# ...which can be passed to fetch_games():
|
|
my $games_with_release_before_latest_date = fetch_games($ids);
|
|
```
|
|
|
|
That gives us safe variable interpolation into our SQL queries and we can
|
|
compose queries as if they were normal strings[^2]. You can do a lot more
|
|
powerful query construction tricks with SQL::Interp, but I'll not cover that
|
|
here. Its documentation has several examples and there are hints at even more
|
|
fun tricks in the
|
|
[SQL::Interpolate](https://metacpan.org/pod/SQL::Interpolate::Macro#Builtin-Macros)
|
|
docs (which is a different and older module, but those examples are easily
|
|
ported).
|
|
|
|
|
|
## Structured documents
|
|
|
|
With that out of the way, we can go back to fetching nested structures from a
|
|
relational database. Let's build upon the iteration example given earlier and
|
|
improve its performance. Instead of running a separate query to fetch the
|
|
releases for each game, let's fetch the releases of all the relevant games in a
|
|
single query. We select the necessary rows with an `IN(id1,id2,,...)` clause
|
|
and then merge the returned data back into the `@games` structure, as follows:
|
|
|
|
```perl
|
|
my @games = $db->query('SELECT game_id, title FROM games')->hashes;
|
|
|
|
# List of game_ids, which we can use for our IN() clause.
|
|
my @game_ids = map $_->{game_id}, @games;
|
|
|
|
# Fetch all the releases linked to @game_ids.
|
|
my @releases = $db->iquery(
|
|
'SELECT game_id, release_id, release_date FROM releases WHERE game_id IN',
|
|
\@game_ids
|
|
)->hashes;
|
|
|
|
# Create a 'game_id' => [releases] lookup table for quick access
|
|
my %releases;
|
|
for my $release (@releases) {
|
|
# Add this release to %releases and remove the 'game_id' column,
|
|
# which was only needed for this merging step.
|
|
push @{$releases{ delete $release->{game_id} }}, $release;
|
|
}
|
|
|
|
# Now merge the release information back into @games
|
|
for my $game (@games) {
|
|
$game->{releases} = $releases{ $game->{game_id} } || [];
|
|
}
|
|
# `@games` now has the requested structure.
|
|
|
|
```
|
|
|
|
It works and it's beautifully fast, but it's also verbose and not something
|
|
you'd like to type out every time you want to query the database. Okay, I
|
|
admit, it's more verbose than necessary because I wanted this code to be
|
|
readable by people who aren't experts in Perl. It can be done with less code
|
|
but the number of steps won't change much. Rather than playing code golf, let's
|
|
do what every sensible programmer does when they notice repetition: Let's
|
|
abstract the repeated bits into a separate function.
|
|
|
|
Let's first define the inputs of the function, we'll need:
|
|
|
|
- `@games`: The list of objects that we want to extend.
|
|
- `"releases"`: The name of the field we are adding to the game objects.
|
|
- `"game_id"`: The column used for querying and merging.
|
|
- A function that, given a list of game ids, returns a query for fetching the
|
|
release information.
|
|
|
|
The function would return a modified list of objects with the new information
|
|
embedded inside. Since we'll be passing the objects to the function as a
|
|
reference and since it's easier to modify data in-place in Perl, that's what
|
|
we'll be doing. In other (more functional) languages, it may be easier to avoid
|
|
mutating the given objects and return a list of fresh new objects instead.
|
|
|
|
Here's what that function looks like. It does exactly the same as the code
|
|
listed above[^3], so I figure I can get away with a shorter and slightly less
|
|
readable version this time:
|
|
|
|
```perl
|
|
sub enrich {
|
|
my($field_name, $merge_field, $sql, @objects) = @_;
|
|
my %ids = map +($_->{$merge_field}, []), @objects;
|
|
return if !keys %ids;
|
|
my @result = $db->iquery( $sql->([keys %ids]) )->hashes;
|
|
push @{$ids{ delete $_->{$merge_field} }}, $_ for (@result);
|
|
$_->{ $field_name } = $ids{ $_->{$merge_field} } for (@objects);
|
|
}
|
|
```
|
|
|
|
And here's how we can rewrite the previous example by using that function:
|
|
|
|
```perl
|
|
my @games = $db->query('SELECT game_id, title FROM games')->hashes;
|
|
|
|
enrich 'releases', 'game_id', sub {
|
|
sql 'SELECT game_id, release_id, release_date FROM releases WHERE game_id IN', $_[0]
|
|
}, @games;
|
|
|
|
# `@games` now has the requested structure.
|
|
```
|
|
|
|
It's short, fast and readable. At least, it is once you get used to the
|
|
`enrich` function - I admit it may not be very intuitive the first time you see
|
|
it.
|
|
|
|
|
|
## More nesting
|
|
|
|
Let's expand our example database with a new table. Each release may be
|
|
available for one or more platforms:
|
|
|
|
```sql
|
|
CREATE TABLE releases_platforms (
|
|
release_id integer REFERENCES releases,
|
|
platform text
|
|
);
|
|
```
|
|
|
|
We also want that list of platforms to be included in our nested data
|
|
structure, so that it looks like this:
|
|
|
|
```json
|
|
[
|
|
{
|
|
"game_id": 7,
|
|
"title": "Tsukihime",
|
|
"releases": [
|
|
{
|
|
"release_id": 339,
|
|
"release_date": "2000-12-29",
|
|
"platforms": [
|
|
{ "platform": "Linux" },
|
|
{ "platform": "Windows" }
|
|
]
|
|
},
|
|
{ ... }
|
|
]
|
|
},
|
|
{ ... }
|
|
]
|
|
```
|
|
|
|
Fortunately, the `enrich` function can easily handle nested objects: all we
|
|
have to do is pass a list of *release* objects to `enrich` instead of *game*
|
|
objects, and we can do that with a simple `map`:
|
|
|
|
```perl
|
|
enrich 'platforms', 'release_id', sub {
|
|
sql 'SELECT release_id, platform FROM releases_platforms WHERE release_id IN', $_[0]
|
|
}, map @{$_->{releases}}, @games;
|
|
```
|
|
|
|
That's it, we now have the data structure as described above. It is possible to
|
|
nest arbitrarily deep with this approach.
|
|
|
|
This works because `enrich` modifies the data structure in-place. A version
|
|
that copies the given data structure would be slightly more complex, but that's
|
|
a surmountable problem.
|
|
|
|
|
|
## Variations
|
|
|
|
That array of `platform` objects in the previous example is more verbose than
|
|
we'd like. It would be nicer if we could flatten that to an array of platform
|
|
strings, so that a release object looks more like this:
|
|
|
|
```json
|
|
{
|
|
"release_id": 339,
|
|
"release_date": "2000-12-29",
|
|
"platforms": [ "Linux", "Windows" ]
|
|
}
|
|
```
|
|
|
|
Of course, such a variation of `enrich` could be written quite easily:
|
|
|
|
```perl
|
|
sub enrich_flatten {
|
|
# This is the same as the original enrich()
|
|
my($field_name, $merge_field, $sql, @objects) = @_;
|
|
my %ids = map +($_->{$merge_field}, []), @objects;
|
|
return if !keys %ids;
|
|
my @result = $db->iquery( $sql->([keys %ids]) )->hashes;
|
|
|
|
# This is the actual merge strategy, which we'll modify slightly:
|
|
push @{$ids{ delete $_->{$merge_field} }}, values %$_ for (@result);
|
|
$_->{ $field_name } = $ids{ $_->{$merge_field} } for (@objects);
|
|
}
|
|
```
|
|
|
|
Usage is exactly the same:
|
|
|
|
|
|
```perl
|
|
enrich_flatten 'platforms', 'release_id', sub {
|
|
sql 'SELECT release_id, platform FROM releases_platforms WHERE release_id IN', $_[0]
|
|
}, map @{$_->{releases}}, @games;
|
|
```
|
|
|
|
Other merge strategies could be implemented in the same way. For example, one
|
|
could imagine a scenario where, instead of a one-to-many relation as in the
|
|
previous examples, there is a one-to-one mapping between the list of objects
|
|
and the query results. An alternative merge strategy in that case could be to
|
|
copy the columns of the new query into the already existing object. You could
|
|
achieve the exact same effect in the original SQL query by adding a plain old
|
|
`JOIN`, but dynamic application-level joins could still have their uses in
|
|
abstracting common functionality or structuring a larger code base.
|
|
|
|
This version of `enrich` has a limitation that the `$merge_field` is used to
|
|
identify both the field in the original objects and the column in the query
|
|
results. This is fine in the example database - I deliberately used full
|
|
*game_id* and *release_id* column names rather than calling them just *id* -
|
|
but there will no doubt come a time when this limitation starts to get annoying
|
|
and you'll want to support different column names for object and result
|
|
queries. It's easy to support that, in any case.
|
|
|
|
|
|
## A more complex example
|
|
|
|
This `enrich` function is cute and all, but the examples I've given so far can
|
|
be implemented just as easily with your average ORM. Yet there is an important
|
|
difference (apart from having one less dependency): `enrich` provides for
|
|
easier ad-hoc flexibility, meaning: you can use the full power of SQL in the
|
|
enrichment queries.
|
|
|
|
Let's add some joins and filters to our example. Let's say we want the
|
|
following structure:
|
|
|
|
```json
|
|
[
|
|
{
|
|
"game_id": 7,
|
|
"title": "Tsukihime",
|
|
"platforms": [ "Windows", "Linux" ],
|
|
"releases": [
|
|
{
|
|
"release_id": 339,
|
|
"release_date": "2000-12-29",
|
|
"platform_count": 2
|
|
}
|
|
]
|
|
},
|
|
{ ... }
|
|
]
|
|
```
|
|
|
|
That is, we want to have a per-game "platforms" array rather than per-release,
|
|
and instead of the platform list for releases, we just want a single
|
|
"platform_count". Let's say we want the list of releases ordered by release
|
|
date and we also want to be able to filter on the release date. Here's a full
|
|
example to achieve all of that:
|
|
|
|
```perl
|
|
my @games = $db->query('SELECT game_id, title FROM games')->hashes;
|
|
|
|
# This is the per-game platforms list.
|
|
enrich_flatten 'platforms', 'game_id', sub {
|
|
sql 'SELECT DISTINCT game_id, platform
|
|
FROM releases
|
|
JOIN releases_platforms USING (release_id)
|
|
WHERE game_id IN', $_[0]
|
|
}, @games;
|
|
|
|
my $latest_date = '2004-01-20'; # Our user input
|
|
|
|
# This is the releases list.
|
|
enrich 'releases', 'game_id', sub {
|
|
sql 'SELECT game_id, release_id, release_date,
|
|
(SELECT COUNT(*) FROM releases_platforms rp
|
|
WHERE rp.release_id = r.release_id) AS platform_count
|
|
FROM releases r
|
|
WHERE game_id IN', $_[0], '
|
|
AND release_date <=', \$latest_date, '
|
|
ORDER BY release_date'
|
|
}, @games;
|
|
```
|
|
|
|
It's not the most elegant piece of code ever written, but it does the trick.
|
|
The alternatives aren't likely to be much better.
|
|
|
|
|
|
## Final notes
|
|
|
|
I've been using this approach - with a slightly different API - for the
|
|
experimental rewrite of [VNDB.org](https://vndb.org/) and I'm happy with the
|
|
result. The main advantage of the `enrich` family of functions is that you can
|
|
easily and efficiently run complex queries and merge the results to construct
|
|
complex JSON documents - ad-hoc and without any boilerplate, as there's no need
|
|
to define, declare or even name your data structure up front. This sometimes
|
|
makes it harder to follow the code, as you may have to visualize a monstrously
|
|
complex data structure in your mind, but this approach can be incredibly
|
|
helpful for quick prototyping. Good documentation and/or a good debugging
|
|
strategy can make up for lack of boilerplate.
|
|
|
|
The exact API for the `enrich()` functions is something I've not fully decided
|
|
on yet. I've not put that much thought into the order of its arguments, the
|
|
need for "variations" for different scenarios could be reduced by adding the
|
|
merge strategy as a separate argument, at which point named arguments may be
|
|
necessary to keep things readable, etc. The exact API doesn't matter all that
|
|
much, it's the idea that matters.
|
|
|
|
As I mentioned in the introduction: while I've been using Perl for these
|
|
examples, I'm convinced that this idea can be easily ported to other dynamic
|
|
languages. I did in fact experiment with porting this idea to Elixir and it was
|
|
every bit as convenient, in a way that
|
|
[Ecto](https://hexdocs.pm/ecto/Ecto.html)'s built-in ORM functionality wasn't.
|
|
|
|
It's quite possible that this abstraction is already available in some
|
|
languages and libraries and that I've just missed it. It's also possible that
|
|
this is, in fact, a terrible idea in the long run. Only way to find out is to
|
|
give it a try. On that note, if you feel like playing around with the examples
|
|
in this article, I wrote [a little script](/download/code/sqlobject.pl) to test
|
|
that the examples actually work, which also makes for a good base for quick
|
|
experimentation.
|
|
|
|
|
|
[^1]: This is a rather simplified schema inspired by
|
|
[VNDB.org](https://vndb.org/) - the project that made me experiment with all
|
|
this.
|
|
[^2]: I'm sure that at this point you'd call me out as a madman for suggesting
|
|
that treating SQL as strings is an acceptable strategy for constructing
|
|
dynamic queries, and I'd actually be sympathetic to that view - I'm well
|
|
aware that it's rather brittle. So now you're going to try and convince me
|
|
that your favorite query builder library is the *ultimate* solution to this
|
|
problem and that everyone should use it. Except it's not available for Perl
|
|
and/or it has severe limitations and/or it has so much boilerplate that it's
|
|
an utter waste of time for quick prototyping. I've played around with query
|
|
builders in various languages, but the only one I actually kinda liked was
|
|
Elixir's [Ecto](https://hexdocs.pm/ecto/Ecto.html). But alas, I'm stuck with
|
|
Perl for now, and string interpolation isn't all that bad.
|
|
[^3]: I lied, this version does two more things: It removes duplicate
|
|
identifiers from the `IN` clause and does an early return if there are no
|
|
identifiers to fetch.
|