yhdev/dat/doc/sqlobject.md

% From SQL to Nested Data Structures

(Published on **2019-08-13**)

This is the most typical problem you'll find in every application that talks
with an SQL database: You want to fetch a listing of some kind of database
entry, but the entries themselves may contain nested structured data.

To demonstrate what I mean, let's start with a simple example of a game
database[^1]: We have a table with rows for each top-level game entry and a
table for releases of said games. There is a one-to-many relation between games
and releases.

```sql
CREATE TABLE games (
    game_id  integer PRIMARY KEY,
    title    text
);

CREATE TABLE releases (
    release_id    integer PRIMARY KEY,
    game_id       integer REFERENCES games,
    release_date  date
);
```

Now suppose we want to display a listing of games with their release dates. It
would be nice if the database could give us a single JSON document with the
following structure:

```json
[
    {
        "game_id": 7,
        "title": "Tsukihime",
        "releases": [
            { "release_id": 339, "release_date": "2000-12-29" },
            { "release_id": 341, "release_date": "2006-12-28" }
        ]
    },
    { ... }
]
```

But relational databases are designed to only deal with rows and columns, so
getting that JSON structure with a single query is not going to happen. Some
databases do support structured column types, but dynamically constructing
those can be unwieldy and/or slow.

A common and simple approach to fetch nested data is to first fetch the list of
games from the database, then iterate over the resulting rows and fetch the
relevant release entries at each iteration. I'll be using Perl 5 and
[DBIx::Simple](https://metacpan.org/pod/DBIx::Simple) in this article, but the
ideas should translate to any other dynamic programming language. Here's an
example of such iteration:

```perl
my @games = $db->query('SELECT game_id, title FROM games')->hashes;

for my $game (@games) {
    $game->{releases} = [
        $db->query(
            'SELECT release_id, release_date FROM releases WHERE game_id = ?',
            $game->{game_id}
        )->hashes
    ];
}
# `@games` now has the requested structure.
```

This approach works, but it's not super convenient and it can be really slow
when fetching many rows. Of course, this is not a new problem. Almost every
programming language has several libraries and frameworks available to it to
help with fetching complex data structures from SQL databases, ranging from a
few helper functions to complete
[ORMs](https://en.wikipedia.org/wiki/Object-relational_mapping). What I propose
here is not a groundbreaking new technique, nor am I going to pitch a
particular library.  Instead, I describe a simple yet generic API to deal with
the problem.


## Constructing SQL queries in Perl

But before I get to the solution, let me first briefly cover how one can
dynamically construct and compose SQL queries. Having such an abstraction is
useful in general, but these concepts will also simplify the rest of this
article.

SQL queries are treated as strings in most programming languages and one can
dynamically construct and compose queries by simple string concatenation.
There are several problems with actually constructing queries that way, but,
for our purposes, it almost suffices.  Almost. The biggest problem in terms of
composability is that, not only do you need to concatenate strings to get the
final query, you also need to keep track of bind parameters. And that can get
annoying real quick when your queries are dynamic.

Here's a simple example of using string interpolation to abstract common
functionality into a separate function, without losing much on flexibility:

```perl
# Our abstraction to fetch game entries.
sub fetch_games {
    my $game_ids = shift;
    $db->query("SELECT * FROM games WHERE game_id IN($game_ids)");
    # We could be doing some transformations here.
    # Sorting, pagination, fetching extra information, etc.
}

# We can use that function to fetch games that have already been released.
my $ids = "SELECT game_id FROM releases WHERE release_date <= CURRENT_DATE";
my $released_games = fetch_games($ids);
```

But what if we wanted to replace that `CURRENT_DATE` with a user-provided
value? Doing that without fear of arbitrary SQL injection involves using a bind
parameter, which our `fetch_games()` function doesn't support. Fortunately,
there's a solution. Or rather, there are many solutions, but let's stick with
one I'm familiar with: [SQL::Interp](https://metacpan.org/pod/SQL::Interp).
That module provides, among other things, a wonderful function called `sql()`
that takes a list of SQL strings and variable references (which are converted
into bind parameters) and returns a single value that can be interpolated
elsewhere. Let's use it to transform the above example:

```perl
use SQL::Inerp 'sql';

sub fetch_games {
    my $game_ids = shift;
    # Note the use of `iquery` instead of `query`,
    # this is a DBIx::Simple wrapper around `sql_interp()`.
    $db->iquery("SELECT * FROM games WHERE game_id IN(", $game_ids, ")");
}

my $latest_date = '2004-01-20'; # Our user input

# This is our "SQL + bind parameters" in a single variable.
my $ids = sql("SELECT game_id FROM releases WHERE release_date <=", \$latest_date);

# ...which can be passed to fetch_games():
my $games_with_release_before_latest_date = fetch_games($ids);
```

That gives us safe variable interpolation into our SQL queries and we can
compose queries as if they were normal strings[^2]. You can do a lot more
powerful query construction tricks with SQL::Interp, but I'll not cover that
here. Its documentation has several examples and there are hints at even more
fun tricks in the
[SQL::Interpolate](https://metacpan.org/pod/SQL::Interpolate::Macro#Builtin-Macros)
docs (which is a different and older module, but those examples are easily
ported).


## Structured documents

With that out of the way, we can go back to fetching nested structures from a
relational database. Let's build upon the iteration example given earlier and
improve its performance. Instead of running a separate query to fetch the
releases for each game, let's fetch the releases of all the relevant games in a
single query. We select the necessary rows with an `IN(id1,id2,,...)` clause
and then merge the returned data back into the `@games` structure, as follows:

```perl
my @games = $db->query('SELECT game_id, title FROM games')->hashes;

# List of game_ids, which we can use for our IN() clause.
my @game_ids = map $_->{game_id}, @games;

# Fetch all the releases linked to @game_ids.
my @releases = $db->iquery(
    'SELECT game_id, release_id, release_date FROM releases WHERE game_id IN',
    \@game_ids
)->hashes;

# Create a 'game_id' => [releases] lookup table for quick access
my %releases;
for my $release (@releases) {
    # Add this release to %releases and remove the 'game_id' column,
    # which was only needed for this merging step.
    push @{$releases{ delete $release->{game_id} }}, $release;
}

# Now merge the release information back into @games
for my $game (@games) {
    $game->{releases} = $releases{ $game->{game_id} } || [];
}
# `@games` now has the requested structure.

```

It works and it's beautifully fast, but it's also verbose and not something
you'd like to type out every time you want to query the database. Okay, I
admit, it's more verbose than necessary because I wanted this code to be
readable by people who aren't experts in Perl. It can be done with less code
but the number of steps won't change much. Rather than playing code golf, let's
do what every sensible programmer does when they notice repetition: Let's
abstract the repeated bits into a separate function.

Let's first define the inputs of the function, we'll need:

- `@games`: The list of objects that we want to extend.
- `"releases"`: The name of the field we are adding to the game objects.
- `"game_id"`: The column used for querying and merging.
- A function that, given a list of game ids, returns a query for fetching the
  release information.

The function would return a modified list of objects with the new information
embedded inside. Since we'll be passing the objects to the function as a
reference and since it's easier to modify data in-place in Perl, that's what
we'll be doing. In other (more functional) languages, it may be easier to avoid
mutating the given objects and return a list of fresh new objects instead.

Here's what that function looks like. It does exactly the same as the code
listed above[^3], so I figure I can get away with a shorter and slightly less
readable version this time:

```perl
sub enrich {
    my($field_name, $merge_field, $sql, @objects) = @_;
    my %ids = map +($_->{$merge_field}, []), @objects;
    return if !keys %ids;
    my @result = $db->iquery( $sql->([keys %ids]) )->hashes;
    push @{$ids{ delete $_->{$merge_field} }}, $_ for (@result);
    $_->{ $field_name } = $ids{ $_->{$merge_field} } for (@objects);
}
```

And here's how we can rewrite the previous example by using that function:

```perl
my @games = $db->query('SELECT game_id, title FROM games')->hashes;

enrich 'releases', 'game_id', sub {
    sql 'SELECT game_id, release_id, release_date FROM releases WHERE game_id IN', $_[0]
}, @games;

# `@games` now has the requested structure.
```

It's short, fast and readable. At least, it is once you get used to the
`enrich` function - I admit it may not be very intuitive the first time you see
it.


## More nesting

Let's expand our example database with a new table. Each release may be
available for one or more platforms:

```sql
CREATE TABLE releases_platforms (
    release_id  integer REFERENCES releases,
    platform    text
);
```

We also want that list of platforms to be included in our nested data
structure, so that it looks like this:

```json
[
    {
        "game_id": 7,
        "title": "Tsukihime",
        "releases": [
            {
                "release_id": 339,
                "release_date": "2000-12-29",
                "platforms": [
                    { "platform": "Linux" },
                    { "platform": "Windows" }
                ]
            },
            { ... }
        ]
    },
    { ... }
]
```

Fortunately, the `enrich` function can easily handle nested objects: all we
have to do is pass a list of *release* objects to `enrich` instead of *game*
objects, and we can do that with a simple `map`:

```perl
enrich 'platforms', 'release_id', sub {
    sql 'SELECT release_id, platform FROM releases_platforms WHERE release_id IN', $_[0]
}, map @{$_->{releases}}, @games;
```

That's it, we now have the data structure as described above. It is possible to
nest arbitrarily deep with this approach.

This works because `enrich` modifies the data structure in-place. A version
that copies the given data structure would be slightly more complex, but that's
a surmountable problem.


## Variations

That array of `platform` objects in the previous example is more verbose than
we'd like. It would be nicer if we could flatten that to an array of platform
strings, so that a release object looks more like this:

```json
{
    "release_id": 339,
    "release_date": "2000-12-29",
    "platforms": [ "Linux", "Windows" ]
}
```

Of course, such a variation of `enrich` could be written quite easily:

```perl
sub enrich_flatten {
    # This is the same as the original enrich()
    my($field_name, $merge_field, $sql, @objects) = @_;
    my %ids = map +($_->{$merge_field}, []), @objects;
    return if !keys %ids;
    my @result = $db->iquery( $sql->([keys %ids]) )->hashes;

    # This is the actual merge strategy, which we'll modify slightly:
    push @{$ids{ delete $_->{$merge_field} }}, values %$_ for (@result);
    $_->{ $field_name } = $ids{ $_->{$merge_field} } for (@objects);
}
```

Usage is exactly the same:


```perl
enrich_flatten 'platforms', 'release_id', sub {
    sql 'SELECT release_id, platform FROM releases_platforms WHERE release_id IN', $_[0]
}, map @{$_->{releases}}, @games;
```

Other merge strategies could be implemented in the same way. For example, one
could imagine a scenario where, instead of a one-to-many relation as in the
previous examples, there is a one-to-one mapping between the list of objects
and the query results. An alternative merge strategy in that case could be to
copy the columns of the new query into the already existing object. You could
achieve the exact same effect in the original SQL query by adding a plain old
`JOIN`, but dynamic application-level joins could still have their uses in
abstracting common functionality or structuring a larger code base.

This version of `enrich` has a limitation that the `$merge_field` is used to
identify both the field in the original objects and the column in the query
results. This is fine in the example database - I deliberately used full
*game_id* and *release_id* column names rather than calling them just *id* -
but there will no doubt come a time when this limitation starts to get annoying
and you'll want to support different column names for object and result
queries. It's easy to support that, in any case.


## A more complex example

This `enrich` function is cute and all, but the examples I've given so far can
be implemented just as easily with your average ORM. Yet there is an important
difference (apart from having one less dependency): `enrich` provides for
easier ad-hoc flexibility, meaning: you can use the full power of SQL in the
enrichment queries.

Let's add some joins and filters to our example. Let's say we want the
following structure:

```json
[
    {
        "game_id": 7,
        "title": "Tsukihime",
        "platforms": [ "Windows", "Linux" ],
        "releases": [
            {
                "release_id": 339,
                "release_date": "2000-12-29",
                "platform_count": 2
            }
        ]
    },
    { ... }
]
```

That is, we want to have a per-game "platforms" array rather than per-release,
and instead of the platform list for releases, we just want a single
"platform_count". Let's say we want the list of releases ordered by release
date and we also want to be able to filter on the release date.  Here's a full
example to achieve all of that:

```perl
my @games = $db->query('SELECT game_id, title FROM games')->hashes;

# This is the per-game platforms list.
enrich_flatten 'platforms', 'game_id', sub {
    sql 'SELECT DISTINCT game_id, platform
           FROM releases
           JOIN releases_platforms USING (release_id)
          WHERE game_id IN', $_[0]
}, @games;

my $latest_date = '2004-01-20'; # Our user input

# This is the releases list.
enrich 'releases', 'game_id', sub {
    sql 'SELECT game_id, release_id, release_date,
                (SELECT COUNT(*) FROM releases_platforms rp
                  WHERE rp.release_id = r.release_id) AS platform_count
           FROM releases r
          WHERE game_id IN', $_[0], '
            AND release_date <=', \$latest_date, '
          ORDER BY release_date'
}, @games;
```

It's not the most elegant piece of code ever written, but it does the trick.
The alternatives aren't likely to be much better.


## Final notes

I've been using this approach - with a slightly different API - for the
experimental rewrite of [VNDB.org](https://vndb.org/) and I'm happy with the
result. The main advantage of the `enrich` family of functions is that you can
easily and efficiently run complex queries and merge the results to construct
complex JSON documents - ad-hoc and without any boilerplate, as there's no need
to define, declare or even name your data structure up front. This sometimes
makes it harder to follow the code, as you may have to visualize a monstrously
complex data structure in your mind, but this approach can be incredibly
helpful for quick prototyping. Good documentation and/or a good debugging
strategy can make up for lack of boilerplate.

The exact API for the `enrich()` functions is something I've not fully decided
on yet. I've not put that much thought into the order of its arguments, the
need for "variations" for different scenarios could be reduced by adding the
merge strategy as a separate argument, at which point named arguments may be
necessary to keep things readable, etc. The exact API doesn't matter all that
much, it's the idea that matters.

As I mentioned in the introduction: while I've been using Perl for these
examples, I'm convinced that this idea can be easily ported to other dynamic
languages. I did in fact experiment with porting this idea to Elixir and it was
every bit as convenient, in a way that
[Ecto](https://hexdocs.pm/ecto/Ecto.html)'s built-in ORM functionality wasn't.

It's quite possible that this abstraction is already available in some
languages and libraries and that I've just missed it. It's also possible that
this is, in fact, a terrible idea in the long run. Only way to find out is to
give it a try. On that note, if you feel like playing around with the examples
in this article, I wrote [a little script](/download/code/sqlobject.pl) to test
that the examples actually work, which also makes for a good base for quick
experimentation.


[^1]: This is a rather simplified schema inspired by
  [VNDB.org](https://vndb.org/) - the project that made me experiment with all
  this.
[^2]: I'm sure that at this point you'd call me out as a madman for suggesting
  that treating SQL as strings is an acceptable strategy for constructing
  dynamic queries, and I'd actually be sympathetic to that view - I'm well
  aware that it's rather brittle. So now you're going to try and convince me
  that your favorite query builder library is the *ultimate* solution to this
  problem and that everyone should use it. Except it's not available for Perl
  and/or it has severe limitations and/or it has so much boilerplate that it's
  an utter waste of time for quick prototyping. I've played around with query
  builders in various languages, but the only one I actually kinda liked was
  Elixir's [Ecto](https://hexdocs.pm/ecto/Ecto.html).  But alas, I'm stuck with
  Perl for now, and string interpolation isn't all that bad.
[^3]: I lied, this version does two more things: It removes duplicate
  identifiers from the `IN` clause and does an early return if there are no
  identifiers to fetch.