Rewrite to static site

With a complete reorganisation of the directory structure and most of
the content converted to pandoc-flavoured markdown.

Some TODO's left before this can go live:
- Main page
- Atom feeds
- Bug tracker
This commit is contained in:
Yorhel 2019-03-23 11:52:08 +01:00
parent 5c85a7d32f
commit 6242b2ee9c
291 changed files with 4346 additions and 6141 deletions

60
dat/doc
View file

@ -1,60 +0,0 @@
=pod
I don't often write stuff. Certainly not enough to warrant a blog. But
sometimes I do feel the need to write down my thoughts. The results of those
rare occasions are published on this page.
=head2 Articles That May As Well Be Considered Blog Posts
=over
=item C<2017-05-28 > - L<An Opinionated Survey of Functional Web Development|https://dev.yorhel.nl/doc/funcweb>
The title says it all.
=item C<2014-07-29 > - L<The Sorry State of Convenient IPC|https://dev.yorhel.nl/doc/easyipc>
A long rant about IPC systems.
=item C<2014-01-09 > - L<Some Measurements on Direct Connect File Lists|https://dev.yorhel.nl/doc/dcstats>
A short measurement study on the file lists obtained from a Direct Connect hub.
Lots of graphs!
=item C<2012-02-15 > - L<A Distributed Communication System for Modular Applications|https://dev.yorhel.nl/doc/commvis>
In this article I explain a vision of mine, and the results of a small research
project aimed at realizing that vision.
=item C<2011-11-26 > - L<Multi-threaded Access to an SQLite3 Database|https://dev.yorhel.nl/doc/sqlaccess>
So you have a single database and some threads. How do you combine these in a
program?
=back
=head2 Longer Reports
=over
=item C<2014-06-10 > - L<Biased Random Periodic Switching in Direct Connect|https://dev.yorhel.nl/download/doc/brpsdc.pdf> (PDF)
My masters thesis.
=item C<2013-04-05 > - L<Peer Selection in Direct Connect|https://dev.yorhel.nl/download/doc/psdc.pdf> (PDF)
The rather long-ish literature study that precluded my masters thesis.
=item C<2010-06-02 > - L<Design and implementation of a compressed linked list library|https://dev.yorhel.nl/download/doc/compll.pdf> (PDF)
The report for the final project of my professional (HBO) bachelor of
Electrical Engineering. I was very liberal with some terminology in this
report. For example, "linked lists" aren't what you think they are, and I
didn't even use the term "locality of reference" where I really should have. It
was also written for an audience with little knowledge on the subject, so I
elaborated on a lot of things that should be obvious for most people in the
field. Then there is a lot of uninteresting overhead about the project itself,
which just happened to be mandatory for this report. Nonetheless, if you can
ignore these faults it's not such a bad read, if I may say so myself. :-)
=back

357
dat/doc/commvis.md Normal file
View file

@ -0,0 +1,357 @@
% A Distributed Communication System for Modular Applications
(Published on **2012-02-15**)
# Introduction
I have a vision. A vision in which rigid point-to-point IPC is replaced with a
far more flexible and distributed communication system. A vision in which
different components in the same program can interact with each other without
having to worry about each others' internal state. A vision where programs can
be designed in a modular way, without even worrying about whether to use
threads or an event-based model. A vision where every component communicates
with others, and where you can communicate with every component. And more
importantly, a vision in which each component can be implemented in a different
programming language, without the need for specific code to glue everything
together.
If that sounds interesting to you, then please read on. As a small research
project of mine, I've been looking into ways to realize the above vision, and I
believe to have found an answer. In this article I'll try to explain my ideas
and how they may be used to realize this vision.
My ideas have been heavily inspired by
[Linda](https://en.wikipedia.org/wiki/Linda_\(coordination_language\)). If
you're already familiar with that, then what I present here probably won't be
very revolutionary. Still, there are several aspects in which my ideas differ
significantly from Linda, so you won't be bored reading this. :-)
# The Concept
In this section I'll try to introduce the overall concept and some terminology.
This is going to be somewhat abstract and technical, but please bear with me.
I promise that things will get more interesting in the later sections.
Let me first define an abstract communications framework. We have a **network**
and a bunch of **sessions** connected to that network. Sessions can communicate
with each other through this network (that's usually what a network is for,
after all). These sessions do not have to be static: they may come and go.
Keep in mind that, for the purpose of explaining this concept, these terms are
very abstract: a session can be anything. A process, thread, a single function,
an object, or even your mobile phone. Anything. In the same way, the network is
nothing more than an abstract way to connect these sessions. It could be
sockets, pipes, a HTTP server, a broadcast network or just shared memory
between threads. If it allows sessions to communicate I'll call it a network.
Unlike many communication systems, this network does not have the concept of
_addresses_. There is no direct way for one session to identify another, and
indeed there is no need to do so for the purposes of communication. Instead,
the primary means of communication is by using **tuples** and patterns.
A tuple is an ordered set (list, array, whatever terminology you prefer) of
zero or more elements. Each element may have a different type, so it can hold
booleans, integers, floating point numbers, strings and even more complex data
structures as arrays or maps. You may think of a tuple as an array in
[JSON](https://json.org/) notation, if that makes things easier to understand.
Sessions send and receive tuples to communicate with each other. On the sending
side, a session simply "passes" a tuple to the network. This is a non-blocking,
asynchronous operation. In fact, it makes no sense to make this a blocking
action, because the sender can not know whether it will be received by any
other session anyway. The tuple may be received by many other sessions, or
there may not even be a single session interested in the tuple at all.
On the receiving side, sessions **register** patterns. A pattern itself is
mostly just a tuple, but with a more limited set of allowed types: only those
types for which exact matching makes sense, like booleans, integers and
strings. A pattern matches an incoming tuple if the first `n` elements of the
tuple exactly match the corresponding elements of the pattern. A special
_wildcard_ element may be used to match any value of any type.
A sessions thus only receives tuples from other sessions if they have
registered a pattern for them. As mentioned, it is not illegal to send a tuple
for which no other sessions have registered. In this case, the tuple will just
be discarded. It is also possible that many sessions have registered for a
matching pattern, in which case all of these sessions will receive the tuple.
As an additional rule, if a session sends out a tuple that matches one of its
own patterns, then it will receive its own tuple. (However, programming
interfaces might allow this to be detected and/or disabled if this eases the
implementation of a session).
Finally, there is the concept of a **return-path**. Upon sending out a tuple, a
session may indicate that it is interested in receiving replies. The network
is then responsible for providing a return-path: a way for receivers of the
tuple to reply to it. When a tuple is received, the session has the option to
reply to it: a reply consists of one or more tuples that are sent directly to
the session from which the tuple originated, using this return-path. When a
receiver is done replying to the tuple or when it has no intention of sending
back a reply, it should close the return-path to indicate this. The session
that sent the original tuple is then notified that the return-path is closed,
and no more replies will be received. If there is no session that has
registered for the tuple, the return-path is closed immediately (or at least,
the sending session is notified that there won't be a reply). If the tuple is
received by multiple sessions, then the replies will be interleaved over the
return-path, and the path is closed when all of the receiving sessions have
closed their end.
# Common design patterns and solutions
The previous section was rather abstract. This section provides several
examples on how to do common tasks and design patterns by using the previously
described concepts.
## Broadcast notifications
This is commonly implemented in OOP systems using the _Observer pattern_.
Implementing the same using tuples and patterns is an order of magnitude more
simple, as broadcast notifications are pretty much the native means of
communication.
In OOP you have the "observers" that can add themselves to the "observer list"
of any "object". This observer list is usually managed by the object that is to
be observed. If something happens to the object, it will walk through the
observer list and notify each observer.
If you represent an object as a session and define a notification as a tuple
that follows a certain pattern, then you very easily achieve the same
functionality as with an OOP implementation. In fact, there are some advantages
to doing it this way:
- Sessions stay registered to the same notifications even if the "object" (the
session that is being observed) is restarted or replaced with something else.
It's the network itself that keeps track of the registrations, not the
sessions that provide the notifications. Of course, this can be seen as a
drawback, but you can easily emulate OOP behaviour by providing an extra
notification when the "object" is shut down, indicating that the observing
sessions can remove their patterns.
- Since there is no need for the session that is being observed to keep a list
of sessions that are observing it, it also doesn't have walk the list and
send out multiple notifications. Notifying the observers is as simple as
sending out a single tuple.
- Many implementations of the Observer pattern maintain only a single list of
observers per object, and each listed observer will be notified for every
change to the object. For example, if an object maintains a list and provides
notifications when something is added and deleted to the list, every observer
will be notified of both the "added" action and the "deleted" action. The use
of tuples and patterns allows observers to register for all actions, or just
for a single one. If an "add" action would be notified with a tuple of
`["object", "add", id]` and a "delete" action with `["object", "delete",
id]`, then an observing session can register with the pattern `["object",
*]` to be notified for both actions, or just `["object", "add"]` to
register only for additions.
Of course, this is only one way to implement a notification mechanism. There
are also solutions that more accurately mimic the behaviour of the Observer
pattern OOP in cases where that is desired.
## Commands
A _command_ is what I call something along the lines of one session telling an
other session to do something. Suppose we have a session representing a file
system. A command for this session could then be something like "delete file
X".
In a sense, this isn't much different from a notification as described above.
The file system session would have registered a pattern like
`["fs", "delete", *]`, where the wildcard is used for the file name. If an
other session then wants to have a file deleted, the only thing it will have to
do is send out a tuple matching that pattern, and the file system session will
take care of deleting it.
In the above scenario, the session sending the command has no feedback
whatsoever on whether the command has been successfully executed or not.
Whether this is acceptable depends of course on the specific application. One
way of still providing some form of feedback is to have the file system session
send out a notification tuple, e.g. `["fs", "deleted", "file"]` (Note that the
second element is now `deleted` rather than `delete`. Using the same tuple
for actions and notifications is going to be very messy...). This way the
session sending the command, in addition to any other sessions that happen to
be interested in file deletion, will be notified of the deletion of the file.
An alternative solution is to use the RPC-like method, as described below.
## RPC
[RPC](https://en.wikipedia.org/wiki/Remote_procedure_call) is in essence
nothing else than providing an interface similar to a regular function call to
a component that can't be reached via a regular function call (e.g. because the
object isn't inside the address space of the program). RPC is generally a
request-response type of interaction, and making use of the return-path
facility as I described earlier, all of the functionality of RPC is also
available with the concept of tuple communication.
### Commands, the RPC-way
Take the previous file system example. Instead of just sending the command
tuple to delete the file, the session could indicate that it is interested in
replies and the network will create a return-path. If the return-path is closed
before any replies have been received, then the commanding session knows that
the file system session is either down or broken. Otherwise, the file system
session has the ability to send back a response. This could be a simple "okay,
file has been deleted" tuple if things went alright, or an error indication if
things didn't go too well. The commanding session has the option to either
block and wait for a reply (or a close of the return-path), or continue doing
whatever it wanted to do and asynchronously check for a reply.
The downside of using the return-path rather than the previously mentioned
notification approach is that other sessions can't easily be notified of file
deletion. Of course, an other session can register for the same pattern as the
file system did and thus receive the same command, but it would have no way of
knowing whether the delete was actually successful or not. For other sessions
to be notified as well, the file system session would probably have to send out
a notification tuple. Of course, it all depends on the application whether this
is necessary, you only have to implement the functionality that is necessary
for your purposes.
### Requesting information
Another use of RPC, and thus also of the return-path, is to allow sessions to
request information from each other. Using the same example again, the file
system session could register for a pattern such as `["fs", "list"]`. Upon
receiving a tuple matching that pattern, the session would send a list of all
its files over the return-path. Other sessions can then request this list by
simply sending out the right tuple and waiting for the replies.
# Advantages over other systems
Now that I've hopefully convinced you that my communication concept is powerful
enough to build applications with it, you may be wondering why you should use
it instead of the other technologies. After all, you can achieve pretty much
the same functionality with just regular OOP, RPC, message passing, or other
systems. Let me present some of the inherent advantages that this system has
compared to others, and why it will help in designing flexible and modular
applications.
## Loose coupling of components
Sessions (representing the components of a system) do not have to have a lot of
knowledge about each other. Sessions implicitly provide abstracted _services_
using tuple communications, in much the same way as interfaces explicitly do in
OOP.
Very much unlike OOP, however, is that sessions do not even have to know of
each other how they should be used in threaded or event-based environments. For
example, threading in OOP is a pain: which objects should implement
synchronisation and which shouldn't? The answer to this question is not nearly
as obvious as it should be. With event-based systems, you'll always need to
worry about how long a certain function call block the callers' thread. Since
communication between the different sessions is completely asynchronous, these
worries are gone.
## Location independence
Sessions can communicate with other sessions without knowing _where_ they are.
This has as major advantage that a session can be moved around without having
to change a single line of code in any of the sessions relying on its service.
This allows sessions that communicate a lot with each other to be placed in the
same process, while resource-heavy sessions may be distributed among several
physical devices.
## Programming language independence
All communication is solely done with tuples, which can be represented as
abstract objects and serialized and deserialized (or marshalled/unmarshalled,
whichever terminology you prefer) for communication. I used a JSON array as an
example of a tuple earlier, and perhaps it's not such a bad one: JSON data can
be interchanged between many programming languages, and are quite often not
that annoying in use. Still, there are many other alternatives (Bencoding, XML,
binary encodings, etc.), and it all depends on the exact data types and values
you wish to use for communication.
Language independence allows each session to be (re)implemented in a different
language, again without affecting any other sessions. Did you write an
application in a high-level language and noticed that performance wasn't as
good as you wanted? Then you can very easily rewrite the most resource-heavy
sessions in a low-level language such as C. Similarly, it allows developers to
hook into your application even when they are not familiar with your favorite
programming language.
## Easy debugging
Not only can other applications and/or plugins hook into your application, you
can also connect a simple debugger to the network. The debugger just has to
register for a pattern and then print out any received tuples, allowing you to
see exactly what is being sent over the network and whether the sessions react
as expected. Similarly, the debugger could allow you to send tuples back to the
network and see whether the sessions react as they should. Unfortunately, what
is being sent over a return-path is generally not visible to anyone but the
receiver of the replies, although a network implementation might allow a
debugging application to look into that as well.
# Where to go from here
What I've described above is nothing more than a bunch of ideas. To actually
use this, there's a lot to be done.
Defining a "tuple"
: What types can be used in tuples? Should a tuple have some maximum size or a
maximum number of elements? Should a `NULL` type be included? What about a
boolean type, why not use the integers 1 and 0 for that? Should it be possible
to interchange binary data, or only UTF-8 strings?
What will be the size of an integer that a session can reasonably assume to be
available? Specifying something like "infinite" is going to be either
inefficient in terms of memory and CPU overhead or will require extra overhead
(in terms of code) in usage. Specifying that everything should fit in a 64bit
integer is a lot more practical, but may be somewhat annoying to cope with in
many dynamically typed languages running on 32bit architectures. Specifying
that integers are 32bits will definitely ease the implementation of the network
library in interpreted languages, but lowers the usefulness of the integer type
and is still a pain to use in OCaml (which has 31bit integers).
These choices greatly affect the ease of implementing a networking library for
specific programming languages and the ease of using the network to actually
develop an application.
The exact semantics of matching
: Somewhat similar to the previous point, the semantics of matching tuples with
patterns should also be defined in some way. Some related questions are whether
values of different types may be equivalent. For example, is the string
`"1234"` equivalent to an integer with that value? What about NULL and/or
boolean types? If there is a floating point type, you probably won't need exact
matching on those values (floating points are too imprecise for that anyway),
but you might still want the floating point number `10.0` to match the integer
`10` to ease the use in dynamic languages where the distinction between
integer and float is blurred.
Defining the protocol(s)
: Making my vision of modularity and ease of use a reality requires that any
session can easily communicate with an other session, even if they have a
vastly different implementation. To do this, we need a protocol to connect
multiple processes together, whether they run on a local machine or over a
physical network.
Coding the stuff
: Obviously, all of this remains as a mere concept if nothing ever gets
implemented. Easy-to-use libraries are needed for several programming
languages. And more importantly, actual applications will have to be developed
using these libraries.
Of course, realizing all of the above is an iterative process. You can't write
an implementation without knowing what data types a tuple is made of, but it is
equally impossible to determine the exact definition of a tuple without having
experienced with an actual implementation.
## What's the plan?
I've been working on documenting the basics of the semantics and the
point-to-point communication protocol, and have started on an early
implementation in the Go programming language to experiment with. I've dubbed
the project **Tanja**, and have published my progress on a
[git repo](https://g.blicky.net/tanja.git/).
My intention is to also write implementations for C and Perl, experiment with
that, and see if I can refine the semantics to make this concept one that is
both efficient and easy to use.
Since I still have no idea whether this concept is actually a convenient one to
write large applications with, I'd love to experiment with that as well. My
original intention has always been to write a flexible client for the Direct
Connect network, possibly extending it to other P2P or chat networks in the
future. So I'd love to write a large application using this concept, and see
how things work out.
In either case, if this article managed to get you interested in this concept
or in project Tanja, and you have any questions, feedback or (gasp!) feel like
helping out, don't hesitate to contact me! I'm available as 'Yorhel' on Direct
Connect at `adc://blicky.net:2780` and IRC at `irc.synirc.net`, or just drop me
a mail at `projects@yorhel.nl`.

215
dat/doc/dcstats.md Normal file
View file

@ -0,0 +1,215 @@
% Some Measurements on Direct Connect File Lists
(Published on **2014-01-09**)
# Introduction
I've been working on Direct Connect related projects for a while now. This
includes maintaining [ncdc](/ncdc) and [Globster](/globster), and doing a bit
of research into improving the downloading performance and scalability (to be
published at some later date). Whether I'm writing code or trying to setup
experiments for research, there's one thing that helps a lot in making
decisions. Measurements from an actual network.
Because useful measurements are often missing, I decided to do some myself.
There's a lot to measure in an actual P2P network, but I restricted myself to
information that can be gathered quite easily from file lists.
# Obtaining the Data
Different hubs will likely have totally different patterns in terms of what is
being shared. In order to keep this experiment simple, I limited myself to a
single hub. And in order to get as much data as possible, I chose the hub that
is commonly known as "Ozerki", famous for being one of the larger hubs in
existence.
My approach to getting as many file lists as possible from this hub was perhaps
a bit too simple. I simply modified ncdc to have an "add the file list from all
users to the download queue" key, and to save all downloaded lists to a
directory instead of opening them.
I started this downloading process on a Monday around noon when there were a
little over 11k users online. I hit my hacked download-all-filelists-key two
more times later that day in order to get the file lists from those users who
joined the hub at a later time. I let this downloading process running until
the evening.
One thing I learned from this experience was that the downloading algorithm in
ncdc (1.18.1) does not scale particularly well. Every 60 seconds, it would try
to open a connection with **all** users listed in the download queue. You can
imagine that trying to connect to 11k users simultaneously put a significantly
heavier load on the hub than would have been necessary. Not good. Not something
a well-behaving netizen would do. Surprisingly enough, the hub didn't seem to
mind too much and handled the load fine. This might have been because Mondays
are typically not the most busy days in P2P land. Weekends tend to be busier.
Despite that scalability issue, I successfully managed to download the file
lists of almost everyone who remained online for long enough to finally get
their list downloaded. In total I managed to download 14143 file lists (that's
one list too many for `10000*sqrt(2)`, I should have stopped the process a bit
earlier). The total bzip2-compressed size of these lists is 6.5 GiB.
For obvious reasons, I won't be sharing my modifications to ncdc. I already
tarnished the reputation of ncdc enough in that single day. If you wish to
repeat this experiment, please do so with a scalable downloading
implementation. :-)
# Obtaining the Stats
And then comes the challenge of aggregating statistics on 6.5 GiB of compressed
XML files. This didn't really sound like much of a challenge. After all, all
one needs to do is decompress the file lists, do some XML parsing and update
some values. Most of the CPU time in this process would likely be spent on
bzip2 decompression, so I figured I'd just pipe the output of
[bzcat(1)](https://manned.org/bzcat) to a Perl script and be done with it.
To get the statistics on the sizes and the distribution of unique files, a data
structure containing information on all unique files in the lists was
necessary. Perl being the perfect language for data manipulation, I made use of
its great support for hash tables to store this information. It turned out,
rather unsurprisingly, that Perl isn't all that conservative with respect to
memory usage. Neither my 4GB or RAM nor the extra 4GB of swap turned out to be
enough to run the script to completion. I tried rewriting the script to use a
disk-based data structure, but that slowed things down to a crawl. Some other
solution was needed.
When faced with such a problem, some people will try to optimize the algorithm,
others will throw extra hardware at it, and I did what I do best: Optimize away
the constants. That is, I rewrote the data analysis program in C. Using the
excellent [khash](https://github.com/attractivechaos/klib) hash table library
to keep track of the file information and the equally awesome [yxml](/yxml)
library (a little bit of self-promotion doesn't hurt, right?) to do the XML
parsing, I was able to do all the necessary processing in 30 minutes using at
most 3.6GB of RAM.
Long story short, here's my analysis program:
[dcfilestats.c](https://g.blicky.net/dcstats.git/tree/dcfilestats.c).
# A Look at the Stats
Some lists didn't decompress/parse correctly, so the actual number of file
lists used in these stats is **14137**. The total compressed size of these
lists is **6,945,269,469** bytes (6.5 GiB), and uncompressed **25,533,519,352**
bytes (24 GiB). In total these lists mentioned **197,413,253** files. After
taking duplicate listings in account, there's still **84,131,932** unique
files.
And now for some graphs...
## Size of the File Lists
Behold, the compressed and uncompressed size of the downloaded file lists:
![](/img/dclistsize.png)
Nothing too surprising here, I guess. 100 KiB seems to be a common size for a
compressed file lists, but lists of 1 MiB aren't too weird, either. The largest
file list in this set is 34.8 MiB compressed and 120 MiB uncompressed. The
uncompressed size of a list tends to be (\*gasp\*) a bit larger, but we can't
easily infer the compression ratio from this graph. Hence, another graph:
![](/img/dclistcomp.png)
Most file lists compress to about 24% - 35% of their original size. This seems
to be consistent with [similar
measurements](http://forum.dcbase.org/viewtopic.php?f=18&t=667) done in 2010.
The raw data for these graphs is found in
[dclistsize](https://g.blicky.net/dcstats.git/tree/dclistsize), which lists the
compressed and uncompressed size, respectively, for each file list. The gnuplot
script for the first graph is
[dclistsize.plot](https://g.blicky.net/dcstats.git/tree/dclistsize.plot) and
[dclistcomp.plot](https://g.blicky.net/dcstats.git/tree/dclistcomp.plot) for the
second.
## Number of Files Per List
So how many files are people sharing? Let's find out.
![](/img/dcnumfiles.png)
As expected, this graph looks very similar to the one about the size of the
file list. The size of a list tends to be linear in the number of items it
holds, after all.
The raw data for this graph is found in
[dcnumfiles](https://g.blicky.net/dcstats.git/tree/dcnumfiles), which lists the
unique and total number of files, respectively, for each file list. The gnuplot
script is
[dcnumfiles.plot](https://g.blicky.net/dcstats.git/tree/dcnumfiles.plot).
## File Sizes
And how large are the files being shared? Well,
![](/img/dcfilesize.png)
This graph is fun, and rather hard to explain without knowing what kind of
files we're dealing with. I'm not going to do any further analysis on what kind
of files these file sizes represent exactly, but I am going to make some
guesses. The files below 1 MiB could be anything, text files, images,
subtitles, source code, etc. And considering that the hub in question doesn't
put a whole lot of effort in weeding out spammers and bots, it's likely that
some malicious users will be sharing small variations of the same virus within
the 100 KiB range. The peak of files between 7 and 10 MiB would likely be
audio files. The number of files larger than, say, 20 MiB drop significantly,
but there are still a few million files in the 20 MiB to 1 GiB range.
I cut off the graph after 10 GiB, but there's apparently someone who claims to
share a file between 1 and 2 TiB (don't know the exact size due to the
binning). Since I can't imagine why someone would share a file that large, I
expect it to be a fake file list entry. Note that there could be more fakes in
my data set. I can't tell which files are fake and which are genuine from the
information in the file lists, but I don't expect the number of fake files to
be very significant.
The "raw" data for this graph is found in
[dcfilesize](https://g.blicky.net/dcstats.git/tree/dcfilesize). Because I wasn't
interested in dealing with a text file of 84 million lines, the data is already
binned. The first column is the bin number and the second column the number of
unique files in that bin. The file sizes that each bin represents are between
`2^(bin+9)` and `2^(bin+10)`, with the exception of bin 0, which starts at a
file size of 0. The source of the gnuplot script is
[dcfilesize.plot](https://g.blicky.net/dcstats.git/tree/dcfilesize.plot).
## Distribution of Files
Another interesting thing to measure is how often files are shared. That is,
how many users have the same file?
![](/img/dcfiledist.png)
Many files are only available from a single user. That's not really a good sign
when you wish to download such a file, but luckily there are also tons of files
that _are_ available from multiple users. What is interesting in this graph
isn't that it follows the [power law](https://en.wikipedia.org/wiki/Power_law),
but it's wondering what those outliers could possibly be. There's a collection
of 269 files that has been shared among 831 users, and there appears to be a
similar group of around 510-515 files that is shared among 20 or so users. I've
honestly no idea what those collections could be. Well, yes, I could probably
figure that out from the file lists, but my analysis program doesn't tell me
which files it's talking about and I'm too lazy to fix that.
The graph has been clipped to 600, but there's another interesting outlier. A
single file that has been shared by 5668 users. I'm going to guess that this is
the empty file. There are so many ways to get an empty file somewhere in your
filesystem, after all.
The raw data for this graph is found in
[dcfiledist](https://g.blicky.net/dcstats.git/tree/dcfiledist), which lists the
number of times shared and the aggregate number of files. The gnuplot script is
[dcfiledist.plot](https://g.blicky.net/dcstats.git/tree/dcfiledist.plot).
# Final Notes
So, erm, what conclusions can we draw from this? That stats are fun, I guess.
If anyone (including me) is going to repeat this experiment on a fresh data
set, make sure to use a more scalable downloading process than I did. My
approach shouldn't be repeated if we wish to keep the Direct Connect network
alive.
Furthermore, keep in mind that this is just a snapshot of a single day on a
single hub. The graphs may look very different when the file lists are
harvested at some other time. And it's also quite likely that different hubs
will have very different share profiles. It could be interesting to try and
graph everything, but I don't have _that_ kind of free time.

632
dat/doc/easyipc.md Normal file
View file

@ -0,0 +1,632 @@
% The Sorry State of Convenient IPC
(Published on **2014-07-29**)
# The Problem
How do you implement communication between two or more processes? This is a
question that has been haunting me for at least 6 years now. Of course, this
question is very broad and has many possible answers, depending on your
scenario. So let me get more specific by describing the problem I want to
solve.
What I want is to write a daemon process that runs in the background and can be
controlled from other programs or libraries. The intention is that people can
easily write custom interfaces or quick scripts to control the daemon. The
service that the daemon offers over this communication channel can be thought
of as its primary API, in this way you can think of the daemon as a persistent
programming library. This concept is similar to existing programs such as
[btpd](https://github.com/btpd/btpd), [MPD](http://www.musicpd.org/),
[Transmission](https://www.transmissionbt.com/) and
[Telepathy](http://telepathy.freedesktop.org/wiki/) - I'll get back to these
later.
More specifically, the most recent project I've been working on that follows
this pattern is [Globster](/globster), a remotely controllable Direct Connect
client (if you're not familiar with Direct Connect, think of it as IRC with
some additional file sharing capabilities built in). While the problem I
describe is not specific to Globster, it still serves as an important use case.
I see many other projects with similar IPC requirements.
The IPC mechanism should support two messaging patterns: Request/response and
asynchronous notifications. The request/response pattern is what you typically
get in RPC systems - the client requests something of the daemon and the daemon
then replies with a response. Asynchronous notifications are useful in allowing
the daemon to send asynchronous status updates to the client, such as incoming
chat messages or file transfer status. Lack of support for such notifications
would mean that a client needs to continuously poll for updates, which is
inefficient.
So what I'm looking for is a high-level IPC mechanism that handles this
communication. Solutions are evaluated by the following criteria, in no
particular order.
**Easy**
: And with _easy_ I refer to _ease of use_. As mentioned above, other people
should be able to write applications and scripts to control the daemon. Not
many people are willing to invest days of work just to figure out how to
communicate with the daemon.
**Simple**
: Simplicity refers to the actual protocol and the complexity of the code
necessary to implement it. Complex protocols require complex code, and complex
code is hard to maintain and will inevitably contain bugs. Note that _simple_
and _easy_ are very different things and often even conflict with each other.
**Small**
: The IPC implementation shouldn't be too large, and shouldn't depend on huge
libraries. If you need several megabytes worth of libraries just to send a few
messages over a socket, you're doing it wrong.
**Language independent**
: Control the daemon with whatever programming language you're familiar with.
**Networked**
: A good solution should be accessible from both the local system (daemon running
on the same machine as the client) and from the network (daemon and client
running different machines).
**Secure**
: There's three parts in having a secure IPC mechanism. One part is to realize
that IPC operates at a _trust boundary_; The daemon can't blindly trust
everything the client says and vice versa, so message validation and other
mechanisms to prevent DoS or information disclosure on either part are
necessary.
Then there the matter of _confidentiality_. On a local system, UNIX sockets
will provide all the confidentiality you can get, so that's trivial. Networked
access, on the other hand, requires some form of transport layer security.
And finally, we need some form of _authentication_. There should be some
mechanism to prevent just about anyone to connect to the daemon. A
coarse-grained solution such as file permissions on a local UNIX socket or a
password-based approach for networked access will do just fine for most
purposes. Really, just keep it simple.
**Fast**
: Although performance isn't really a primary goal, the communication between the
daemon and the clients shouldn't be too slow or heavyweight. For my purposes,
anything that supports about a hundred messages a second on average hardware
will do perfectly fine. And that shouldn't be particularly hard to achieve.
**Proxy support**
: This isn't really a hard requirement either, but it would be nice to allow
other processes (say, plugins of the daemon, or clients connecting to the
daemon) to export services over the same IPC channel as the main daemon. This
is especially useful in implementing a cross-language plugin architecture. But
again, not a hard requirement, because even if the IPC mechanism doesn't
directly support proxying, it's always possible for the daemon to implement
some custom APIs to achieve the same effect. This, however, requires extra work
and may not be as elegant as a built-in solution.
Now let's discuss some existing solutions...
# Custom Protocol
Why use an existing IPC mechanism in the first place when all you need is
UNIX/TCP sockets? This is the approach taken by
[btpd](https://github.com/btpd/btpd), [MPD](http://www.musicpd.org/)
([protocol spec](http://www.musicpd.org/doc/protocol/index.html)) and older
versions of Transmission (see their [1.2x
spec](https://trac.transmissionbt.com/browser/branches/1.2x/doc/ipcproto.txt)).
Brpd hasn't taken the time to documented the protocol format, suggesting it's
not really intended to be used as a convenient API (other than through their
btcli), and Transmission has since changed to a different protocol. I'll mainly
focus on MPD here.
MPD uses a text-based request/response mechanism, where each request is a
simple one-line command and a response consists of one or more lines, ending
with an `OK` or `ACK` line. There's no support for asynchronous
notifications, although that could obviously have been implemented, too. Let's
grade this protocol...
**Easy?** Not really.
: Although MPD has conventions for how messages are formatted, each individual
message still requires custom parsing and validation. This can be automated by
designing an
[IDL](https://en.wikipedia.org/wiki/Interface_description_language) and
accompanying code generator, but writing one specific for a single project
doesn't seem like a particularly fun task.
The protocol, despite its apparent simplicity, is apparently painful enough to
use that there is a special _libmpdclient_ library to abstract away the
communication with MPD, and interfaces to this library are available in many
programming languages. If you have access to such an application-specific
library for your language of choice, then sure, using the IPC mechanism is easy
enough. But that applies to literally any IPC mechanism.
Ideally, such a library needs to be written only once for the IPC mechanism in
use, and after that no additional code is needed to communicate with
services/daemons using that particular IPC mechanism. Code re-use among
different projects is great, yo. It also doesn't scale very well when extending
the services offered by daemon, any addition to the API will require
modifications to all implementations.
**Simple?** Definitely.
: I only needed a quick glance at the MPD protocol reference and I was able to
play a bit with telnet and control my MPD. Writing an implementation doesn't
seem like a complex task. Of course, this doesn't necessarily apply to all
custom protocols, but you can make it as simple or complex as you want it to
be.
**Small?** Sure.
: This obviously depends on how elaborate you design your protocol. If you have a
large or complex API, the size of a generic message parser and validator can
easily compensate for the custom parser and validator needed for each custom
message. But for a simple APIs, it's hard to beat a custom protocol in terms of
size.
**Language independent?** Depends.
: Of course, a socket library is available to most programming languages, and in
that sense any IPC mechanism built on sockets is language independent. This is,
as such, more of an argument as to how convenient it is to communicate with the
protocol directly rather than with a library that abstracts the protocol away.
In the case of MPD, the text-based protocol seems easy enough to use directly
from most languages, yet for some reason most people prefer language-specific
libraries for MPD.
If you design a binary protocol or anything more complex than simple
request/response message types, using your protocol directly is going to be a
pain in certain languages, and people will definitely want a library specific
to your daemon for their favourite programming language. Something you'll want
to avoid, I suppose.
**Networked?** Sure enough.
: Just a switch between UNIX sockets and TCP sockets. Whether a simple solution
like that is a good idea, however, depends on the next point...
**Secure?** Ugh.
: Security is hard to get right, so having an existing infrastructure that takes
care of most security sensitive features will help a lot. Implementing your own
protocol means that you also have to implement your own security, to some
extent at least.
Writing code to parse and validate custom messages is error-prone, and a bug in
this code could make both the daemon and the client vulnerable to crashes and
buffer overflows. A statically-typed abstraction that handles parsing and
validation would help a lot.
For networked communication, you'll need some form of confidentiality. MPD does
not seem to support this, so any networked access to an MPD server is
vulnerable to passive observers and MITM attacks. This may be fine for a local
network (presumably what it is intended to be used for), but certainly doesn't
work for exposing your MPD control interface to the wider internet. Existing
protocols such as TLS or SSH can be used to create a secure channel, but these
libraries tend to be large and hard to use securely. This is especially true
for TLS, but at least there's [stunnel](https://www.stunnel.org/) to simplify
the implementation - at the cost of less convenient deployment.
In terms of authentication, you again need to implement this yourself. MPD
supports authentication using a plain-text password. This is fine for a trusted
network, but on an untrusted network you certainly want confidentiality to
prevent a random observer from reading your password.
**Fast?** Sure.
: Existing protocols may have put more effort into profiling and implementing
various optimizations than one would typically do with a custom and
quickly-hacked-together protocol, but still, it probably takes effort to design
a protocol that isn't fast enough.
**Proxy support?** Depends...
: Really depends on how elaborate you want to be. It can be very simple if all
you want is to route some messages, it can get very complex if you want to
ensure that these messages follow some format or if you want to reserve certain
interfaces or namespaces to certain clients. What surprised me about the MPD
protocol is that it actually has [some support for
proxying](http://www.musicpd.org/doc/protocol/ch03s11.html). But considering the
ad-hoc nature of the MPD protocol, the primitiveness and simplicity of this
proxy support wasn't too surprising. Gets the job done, I suppose.
Overall, and as a rather obvious conclusion, a custom protocol really is what
you make of it. In general, though, it's a lot of work, not always easy to use,
and a challenge to get the security part right.
# D-Bus
D-Bus is being used in [Transmission](https://www.transmissionbt.com/) and is
what I used for [Globster](/globster).
On a quick glance, D-Bus looks _perfect_. It is high-level, has the messaging
patterns I described, the [protocol
specification](http://dbus.freedesktop.org/doc/dbus-specification.html) does
not seem _overly_ complex (though certainly could be simplified), it has
implementations for a number of programming languages, has support for
networking, proxying is part of normal operation, and it seems fast enough for
most purposes. When you actually give it a closer look, however, reality isn't
as rose-colored.
D-Bus is designed for two very specific use-cases. One is to allow local
applications to securely interact with system-level daemons such as
[HAL](https://en.wikipedia.org/wiki/HAL_\(software\)) (now long dead) and
[systemd](http://freedesktop.org/wiki/Software/systemd/), and the other
use-case is to allow communication between different applications inside one
login session. As such, on a typical Linux system there are two D-Bus daemons
where applications can export interfaces and where messages can be routed
through. These are called the _system bus_ and the _session bus_.
**Easy?** Almost.
: The basic ideas behind D-Bus seem easy enough to use. The fact that is has
type-safe messages, interface descriptions and introspection really help in
making D-Bus a convenient IPC mechanism.
The main reasons why I think D-Bus isn't all that easy to use in practice is
due to the lack of good introductionary documentation and the crappy state of
the various D-Bus implementations. There is a [fairly good
article](https://pythonhosted.org/txdbus/dbus_overview.html) providing a
high-level overview to D-Bus, but there isn't a lot of material that covers how
to actually use D-Bus to interact with applications or to implement a service.
On the implementations, I have had rather bad experiences with the actual
libraries. I've personally used the official libdbus-1, which markets itself a
"low-level" library designed to facilitate writing bindings for other
languages. In practice, the functionality that it offers appears to be too
high-level for writing bindings ([GDBus](https://developer.gnome.org/glib/)
doesn't use it for this reason), and it is indeed missing a lot of
functionality to make it convenient to use directly. I've also played around
with Perl's [Net::DBus](http://search.cpan.org/perldoc?Net%3A%3ADBus) and was
highly disappointed. Not only is the documentation rather incomplete, the
actual implementation has more bugs than features. And instead of building on
top of one of the many good event loops for Perl (such as
[AnyEvent](http://search.cpan.org/perldoc?AnyEvent)), it chooses to implement
[its own event
loop](http://search.cpan.org/perldoc?Net%3A%3ADBus%3A%3AReactor). The existence
of several different libraries for Python doesn't incite much confidence,
either.
I was also disappointed in terms of the available tooling to help in the
development, testing and debugging of services. The [gdbus(1)](http://man.he.net/man1/gdbus) tool is useful
for monitoring messages and scripting some things, but is not all that
convenient because D-Bus has too many namespaces and the terrible Java-like
naming conventions make typing everything out a rather painful experience.
[D-Feet](http://live.gnome.org/DFeet/) offers a great way to explore services,
but lacks functionality for quick debugging sessions. I [made an
attempt](http://g.blicky.net/dbush.git/) to write a convenient command-line
shell, but lost interest halfway. :-(
D-Bus has the potential to be an easy and convenient IPC mechanism, but the
lack of any centralized organization to offer good implementations,
documentation and tooling makes using D-Bus a pain to use.
**Simple?** Not quite.
: D-Bus is conceptually easy and the message protocol is alright, too. Some
aspects of D-Bus, however, are rather more complex than they need to be.
I have once made an attempt to fully understand how D-Bus discovers and
connects to the session bus, but I gave up halfway because there are too
many special cases. To quickly summarize what I found, there's the
`DBUS_SESSION_BUS_ADDRESS` environment variable which could point to the
(filesystem or abstract) path of a UNIX socket or a TCP address. If that
variable isn't set, D-Bus will try to connect to your X server and get the
address from that. In order to avoid linking everything against X
libraries, a separate [dbus-launch](https://metacpan.org/pod/dbus-launch)
utility is spawned instead. Then the bus address could also be obtained
from a file in your `$HOME/.dbus/` directory, with added complexity to
still support a different session bus for each X session. I've no idea how
exactly connection initiation to the system bus works, but my impression is
that a bunch of special cases exist there, too, depending on which init
system your OS happens to use.
As if all the options in connection initiation aren't annoying enough,
there's also work on [kdbus](https://lwn.net/Articles/580194/), a Linux
kernel implementation to get better performance. Not only will kdbus use a
different underlying communication mechanism, it will also switch to a
completely different serialization format. If/when this becomes widespread
you will have to implement and support two completely different protocols
and pray that your application works with both.
On the design aspect there is, in my opinion, needless complexity with
regards to naming and namespaces. First there is a global namespace for
_bus names_, which are probably better called _application names_, because
that's usually what they represent. Then, there is a separate _object_
namespace local to each bus name. Each object has methods and properties,
and these are associated with an _interface name_, in a namespace specific
to the particular object. Despite these different namespaces, the
convention is to use a full and globally unique path for everything that
has a name. For example, to list the IM protocols that Telepathy supports,
you call the `ListProtocols` method in the
`org.freedesktop.Telepathy.ConnectionManager` interface on the
`/org/freedesktop/Telepathy/ConnectionManager` object at the
`org.freedesktop.Telepathy` bus. Fun times indeed. I can understand the
reasoning behind most of these choices, but in my opinion they found the
wrong trade-off.
Another point of complexity that annoys me is the fact that an XML format
is used to describe interfaces. Supporting XML as an IDL format is alright,
but requiring a separate format for an introspection interface gives me the
impression that the message format wasn't powerful enough for such a simple
purpose. The direct effect of this is that any application wishing to use
introspection data will have to link against an XML parser, and almost all
conforming XML parser implementations are as large as the D-Bus
implementation itself.
**Small?** Kind of.
: `libdbus-1.so.3.8.6` on my system is about 240 KiB. It doesn't cover parsing
interface descriptions or implementing a D-Bus daemon, but still covers most of
what is needed to interact with services and to offer services over D-Bus.
It's not _that_ small, but then again, libdbus-1 was not really written with
small size in mind. There's room for optimization.
**Language independent?** Sure.
: D-Bus libraries exist for a number of programming languages.
**Networked?** Half-assed.
: D-Bus _officially_ supports networked connections to a D-Bus daemon. Actually
using this, however, is painful. Convincing
[dbus-daemon(1)](http://man.he.net/man1/dbus-daemon) to accept connections
on a TCP socket involves disabling all authentication (it expects UNIX
credential passing, normally) and requires adding an undocumented
`<allow_anonymous/>` tag in the configuration (I only figured this out from
reading the source code).
Even when you've gotten that to work, there is the problem that D-Bus isn't
totally agnostic to the underlying socket protocol. D-Bus has support for
passing UNIX file descriptors over the connection, and this of course doesn't
work over TCP. While this feature is optional and easily avoided, some services
(I can't find one now) use UNIX fds in order to keep track of processes that
listen to a certain event. Obviously, those services can't be accessed over the
network.
**Secure?** Only locally.
: D-Bus has statically typed messages that can be validated automatically, so
that's a plus.
For local authentication, there is support for standard UNIX permissions and
credential passing for more fine-grained authorization. For remote
authentication, I think there is support for a shared secret cookie, but I
haven't tried to use this yet.
There is, as with MPD, no support at all for confidentiality, so using
networked D-Bus over an untrusted network would be a very bad idea anyway.
**Fast?** Mostly.
: The messaging protocol is fairly lightweight, so no problems there. I do have
to mention two potential performance issues, however.
The first issue is that the normal mode of operation in D-Bus is to proxy all
messages through an intermediate D-Bus daemon. This involves extra context
switches and message parsing passes in order to get one message from
application A to application B. I believe it is _officially_ supported to
bypass this daemon and to communicate directly between two processes, but after
my experience with networking I am wary of trying anything that isn't part of
how D-Bus is _intended_ to be used. This particular performance issue is what
kdbus addresses, so I suppose it won't apply to future Linux systems.
The other issue is that a daemon that provides a service over D-Bus does not
know whether there exists an application that is interested receiving its
notifications. This means that the daemon always has to spend resources to send
out notification messages, even if no application is actually interested in
receiving them. In practice this means that the notification mechanism is
avoided for events that may occur fairly often, and an equally inefficient
polling approach has to be used instead. It is possible for a service provider
to keep track of interested applications, but this is not part of the D-Bus
protocol and not something you would want to implement for each possible event.
I've no idea if kdbus addresses this issue, but it would be stupid not to.
**Proxy support?** Yup.
: It's part of normal operation, even.
D-Bus has many faults, some of them are by design, but many are fixable. I
would have contributed to improving the situation, but I get the feeling that
the goals of the D-Bus maintainers are not at all aligned with mine. My
impression is that the D-Bus maintainers are far too focussed on their own
specific needs and care little about projects with slightly different needs.
Especially with the introduction of kdbus, I consider D-Bus too complex now to
consider it worth the effort to improve. Starting from scratch seems less work.
# JSON/XML RPC
While I haven't extensively used JSON-RPC or XML-RPC myself, it's still an
interesting alternative to study.
[Transmission](https://www.transmissionbt.com/) uses JSON-RPC
([spec](https://trac.transmissionbt.com/browser/trunk/extras/rpc-spec.txt)) as
its primary IPC mechanism, and [RTorrent](http://rakshasa.github.io/rtorrent/)
has support for an optional XML-RPC interface. (Why do I keep referencing
torrent clients? Surely there are other interesting applications? Oh well.)
The main selling point of HTTP-based IPC is that it is accessible from
browser-based applications, assuming everything has been setup correctly. This
is a nice advantage, but lack of this support is not really a deal-breaker for
me. Browser-based applications can still use any other IPC mechanism, as long
as there are browser plugins or some form of proxy server that converts the
messages of the IPC mechanism to something that is usable over HTTP. For
example, both solutions exist for D-Bus, in the form of the [Browser DBus
Bridge](http://sandbox.movial.com/wiki/index.php/Browser_DBus_Bridge) and
[cloudeebus](https://github.com/01org/cloudeebus). Of course, such solutions
typically aren't as convenient as native HTTP support.
Since HTTP is, by design, purely request-response, JSON-RPC and XML-RPC don't
generally support asynchronous notifications. It's possible to still get
asynchronous notifications by using
[WebSockets](https://en.wikipedia.org/wiki/WebSocket) (Ugh, opaque stream
sockets, time to go back to our [custom protocol](#custom-protocol)) or by
having the client implement a HTTP server itself and send its URL to the
service provider (This is known as a
[callback](https://duckduckgo.com/?q=web%20service%20callback) in the
[SOAP](https://en.wikipedia.org/wiki/SOAP) world. I have a lot of respect for
developers who can put up with that crap). As I already hinted, neither
solution is simple or easy.
Let's move on to the usual grading...
**Easy?** Sure.
: The ubiquity of HTTP, JSON and XML on the internet means that most developers
are already familiar with using it. And even if you aren't, there are so many
easy-to-use and well-documented libraries available that you're ready to go in
a matter of minutes.
Although interface description languages/formats exist for XML-RPC (and
possibly for JSON-RPC, too), I get the impression these are not often used
outside of the SOAP world. As a result, interacting with such a service tends
be weakly/stringly typed, which, I imagine, is not as convenient in strongly
typed programming languages.
**Simple?** Not really.
: Many people have the impression that HTTP is somehow a simple protocol. Sure,
it may look simple on the wire, but in reality it is a hugely bloated and
complex protocol. I strongly encourage everyone to read through [RFC
2616](https://tools.ietf.org/html/rfc2616) at least once to get an idea of its
size and complexity. To make things worse, there's a lot of recent activity to
standardize on a next generation HTTP
([SPDY](https://en.wikipedia.org/wiki/SPDY) and [HTTP
2.0](https://en.wikipedia.org/wiki/HTTP_2.0)), but I suppose we can ignore these
developments for the foreseeable future for the use case of IPC.
Of course, a lot of the functionality specified for HTTP is optional and can be
ignored for the purpose of IPC, but that doesn't mean that these options don't
exist. When implementing a client, it would be useful to know exactly which
HTTP options the server supports. It would be wasteful to implement compression
support if the server doesn't support it, or keep-alive, or content
negotiation, or ranged requests, or authentication, or correct handling for all
response codes when the server will only ever send 'OK'. What also commonly
happens is that server implementors want to support as much as possible, to the
point that you can have JSON or XML output, depending on what the client
requested.
XML faces a similar problem. The format looks simple, but the specification has
a bunch of features that hardly anyone uses. In contrast to HTTP, however, a
correct XML parser can't just decide to not parse `<!DOCTYPE ..>` stuff,
so it _has_ to implement some of this complexity.
On the upside, JSON is a really simple serialization format, and if you're
careful enough to only implement the functionality necessary for basic HTTP, a
JSON-RPC implementation _can_ be somewhat simple.
**Small?** Not really.
: What typically happens is that implementors take an existing HTTP library and
build on top of that. A generic HTTP library likely implements a lot more than
necessary for IPC, so that's not going to be very small. RTorrent, for example,
makes use of the not-very-small [xmlrpc-c](http://xmlrpc-c.sourceforge.net/),
which in turn uses [libcurl](http://curl.haxx.se/) (400 KiB, excluding TLS
library) and either the bloated [libxml2](http://xmlsoft.org/) (1.5 MiB) or
[libexpat](http://www.libexpat.org/) (170 KiB). In any case, expect your
programs to grow by a megabyte or more if you go this route.
Transmission seems rather less bloated. It uses the HTTP library that is built
into [libevent](http://libevent.org/) (totalling ~500 KiB, but libevent is also
used for other networking parts), and a simple JSON parser can't be that large
either. I'm sure that if you reimplement everything from scratch for the
purpose of building an API, you could get something much smaller. Then again,
even if you manage to shrink the size of the server that way, you can't expect
all your users to do the same.
If HTTPS is to be supported, add ~500 KiB more. TLS isn't the simplest
protocol, either.
**Language independent?** Yes.
: Almost every language has libraries for web stuff.
**Networked?** Definitely.
: In fact, I've never seen anyone use XML/JSON RPC over UNIX sockets.
**Secure?** Alright.
: HTTP has built-in support for authentication, but it also isn't uncommon to use
some other mechanism (based on cookies, I guess?).
Confidentiality can be achieved with HTTPS. There is the problem of verifying
the certificate, since I doubt anyone is going to have certificates of their
local applications signed by a certificate authority, but there's always the
option of trust-on-first use. Custom applications can also include a
fingerprint of the server certificate in the URL for verification, but this
won't work for web apps.
**Fast?** No.
: JSON/XML RPC messages add significant overhead to the network and requires more
parsing than a simple custom solution or D-Bus. I wouldn't really call it
_fast_, but admittedly, it might still be _fast enough_ for most purposes.
**Proxy support?** Sure.
: HTTP has native support for proxying, and it's always possible to proxy some
URI on the main server to another server, assuming the libraries you use
support that. It's not necessarily simple to implement, however.
The lack of asynchronous notifications and the overhead and complexity of
JSON/XML RPC make me stay away from it, but it certainly is a solution that
many client developers will like because of its ease of use.
# Other Systems
There are a more alternatives out there than I have described so far. Most of
those were options I dismissed early on because they're either incomplete
solutions or specific to a single framework or language. I'll still mention a
few here.
## Message Queues
In the context of IPC I see that message queues such as
[RabbitMQ](https://www.rabbitmq.com/) and [ZeroMQ](http://zeromq.org/) are
quite popular. I can't say I have much experience with any of these, but these
MQs don't seem to offer a solution to the problem I described in the
introduction. My impression of MQs is that they offer a higher-level and more
powerful alternative to TCP and UDP. That is, they route messages from one
endpoint to another. The contents of the messages are still completely up to
the application, so you're still on your own in implementing an RPC mechanism
on top of that. And for the purpose of building a simple RPC mechanism, I'm
convinced that plain old UNIX sockets or TCP will do just fine.
## Cap'n Proto
I probably should be spending a full chapter on [Cap'n
Proto](http://kentonv.github.io/capnproto/) instead of this tiny little section,
but I'm simply not familiar enough with it to offer any deep insights. I can
still offer my blatantly uninformed impression of it: It looks very promising,
but puts, in my opinion, too much emphasis on performance and too little
emphasis on ease of use. It lacks introspection and requires that clients have
already obtained the schema of the service in order to interact with it. It
also uses a capability system to handle authorization, which, despite being
elegant and powerful, increases complexity and cognitive load (though I
obviously need more experience to quantify this). It still lacks
confidentiality for networked access and the number of bindings to other
programming languages is limited, but these problems can be addressed.
Cap'n Proto seems like the ideal IPC mechanism for internal communication
within a single (distributed) application and offers a bunch of unique features
not found in other RPC systems. But it doesn't feel quite right as an easy API
for others to use.
## CORBA
CORBA has been used by the GNOME project in the past, and was later abandoned
in favour of D-Bus, primarily (I think) because CORBA was deemed too [complex
and incomplete](http://dbus.freedesktop.org/doc/dbus-faq.html#corba). A system
that is deemed more complex than D-Bus is an immediate red flag. The [long and
painful history of CORBA](http://queue.acm.org/detail.cfm?id=1142044) also makes
me want to avoid it, if only because that makes it very hard to judge the
quality and modernness of existing implementations.
## Project Tanja
A bit over two years ago I was researching the same problem, but from a much
more generic angle. The result of that was a project that I called Tanja. I
described its concepts [in an earlier article](/doc/commvis), and wrote an
incomplete [specification](https://g.blicky.net/tanja.git/) along with
implementations in [C](https://g.blicky.net/tanja-c.git/),
[Go](https://g.blicky.net/tanja-go.git/) and
[Perl](https://g.blicky.net/tanja-perl.git/). I consider project Tanja a
failure, primarily because of its genericity. It supported too many
communication models and the lack of a specification as to which model was
used, and the lack of any guarantee that this model was actually followed, made
Tanja hard to use in practice. It was a very interesting experiment, but not
something I would actually use. I learned the hard way that you sometimes have
to move some complexity down into a lower abstraction layer in order to keep
the complexity in check at higher layers of abstraction.
# Conclusions
This must be the longest rant I've written so far.
In any case, there isn't really a perfect IPC mechanism for my use case. A
custom protocol involves reimplementing a lot of stuff, D-Bus is a pain, and
JSON/XML RPC are bloat.
I am still undecided on what to do. I have a lot of ideas as to what a perfect
IPC solution would look like, both in terms of features and in how to implement
it, and I feel like I have enough experience by now to actually develop a
proper solution. Unfortunately, writing a complete IPC system with the required
utilities and language bindings takes **a lot** of time and effort. It's not
really worth it if I am the only one using it.
So here is my plea to you, dear reader: If you know of any existing solutions
I've missed, please tell me. If you empathize with me and want a better
solution to this problem, please get in touch as well! I'd love to hear about
projects which face similar problems and have similar requirements.

517
dat/doc/funcweb.md Normal file
View file

@ -0,0 +1,517 @@
% An Opinionated Survey of Functional Web Development
(Published on **2017-05-28**)
# Intro
TL;DR: In this article I provide an overview of the frameworks and libraries
available for creating websites in statically-typed functional programming
languages.
I recommend you now skip directly to the next section, but if you're interested
in some context and don't mind a rant, feel free to read on. :-)
**&lt;Rant mode>**
When compared to native desktop application development, web development just
sucks. Native development is relatively simple with toolkits such as
[Qt](https://www.qt.io/), [GTK+](https://www.gtk.org/) and others: You have
convenient widget libraries, and you can describe your entire application, from
interface design to all behavioural aspects, in a single programming language.
You're also largely free to structure code in whichever way makes most sense.
You can describe what a certain input field looks like, what happens when the
user interacts with it and what will happen with the input data, all succinctly
in a single file. There are even drag-and-drop UI builders to speed up
development.
Web development is the exact opposite of that. There are several different
technologies you're forced to work with even when creating the most mundane
website, and there's a necessary but annoying split between code that runs on
the server and code that runs in the browser. Creating a simple input field
requires you to consider and maintain several ends:
- The back end (server-side code) that describes how the input field interacts
with the database.
- Some JavaScript code to describe how the user can interact with the input
field.
- Some CSS to describe what the input field looks like.
- And then there's HTML to act as a glue between the above.
In many web development setups, all four of the above technologies are
maintained in different files. If you want to add, remove or modify an input
field, or just about anything else on a page, you'll be editing at least four
different files with different syntax and meaning. I don't know how other
developers deal with this, but the only way I've been able to keep these places
synchronized is to just edit one or two places, test if it works in a browser,
and then edit the other places accordingly to fix whatever issues I find. This
doesn't always work well: I don't get a warning if I remove an HTML element
somewhere and forget to also remove the associated CSS. Heck, in larger
projects I can't even tell whether it's safe to remove or edit a certain line
of CSS because I have no way to know for sure that it's not still being used
elsewhere. Perhaps this particular case can be solved with proper organization
and discipline, but similar problems exist with the other technologies.
Yet despite that, why do I still create websites in my free time? Because it is
the only environment with high portability and low friction - after all, pretty
much anyone can browse the web. I would not have been able to create a useful
"[Visual Novel Database](https://vndb.org/)" any other way than through a
website. And the entire purpose of [Manned.org](https://manned.org/) is to
provide quick access to man pages from anywhere, which is not easily possible
with native applications.
**&lt;/Rant mode>**
Fortunately, I am not the only one who sees the problems with the "classic"
development strategy mentioned above. There are many existing attempts to
improve on that situation. A popular approach to simplify development is the
[Single-page
application](https://en.wikipedia.org/wiki/Single-page_application) (SPA). The
idea is to move as much code as possible to the front end, and keep only a
minimal back end. Both the HTML and the entire behaviour of the page can be
defined in the same language and same file. With libraries such as
[React](https://facebook.github.io/react/) and browser support for [Web
components](https://developer.mozilla.org/en-US/docs/Web/Web_Components), the
split between files described above can be largely eliminated. And if
JavaScript isn't your favorite language, there are many alternative languages
that compile to JavaScript. (See [The JavaScript
Minefield](http://walkercoderanger.com/blog/2014/02/javascript-minefield/) for
an excellent series of articles on that topic).
While that approach certainly has the potential to make web development more
pleasant, it has a very significant drawback: Performance. For some
applications, such as web based email clients or CRM systems, it can be
perfectly acceptable to have a megabyte of JavaScript as part of the initial
page load. But for most other sites, such as this one, or the two sites I
mentioned earlier, or sites like Wikipedia, a slow initial page load is
something I consider to be absolutely unacceptable. The web can be really fast,
and developer laziness is not a valid excuse to ruin it. (If you haven't seen
or read [The Website Obesity
Crisis](http://idlewords.com/talks/website_obesity.htm) yet, please do so now).
I'm much more interested in the opposite approach to SPA: Move as much code as
possible to the back end, and only send a minimal amount of JavaScript to the
browser. This is arguably how web development has always been done in the past,
and there's little reason to deviate from it. The difference, however, is that
people tend to expect much more "interactivity" from web sites nowadays, so the
amount of JavaScript is increasing. And that is alright, so long as the
JavaScript doesn't prevent the initial page from loading quickly. But this
increase in JavaScript does amplify the "multiple files" problem I ranted about
earlier.
So my ideal solution is a framework where I can describe all aspects of a site
in a single language, and organize the code among files in a way that makes
sense to me. That is, I want the same kind of freedom that I get with native
desktop software development. Such a framework should run on the back end, and
automatically generate efficient JavaScript and, optionally, CSS for the front
end. As an additional requirement (or rather, strong preference), all this
should be in a statically-typed language - because I am seemingly incapable of
writing large reliable applications with dynamic typing - and in a language
from functional heritage - because programming in functional languages has
spoiled me.
I'm confident that what I describe is possible, and it's evident that I'm not
the only person to want this, as several (potential) solutions like this do
indeed exist. I've been looking around for these solutions and have
experimented with a few that looked promising. This article provides an
overview of what I have found so far.
# OCaml
My adventure began with [OCaml](https://ocaml.org/). It's been a few years
since I last used OCaml for anything, but development on the language and its
ecosystem have all but halted. [Real World OCaml](https://realworldocaml.org/)
has been a great resource to get me up to speed again.
## Ocsigen
For OCaml there is one project that has it all: [Ocsigen](http://ocsigen.org/).
It comes with an OCaml to JavaScript compiler, a web server, several handy
libraries, and a [framework](http://ocsigen.org/eliom/) to put everything
together. Its [syntax
extension](http://ocsigen.org/eliom/6.2/manual/ppx-syntax) allows you to mix
front and back end code, and you can easily share code between both ends. The
final result is a binary that runs the server and a JavaScript file that
handles everything on the client side.
The framework comes with an embedded DSL with which you can conveniently
generate HTML without actually typing HTML. And best of all, this DSL works on
both the client and the server: On the server side it generates an HTML string
that can be sent to the client, and running the same code on the client side
will result in a DOM element that is ready to be used.
Ocsigen makes heavy use of the OCaml type system to statically guarantee the
correctness of various aspects of the application. The HTML DSL ensures not
only that the generated HTML well-formed, but also prevents you from
incorrectly nesting certain elements and using the wrong attributes on the
wrong elements. Similarly, an HTML element generated on the server side can be
referenced from client side code without having to manually assign a unique ID
to the element. This prevents accidental typos in the ID naming and guarantees
that the element that the client side code refers to actually exists. URL
routing and links to internal pages are also checked at compile time.
Ocsigen almost exactly matches what I previously described as the perfect
development framework. Unfortunately, it has a few drawbacks:
- The generated JavaScript is quite large, a bit over 400 KiB for an hello
world. In my brief experience with the framework, this also results in a
noticeably slower page load. I don't know if it was done for performance
purposes, but subsequent page views are per default performed via in-browser
XHR requests, which do not require that all the JavaScript is re-parsed and
evaluated, and is thus much faster. This, however, doesn't work well if the
user opens pages in multiple tabs or performs a page reload for whatever
reason. And as I mentioned, I care a lot about the initial page loading time.
- The framework has a steep learning curve, and the available documentation is
by far not complete enough to help you. I've found myself wondering many
times how I was supposed to use a certain API and have had to look for
example code for enlightenment. At some point I ended up just reading the
source code instead of going for the documentation. What doesn't help here is
that, because of the heavy use of the type system to ensure code correctness,
most of the function signatures are far from intuitive and are sometimes very
hard to interpret. This problem is made even worse with the generally
unhelpful error messages from the compiler. (A few months with
[Rust](https://www.rust-lang.org/) and its excellent error messages has
really spoiled me on this aspect, I suppose).
- I believe they went a bit too far with the compile-time verification of
certain correctness properties. Apart from making the framework harder to
learn, it also increases the verbosity of the code and removes a lot of
flexibility. For instance, in order for internal links to be checked, you
have to declare your URLs (or _services_, as they call it) somewhere central
such that the view part of your application can access it. Then elsewhere you
have to register a handler to that service. This adds boilerplate and
enforces a certain code structure. And the gain of all this is, in my
opinion, pretty small: In the 15 years that I have been building web sites, I
don't remember a single occurrence where I mistyped the URL in an internal
link. I do suppose that this feature makes it easy to change URLs without
causing breakage, but there is a trivial counter-argument to that: [Cool URIs
don't change](https://www.w3.org/Provider/Style/URI.html). (Also, somewhat
ironically, I have found more dead internal links on the Ocsigen website than
on any other site I have visited in the past year, so perhaps this was indeed
a problem they considered worth fixing. Too bad it didn't seem to work out so
well for them).
Despite these drawbacks, I am really impressed with what the Ocsigen project
has achieved, and it has set a high bar for the future frameworks that I will
be considering.
# Haskell
I have always seen Haskell as that potentially awesome language that I just
can't seem to wrap my head around, despite several attempts in the past to
learn it. Apparently the only thing I was missing in those attempts was a
proper goal: When I finally started playing around with some web frameworks I
actually managed to get productive in Haskell with relative ease. What also
helped me this time was a practical introductory Haskell reference, [What I
Wish I Knew When Learning Haskell](http://dev.stephendiehl.com/hask/), in
addition to the more theoretical [Learn You A Haskell for Great
Good](http://learnyouahaskell.com/).
Haskell itself already has a few advantages when compared to OCaml: For one, it
has a larger ecosystem, so for any task you can think of there is probably
already at least one existing library. As an example, I was unable to find an
actively maintained SQL DSL for OCaml, while there are several available for
Haskell. Another advantage that I found were the much more friendly and
detailed error messages generated by the Haskell compiler, GHC. In terms of
build systems, Haskell has standardized on
[Cabal](https://www.haskell.org/cabal/), which works alright most of the time.
Packaging is still often complex and messy, but it's certainly improving as
[Stack](http://haskellstack.org/) is gaining more widespread adoption. Finally,
I feel that the Haskell syntax is slightly less verbose, and more easily lends
itself to convenient DSLs.
Despite Haskell's larger web development community, I could not find a single
complete and integrated client/server development framework such as Ocsigen.
Instead, there are a whole bunch of different projects focussing on either the
back end or the front end. I'll explore some of them with the idea that,
perhaps, it's possible to mix and match different libraries and frameworks in
order to get the perfect development environment. And indeed, this seems to be
a common approach in many Haskell projects.
## Server-side
Let's start with a few back end frameworks.
Scotty
: [Scotty](https://github.com/scotty-web/scotty) is a web framework inspired by
[Sinatra](http://www.sinatrarb.com/). I have no experience with (web)
development in Ruby and have never used Sinatra, but it has some similarities
to what I have been using for a long time: [TUWF](https://dev.yorhel.nl/tuwf).
Scotty is a very minimalist framework; It does routing (that is, mapping URLs
to Haskell functions), it has some functions to access request data and some
functions to create and modify a response. That's it. No database handling,
session management, HTML generation, form handling or other niceties. But
that's alright, because there are many generic libraries to help you out there.
Thanks to its minimalism, I found Scotty to be very easy to learn and get used
to. Even as a Haskell newbie I had a simple website running within a day. The
documentation is appropriate, but the idiomatic way of combining Scotty with
other libraries is through the use of Monad Transformers, and a few more
examples in this area would certainly have helped.
Spock
: Continuing with the Star Trek franchise, there's
[Spock](https://www.spock.li/). Spock is very similar to Scotty, but comes with
type-safe routing and various other goodies such as session and state
management, [CSRF](https://en.wikipedia.org/wiki/Cross-site_request_forgery)
protection and database helpers.
As with everything that is (supposedly) more convenient, it also comes with a
slightly steeper learning curve. I haven't, for example, figured out yet how to
do regular expression based routing. I don't even know if that's still possible
in the latest version - the documentation isn't very clear. Likewise, it's
unclear to me what the session handling does exactly (Does it store something?
And where? Is there a timeout?) and how that interacts with CSRF protection.
Spock seems useful, but requires more than just a cursory glance.
Servant
: [Servant](http://haskell-servant.github.io/) is another minimalist web
framework, although it is primarily designed for creating RESTful APIs.
Servant distinguishes itself from Scotty and Spock by not only featuring
type-safe routing, it furthermore allows you to describe your complete public
API as a type, and get strongly typed responses for free. This also enables
support for automatically generated documentation and client-side API wrappers.
Servant would be an excellent back end for a SPA, but it does not seem like an
obvious approach to building regular websites.
Happstack / Snap / Yesod
: [Happstack](http://www.happstack.com/), [Yesod](http://www.yesodweb.com/) and
[Snap](http://snapframework.com/) are three large frameworks with many
auxiliary libraries. They all come with a core web server, routing, state and
database management. Many of the libraries are not specific to the framework
and can be used together with other frameworks. I won't go into a detailed
comparison between the three projects because I have no personal experience
with any of them, and fortunately [someone else already wrote a
comparison](http://softwaresimply.blogspot.nl/2012/04/hopefully-fair-and-useful-comparison-of.html)
in 2012 - though I don't know how accurate that still is today.
So there are a fair amount of frameworks to choose from, and they can all work
together with other libraries to implement additional functions. Apart from the
framework, another important aspect of web development is how you generate the
HTML to send to the client. In true Haskell style, there are several answers.
For those who prefer embedded DSLs, there are
[xhtml](http://hackage.haskell.org/package/xhtml),
[BlazeHTML](https://jaspervdj.be/blaze/) and
[Lucid](https://github.com/chrisdone/lucid). The xhtml package is not being
used much nowadays and has been superseded by BlazeHTML, which is both faster
and offers a more readable DSL using Haskell's do-notation. Lucid is heavily
inspired by Blaze, and attempts to [fix several of its
shortcomings](http://chrisdone.com/posts/lucid). Having used Lucid a bit
myself, I can attest that it is easy to get started with and pretty convenient
in use.
I definitely prefer to generate HTML using DSLs as that keeps the entire
application in a single host language and with consistent syntax, but the
alternative approach, templating, is also fully supported in Haskell. The Snap
framework comes with [Heist](https://github.com/snapframework/heist), which are
run-time interpreted templates, like similar systems in most other languages.
Yesod comes with [Shakespeare](http://hackage.haskell.org/package/shakespeare),
which is a type-safe templating system with support for inlining the templates
in Haskell code. Interestingly, Shakespeare also has explicit support for
templating JavaScript code. Too bad that this doesn't take away the need to
write the JavaScript yourself, so I don't see how this is an improvement over
some other JavaScript solution that uses JSON for communication with the back
end.
## Client-side
It is rather unusual to have multiple compiler implementations targeting
JavaScript for the same source language, but Haskell has three of them. All
three can be used to write front end code without touching a single line of
JavaScript, but there are large philosophical differences between the three
projects.
Fay
: [Fay](https://github.com/faylang/fay/wiki) compiles Haskell code directly to
JavaScript. The main advantage of Fay is that it does not come with a large
runtime, resulting small and efficient JavaScript. The main downside is that it
only [supports a subset of
Haskell](https://github.com/faylang/fay/wiki/What-is-not-supported?). The
result is a development environment that is very browser-friendly, but where
you can't share much code between the front and back ends. You're basically
back to the separated front and back end situation in classic web development,
but at least you can use the same language for both - somewhat.
Fay itself doesn't come with many convenient UI libraries, but
[Cinder](http://crooney.github.io/cinder/index.html) covers that with a
convenient HTML DSL and DOM manipulation library.
Fay is still seeing sporadic development activity, but there is not much of
a lively community around it. Most people have moved on to other solutions.
GHCJS
: [GHCJS](https://github.com/ghcjs/ghcjs) uses GHC itself to compile Haskell to a
low-level intermediate language, and then compiles that language to JavaScript.
This allows GHCJS to achieve excellent compatibility with native Haskell code,
but comes, quite predictably, at the high cost of duplicating a large part of
the Haskell runtime into the JavaScript output. The generated JavaScript code
is typically measured in megabytes rather than kilobytes, which is (in my
opinion) far too large for regular web sites. The upside of this high
compatibility, of course, is that you can re-use a lot of code between the
front and back ends, which will certainly make web development more tolerable.
The community around GHCJS seems to be more active than that of Fay. GHCJS
integrates properly with the Stack package manager, and there are a [whole
bunch](http://hackage.haskell.org/packages/search?terms=ghcjs) of libraries
available.
Haste
: [Haste](https://github.com/valderman/haste-compiler) provides a middle ground
between Fay and GHCJS. Like GHCJS, Haste is based on GHC, but it instead of
using low-level compiler output, Haste uses a higher-level intermediate
language. This results in good compatibility with regular Haskell code while
keeping the output size in check. Haste has a JavaScript runtime of around 60
KiB and the compiled code is roughly as space-efficient as Fay.
While it should be possible to share a fair amount of code between the front
and back ends, not all libraries work well with Haste. I tried to use Lucid
within a Haste application, for example, but that did not work. Apparently one
of its dependencies (probably the UTF-8 codec, as far as I could debug the
problem) performs some low-level performance optimizations that are
incompatible with Haste.
Haste itself is still being sporadically developed, but not active enough to be
called alive. The compiler lags behind on the GHC version, and the upcoming 0.6
version has stayed unreleased and in limbo state for at least 4 months on the
git repository. The community around Haste is in a similar state. Various
libraries do exist, such as [Shade](https://github.com/takeoutweight/shade)
(HTML DSL, Reactive UI), [Perch](https://github.com/agocorona/haste-perch)
(another HTML DSL), [haste-markup](https://github.com/ajnsit/haste-markup) (yet
another HTML DSL) and
[haste-dome](https://github.com/wilfriedvanasten/haste-dome) (_yet_ another
HTML DSL), but they're all pretty much dead.
Despite having three options available, only Haste provides enough benefit of
code reuse while remaining efficient enough for the kind of site that I
envision. Haste really deserves more love than it is currently getting.
## More Haskell
In my quest for Haskell web development frameworks and tools, I came across a
few other interesting libraries. One of them is
[Clay](http://fvisser.nl/clay/), a CSS preprocessor as a DSL. This will by
itself not solve the CSS synchronisation problem that I mentioned at the start
of this article, but it could still be used to keep the CSS closer to code
implementing the rest of the site.
It also would not do to write an article on Haskell web development and not
mention a set of related projects: [MFlow](https://github.com/agocorona/MFlow),
[HPlayground](https://github.com/agocorona/hplayground) and the more recent
[Axiom](https://github.com/transient-haskell/axiom). These are ambitious
efforts at building a very high-level and functional framework for both front
and back end web development. I haven't spend nearly enough time on these
projects to fully understand their scope, but I'm afraid of these being a bit
too high level. This invariably results in reduced flexibility (i.e. too many
opinions being hard-coded in the API) and less efficient JavaScript output.
Axiom being based on GHCJS reinforces the latter concern.
# Other languages
I've covered OCaml and Haskell now, but there are relevant projects in other
languages, too:
PureScript
: [PureScript](http://www.purescript.org/) is the spiritual successor of Fay -
except it does not try to be compatible with Haskell, and in fact
[intentionally deviates from
Haskell](https://github.com/purescript/documentation/blob/master/language/Differences-from-Haskell.md)
at several points. Like Fay, and perhaps even more so, PureScript compiles down
to efficient and small JavaScript.
Being a not-quite-Haskell language, sharing code between a PureScript front end
and a Haskell back end is not possible, the differences are simply too large.
It is, however, possible to go into the other direction: PureScript could also
run on the back end in a NodeJS environment. I don't really know how well this
is supported by the language ecosystem, but I'm not sure I'm comfortable with
replacing the excellent quality of Haskell back end frameworks with a fragile
NodeJS back end (or such is my perception, I admittedly don't have too much
faith in most JavaScript-heavy projects).
The PureScript community is very active and many libraries are available in the
[Persuit](https://pursuit.purescript.org/) package repository. Of note is
[Halogen](https://pursuit.purescript.org/packages/purescript-halogen), a
high-level reactive UI library. One thing to be aware of is that not all
libraries are written with space efficiency as their highest priority, the
simple [Halogen
button](https://github.com/slamdata/purescript-halogen/tree/v2.0.1/examples/basic)
example already compiles down to a hefty 300 KB for me.
Elm
: [Elm](http://elm-lang.org/) is similar to PureScript, but rather than trying to
be a generic something-to-JavaScript compiler, Elm focuses exclusively on
providing a good environment to create web UIs. The reactive UI libraries are
well maintained and part of the core Elm project. Elm has a strong focus on
being easy to learn and comes with good documentation and many examples to get
started with.
Ur/Web
: [Ur/Web](http://www.impredicative.com/ur/) is an ML and Haskell inspired
programming language specifically designed for client/server programming. Based
on its description, Ur/Web is exactly the kind of thing I'm looking for: It
uses a single language for the front and back ends and provides convenient
methods for communication between the two.
This has been a low priority on my to-try list because it seems to be primarily
a one-man effort, and the ecosystem around it is pretty small. Using Ur/Web for
practical applications will likely involve writing your own libraries or
wrappers for many common tasks, such as for image manipulation or advanced text
processing. Nonetheless, I definitely should be giving this a try sometime.
(Besides, who still uses frames in this day and age? :-)
Opa
: I'll be moving out of the functional programming world for a bit.
[Opa](http://opalang.org/) is another language and environment designed for
client/server programming. Opa takes a similar approach to "everything in
PureScript": Just compile everything to JavaScript and run the server-side code
on NodeJS. The main difference with other to-JavaScript compilers is that Opa
supports mixing back end code with front end code, and it can automatically
figure out where the code should be run and how the back and front ends
communicate with each other.
Opa, as a language, is reminiscent of a statically-typed JavaScript with
various syntax extensions. While it does support SQL databases, its database
API seems to strongly favor object-oriented use rather than relational database
access.
GWT
: Previously I compared web development to native GUI application development.
There is no reason why you can't directly apply native development structure
and strategies onto the web, and that's exactly what
[GWT](http://www.gwtproject.org/) does. It provides a widget-based programming
environment that eventually runs on the server and compiles the client-side
part to JavaScript. I haven't really considered it further, as Java is not a
language I can be very productive in.
Webtoolkit
: In the same vein, there's [Wt](https://www.webtoolkit.eu/wt). The name might
suggest that it is a web-based clone of Qt, and indeed that's what it looks
like. Wt is written in C++, but there are wrappers for [other
languages](https://www.webtoolkit.eu/wt/other_language). None of the languages
really interest me much, however.
That said, if I had to write a web UI for a resource-constrained device, this
seems like an excellent project to consider.
# To conclude
To be honest, I am a bit overwhelmed at the number of options. On the one hand,
it makes me very happy to see that a lot is happening in this world, and that
alternatives to boring web frameworks do exist. Yet after all this research I
still have no clue what I should use to develop my next website. I do like the
mix and match culture of Haskell, which has the potential to form a development
environment entirely to my own taste and with my own chosen trade-offs. On the
other hand, the client-side Haskell solutions are simply too immature and
integration with the back end frameworks is almost nonexistent.
Almost none of the frameworks I discussed attempt to tackle the CSS problem
that I mentioned in the introduction, so there is clearly room for more
research in this area.
There are a few technologies that I should spend more time on to familiarize
myself with. Ur/Web is an obvious candidate here, but perhaps it is possible to
create a Haskell interface to Wt. Or maybe some enhancements to the Haste
ecosystem could be enough to make that a workable solution instead.

576
dat/doc/sqlaccess.md Normal file
View file

@ -0,0 +1,576 @@
% Multi-threaded Access to an SQLite3 Database
(Published on **2011-11-26**)
(Minor 2013-04-06 update: I abstracted my message passing solution from ncdc
and implemented it in a POSIX C library for general use. It's called _sqlasync_
and is part of my [Ylib library collection](/ylib).)
# Introduction
As I was porting [ncdc](/ncdc) over to use SQLite3 as storage backend, I
stumbled on a problem: The program uses a few threads for background jobs, and
it would be nice to give these threads access to the database.
Serializing all database access through the main thread wouldn't have been very
hard to implement in this particular case, but that would have been far from
optimal. The main thread is also responsible for keeping the user interface
responsive and handling most of the network interaction. Overall responsiveness
of the program would significantly improve when the threads could access the
database without involvement of the main thread.
Which brought me to the following questions: What solutions are available for
providing multi-threaded access to an SQLite database? What problems may I run
in to? I was unable to find a good overview in this area on the net, so I wrote
this article with the hope to improve that situation.
# SQLite3 and threading
Let's first see what SQLite3 itself has to offer in terms of threading support.
The official documentation mentions threading support several times in various
places, but this information is scattered around and no good overview is given.
Someone has tried to organize this before on a [single
page](http://www.sqlite.org/cvstrac/wiki?p=MultiThreading), and while this
indeed gives a nice overview, it has unfortunately not been updated since 2006.
The advices are therefore a little on the conservative side.
Nonetheless, it is wise to remain portable with different SQLite versions,
especially when writing programs that dynamically link with some random version
installed on someone's system. It should be fairly safe to assume that SQLite
binaries provided by most systems, if not all, are compiled with thread safety
enabled. This doesn't mean all that much, unfortunately: The only thing _thread
safe_ means in this context is that you can use SQLite3 in multiple threads,
but a single database connection should still stay within a single thread.
Since SQLite 3.3.1, which was released in early 2006, it is possible to move a
single database connection along multiple threads. Doing this with older
versions is not advisable, as explained in [the SQLite
FAQ](http://www.sqlite.org/faq.html#q6). But even with 3.3.1 and later there is
an annoying restriction: A connection can only be passed to another thread when
any outstanding statements are closed and finalized. In practice this means
that it is not possible to keep a prepared statement in memory for later
executions.
Since SQLite 3.5.0, released in 2007, a single SQLite connection can be used
from multiple threads simultaneously. SQLite will internally manage locks to
avoid any data corruption. I can't recommend making use of this facility,
however, as there are still many issues with the API. The [error fetching
functions](http://www.sqlite.org/c3ref/errcode.html) and
[sqlite3\_last\_insert\_row\_id()](http://www.sqlite.org/c3ref/last_insert_rowid.html),
among others, are still useless without explicit locking in the application. I
also believe that the previously mentioned restriction on having to finalize
statements has been relaxed in this version, so keeping prepared statements in
memory and passing them among different threads becomes possible.
When using multiple database connections within a single process, SQLite offers
a facility to allow [sharing of its
cache](http://www.sqlite.org/sharedcache.html), in order to reduce memory usage
and disk I/O. The semantics of this feature have changed with different SQLite
versions and appear to have stabilised in 3.5.0. This feature may prove useful
to optimize certain situations, but does not open up new possibilities of
communicating with a shared database.
# Criteria
Before looking at some available solutions, let's first determine the criteria
we can use to evaluate them.
Implementation size
: Obviously, a solution that requires only a few lines of code to implement is
preferable over one that requires several levels of abstraction in order to be
usable. I won't be giving actual implementations here, so the sizes will be
rough estimates for comparison purposes. The actual size of an implementation
is of course heavily dependent on the programming environment as well.
Memory/CPU overhead
: The most efficient solution for a single-threaded application is to simply have
direct access to a single database connection. Every solution is in principle a
modification or extension of this idea, and will therefore add a certain
overhead. This overhead manifests itself in both increased CPU and memory
usage. The order of which varies between solutions.
Prepared statement re-use
: Is it possible to prepare a statement once and keep using it for the lifetime
of the program? Or will prepared statements have to be thrown away and
recreated every time? Keeping statement handles in memory will result in a nice
performance boost for applications that run the same SQL statement many times.
Transaction grouping
: A somewhat similar issue to prepared statement re-use: From a performance point
of view, it is very important to try to batch many UPDATE/DELETE/INSERT
statements within a single transaction, as opposed to running each modify query
separately. Running each query separately will force SQLite to flush the data
to disk separately every time, whereas a single transaction will batch-flush
all the changes to disk in a single go. Some solutions allow for grouping
multiple statements in a single transaction quite easily, while others require
more involved steps.
Background processing
: In certain situations it may be desirable to queue a certain query for later
processing, without explicitly waiting for it to complete. For example, if
something in the database has to be modified as a result of user interaction in
a UI thread, then the application would feel a lot more responsive if the
UPDATE query was simply queued to be processed in a background thread than when
the query had run in the UI thread itself. A database accessing solution with
built-in support for background processing of queries will significantly help
with building a responsive application.
Concurrency
: Concurrency indicates how well the solution allows for concurrent access. The
worst possible concurrency is achieved when a single database connection is
used for all threads, as only a single action can be performed on the database
at any point in time. Maximum concurrency is achieved when each thread has its
own SQLite connection. Note that maximum concurrency doesn't mean that the
database can be accessed in a _fully_ concurrent manner. SQLite uses internal
database-level locks to avoid data corruption, and these will limit the actual
maximum concurrency. I am not too knowledgeable about the inner workings of
these locks, but it is at least possible to have a large number truly
concurrent database _reads_. Database _writes_ from multiple threads may
still allow for significantly more concurrency than when they are manually
serialized over a single database connection.
Portability
: What is the minimum SQLite version required to implement the solution? Does it
require any special OS features or SQLite compilation settings? As outlined
above, different versions of SQLite offer different features with regards to
threading. Relying one of the relatively new features will decrease
portability.
# The Solutions
Here I present four solutions to allow database access from multiple threads.
Note that this list may not be exhaustive, these are just a few solutions that
I am aware of. Also note that none of the solutions presented here are in any
way new. Most of these paradigms date back to the entire notion of concurrent
programming, and have been applied in software since decades ago.
## Connection sharing
By far the simplest solution to implement: Keep a single database connection
throughout your program and allow every thread to access it. Of course, you
will need to be careful to always put locks around the code where you access
the database handler. An example implementation could look like the following:
```c
// The global SQLite connection
sqlite3 *db;
int main(int argc, char **argv) {
if(sqlite3_open("database.sqlite3", &db))
exit(1);
// start some threads
// wait until the threads are finished
sqlite3_close(db);
return 0;
}
void *some_thread(void *arg) {
sqlite3_mutex_enter(sqlite3_db_mutex(db));
// Perform some queries on the database
sqlite3_mutex_leave(sqlite3_db_mutex(db));
}
```
Implementation size
: This is where connection sharing shines: There is little extra code required
when compared to using a database connection in a single-threaded context. All
you need to be careful of is to lock the mutex before using the database, and
to unlock it again afterwards.
Memory/CPU overhead
: As the only addition to the single-threaded case are the locks, this solution
has practically no memory overhead. The mutexes are provided by SQLite,
after all. CPU overhead is also as minimal as it can be: mutexes are the most
primitive type provided by threading libraries to serialize access to a shared
resource, and are therefore very efficient.
Prepared statement re-use
: Prepared statements can be safely re-used inside a single enter/leave block.
However, if you want to remain portable with SQLite versions before 3.5.0, then
any prepared statements **must** be freed before the mutex is unlocked. This can
be a major downside if the enter/leave blocks themselves are relatively short
but accessed quite often. If portability with older versions is not an issue,
then this restriction is gone and prepared statements can be re-used easily.
Transaction grouping
: A reliable implementation will not allow transactions to span multiple
enter/leave blocks. So as with prepared statements, transactions need to be
committed to disk before the mutex is unlocked. Again shared with prepared
statement re-use is that this limitation may prove to be a significant problem
in optimizing application performance, disk I/O in particular. One way to lower
the effects of this limitation is to increase the size of a single enter/leave
block, thus allowing for more work to be done in a single transaction. Code
restructuring may be required in order to efficiently implement this. Another
way to get around this problem is to do allow a transaction to span multiple
enter/leave blocks. Implementing this reliably may not be an easy task,
however, and will most likely require application-specific knowledge.
Background processing
: Background processing is not natively supported with connection sharing. It is
possible to spawn a background thread to perform database operations each time
that this is desirable. But care should be taken to make sure that these
background threads will execute dependent queries in the correct order. For
example, if thread A spawns a background thread, say B, to execute an UPDATE
query, and later thread A wants to read that same data back, it must first wait
for thread B to finish execution. This may add more inter-thread communication
than is preferable.
Concurrency
: There is no concurrency at all here. Since the database connection is protected
by an exclusive lock, only a single thread can operate on the database at any
point in time. Additionally, one may be tempted to increase the size of an
enter/leave block in order to allow for larger transactions or better re-use of
prepared statements. However, any time spent on performing operations that do
not directly use the database within such an enter/leave block will lower the
maximum possible database concurrency even further.
Portability
: Connection sharing requires at least SQLite 3.3.1 in order to pass the same
database connection around. SQLite must be compiled with threading support
enabled. If prepared statements are kept around outside of an enter/leave
block, then version 3.5.0 or higher will be required.
## Message passing
An alternative approach is to allow only a single thread to access the
database. Any other thread that wants to access the database in any way will
then have to communicate with this database thread. This communication is done
by sending messages (_requests_) to the database thread, and, when query
results are required, receiving back one or more _response_ messages.
Message passing schemes and libraries are available for many programming
languages and come in many different forms. For this article, I am going to
assume that an asynchronous and unbounded FIFO queue is used to pass around
messages, but most of the following discussion will apply to bounded queues as
well. I'll try to note the important differences between the two where
applicable.
A very simple and naive implementation of a message passing solution is given
below. Here I assume that `queue_create()` will create a message queue (type
`message_queue`), `queue_get()` will return the next message in the queue, or
block if the queue is empty. `thread_create(func, arg)` will run _func_ in a
newly created thread and pass _arg_ as its argument. Error handling has been
ommitted to keep this example consice.
```c
void *db_thread(void *arg) {
message_queue *q = arg;
sqlite3 *db;
if(sqlite3_open("database.sqlite3", &db))
return ERROR;
request_msg *m;
while((m = queue_get(q)) {
if(m->action == QUIT)
break;
if(m->action == EXEC)
sqlite3_exec(db, m->query, NULL, NULL, NULL);
}
sqlite3_close(db);
return OK;
}
int main(int argc, char **argv) {
message_queue *db_queue = queue_create();
thread_create(db_thread, db_queue);
// Do work.
return 0;
}
```
This example implementation has a single database thread running in the
background that accepts the messages `QUIT`, to stop processing queries and
close the database, and `EXEC`, to run a certain query on the database. No
support is available yet for passing query results back to the thread that sent
the message. This can be implemented by including a separate `message_queue`
object in the request messages, to which the results can be sent.
Implementation size
: This will largely depend on the used programming environment and the complexity
of the database thread. If your environment already comes with a message queue
implementation, and constructing the request/response messages is relatively
simple, then a simple implementation as shown above will not require much code.
On the other hand, if you have to implement your own message queue or want more
intelligence in the database thread to improve efficiency, then the complete
implementation may be significantly larger than that of connection sharing.
Memory/CPU overhead
: Constructing and passing around messages will incur a CPU overhead, though with
an efficient implementation this should not be significant enough to worry
about. Memory usage is highly dependent on the size of the messages being
passed around and the length of the queue. If messages are queued faster than
they are processed and there is no bound on the queue length, then a process
may quickly run of out memory. On the other hand, if messages are processed
fast enough then the queue will generally not have more than a single message
in it, and the memory overhead will remain fairly small.
Prepared statement re-use
: As the database connection will never leave the database thread, prepared
statements can be kept in memory and re-used without problems.
Transaction grouping
: A naive but robust implementation will handle each message in its own
transaction. A more clever database thread, however, could wait for multiple
messages to be queued and can then batch-execute them in a single transaction.
Correctly implementing this may require some additional information to be
specified along with the request, such as whether the query may be combined in
a single transaction or whether it may only be executed outside of a
transaction. Some threads may want to have confirmation that the data has been
successfully written to disk, in which case responsiveness will not improve if
such actions are queued for later processing. Nonetheless, since the database
thread has all the knowledge about the state of the database and any
outstanding actions, transaction grouping can be implemented quite reliably.
Background processing
: Background processing is supported natively with a message passing
implementation: a thread that isn't interested in query results can simply
queue the action to be performed by the database thread without indicating a
return path for the results. Of course, if a thread queues many messages that
do not require results followed by one that does, it will have to wait for all
earlier messages to be processed before receiving any results for the last one.
In the case that the actions are not dependent on each other, the database
thread may re-order the messages in order to process the last request first.
This requires knowledge about dependencies and may significantly complicate the
implementation, however.
Concurrency
: As with a shared database connection, database access is exclusive: Only a
single action can be performed on the database at a time. Unlike connection
sharing, however, any processing within the application will not further
degrade the maximum attainable concurrency. As long as unbounded asynchronous
queues are used to pass around messages, the database thread will be able to
continue working on the database without waiting for another thread to process
the results.
Portability
: This is where message passing shines: SQLite is only used within the database
thread, no other thread will have a need to call any SQLite function. This
allows any version of SQLite to be used, even those that have not been compiled
with thread safety enabled.
## Thread-local connections
A rather different approach to giving each thread access to a single database
is to simply open a new database connection for each thread. This way each
connection will be local to the specific thread, which in turn has the power to
do with it as it likes without worrying about what the other threads do. The
following is a short example to illustrate the idea:
```c
void *some_thread(void *arg) {
sqlite3 *db;
if(sqlite3_open("database.sqlite3", &db))
return ERROR;
// Do some work on the database
sqlite3_close(db);
}
int main(int argc, char **argv) {
int i;
for(i=0; i<10; i++)
thread_create(some_thread, NULL);
// Wait until the threads are done
return 0;
}
```
Implementation size
: Giving each thread its own connection is practically not much different from
the single-threaded case where there is only a single database connection. And
as the example shows, this can be implemented quite trivially.
Memory/CPU overhead
: If we assume that threads are not created very often and each thread has a
relatively long life, then the CPU and I/O overhead caused by opening a new
connection for each thread will not be very significant. On the other hand, if
threads are created quite often and lead a relatively short life before they
are destroyed again, then opening a new connection each time will soon require
more resources than running the queries themselves.
There is a significant memory overhead: every new database connection requires
memory. If each connection also has a separate cache, then every thread will
quickly require several megabytes only to interact with the database. Since
version 3.5.0, SQLite allows sharing of this cache with the other threads,
which will reduce this memory overhead.
Prepared statement re-use
: Prepared statements can be re-used without limitations within a single thread.
This will allow full re-use of prepared statements if each thread has a
different task, in which case every thread will have different queries and
access patterns anyway. But when every thread runs the same code, and thus also
the same queries, it will still need its own copy of the prepared statement.
Prepared statements are specific to a single database connection, so they can't
be passed around between the threads. The same argument for CPU overhead works
here: as long as threads are long-lived, then this will not be a very large
problem.
Transaction grouping
: Each thread has full access to its own database connection, so it can easily
batch many queries in a single transaction. It is not possible, however, to
group queries from the other threads in this same transaction as well. The
grouping may therefore not be as optimal as a message passing solution could
provide, but it is still a large improvement compared to connection sharing.
Background processing
: Background processing is not easily possible. While it is possible to spawn a
separate thread for each query that needs to be processed in the background, a
new database connection will have to be opened every time this is done. This
solution will obviously not be very efficient.
Concurrency
: In general, it is not possible to get better concurrency than by providing each
thread with its own database connection. This solution definitely wins in this
area.
Portability
: Thread-local connections are very portable: the only requirement is that SQLite
has been built with threading support enabled. Connections are not passed
around between threads, so any SQLite version will do. In order to make use of
the shared cache feature, however, SQLite 3.5.0 is required.
## Connection pooling
A common approach in server-like applications is to have a connection pool.
When a thread wishes to have access to the database, it requests a database
connection from a pool of (currently) unused database connections. If no unused
connections are available, it can either wait until one becomes available, or
create a new database connection on its own. When a thread is done with a
connection, it will add it back to the pool to allow it to be re-used in an
other thread.
The following example illustrates a basic connection pool implementation in
which a thread creates a new database connection when no connections are
available. A global `db_pool` is defined, on which any thread can call
`pool_pop()` to get an SQLite connection if there is one available, and
`pool_push()` can be used to push a connection back to the pool. This pool can
be implemented as any kind of set: a FIFO or a stack could do the trick, as
long as it can be accessed from multiple threads concurrently.
```c
// Some global pool of database connections
pool_t *db_pool;
sqlite3 *get_database() {
sqlite3 *db = pool_pop(db_pool);
if(db)
return db;
if(sqlite3_open("database.sqlite3", &db))
return NULL;
return db;
}
void *some_thread(void *arg) {
// Do some work
sqlite3 *db = get_database();
// Do some work on the database
pool_push(db_pool, db);
}
int main(int argc, char **argv) {
int i;
for(i=0; i<10; i++)
thread_create(some_thread, NULL);
// Wait until the threads are done
return 0;
}
```
Implementation size
: A connection pool is in essense not very different from thread-local
connections. The only major difference is that the call to sqlite3\_open() is
replaced with a function call to obtain a connection from the pool and
sqlite3\_close() with one to give it back to the pool. As shown above, these
functions can be fairly simple. Note, however, that unlike with thread-local
connections it is advisable to "open" and "close" a connection more often in
long-running threads, in order to give other threads a chance to use the
connection as well.
Memory/CPU overhead
: This mainly depends on the number of connections you allow to be in memory at
any point in time. If this number is not bounded, as in the above example, then
you can assume that after running your program for a certain time, there will
always be enough unused connections available in the pool. Requesting a
connection will then be very fast, since the overhead of creating a new
connection, as would have been done with thread-local connections, is
completely gone.
In terms of memory usage, however, it would be more efficient to put a maximum
limit on the number of open connections, and have the thread wait until another
thread gives a connection back to the pool. Similarly to thread-local
connections, memory usage can be decreased by using SQLite's cache sharing
feature.
Prepared statement re-use
: Unfortunately, this is where connection pooling borrows from connection
sharing. Prepared statements must be cleaned up before passing a connection to
another thread if one aims to be portable. But even if you remove that
portability requirement, prepared statements are always specific to a single
connection. Since you can't assume that you will always get the same connection
from the pool, caching prepared statements is not practical.
On the other hand, a connection pool does allow you to use a single connection
for a longer period of time than with connection sharing without negatively
affecting concurrency. Unless, of course, there is a limit on the number of
open connections, in which case using a connection for a long period of time
may starve another thread.
Transaction grouping
: Pretty much the same arguments with re-using prepared statements also apply to
transaction grouping: Transactions should be committed to disk before passing a
connection back to the pool.
Background processing
: This is also where a connection pool shares a lot of similarity with connection
sharing. With thread-local storage, creating a worker thread to perform
database operations on the background would be very inefficient. But since this
inefficiency is being tackled by allowing connection re-use with a connection
pool, it's not a problem. Still the same warning applies with regard to
dependent queries, though.
Concurrency
: Connection pooling gives you fine-grained control over how much concurrency
you'd like to have. For maximum concurrency, don't put a limit on the number of
maximum database connections. If there is a limit, then that will decrease the
maximim concurrency in favor of lower memory usage.
Portability
: Since database connections are being passed among threads, connection pooling
will require at least SQLite 3.3.1 compiled with thread safety enabled. Making
use of its cache sharing capibilities to reduce memory usage will require
SQLite 3.5.0 or higher.
# Final notes
As for what I used for ncdc. I initially chose connection sharing, for its
simplicity. Then when I noticed that the UI became less responsive than I found
acceptable I started adding a simple queue for background processing of
queries. Later I stumbled upon the main problem with that solution: I wanted to
read back a value that was written in a background thread, and had no way of
knowing whether the background thread had finished executing that query or not.
I then decided to expand the background thread to allow for passing back query
results, and transformed everything into a full message passing solution. This
appears to be working well at the moment, and my current implementation has
support for both prepared statement re-use and transaction grouping, which
measurably increased performance.
To summarize, there isn't really a _best_ solution that works for every
application. Connection sharing works well for applications where
responsiveness and concurrency isn't of major importance. Message passing works
well for applications that aim to be responsive, and is flexible enough for
optimizing CPU and I/O by re-using prepared statements and grouping queries in
larger transactions. Thread-local connections are suitable for applications
that have a relatively fixed number of threads, whereas connection pooling
works better for applications with a varying number of worker threads.