Rewrite to static site

With a complete reorganisation of the directory structure and most of the content converted to pandoc-flavoured markdown. Some TODO's left before this can go live: - Main page - Atom feeds - Bug tracker
2019-03-23 11:52:08 +01:00 · 2019-03-23 11:52:08 +01:00 · 6242b2ee9c
commit 6242b2ee9c
parent 5c85a7d32f
291 changed files with 4346 additions and 6141 deletions
--- a/dat/doc/sqlaccess.md
+++ b/dat/doc/sqlaccess.md
@ -0,0 +1,576 @@
+% Multi-threaded Access to an SQLite3 Database
+
+(Published on **2011-11-26**)
+
+(Minor 2013-04-06 update: I abstracted my message passing solution from ncdc
+and implemented it in a POSIX C library for general use. It's called _sqlasync_
+and is part of my [Ylib library collection](/ylib).)
+
+# Introduction
+
+As I was porting [ncdc](/ncdc) over to use SQLite3 as storage backend, I
+stumbled on a problem: The program uses a few threads for background jobs, and
+it would be nice to give these threads access to the database.
+
+Serializing all database access through the main thread wouldn't have been very
+hard to implement in this particular case, but that would have been far from
+optimal. The main thread is also responsible for keeping the user interface
+responsive and handling most of the network interaction. Overall responsiveness
+of the program would significantly improve when the threads could access the
+database without involvement of the main thread.
+
+Which brought me to the following questions: What solutions are available for
+providing multi-threaded access to an SQLite database? What problems may I run
+in to? I was unable to find a good overview in this area on the net, so I wrote
+this article with the hope to improve that situation.
+
+# SQLite3 and threading
+
+Let's first see what SQLite3 itself has to offer in terms of threading support.
+The official documentation mentions threading support several times in various
+places, but this information is scattered around and no good overview is given.
+Someone has tried to organize this before on a [single
+page](http://www.sqlite.org/cvstrac/wiki?p=MultiThreading), and while this
+indeed gives a nice overview, it has unfortunately not been updated since 2006.
+The advices are therefore a little on the conservative side.
+
+Nonetheless, it is wise to remain portable with different SQLite versions,
+especially when writing programs that dynamically link with some random version
+installed on someone's system. It should be fairly safe to assume that SQLite
+binaries provided by most systems, if not all, are compiled with thread safety
+enabled. This doesn't mean all that much, unfortunately: The only thing _thread
+safe_ means in this context is that you can use SQLite3 in multiple threads,
+but a single database connection should still stay within a single thread.
+
+Since SQLite 3.3.1, which was released in early 2006, it is possible to move a
+single database connection along multiple threads. Doing this with older
+versions is not advisable, as explained in [the SQLite
+FAQ](http://www.sqlite.org/faq.html#q6). But even with 3.3.1 and later there is
+an annoying restriction: A connection can only be passed to another thread when
+any outstanding statements are closed and finalized. In practice this means
+that it is not possible to keep a prepared statement in memory for later
+executions.
+
+Since SQLite 3.5.0, released in 2007, a single SQLite connection can be used
+from multiple threads simultaneously. SQLite will internally manage locks to
+avoid any data corruption. I can't recommend making use of this facility,
+however, as there are still many issues with the API. The [error fetching
+functions](http://www.sqlite.org/c3ref/errcode.html) and
+[sqlite3\_last\_insert\_row\_id()](http://www.sqlite.org/c3ref/last_insert_rowid.html),
+among others, are still useless without explicit locking in the application. I
+also believe that the previously mentioned restriction on having to finalize
+statements has been relaxed in this version, so keeping prepared statements in
+memory and passing them among different threads becomes possible.
+
+When using multiple database connections within a single process, SQLite offers
+a facility to allow [sharing of its
+cache](http://www.sqlite.org/sharedcache.html), in order to reduce memory usage
+and disk I/O. The semantics of this feature have changed with different SQLite
+versions and appear to have stabilised in 3.5.0. This feature may prove useful
+to optimize certain situations, but does not open up new possibilities of
+communicating with a shared database.
+
+# Criteria
+
+Before looking at some available solutions, let's first determine the criteria
+we can use to evaluate them.
+
+Implementation size
+:   Obviously, a solution that requires only a few lines of code to implement is
+    preferable over one that requires several levels of abstraction in order to be
+    usable. I won't be giving actual implementations here, so the sizes will be
+    rough estimates for comparison purposes. The actual size of an implementation
+    is of course heavily dependent on the programming environment as well.
+
+Memory/CPU overhead
+:   The most efficient solution for a single-threaded application is to simply have
+    direct access to a single database connection. Every solution is in principle a
+    modification or extension of this idea, and will therefore add a certain
+    overhead. This overhead manifests itself in both increased CPU and memory
+    usage. The order of which varies between solutions.
+
+Prepared statement re-use
+:   Is it possible to prepare a statement once and keep using it for the lifetime
+    of the program? Or will prepared statements have to be thrown away and
+    recreated every time? Keeping statement handles in memory will result in a nice
+    performance boost for applications that run the same SQL statement many times.
+
+Transaction grouping
+:   A somewhat similar issue to prepared statement re-use: From a performance point
+    of view, it is very important to try to batch many UPDATE/DELETE/INSERT
+    statements within a single transaction, as opposed to running each modify query
+    separately. Running each query separately will force SQLite to flush the data
+    to disk separately every time, whereas a single transaction will batch-flush
+    all the changes to disk in a single go. Some solutions allow for grouping
+    multiple statements in a single transaction quite easily, while others require
+    more involved steps.
+
+Background processing
+:   In certain situations it may be desirable to queue a certain query for later
+    processing, without explicitly waiting for it to complete. For example, if
+    something in the database has to be modified as a result of user interaction in
+    a UI thread, then the application would feel a lot more responsive if the
+    UPDATE query was simply queued to be processed in a background thread than when
+    the query had run in the UI thread itself. A database accessing solution with
+    built-in support for background processing of queries will significantly help
+    with building a responsive application.
+
+Concurrency
+:   Concurrency indicates how well the solution allows for concurrent access. The
+    worst possible concurrency is achieved when a single database connection is
+    used for all threads, as only a single action can be performed on the database
+    at any point in time. Maximum concurrency is achieved when each thread has its
+    own SQLite connection. Note that maximum concurrency doesn't mean that the
+    database can be accessed in a _fully_ concurrent manner. SQLite uses internal
+    database-level locks to avoid data corruption, and these will limit the actual
+    maximum concurrency. I am not too knowledgeable about the inner workings of
+    these locks, but it is at least possible to have a large number truly
+    concurrent database _reads_. Database _writes_ from multiple threads may
+    still allow for significantly more concurrency than when they are manually
+    serialized over a single database connection.
+
+Portability
+:   What is the minimum SQLite version required to implement the solution? Does it
+    require any special OS features or SQLite compilation settings? As outlined
+    above, different versions of SQLite offer different features with regards to
+    threading. Relying one of the relatively new features will decrease
+    portability.
+
+# The Solutions
+
+Here I present four solutions to allow database access from multiple threads.
+Note that this list may not be exhaustive, these are just a few solutions that
+I am aware of. Also note that none of the solutions presented here are in any
+way new. Most of these paradigms date back to the entire notion of concurrent
+programming, and have been applied in software since decades ago.
+
+## Connection sharing
+
+By far the simplest solution to implement: Keep a single database connection
+throughout your program and allow every thread to access it. Of course, you
+will need to be careful to always put locks around the code where you access
+the database handler. An example implementation could look like the following:
+
+```c
+// The global SQLite connection
+sqlite3 *db;
+
+int main(int argc, char **argv) {
+  if(sqlite3_open("database.sqlite3", &db))
+    exit(1);
+
+  // start some threads
+  // wait until the threads are finished
+
+  sqlite3_close(db);
+  return 0;
+}
+
+void *some_thread(void *arg) {
+  sqlite3_mutex_enter(sqlite3_db_mutex(db));
+  // Perform some queries on the database
+  sqlite3_mutex_leave(sqlite3_db_mutex(db));
+}
+```
+
+Implementation size
+:   This is where connection sharing shines: There is little extra code required
+    when compared to using a database connection in a single-threaded context. All
+    you need to be careful of is to lock the mutex before using the database, and
+    to unlock it again afterwards.
+
+Memory/CPU overhead
+:   As the only addition to the single-threaded case are the locks, this solution
+    has practically no memory overhead. The mutexes are provided by SQLite,
+    after all. CPU overhead is also as minimal as it can be: mutexes are the most
+    primitive type provided by threading libraries to serialize access to a shared
+    resource, and are therefore very efficient.
+
+Prepared statement re-use
+:   Prepared statements can be safely re-used inside a single enter/leave block.
+    However, if you want to remain portable with SQLite versions before 3.5.0, then
+    any prepared statements **must** be freed before the mutex is unlocked. This can
+    be a major downside if the enter/leave blocks themselves are relatively short
+    but accessed quite often. If portability with older versions is not an issue,
+    then this restriction is gone and prepared statements can be re-used easily.
+
+Transaction grouping
+:   A reliable implementation will not allow transactions to span multiple
+    enter/leave blocks. So as with prepared statements, transactions need to be
+    committed to disk before the mutex is unlocked. Again shared with prepared
+    statement re-use is that this limitation may prove to be a significant problem
+    in optimizing application performance, disk I/O in particular. One way to lower
+    the effects of this limitation is to increase the size of a single enter/leave
+    block, thus allowing for more work to be done in a single transaction. Code
+    restructuring may be required in order to efficiently implement this.  Another
+    way to get around this problem is to do allow a transaction to span multiple
+    enter/leave blocks. Implementing this reliably may not be an easy task,
+    however, and will most likely require application-specific knowledge.
+
+Background processing
+:   Background processing is not natively supported with connection sharing. It is
+    possible to spawn a background thread to perform database operations each time
+    that this is desirable. But care should be taken to make sure that these
+    background threads will execute dependent queries in the correct order.  For
+    example, if thread A spawns a background thread, say B, to execute an UPDATE
+    query, and later thread A wants to read that same data back, it must first wait
+    for thread B to finish execution. This may add more inter-thread communication
+    than is preferable.
+
+Concurrency
+:   There is no concurrency at all here. Since the database connection is protected
+    by an exclusive lock, only a single thread can operate on the database at any
+    point in time. Additionally, one may be tempted to increase the size of an
+    enter/leave block in order to allow for larger transactions or better re-use of
+    prepared statements. However, any time spent on performing operations that do
+    not directly use the database within such an enter/leave block will lower the
+    maximum possible database concurrency even further.
+
+Portability
+:   Connection sharing requires at least SQLite 3.3.1 in order to pass the same
+    database connection around. SQLite must be compiled with threading support
+    enabled. If prepared statements are kept around outside of an enter/leave
+    block, then version 3.5.0 or higher will be required.
+
+## Message passing
+
+An alternative approach is to allow only a single thread to access the
+database. Any other thread that wants to access the database in any way will
+then have to communicate with this database thread. This communication is done
+by sending messages (_requests_) to the database thread, and, when query
+results are required, receiving back one or more _response_ messages.
+
+Message passing schemes and libraries are available for many programming
+languages and come in many different forms. For this article, I am going to
+assume that an asynchronous and unbounded FIFO queue is used to pass around
+messages, but most of the following discussion will apply to bounded queues as
+well. I'll try to note the important differences between the two where
+applicable.
+
+A very simple and naive implementation of a message passing solution is given
+below.  Here I assume that `queue_create()` will create a message queue (type
+`message_queue`), `queue_get()` will return the next message in the queue, or
+block if the queue is empty. `thread_create(func, arg)` will run _func_ in a
+newly created thread and pass _arg_ as its argument. Error handling has been
+ommitted to keep this example consice.
+
+```c
+void *db_thread(void *arg) {
+  message_queue *q = arg;
+
+  sqlite3 *db;
+  if(sqlite3_open("database.sqlite3", &db))
+    return ERROR;
+
+  request_msg *m;
+  while((m = queue_get(q)) {
+    if(m->action == QUIT)
+      break;
+    if(m->action == EXEC)
+      sqlite3_exec(db, m->query, NULL, NULL, NULL);
+  }
+
+  sqlite3_close(db);
+  return OK;
+}
+
+int main(int argc, char **argv) {
+  message_queue *db_queue = queue_create();
+  thread_create(db_thread, db_queue);
+  // Do work.
+  return 0;
+}
+```
+
+This example implementation has a single database thread running in the
+background that accepts the messages `QUIT`, to stop processing queries and
+close the database, and `EXEC`, to run a certain query on the database. No
+support is available yet for passing query results back to the thread that sent
+the message. This can be implemented by including a separate `message_queue`
+object in the request messages, to which the results can be sent.
+
+Implementation size
+:   This will largely depend on the used programming environment and the complexity
+    of the database thread. If your environment already comes with a message queue
+    implementation, and constructing the request/response messages is relatively
+    simple, then a simple implementation as shown above will not require much code.
+    On the other hand, if you have to implement your own message queue or want more
+    intelligence in the database thread to improve efficiency, then the complete
+    implementation may be significantly larger than that of connection sharing.
+
+Memory/CPU overhead
+:   Constructing and passing around messages will incur a CPU overhead, though with
+    an efficient implementation this should not be significant enough to worry
+    about. Memory usage is highly dependent on the size of the messages being
+    passed around and the length of the queue. If messages are queued faster than
+    they are processed and there is no bound on the queue length, then a process
+    may quickly run of out memory. On the other hand, if messages are processed
+    fast enough then the queue will generally not have more than a single message
+    in it, and the memory overhead will remain fairly small.
+
+Prepared statement re-use
+:   As the database connection will never leave the database thread, prepared
+    statements can be kept in memory and re-used without problems.
+
+Transaction grouping
+:   A naive but robust implementation will handle each message in its own
+    transaction. A more clever database thread, however, could wait for multiple
+    messages to be queued and can then batch-execute them in a single transaction.
+    Correctly implementing this may require some additional information to be
+    specified along with the request, such as whether the query may be combined in
+    a single transaction or whether it may only be executed outside of a
+    transaction. Some threads may want to have confirmation that the data has been
+    successfully written to disk, in which case responsiveness will not improve if
+    such actions are queued for later processing. Nonetheless, since the database
+    thread has all the knowledge about the state of the database and any
+    outstanding actions, transaction grouping can be implemented quite reliably.
+
+Background processing
+:   Background processing is supported natively with a message passing
+    implementation: a thread that isn't interested in query results can simply
+    queue the action to be performed by the database thread without indicating a
+    return path for the results. Of course, if a thread queues many messages that
+    do not require results followed by one that does, it will have to wait for all
+    earlier messages to be processed before receiving any results for the last one.
+    In the case that the actions are not dependent on each other, the database
+    thread may re-order the messages in order to process the last request first.
+    This requires knowledge about dependencies and may significantly complicate the
+    implementation, however.
+
+Concurrency
+:   As with a shared database connection, database access is exclusive: Only a
+    single action can be performed on the database at a time. Unlike connection
+    sharing, however, any processing within the application will not further
+    degrade the maximum attainable concurrency. As long as unbounded asynchronous
+    queues are used to pass around messages, the database thread will be able to
+    continue working on the database without waiting for another thread to process
+    the results.
+
+Portability
+:   This is where message passing shines: SQLite is only used within the database
+    thread, no other thread will have a need to call any SQLite function. This
+    allows any version of SQLite to be used, even those that have not been compiled
+    with thread safety enabled.
+
+## Thread-local connections
+
+A rather different approach to giving each thread access to a single database
+is to simply open a new database connection for each thread. This way each
+connection will be local to the specific thread, which in turn has the power to
+do with it as it likes without worrying about what the other threads do. The
+following is a short example to illustrate the idea:
+
+```c
+void *some_thread(void *arg) {
+  sqlite3 *db;
+  if(sqlite3_open("database.sqlite3", &db))
+    return ERROR;
+
+  // Do some work on the database
+
+  sqlite3_close(db);
+}
+
+int main(int argc, char **argv) {
+  int i;
+  for(i=0; i<10; i++)
+    thread_create(some_thread, NULL);
+
+  // Wait until the threads are done
+
+  return 0;
+}
+```
+
+Implementation size
+:   Giving each thread its own connection is practically not much different from
+    the single-threaded case where there is only a single database connection. And
+    as the example shows, this can be implemented quite trivially.
+
+Memory/CPU overhead
+:   If we assume that threads are not created very often and each thread has a
+    relatively long life, then the CPU and I/O overhead caused by opening a new
+    connection for each thread will not be very significant. On the other hand, if
+    threads are created quite often and lead a relatively short life before they
+    are destroyed again, then opening a new connection each time will soon require
+    more resources than running the queries themselves.
+
+    There is a significant memory overhead: every new database connection requires
+    memory. If each connection also has a separate cache, then every thread will
+    quickly require several megabytes only to interact with the database. Since
+    version 3.5.0, SQLite allows sharing of this cache with the other threads,
+    which will reduce this memory overhead.
+
+Prepared statement re-use
+:   Prepared statements can be re-used without limitations within a single thread.
+    This will allow full re-use of prepared statements if each thread has a
+    different task, in which case every thread will have different queries and
+    access patterns anyway. But when every thread runs the same code, and thus also
+    the same queries, it will still need its own copy of the prepared statement.
+    Prepared statements are specific to a single database connection, so they can't
+    be passed around between the threads. The same argument for CPU overhead works
+    here: as long as threads are long-lived, then this will not be a very large
+    problem.
+
+Transaction grouping
+:   Each thread has full access to its own database connection, so it can easily
+    batch many queries in a single transaction. It is not possible, however, to
+    group queries from the other threads in this same transaction as well. The
+    grouping may therefore not be as optimal as a message passing solution could
+    provide, but it is still a large improvement compared to connection sharing.
+
+Background processing
+:   Background processing is not easily possible. While it is possible to spawn a
+    separate thread for each query that needs to be processed in the background, a
+    new database connection will have to be opened every time this is done. This
+    solution will obviously not be very efficient.
+
+Concurrency
+:   In general, it is not possible to get better concurrency than by providing each
+    thread with its own database connection. This solution definitely wins in this
+    area.
+
+Portability
+:   Thread-local connections are very portable: the only requirement is that SQLite
+    has been built with threading support enabled. Connections are not passed
+    around between threads, so any SQLite version will do. In order to make use of
+    the shared cache feature, however, SQLite 3.5.0 is required.
+
+## Connection pooling
+
+A common approach in server-like applications is to have a connection pool.
+When a thread wishes to have access to the database, it requests a database
+connection from a pool of (currently) unused database connections. If no unused
+connections are available, it can either wait until one becomes available, or
+create a new database connection on its own.  When a thread is done with a
+connection, it will add it back to the pool to allow it to be re-used in an
+other thread.
+
+The following example illustrates a basic connection pool implementation in
+which a thread creates a new database connection when no connections are
+available. A global `db_pool` is defined, on which any thread can call
+`pool_pop()` to get an SQLite connection if there is one available, and
+`pool_push()` can be used to push a connection back to the pool. This pool can
+be implemented as any kind of set: a FIFO or a stack could do the trick, as
+long as it can be accessed from multiple threads concurrently.
+
+```c
+// Some global pool of database connections
+pool_t *db_pool;
+
+sqlite3 *get_database() {
+  sqlite3 *db = pool_pop(db_pool);
+  if(db)
+    return db;
+  if(sqlite3_open("database.sqlite3", &db))
+    return NULL;
+  return db;
+}
+
+void *some_thread(void *arg) {
+  // Do some work
+
+  sqlite3 *db = get_database();
+
+  // Do some work on the database
+
+  pool_push(db_pool, db);
+}
+
+int main(int argc, char **argv) {
+  int i;
+  for(i=0; i<10; i++)
+    thread_create(some_thread, NULL);
+
+  // Wait until the threads are done
+
+  return 0;
+}
+```
+
+Implementation size
+:   A connection pool is in essense not very different from thread-local
+    connections. The only major difference is that the call to sqlite3\_open() is
+    replaced with a function call to obtain a connection from the pool and
+    sqlite3\_close() with one to give it back to the pool. As shown above, these
+    functions can be fairly simple. Note, however, that unlike with thread-local
+    connections it is advisable to "open" and "close" a connection more often in
+    long-running threads, in order to give other threads a chance to use the
+    connection as well.
+
+Memory/CPU overhead
+:   This mainly depends on the number of connections you allow to be in memory at
+    any point in time. If this number is not bounded, as in the above example, then
+    you can assume that after running your program for a certain time, there will
+    always be enough unused connections available in the pool. Requesting a
+    connection will then be very fast, since the overhead of creating a new
+    connection, as would have been done with thread-local connections, is
+    completely gone.
+
+    In terms of memory usage, however, it would be more efficient to put a maximum
+    limit on the number of open connections, and have the thread wait until another
+    thread gives a connection back to the pool. Similarly to thread-local
+    connections, memory usage can be decreased by using SQLite's cache sharing
+    feature.
+
+Prepared statement re-use
+:   Unfortunately, this is where connection pooling borrows from connection
+    sharing. Prepared statements must be cleaned up before passing a connection to
+    another thread if one aims to be portable. But even if you remove that
+    portability requirement, prepared statements are always specific to a single
+    connection. Since you can't assume that you will always get the same connection
+    from the pool, caching prepared statements is not practical.
+
+    On the other hand, a connection pool does allow you to use a single connection
+    for a longer period of time than with connection sharing without negatively
+    affecting concurrency. Unless, of course, there is a limit on the number of
+    open connections, in which case using a connection for a long period of time
+    may starve another thread.
+
+Transaction grouping
+:   Pretty much the same arguments with re-using prepared statements also apply to
+    transaction grouping: Transactions should be committed to disk before passing a
+    connection back to the pool.
+
+Background processing
+:   This is also where a connection pool shares a lot of similarity with connection
+    sharing. With thread-local storage, creating a worker thread to perform
+    database operations on the background would be very inefficient. But since this
+    inefficiency is being tackled by allowing connection re-use with a connection
+    pool, it's not a problem. Still the same warning applies with regard to
+    dependent queries, though.
+
+Concurrency
+:   Connection pooling gives you fine-grained control over how much concurrency
+    you'd like to have. For maximum concurrency, don't put a limit on the number of
+    maximum database connections. If there is a limit, then that will decrease the
+    maximim concurrency in favor of lower memory usage.
+
+Portability
+:   Since database connections are being passed among threads, connection pooling
+    will require at least SQLite 3.3.1 compiled with thread safety enabled. Making
+    use of its cache sharing capibilities to reduce memory usage will require
+    SQLite 3.5.0 or higher.
+
+# Final notes
+
+As for what I used for ncdc. I initially chose connection sharing, for its
+simplicity. Then when I noticed that the UI became less responsive than I found
+acceptable I started adding a simple queue for background processing of
+queries. Later I stumbled upon the main problem with that solution: I wanted to
+read back a value that was written in a background thread, and had no way of
+knowing whether the background thread had finished executing that query or not.
+I then decided to expand the background thread to allow for passing back query
+results, and transformed everything into a full message passing solution. This
+appears to be working well at the moment, and my current implementation has
+support for both prepared statement re-use and transaction grouping, which
+measurably increased performance.
+
+To summarize, there isn't really a _best_ solution that works for every
+application.  Connection sharing works well for applications where
+responsiveness and concurrency isn't of major importance. Message passing works
+well for applications that aim to be responsive, and is flexible enough for
+optimizing CPU and I/O by re-using prepared statements and grouping queries in
+larger transactions. Thread-local connections are suitable for applications
+that have a relatively fixed number of threads, whereas connection pooling
+works better for applications with a varying number of worker threads.