yhdev/dat/yxml

=pod

I<*But see the L<Bugs and Limitations|/Bugs and Limitations> and L<Conformance Issues|/Conformance Issues> below.>

Yxml is a small (C<6 KiB>) L<non-validating|/Validating vs. non-validating> yet
mostly conforming XML parser written in C.  Its primary goals are small binary
size, simplicity and correctness. It also happens to be L<pretty
fast|/Comparison>.

The code can be obtained from the L<git repo|http://g.blicky.net/yxml.git> and
is available under a permissive MIT license. The only two files you need are
L<yxml.c|http://g.blicky.net/yxml.git/plain/yxml.c> and
L<yxml.h|http://g.blicky.net/yxml.git/plain/yxml.h>, which can easily be
included and compiled as part of your project. Complete API documentation is
available in L<the manual|http://dev.yorhel.nl/yxml/man>.

The API follows a simple and mostly buffer-less design, and only consists of
three functions:

  void yxml_init(yxml_t *x, void *buf, size_t bufsize);
  yxml_ret_t yxml_parse(yxml_t *x, int ch);
  yxml_ret_t yxml_eof(yxml_t *x);

Be aware that I<simple> is not necessarily I<easy> or I<convenient>. The API is
relatively low-level and designed to integrate into pretty much any application
and for any use case. This includes incrementally parsing data from a socket in
an event-driven fashion and parsing large XML files on memory-restricted
devices. It is possible to implement a more convenient and high-level API on
top of yxml, but I'm not very fond of libraries that do more than what I
strictly need.

There are no tarball releases available at the moment. The API is relatively
stable, but I won't currently promise any ABI stability. Dynamic linking
against yxml is therefore not a very good idea.

=head3 Features

=over

=item * Simple and low-level API.

=item * Does not require C<malloc()>.

=item * Pure C, should be very portable.

=item * Recognizes and consumes the UTF-8 BOM.

=item * Parses entity references (C<&amp;>) and character references (C<&#x26;>).

=item * Verifies most well-formedness constraints, including the correct
nesting of elements.

=item * Parses XML documents in any ASCII-compatible encoding.

=back

But let's not be I<too> optimistic, because there are also...

=head3 Bugs and Limitations

=over

=item * A conditional section in a C<< <!DOCTYPE ..> >> declaration will result
in a parse error.

=item * Allows multiple C<< <!DOCTYPE ..> >> declarations.

=item * Information encoded in the XML and doctype declarations is currently
not available through the API.

=back

I hope to have these issues fixed in the near future.

=head3 Conformance Issues

=over

=item * Does not verify that non-ASCII characters in element names, element
content, attribute names and attribute values are within the allowed Unicode
character ranges.

=item * Does not verify that attribute names within the same element are unique.

=item * Does not verify that the contents of a C<< <!DOCTYPE ..> >> declaration
follow the XML grammar.

=item * Can't parse documents in a non-ASCII-compatible encoding. You'll have
to convert it to UTF-8 or something similar first.

=item * No support for custom entity references, neither through the API nor
using C<< <!ENTITY> >>.

=back

These conformance issues are the result of the byte-oriented and minimal design
of yxml, and I do not intent to fix these directly within the library. The
intention is to make sure that all of the above mentioned issues can be fixed
on top of yxml (by the application, or by a wrapper) if strict conformance is
required, but the required functionality to support custom entity references
and DTD handling has not been implemented yet.

=head3 Non-features

And now follows a list of things that are not part of the core XML
specification and are not directly supported.  As with the conformance issues,
these features can be implemented on top of yxml.

=over

=item * No helper functions to deal with namespaces. Yxml will parse XML files
with namespaces just fine, but it's up to the application to do the rest.

=item * No DTD or XML Schema validation.

=item * No XSLT.

=item * No XPath.

=item * Doesn't do your household chores.

=back


=head2 Comparison

The following benchmark compares L<expat|http://expat.sourceforge.net/>,
L<libxml2|http://xmlsoft.org/> and
L<Mini-XML|http://www.msweet.org/projects.php?Z3> with yxml. A L<strlen(3)>
implementation is also included as an indication of the "theoretical" minimum.

                                    SIZE                  PERFORMANCE
  LIB      VER    LICENSE              OBJ    STATIC       WIKI  DISCOGS
  strlen                                      25 816       0.16     0.09
  expat    2.1.0  MIT              162 139   194 432       1.47     1.09
  libxml2  2.9.1  MIT              464 328   518 816       2.53     1.75
  mxml     2.7    LGPL2+static      32 733    75 832      12.38     7.80
  yxml     git    MIT                5 971    31 416       1.15     0.74

The code for these benchmarks is available in the
L<bench/|http://g.blicky.net/yxml.git/tree/bench> directory on git. Some
explanatory notes:

=over

=item * C<OBJ> is the total size of all object code of the library, measured
with L<size(1)>.

=item * C<STATIC> is the file size of a minimal statically linked binary when
linked against L<musl|http://www.musl-libc.org/> 0.9.13, measured with
L<wc(1)> after running L<strip(1)>.

=item * The performance is the time, in seconds, to load a large XML file.
C<WIKI> refers to C<enwiki-20130805-abstract5.xml> (162 MiB) from a L<Wikipedia
Dump|http://dumps.wikimedia.org/enwiki/>, C<DISCOGS> refers to
C<discogs_20130801_labels.xml> (94 MiB) from a L<Discogs Data
Dump|http://www.discogs.com/data/>.

=item * Libxml2 has been compiled with most of its features disabled with
C<./configure>, but it still manages to be the very definition of bloat.

=item * Everything has been compiled with gcc 4.8.1 at C<-O2>.

=item * Benchmarks are run on Linux 3.10.7 with a 3 Ghz Intel Core Duo E8400
and with 4GB RAM.

=back

And just for fun, here's the same comparison when compiled with C<-Os>, i.e.
optimized for small size. Interestingly enough, Mini-XML actually runs faster
with C<-Os> than with C<-O2>.

                                    SIZE                  PERFORMANCE
  LIB       VER    LICENSE             OBJ    STATIC       WIKI  DISCOGS
  strlen                                      25 816       0.16     0.09
  expat     2.1.0  MIT             113 314   145 632       1.58     1.20
  libxml2   2.9.1  MIT             356 948   412 256       3.01     2.08
  mxml      2.7    LGPL2+static     27 725    71 704      11.70     7.44
  yxml      git    MIT               4 955    30 392       1.67     1.02


=head2 Validating vs. non-validating

TL;DR: yxml does I<not> accept garbage XML documents, it will correctly handle
and report issues if the input does not strictly follow the XML grammar.

The terms I<validating> and I<non-validating> have specific meanings within the
context of XML. A validating parser is one that reads the doctype declaration
(DTD) associated with a document, and validates that the contents of the
document follow the rules described in the DTD. A DTD may also include
instructions on how to parse the document, including the definition of custom
entity references (C<&whatever;>) and instructions on how attribute values or
element contents should be normalized before passing its data to the
application.

A non-validating parser is one that ignores the DTD and happily parses
documents that do not follow the rules described in that DTD. They (usually)
don't support entity references and will not normalize attribute values or
element contents. A non-validating parser still has to verify that the XML
document follows the XML syntax rules.

It should be noted that a lot of XML documents found in the wild are not
described with a DTD, but instead use an alternative technology such as XML
schema. Wikipedia L<has more
information|https://en.wikipedia.org/wiki/XML#Schemas_and_validation> on this.
Using a validating parser for such documents would only add bloat and may
introduce L<potential security
vulnerabilities|https://en.wikipedia.org/wiki/Billion_laughs>.