208 lines
8 KiB
Text
208 lines
8 KiB
Text
=pod
|
|
|
|
I<*But see the L<Bugs and Limitations|/Bugs and Limitations> and L<Conformance Issues|/Conformance Issues> below.>
|
|
|
|
Yxml is a small (C<6 KiB>) L<non-validating|/Validating vs. non-validating> yet
|
|
mostly conforming XML parser written in C. Its primary goals are small binary
|
|
size, simplicity and correctness. It also happens to be L<pretty
|
|
fast|/Comparison>.
|
|
|
|
The code can be obtained from the L<git repo|http://g.blicky.net/yxml.git> and
|
|
is available under a permissive MIT license. The only two files you need are
|
|
L<yxml.c|http://g.blicky.net/yxml.git/plain/yxml.c> and
|
|
L<yxml.h|http://g.blicky.net/yxml.git/plain/yxml.h>, which can easily be
|
|
included and compiled as part of your project. Complete API documentation is
|
|
available in L<the manual|http://dev.yorhel.nl/yxml/man>.
|
|
|
|
The API follows a simple and mostly buffer-less design, and only consists of
|
|
three functions:
|
|
|
|
void yxml_init(yxml_t *x, void *buf, size_t bufsize);
|
|
yxml_ret_t yxml_parse(yxml_t *x, int ch);
|
|
yxml_ret_t yxml_eof(yxml_t *x);
|
|
|
|
Be aware that I<simple> is not necessarily I<easy> or I<convenient>. The API is
|
|
relatively low-level and designed to integrate into pretty much any application
|
|
and for any use case. This includes incrementally parsing data from a socket in
|
|
an event-driven fashion and parsing large XML files on memory-restricted
|
|
devices. It is possible to implement a more convenient and high-level API on
|
|
top of yxml, but I'm not very fond of libraries that do more than what I
|
|
strictly need.
|
|
|
|
There are no tarball releases available at the moment. The API is relatively
|
|
stable, but I won't currently promise any ABI stability. Dynamic linking
|
|
against yxml is therefore not a very good idea.
|
|
|
|
=head3 Features
|
|
|
|
=over
|
|
|
|
=item * Simple and low-level API.
|
|
|
|
=item * Does not require C<malloc()>.
|
|
|
|
=item * Pure C, should be very portable.
|
|
|
|
=item * Recognizes and consumes the UTF-8 BOM.
|
|
|
|
=item * Parses entity references (C<&>) and character references (C<&>).
|
|
|
|
=item * Verifies most well-formedness constraints, including the correct
|
|
nesting of elements.
|
|
|
|
=item * Parses XML documents in any ASCII-compatible encoding.
|
|
|
|
=back
|
|
|
|
But let's not be I<too> optimistic, because there are also...
|
|
|
|
=head3 Bugs and Limitations
|
|
|
|
=over
|
|
|
|
=item * A conditional section in a C<< <!DOCTYPE ..> >> declaration will result
|
|
in a parse error.
|
|
|
|
=item * Allows multiple C<< <!DOCTYPE ..> >> declarations.
|
|
|
|
=item * Information encoded in the XML and doctype declarations is currently
|
|
not available through the API.
|
|
|
|
=back
|
|
|
|
I hope to have these issues fixed in the near future.
|
|
|
|
=head3 Conformance Issues
|
|
|
|
=over
|
|
|
|
=item * Does not verify that non-ASCII characters in element names, element
|
|
content, attribute names and attribute values are within the allowed Unicode
|
|
character ranges.
|
|
|
|
=item * Does not verify that attribute names within the same element are unique.
|
|
|
|
=item * Does not verify that the contents of a C<< <!DOCTYPE ..> >> declaration
|
|
follow the XML grammar.
|
|
|
|
=item * Can't parse documents in a non-ASCII-compatible encoding. You'll have
|
|
to convert it to UTF-8 or something similar first.
|
|
|
|
=item * No support for custom entity references, neither through the API nor
|
|
using C<< <!ENTITY> >>.
|
|
|
|
=back
|
|
|
|
These conformance issues are the result of the byte-oriented and minimal design
|
|
of yxml, and I do not intent to fix these directly within the library. The
|
|
intention is to make sure that all of the above mentioned issues can be fixed
|
|
on top of yxml (by the application, or by a wrapper) if strict conformance is
|
|
required, but the required functionality to support custom entity references
|
|
and DTD handling has not been implemented yet.
|
|
|
|
=head3 Non-features
|
|
|
|
And now follows a list of things that are not part of the core XML
|
|
specification and are not directly supported. As with the conformance issues,
|
|
these features can be implemented on top of yxml.
|
|
|
|
=over
|
|
|
|
=item * No helper functions to deal with namespaces. Yxml will parse XML files
|
|
with namespaces just fine, but it's up to the application to do the rest.
|
|
|
|
=item * No DTD or XML Schema validation.
|
|
|
|
=item * No XSLT.
|
|
|
|
=item * No XPath.
|
|
|
|
=item * Doesn't do your household chores.
|
|
|
|
=back
|
|
|
|
|
|
=head2 Comparison
|
|
|
|
The following benchmark compares L<expat|http://expat.sourceforge.net/>,
|
|
L<libxml2|http://xmlsoft.org/> and
|
|
L<Mini-XML|http://www.msweet.org/projects.php?Z3> with yxml. A L<strlen(3)>
|
|
implementation is also included as an indication of the "theoretical" minimum.
|
|
|
|
SIZE PERFORMANCE
|
|
LIB VER LICENSE OBJ STATIC WIKI DISCOGS
|
|
strlen 25 816 0.16 0.09
|
|
expat 2.1.0 MIT 162 139 194 432 1.47 1.09
|
|
libxml2 2.9.1 MIT 464 328 518 816 2.53 1.75
|
|
mxml 2.7 LGPL2+static 32 733 75 832 12.38 7.80
|
|
yxml git MIT 5 971 31 416 1.15 0.74
|
|
|
|
The code for these benchmarks is available in the
|
|
L<bench/|http://g.blicky.net/yxml.git/tree/bench> directory on git. Some
|
|
explanatory notes:
|
|
|
|
=over
|
|
|
|
=item * C<OBJ> is the total size of all object code of the library, measured
|
|
with L<size(1)>.
|
|
|
|
=item * C<STATIC> is the file size of a minimal statically linked binary when
|
|
linked against L<musl|http://www.musl-libc.org/> 0.9.13, measured with
|
|
L<wc(1)> after running L<strip(1)>.
|
|
|
|
=item * The performance is the time, in seconds, to load a large XML file.
|
|
C<WIKI> refers to C<enwiki-20130805-abstract5.xml> (162 MiB) from a L<Wikipedia
|
|
Dump|http://dumps.wikimedia.org/enwiki/>, C<DISCOGS> refers to
|
|
C<discogs_20130801_labels.xml> (94 MiB) from a L<Discogs Data
|
|
Dump|http://www.discogs.com/data/>.
|
|
|
|
=item * Libxml2 has been compiled with most of its features disabled with
|
|
C<./configure>, but it still manages to be the very definition of bloat.
|
|
|
|
=item * Everything has been compiled with gcc 4.8.1 at C<-O2>.
|
|
|
|
=item * Benchmarks are run on Linux 3.10.7 with a 3 Ghz Intel Core Duo E8400
|
|
and with 4GB RAM.
|
|
|
|
=back
|
|
|
|
And just for fun, here's the same comparison when compiled with C<-Os>, i.e.
|
|
optimized for small size. Interestingly enough, Mini-XML actually runs faster
|
|
with C<-Os> than with C<-O2>.
|
|
|
|
SIZE PERFORMANCE
|
|
LIB VER LICENSE OBJ STATIC WIKI DISCOGS
|
|
strlen 25 816 0.16 0.09
|
|
expat 2.1.0 MIT 113 314 145 632 1.58 1.20
|
|
libxml2 2.9.1 MIT 356 948 412 256 3.01 2.08
|
|
mxml 2.7 LGPL2+static 27 725 71 704 11.70 7.44
|
|
yxml git MIT 4 955 30 392 1.67 1.02
|
|
|
|
|
|
=head2 Validating vs. non-validating
|
|
|
|
TL;DR: yxml does I<not> accept garbage XML documents, it will correctly handle
|
|
and report issues if the input does not strictly follow the XML grammar.
|
|
|
|
The terms I<validating> and I<non-validating> have specific meanings within the
|
|
context of XML. A validating parser is one that reads the doctype declaration
|
|
(DTD) associated with a document, and validates that the contents of the
|
|
document follow the rules described in the DTD. A DTD may also include
|
|
instructions on how to parse the document, including the definition of custom
|
|
entity references (C<&whatever;>) and instructions on how attribute values or
|
|
element contents should be normalized before passing its data to the
|
|
application.
|
|
|
|
A non-validating parser is one that ignores the DTD and happily parses
|
|
documents that do not follow the rules described in that DTD. They (usually)
|
|
don't support entity references and will not normalize attribute values or
|
|
element contents. A non-validating parser still has to verify that the XML
|
|
document follows the XML syntax rules.
|
|
|
|
It should be noted that a lot of XML documents found in the wild are not
|
|
described with a DTD, but instead use an alternative technology such as XML
|
|
schema. Wikipedia L<has more
|
|
information|https://en.wikipedia.org/wiki/XML#Schemas_and_validation> on this.
|
|
Using a validating parser for such documents would only add bloat and may
|
|
introduce L<potential security
|
|
vulnerabilities|https://en.wikipedia.org/wiki/Billion_laughs>.
|