197 lines
9.5 KiB
Markdown
197 lines
9.5 KiB
Markdown
% Yxml - A small, fast and correct\* XML parser
|
|
|
|
_\*But see the [Bugs and Limitations](#bugs-and-limitations) and [Conformance Issues](#conformance-issues) below._
|
|
|
|
Yxml is a small (`6 KiB`) [non-validating](#validating-vs.-non-validating) yet
|
|
mostly conforming XML parser written in C. Its primary goals are small binary
|
|
size, simplicity and correctness. It also happens to be [pretty
|
|
fast](#comparison).
|
|
|
|
The code can be obtained from the [git repo](https://g.blicky.net/yxml.git) and
|
|
is available under a permissive MIT license. The only two files you need are
|
|
[yxml.c](https://g.blicky.net/yxml.git/plain/yxml.c) and
|
|
[yxml.h](https://g.blicky.net/yxml.git/plain/yxml.h), which can easily be
|
|
included and compiled as part of your project. Complete API documentation is
|
|
available in [the manual](/yxml/man).
|
|
|
|
The API follows a simple and mostly buffer-less design, and only consists of
|
|
three functions:
|
|
|
|
```c
|
|
void yxml_init(yxml_t *x, void *buf, size_t bufsize);
|
|
yxml_ret_t yxml_parse(yxml_t *x, int ch);
|
|
yxml_ret_t yxml_eof(yxml_t *x);
|
|
```
|
|
|
|
Be aware that _simple_ is not necessarily _easy_ or _convenient_. The API is
|
|
relatively low-level and designed to integrate into pretty much any application
|
|
and for any use case. This includes incrementally parsing data from a socket in
|
|
an event-driven fashion and parsing large XML files on memory-restricted
|
|
devices. It is possible to implement a more convenient and high-level API on
|
|
top of yxml, but I'm not very fond of libraries that do more than what I
|
|
strictly need.
|
|
|
|
There are no tarball releases available at the moment. The API is relatively
|
|
stable, but I won't currently promise any ABI stability. Dynamic linking
|
|
against yxml is therefore not a very good idea.
|
|
|
|
### Features
|
|
|
|
- Simple and low-level API.
|
|
- Does not require `malloc()`.
|
|
- Pure C, should be very portable.
|
|
- Recognizes and consumes the UTF-8 BOM.
|
|
- Parses entity references (`&`) and character references (`&`).
|
|
- Verifies most well-formedness constraints, including the correct nesting of
|
|
elements.
|
|
- Parses XML documents in any ASCII-compatible encoding.
|
|
- Extensively fuzzed.
|
|
|
|
But let's not be _too_ optimistic, because there are also...
|
|
|
|
### Bugs and Limitations
|
|
|
|
- A conditional section in a `<!DOCTYPE ..>` declaration will result in a parse
|
|
error.
|
|
- Allows multiple `<!DOCTYPE ..>` declarations.
|
|
- Information encoded in the XML and doctype declarations is currently not
|
|
available through the API.
|
|
|
|
These issues may be addressed in future versions.
|
|
|
|
### Conformance Issues
|
|
|
|
- Does not verify that non-ASCII characters in element names, element content,
|
|
attribute names and attribute values are within the allowed Unicode character
|
|
ranges.
|
|
- Does not verify that attribute names within the same element are unique.
|
|
- Does not verify that the contents of a `<!DOCTYPE ..>` declaration follow the
|
|
XML grammar.
|
|
- Can't parse documents in a non-ASCII-compatible encoding. You'll have to
|
|
convert it to UTF-8 or something similar first.
|
|
- No support for custom entity references, neither through the API nor using
|
|
`<!ENTITY>`.
|
|
|
|
These conformance issues are the result of the byte-oriented and minimal design
|
|
of yxml and I do not intent to fix these directly within the library. The
|
|
intention is to make sure that all of the above mentioned issues can be fixed
|
|
on top of yxml (by the application, or by a wrapper) if strict conformance is
|
|
required, but the required functionality to support custom entity references
|
|
and DTD handling has not been implemented yet.
|
|
|
|
### Non-features
|
|
|
|
And now follows a list of things that are not part of the core XML
|
|
specification and are not directly supported. As with the conformance issues,
|
|
these features can be implemented on top of yxml.
|
|
|
|
- No helper functions to deal with namespaces. Yxml will parse XML files with
|
|
namespaces just fine, but it's up to the application to do the rest.
|
|
- No DTD or XML Schema validation.
|
|
- No XSLT.
|
|
- No XPath.
|
|
- Doesn't do your household chores.
|
|
|
|
## Users
|
|
|
|
Yxml is used in a few products. Let me know if I missed one.
|
|
|
|
- [FreeBSD's PKG](https://wiki.freebsd.org/pkgng) uses it to parse
|
|
[VuXML](https://www.vuxml.org/) metadata
|
|
([src](https://github.com/freebsd/pkg/blob/master/libpkg/pkg_audit.c)).
|
|
- [getdns](https://getdnsapi.net/) uses it to parse DNSSEC trust anchor
|
|
metadata
|
|
([src](https://github.com/getdnsapi/getdns/blob/develop/src/anchor.c)).
|
|
- [Fuchsia](https://fuchsia.dev/) uses it to parse SVG images
|
|
([src](https://fuchsia.googlesource.com/fuchsia/+/refs/heads/master/src/graphics/lib/compute/svg/svg.c)).
|
|
- [ncdc](https://dev.yorhel.nl/ncdc) uses it to parse XML-encoded file lists
|
|
([src](https://g.blicky.net/ncdc.git/tree/src/fl_load.c)).
|
|
- [BTstack](https://github.com/bluekitchen/btstack/) - apparently Bluetooth
|
|
uses XML somewhere.
|
|
- [A MATLAB GIfTI library](https://www.artefact.tk/software/matlab/gifti/)
|
|
([src](https://github.com/gllmflndn/gifti/blob/master/%40gifti/private/xml_parser.c)).
|
|
- [RetroArch](https://github.com/libretro/RetroArch)
|
|
([src](https://github.com/libretro/RetroArch/blob/master/libretro-common/formats/xml/rxml.c)).
|
|
- [radare2](https://www.radare.org/n/radare2.html) uses it to parse information
|
|
out of XNU binaries
|
|
([src](https://github.com/radareorg/radare2/blob/160fc95e66e82d844ef0f5a258c03de844524a6e/libr/bin/format/xnu/r_cf_dict.c)).
|
|
- [Crank Software's Storyboard](https://www.cranksoftware.com/storyboard) uses it
|
|
to parse runtime configurations
|
|
([license](https://resources.cranksoftware.com/cranksoftware/v6.0.0/license/CrankStoryboardLicensing.html#idm45747345758896)).
|
|
|
|
## Comparison
|
|
|
|
The following benchmark compares [expat](http://expat.sourceforge.net/),
|
|
[libxml2](http://xmlsoft.org/) and
|
|
[Mini-XML](http://www.msweet.org/projects.php?Z3) with yxml. A
|
|
[strlen(3)](http://man.he.net/man3/strlen) implementation is also included as
|
|
an indication of the "theoretical" minimum.
|
|
|
|
SIZE PERFORMANCE
|
|
LIB VER LICENSE OBJ STATIC WIKI DISCOGS
|
|
strlen 25 816 0.16 0.09
|
|
expat 2.1.0 MIT 162 139 194 432 1.47 1.09
|
|
libxml2 2.9.1 MIT 464 328 518 816 2.53 1.75
|
|
mxml 2.7 LGPL2+static 32 733 75 832 12.38 7.80
|
|
yxml git MIT 5 971 31 416 1.15 0.74
|
|
|
|
The code for these benchmarks is available in the
|
|
[bench/](https://g.blicky.net/yxml.git/tree/bench) directory on git. Some
|
|
explanatory notes:
|
|
|
|
- `OBJ` is the total size of all object code of the library, measured with
|
|
[size(1)](https://manned.org/size.1).
|
|
- `STATIC` is the file size of a minimal statically linked binary when linked
|
|
against [musl](http://www.musl-libc.org/) 0.9.13, measured with
|
|
[wc(1)](https://manned.org/wc.1) after running
|
|
[strip(1)](https://manned.org/strip.1).
|
|
- The performance is the time, in seconds, to load a large XML file. `WIKI`
|
|
refers to `enwiki-20130805-abstract5.xml` (162 MiB) from a [Wikipedia
|
|
Dump](http://dumps.wikimedia.org/enwiki/), `DISCOGS` refers to
|
|
`discogs_20130801_labels.xml` (94 MiB) from a [Discogs Data
|
|
Dump](http://www.discogs.com/data/).
|
|
- Libxml2 has been compiled with most of its features disabled with
|
|
`./configure`, but it still manages to be the very definition of bloat.
|
|
- Everything has been compiled with gcc 4.8.1 at `-O2`.
|
|
- Benchmarks are run on Linux 3.10.7 with a 3 Ghz Intel Core Duo E8400 and with
|
|
4GB RAM.
|
|
|
|
And just for fun, here's the same comparison when compiled with `-Os`, i.e.
|
|
optimized for small size. Interestingly enough, Mini-XML actually runs faster
|
|
with `-Os` than with `-O2`.
|
|
|
|
SIZE PERFORMANCE
|
|
LIB VER LICENSE OBJ STATIC WIKI DISCOGS
|
|
strlen 25 816 0.16 0.09
|
|
expat 2.1.0 MIT 113 314 145 632 1.58 1.20
|
|
libxml2 2.9.1 MIT 356 948 412 256 3.01 2.08
|
|
mxml 2.7 LGPL2+static 27 725 71 704 11.70 7.44
|
|
yxml git MIT 4 955 30 392 1.67 1.02
|
|
|
|
## Validating vs. non-validating
|
|
|
|
TL;DR: yxml does _not_ accept garbage XML documents, it will correctly handle
|
|
and report issues if the input does not strictly follow the XML grammar.
|
|
|
|
The terms _validating_ and _non-validating_ have specific meanings within the
|
|
context of XML. A validating parser is one that reads the doctype declaration
|
|
(DTD) associated with a document, and validates that the contents of the
|
|
document follow the rules described in the DTD. A DTD may also include
|
|
instructions on how to parse the document, including the definition of custom
|
|
entity references (`&whatever;`) and instructions on how attribute values or
|
|
element contents should be normalized before passing its data to the
|
|
application.
|
|
|
|
A non-validating parser is one that ignores the DTD and happily parses
|
|
documents that do not follow the rules described in that DTD. They (usually)
|
|
don't support entity references and will not normalize attribute values or
|
|
element contents. A non-validating parser still has to verify that the XML
|
|
document follows the XML syntax rules.
|
|
|
|
It should be noted that a lot of XML documents found in the wild are not
|
|
described with a DTD, but instead use an alternative technology such as XML
|
|
schema. Wikipedia [has more
|
|
information](https://en.wikipedia.org/wiki/XML#Schemas_and_validation) on this.
|
|
Using a validating parser for such documents would only add bloat and may
|
|
introduce [potential security
|
|
vulnerabilities](https://en.wikipedia.org/wiki/Billion_laughs).
|