With a complete reorganisation of the directory structure and most of the content converted to pandoc-flavoured markdown. Some TODO's left before this can go live: - Main page - Atom feeds - Bug tracker
169 lines
7.9 KiB
Markdown
169 lines
7.9 KiB
Markdown
% Yxml - A small, fast and correct\* XML parser
|
|
|
|
_\*But see the [Bugs and Limitations](#bugs-and-limitations) and [Conformance Issues](#conformance-issues) below._
|
|
|
|
Yxml is a small (`6 KiB`) [non-validating](#validating-vs.-non-validating) yet
|
|
mostly conforming XML parser written in C. Its primary goals are small binary
|
|
size, simplicity and correctness. It also happens to be [pretty
|
|
fast](#comparison).
|
|
|
|
The code can be obtained from the [git repo](https://g.blicky.net/yxml.git) and
|
|
is available under a permissive MIT license. The only two files you need are
|
|
[yxml.c](https://g.blicky.net/yxml.git/plain/yxml.c) and
|
|
[yxml.h](https://g.blicky.net/yxml.git/plain/yxml.h), which can easily be
|
|
included and compiled as part of your project. Complete API documentation is
|
|
available in [the manual](/yxml/man).
|
|
|
|
The API follows a simple and mostly buffer-less design, and only consists of
|
|
three functions:
|
|
|
|
```c
|
|
void yxml_init(yxml_t *x, void *buf, size_t bufsize);
|
|
yxml_ret_t yxml_parse(yxml_t *x, int ch);
|
|
yxml_ret_t yxml_eof(yxml_t *x);
|
|
```
|
|
|
|
Be aware that _simple_ is not necessarily _easy_ or _convenient_. The API is
|
|
relatively low-level and designed to integrate into pretty much any application
|
|
and for any use case. This includes incrementally parsing data from a socket in
|
|
an event-driven fashion and parsing large XML files on memory-restricted
|
|
devices. It is possible to implement a more convenient and high-level API on
|
|
top of yxml, but I'm not very fond of libraries that do more than what I
|
|
strictly need.
|
|
|
|
There are no tarball releases available at the moment. The API is relatively
|
|
stable, but I won't currently promise any ABI stability. Dynamic linking
|
|
against yxml is therefore not a very good idea.
|
|
|
|
### Features
|
|
|
|
- Simple and low-level API.
|
|
- Does not require `malloc()`.
|
|
- Pure C, should be very portable.
|
|
- Recognizes and consumes the UTF-8 BOM.
|
|
- Parses entity references (`&`) and character references (`&`).
|
|
- Verifies most well-formedness constraints, including the correct nesting of
|
|
elements.
|
|
- Parses XML documents in any ASCII-compatible encoding.
|
|
|
|
But let's not be _too_ optimistic, because there are also...
|
|
|
|
### Bugs and Limitations
|
|
|
|
- A conditional section in a `<!DOCTYPE ..>` declaration will result in a parse
|
|
error.
|
|
- Allows multiple `<!DOCTYPE ..>` declarations.
|
|
- Information encoded in the XML and doctype declarations is currently not
|
|
available through the API.
|
|
|
|
I hope to have these issues fixed in the near future.
|
|
|
|
### Conformance Issues
|
|
|
|
- Does not verify that non-ASCII characters in element names, element content,
|
|
attribute names and attribute values are within the allowed Unicode character
|
|
ranges.
|
|
- Does not verify that attribute names within the same element are unique.
|
|
- Does not verify that the contents of a `<!DOCTYPE ..>` declaration follow the
|
|
XML grammar.
|
|
- Can't parse documents in a non-ASCII-compatible encoding. You'll have to
|
|
convert it to UTF-8 or something similar first.
|
|
- No support for custom entity references, neither through the API nor using
|
|
`<!ENTITY>`.
|
|
|
|
These conformance issues are the result of the byte-oriented and minimal design
|
|
of yxml, and I do not intent to fix these directly within the library. The
|
|
intention is to make sure that all of the above mentioned issues can be fixed
|
|
on top of yxml (by the application, or by a wrapper) if strict conformance is
|
|
required, but the required functionality to support custom entity references
|
|
and DTD handling has not been implemented yet.
|
|
|
|
### Non-features
|
|
|
|
And now follows a list of things that are not part of the core XML
|
|
specification and are not directly supported. As with the conformance issues,
|
|
these features can be implemented on top of yxml.
|
|
|
|
- No helper functions to deal with namespaces. Yxml will parse XML files with
|
|
namespaces just fine, but it's up to the application to do the rest.
|
|
- No DTD or XML Schema validation.
|
|
- No XSLT.
|
|
- No XPath.
|
|
- Doesn't do your household chores.
|
|
|
|
## Comparison
|
|
|
|
The following benchmark compares [expat](http://expat.sourceforge.net/),
|
|
[libxml2](http://xmlsoft.org/) and
|
|
[Mini-XML](http://www.msweet.org/projects.php?Z3) with yxml. A
|
|
[strlen(3)](http://man.he.net/man3/strlen) implementation is also included as
|
|
an indication of the "theoretical" minimum.
|
|
|
|
SIZE PERFORMANCE
|
|
LIB VER LICENSE OBJ STATIC WIKI DISCOGS
|
|
strlen 25 816 0.16 0.09
|
|
expat 2.1.0 MIT 162 139 194 432 1.47 1.09
|
|
libxml2 2.9.1 MIT 464 328 518 816 2.53 1.75
|
|
mxml 2.7 LGPL2+static 32 733 75 832 12.38 7.80
|
|
yxml git MIT 5 971 31 416 1.15 0.74
|
|
|
|
The code for these benchmarks is available in the
|
|
[bench/](https://g.blicky.net/yxml.git/tree/bench) directory on git. Some
|
|
explanatory notes:
|
|
|
|
- `OBJ` is the total size of all object code of the library, measured with
|
|
[size(1)](https://manned.org/size.1).
|
|
- `STATIC` is the file size of a minimal statically linked binary when linked
|
|
against [musl](http://www.musl-libc.org/) 0.9.13, measured with
|
|
[wc(1)](https://manned.org/wc.1) after running
|
|
[strip(1)](https://manned.org/strip.1).
|
|
- The performance is the time, in seconds, to load a large XML file. `WIKI`
|
|
refers to `enwiki-20130805-abstract5.xml` (162 MiB) from a [Wikipedia
|
|
Dump](http://dumps.wikimedia.org/enwiki/), `DISCOGS` refers to
|
|
`discogs_20130801_labels.xml` (94 MiB) from a [Discogs Data
|
|
Dump](http://www.discogs.com/data/).
|
|
- Libxml2 has been compiled with most of its features disabled with
|
|
`./configure`, but it still manages to be the very definition of bloat.
|
|
- Everything has been compiled with gcc 4.8.1 at `-O2`.
|
|
- Benchmarks are run on Linux 3.10.7 with a 3 Ghz Intel Core Duo E8400 and with
|
|
4GB RAM.
|
|
|
|
And just for fun, here's the same comparison when compiled with `-Os`, i.e.
|
|
optimized for small size. Interestingly enough, Mini-XML actually runs faster
|
|
with `-Os` than with `-O2`.
|
|
|
|
SIZE PERFORMANCE
|
|
LIB VER LICENSE OBJ STATIC WIKI DISCOGS
|
|
strlen 25 816 0.16 0.09
|
|
expat 2.1.0 MIT 113 314 145 632 1.58 1.20
|
|
libxml2 2.9.1 MIT 356 948 412 256 3.01 2.08
|
|
mxml 2.7 LGPL2+static 27 725 71 704 11.70 7.44
|
|
yxml git MIT 4 955 30 392 1.67 1.02
|
|
|
|
## Validating vs. non-validating
|
|
|
|
TL;DR: yxml does _not_ accept garbage XML documents, it will correctly handle
|
|
and report issues if the input does not strictly follow the XML grammar.
|
|
|
|
The terms _validating_ and _non-validating_ have specific meanings within the
|
|
context of XML. A validating parser is one that reads the doctype declaration
|
|
(DTD) associated with a document, and validates that the contents of the
|
|
document follow the rules described in the DTD. A DTD may also include
|
|
instructions on how to parse the document, including the definition of custom
|
|
entity references (`&whatever;`) and instructions on how attribute values or
|
|
element contents should be normalized before passing its data to the
|
|
application.
|
|
|
|
A non-validating parser is one that ignores the DTD and happily parses
|
|
documents that do not follow the rules described in that DTD. They (usually)
|
|
don't support entity references and will not normalize attribute values or
|
|
element contents. A non-validating parser still has to verify that the XML
|
|
document follows the XML syntax rules.
|
|
|
|
It should be noted that a lot of XML documents found in the wild are not
|
|
described with a DTD, but instead use an alternative technology such as XML
|
|
schema. Wikipedia [has more
|
|
information](https://en.wikipedia.org/wiki/XML#Schemas_and_validation) on this.
|
|
Using a validating parser for such documents would only add bloat and may
|
|
introduce [potential security
|
|
vulnerabilities](https://en.wikipedia.org/wiki/Billion_laughs).
|