yxml updates

This commit is contained in:
Yorhel 2013-10-02 16:27:05 +02:00
parent 4e7538db99
commit 94469e44e3

View file

@ -1,10 +1,11 @@
=pod
I<*But see the L<Bugs and Limitations|/Bugs and Limitations> below.>
I<*But see the L<Bugs and Limitations|/Bugs and Limitations> and L<Conformance Issues|/Conformance Issues> below.>
Yxml is a small (C<6 KiB>) non-validating yet mostly conforming XML parser
written in C. Its primary goals are small binary size, simplicity and
correctness. It also happens to be L<pretty fast|/Comparison>.
Yxml is a small (C<6 KiB>) L<non-validating|/Validating vs. non-validating> yet
mostly conforming XML parser written in C. Its primary goals are small binary
size, simplicity and correctness. It also happens to be L<pretty
fast|/Comparison>.
The code can be obtained from the L<git repo|http://g.blicky.net/yxml.git> and
is available under a permissive MIT license. The only two files you need are
@ -60,11 +61,6 @@ But let's not be I<too> optimistic, because there are also...
=over
=item * Element and Attribute names may only consist of ASCII characters.
=item * Does not verify that non-ASCII characters in attribute values or
element contents are within the allowed character ranges.
=item * A conditional section in a C<< <!DOCTYPE ..> >> declaration will result
in a parse error.
@ -77,33 +73,51 @@ not available through the API.
I hope to have these issues fixed in the near future.
=head3 Non-features
And now follows a list of things that are not supported and probably never will
be. Most items on this list can be implemented on top of yxml.
=head3 Conformance Issues
=over
=item * Does not verify all well-formedness constraints. In particular, does
not verify that attribute names within the same element are unique, and does
not verify that the contents of a C<< <!DOCTYPE ..> >> declaration follow the
XML grammar.
=item * Does not verify that non-ASCII characters in element names, element
content, attribute names and attribute values are within the allowed Unicode
character ranges.
=item * No helper functions to deal with namespaces. Yxml will parse XML files
with namespaces just fine, but it's up to the application to do the rest.
=item * Does not verify that attribute names within the same element are unique.
=item * Does not verify that the contents of a C<< <!DOCTYPE ..> >> declaration
follow the XML grammar.
=item * Can't parse documents in a non-ASCII-compatible encoding. You'll have
to convert it to UTF-8 or something similar first.
=item * No support for custom entity references, neither through the API nor
using C<< <!ENTITY> >>.
=back
These conformance issues are the result of the byte-oriented and minimal design
of yxml, and I do not intent to fix these directly within the library. All of
the above mentioned issues can be fixed on top of yxml (by the application, or
by a wrapper) if strict conformance is required. With the exception of custom
entity references, but I have a simple idea on how to support that in the
future, too.
=head3 Non-features
And now follows a list of things that are not part of the core XML
specification and are not directly supported. As with the conformance issues,
these features can be implemented on top of yxml.
=over
=item * No helper functions to deal with namespaces. Yxml will parse XML files
with namespaces just fine, but it's up to the application to do the rest.
=item * No DTD or XML Schema validation.
=item * No XSLT.
=item * No XPath.
=item * Can't parse documents in a non-ASCII-compatible encoding. You'll have
to convert it to UTF-8 or something similar first.
=item * Doesn't do your household chores.
=back
@ -122,7 +136,7 @@ implementation is also included as an indication of the "theoretical" minimum.
expat 2.1.0 MIT 162 139 194 432 1.47 1.09
libxml2 2.9.1 MIT 464 328 518 816 2.53 1.75
mxml 2.7 LGPL2+static 32 733 75 832 12.38 7.80
yxml git MIT 6 015 31 448 1.18 0.73
yxml git MIT 5 935 31 384 1.14 0.74
The code for these benchmarks is available in the
L<bench/|http://g.blicky.net/yxml.git/tree/bench> directory on git. Some
@ -164,3 +178,32 @@ with C<-Os> than with C<-O2>.
libxml2 2.9.1 MIT 356 948 412 256 3.01 2.08
mxml 2.7 LGPL2+static 27 725 71 704 11.70 7.44
yxml git MIT 4 835 30 264 1.72 1.05
=head2 Validating vs. non-validating
TL;DR: yxml does I<not> accept garbage XML documents, it will correctly handle
and report issues if the input does not strictly follow the XML grammar.
The terms I<validating> and I<non-validating> have specific meanings within the
context of XML. A validating parser is one that reads the doctype declaration
(DTD) associated with a document, and validates that the contents of the
document follow the rules described in the DTD. A DTD may also include
instructions on how to parse the document, including the definition of custom
entity references (C<&whatever;>) and instructions on how attribute values or
element contents should be normalized before passing its data to the
application.
A non-validating parser is one that ignores the DTD and happily parses
documents that do not follow the rules described in that DTD. They (usually)
don't support entity references and will not normalize attribute values or
element contents. A non-validating parser still has to verify that the XML
document follows the XML syntax rules.
It should be noted that a lot of XML documents found in the wild are not
described with a DTD, but instead use an alternative technology such as XML
schema. Wikipedia L<has more
information|https://en.wikipedia.org/wiki/XML#Schemas_and_validation> on this.
Using a validating parser for such documents would only introduce bloat and may
introduce L<potential security
vulnerabilities|https://en.wikipedia.org/wiki/Billion_laughs>.